unit2
unit2
Word-level analysis in Natural Language Processing (NLP) focuses on understanding and processing individual
words in a text. This type of analysis is fundamental and serves as the basis for more complex language processing
tasks. Here are key components and techniques involved in word-level analysis:
Word-Level Analysis
1. Stemming: Reducing words to their root form (e.g., "running" to "run").Examples: Porter Stemmer,
Snowball Stemmer.
2. Lemmatization: Reducing words to their base or dictionary form, considering the context. Examples:
WordNet Lemmatizer, spaCy‟s lemmatizer.
3. Part-of-Speech (POS) Tagging Assigning parts of speech (e.g., noun, verb, adjective) to each word.
4. Named Entity Recognition (NER)Description:- Identifying and classifying proper nouns and entities in
text (e.g., names, locations, organizations).
5. Word Sense Disambiguation (WSD):- Determining the correct meaning of a word in context.
6. Stop-word Removal:- Removing common words that do not carry significant meaning (e.g., "the", "is").
Here's a summary of the differences between lemmatization, stemming, and morphological analysis in NLP:
Stemming:
- Reduces words to their stem form
- Removes suffixes and other morphological endings
- Often uses simple rules or algorithms
- May not always produce a valid word
Lemmatization:
- Reduces words to their lemma or dictionary form
- Uses a dictionary or lexicon to determine the base form
- Produces a valid word
- More accurate than stemming
Key differences:
In summary, stemming is a basic technique for reducing words to a stem, lemmatization is a more accurate technique for
reducing words to their dictionary form, and morphological analysis is a more comprehensive technique for understanding the
internal structure and meaning of words.
Regular expressions (regex) are powerful tools used for matching patterns in text. They provide a concise and
flexible means for identifying strings of text that follow a particular structure or format. Here‟s a comprehensive
overview of regular expressions:
1. Literal Characters
o Description: Regular characters in the pattern that match themselves.
o Example: The regex cat matches the string "cat" exactly.
2. Metacharacters
o Description: Special characters with unique meanings in regex.
o Examples:
. (dot): Matches any single character except newline.
^ (caret): Matches the start of a string.
$ (dollar sign): Matches the end of a string.
*, +, ?: Quantifiers that specify repetition (zero or more, one or more, zero or one
respectively).
3. Character Classes
o Description: Matches any one of a set of characters.
o Examples:
[abc]: Matches either 'a', 'b', or 'c'.
[a-z]: Matches any lowercase letter.
[^0-9]: Matches any character except digits.
4. Anchors
o Description: Specify positions in the text.
o Examples:
^ at the beginning of a pattern anchors it to the start of a string.
$ at the end of a pattern anchors it to the end of a string.
5. Quantifiers
o Description: Specify how many times a character or group can appear.
o Examples:
*: Matches zero or more occurrences.
+: Matches one or more occurrences.
?: Matches zero or one occurrence.
6. Grouping and Capturing
o Description: Enclose parts of a pattern in parentheses to group them together or capture matched
text.
o Examples:
(abc)+: Matches "abc", "abcabc", etc.
(a|b): Matches either 'a' or 'b'.
python
Copy code
import re
pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
text = "Contact us at [email protected] for more information."
emails = re.findall(pattern, text)
print(emails) # Output: ['[email protected]']
python
Copy code
pattern = r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b'
text = "Contact us at 123-456-7890 or 456.789.0123 for support."
phones = re.findall(pattern, text)
print(phones) # Output: ['123-456-7890', '456.789.0123']
Tokenizing Text
python
Copy code
pattern = r'\b\w+\b'
text = "Tokenize this sentence into words."
tokens = re.findall(pattern, text)
print(tokens) # Output: ['Tokenize', 'this', 'sentence', 'into', 'words']
Compile Regex: Use re.compile(pattern) for efficiency if using the same pattern multiple times.
Escape Special Characters: Use re.escape() for dynamic patterns where special characters need to be
treated literally.
Test Patterns: Use online regex testers (e.g., Regex101, RegExr) to visualize and debug patterns.
Regular expressions are versatile and indispensable tools for text processing tasks such as data validation, pattern
matching, and text manipulation in various programming languages, including Python.
Process of Morphological Parsing or (WORD LEVEL ANALYSIS IN NLP)
Word-Level Analysis
1. Stemming: Reducing words to their root form (e.g., "running" to "run").Examples: Porter Stemmer,
Snowball Stemmer.
2. Lemmatization: Reducing words to their base or dictionary form, considering the context. Examples:
WordNet Lemmatizer, spaCy‟s lemmatizer.
3. Part-of-Speech (POS) Tagging Assigning parts of speech (e.g., noun, verb, adjective) to each word.
4. Named Entity Recognition (NER)Description:- Identifying and classifying proper nouns and entities in
text (e.g., names, locations, organizations).
5. Word Sense Disambiguation (WSD):- Determining the correct meaning of a word in context.
6. Stop-word Removal:- Removing common words that do not carry significant meaning (e.g., "the", "is").
7. Dictionary Lookup: Each segmented morpheme is typically looked up in a lexicon or dictionary to
determine its meaning and grammatical properties.
8. Morphological Analysis: Based on the dictionary lookup, the system identifies the part-of-speech (POS)
tags and other morpho syntactic features associated with each morpheme and the entire word.
Let's consider the word "unhappiest" and perform morphological parsing on it:
Input Word:
Word: unhappiest
Steps:
1. Tokenization:
o Tokenize the word into individual morphemes:
Stem: "happy"
Affixes: "un" (prefix), "est" (suffix)
2. Morpheme Segmentation:
o Stem: "happy"
o Affixes:
Prefix: "un-"
Suffix: "-est"
3. Dictionary Lookup:
o Look up each morpheme in a dictionary or lexicon:
Stem: "happy" (adjective, positive degree)
Prefix: "un-" (negative or reversing meaning)
Suffix: "-est" (superlative degree)
4. Morphological Analysis:
o Assign appropriate POS tags and morphosyntactic features:
"unhappiest" is analyzed as an adjective in superlative degree, indicating the most unhappy
state.
Stemming and Lemmatization: Helps in reducing words to their base or dictionary forms, which is crucial
for tasks like information retrieval and text mining.
Grammatical Analysis: Provides insights into the syntactic and semantic structures of sentences, aiding in
parsing and understanding natural language.
Machine Translation: Ensures accurate translation by handling inflections and derivations in different
languages.
Text-to-Speech and Speech Recognition: Helps in generating correct pronunciations and understanding
spoken language by breaking down words into manageable units.
How does autocorrect work in NLP? Autocorrect tries first to find whether a given word is correct or not.??
Spelling correction in NLP refers to detecting incorrect spellings and then correcting them.
There are many commercial as well as non-commercial spelling error detection and correction
tools available in the market for almost all popular languages. And every tool works on word level with the help
of integral dictionary/Wordnet as the backend database for correction and detection. Every word from the text is
looked up in the speller lexicon. When a word is not in the dictionary, it is detected as an error. In order to
correct the error, a spellchecker searches the dictionary/Wordnet for the word that is most resembled to the
erroneous word. These words are then suggested to the user to choose the intended word.
TYPES OF SPELL ERRORS Various techniques that were designed on the basis of spelling errors and trends
also called error patterns, and most notable studies on these were performed by Damerau. According to the
studies spelling errors are generally divided into two types
Typographic errors and Cognitive errors.
Typographic errors (Non Word Errors): These errors occur when the correct spelling of the word is known but
the word is mistyped by mistake. These errors are mostly related to the keyboard and therefore do not follow any
linguistic criteria.
Cognitive errors (Real Word Errors): These errors occur when the correct spellings of the word are not known.
In the case of cognitive errors, the pronunciation of misspelled word is the same or similar to the pronunciation
of the intended correct word.
ERROR DETECTION
N-gram analysis :- N-gram Analysis N-gram analysis is a method to find incorrectly spelled words in a mass of
text. Instead of comparing each entire word in a text to a dictionary, just n grams are checked. A check is done
by using an n-dimensional matrix where real n-gram frequencies are stored. If a non-existent or rare n-gram is
found the word is flagged as a misspelling, otherwise not. An n-gram is a set of consecutive characters taken
from a string with a length of whatever n is set to. This method is language independent as it requires no
knowledge of the language for which itis used. In this algorithm, each string that is involved in the comparison
process is split up into sets of adjacent n-grams. The similarity between two strings is achieved by discovering
the number of unique n-grams that they share and then calculating a similarity coefficient, i.e. the number of the
ngrams in common (intersection), divided by the total number of n-grams in the two words (union)
Dictionary Lookup :- A dictionary/Wordnet is a lexical source that contains list of correct words
a particular language. The non-word errors can be easily detected by checking each word against a
dictionary. Drawbacks of this method i. Difficulties in keeping such a dictionary up to date, and sufficiently
extensive tocover all the words in a text. A large scale lexical resources is given by linguistic ontology that
covers many words of alanguage and has a hierarchical structure based on the relationship between
concepts.WordNet is a widely used lexical resource. It covers nouns, verbs, adjectives and adverbs.
ERROR CORRECTION
The ranking of candidate corrections:-The ranking process usually invokes some lexical similarity measure
between the misspelled string and the candidates or a probabilistic estimate of the likelihood of the
correction to rank order the candidates.
These two steps are most of the time treated as a separate process and executed in sequence. Some techniques
can omit the second process though, leaving the ranking and final selectionto the user.
simple Python example using the `textblob` library for spelling correction:
```python
from textblob import TextBlob
def correct_spelling(text):
blob = TextBlob(text)
corrected_text = blob.correct()
return corrected_text
# Example usage
text = "This is an exmple of speling eror detection and corection."
corrected_text = correct_spelling(text)
print(corrected_text)
```
This code will output:
```
This is an example of spelling error detection and correction.
```
For more advanced applications, you can use libraries like `pyspellchecker` for dictionary-based corrections or
train machine learning models using frameworks like `transformers` to perform contextual corrections. Here‟s an
example using `pyspellchecker`:
```python
from spellchecker import SpellChecker
def correct_spelling(text):
spell = SpellChecker()
words = text.split()
corrected_words = [spell.correction(word) for word in words]
corrected_text = " ".join(corrected_words)
return corrected_text
# Example usage
text = "This is an exmple of speling eror detection and corection."
corrected_text = correct_spelling(text)
print(corrected_text)
```
This code also corrects spelling errors and provides the corrected text.
Words
Words are the basic units of language and text. They can be further categorized into various types based on their
function and meaning.
Words can be grouped into classes based on their grammatical roles in sentences. Here are the main parts of speech:
1. Nouns: Represent people, places, things, or ideas (e.g., "cat", "New York", "happiness").
2. Pronouns: Stand in for nouns (e.g., "he", "she", "it", "they").
3. Verbs: Describe actions, states, or occurrences (e.g., "run", "is", "seem").
4. Adjectives: Describe or modify nouns (e.g., "blue", "quick", "happy").
5. Adverbs: Modify verbs, adjectives, or other adverbs (e.g., "quickly", "very", "well").
6. Prepositions: Show relationships between nouns (or pronouns) and other words in a sentence (e.g., "in",
"on", "at").
7. Conjunctions: Connect words, phrases, or clauses (e.g., "and", "but", "because").
8. Determiners: Introduce nouns and specify them in some way (e.g., "the", "a", "this").
9. Interjections: Express emotions or reactions (e.g., "oh!", "wow!", "ouch!").
One of the core tasks in Natural Language Processing (NLP) is Parts of Speech (PoS) tagging, which is
giving each word in a text a grammatical category, such as nouns, verbs, adjectives, and adverbs. Through
improved comprehension of phrase structure and semantics, this technique makes it possible for machines to
study and comprehend human language more accurately.
Parts of Speech
Noun (NN): Represents people, places, things, or concepts (e.g., "cat", "New York", "happiness").
Verb (VB): Represents actions or states (e.g., "run", "is").
Adjective (JJ): Describes or modifies nouns (e.g., "happy", "blue").
Adverb (RB): Modifies verbs, adjectives, or other adverbs (e.g., "quickly", "very").
Pronoun (PRP): Substitutes for nouns (e.g., "he", "they").
Preposition (IN): Shows relationships between nouns (e.g., "on", "in").
Conjunction (CC): Connects words, phrases, or clauses (e.g., "and", "but").
Determiner (DT): Introduces nouns (e.g., "the", "a").
Interjection (UH): Expresses emotion or exclamation (e.g., "oh", "wow").
POS tagging is a fundamental component of many NLP applications and plays a crucial role in understanding and
processing human language.
Syntactic analysis, also known as parsing, is a fundamental task in Natural Language Processing (NLP) that involves
analyzing the grammatical structure of sentences to establish how words relate to each other. This process is crucial
for understanding the syntax (sentence structure) of natural language and is typically performed using syntactic
parsers. Here‟s a detailed explanation of syntactic analysis in NLP:
Syntactic analysis or parsing or syntax analysis is the third phase of NLP. The purpose of this phase is to draw exact
meaning, or you can say dictionary meaning from the text. Syntax analysis checks the text for meaningfulness
comparing to the rules of formal grammar. For example, the sentence like “hot ice-cream” would be rejected by
semantic analyzer.
In this sense, syntactic analysis or parsing may be defined as the process of analyzing the strings of symbols in
natural language conforming to the rules of formal grammar. The origin of the word „parsing‟ is from Latin
word „pars‟ which means „part‟.
1. Structural Understanding: Identifying the hierarchical structure of sentences, including phrases and
clauses.
2. Dependency Relations: Establishing relationships between words to determine which words modify or
depend on others.
3. Grammatical Role Assignment: Assigning syntactic roles such as subject, object, predicate, etc., to words
within a sentence.
Abbreviation Meaning
CC coordinating conjunction
CD cardinal digit
DT determiner
EX existential there
FW foreign word
IN preposition/subordinating conjunction
JJ This NLTK POS Tag is an adjective (large)
JJR adjective, comparative (larger)
JJS adjective, superlative (largest)
LS list market
MD modal (could, will)
NN noun, singular (cat, tree)
NNS noun plural (desks)
NNP proper noun, singular (sarah)
NNPS proper noun, plural (indians or americans)
PDT predeterminer (all, both, half)
POS possessive ending (parent\ „s)
PRP personal pronoun (hers, herself, him, himself)
PRP$ possessive pronoun (her, his, mine, my, our )
RB adverb (occasionally, swiftly)
RBR adverb, comparative (greater)
Abbreviation Meaning
RBS adverb, superlative (biggest)
RP particle (about)
TO infinite marker (to)
UH interjection (goodbye)
VB verb (ask)
VBG verb gerund (judging)
VBD verb past tense (pleaded)
VBN verb past participle (reunified)
VBP verb, present tense not 3rd person singular(wrap)
VBZ verb, present tense with 3rd person singular (bases)
WDT wh-determiner (that, what)
WP wh- pronoun (who)
WRB wh- adverb (how)
The above NLTK POS tag list contains all the NLTK POS Tags. NLTK POS tagger is used to assign
grammatical information of each word of the sentence. Installing, Importing and downloading all the
packages of POS NLTK is complete.
1. Machine Translation: Understanding the grammatical structure of sentences in both source and target
languages is crucial for accurate translation.
2. Information Retrieval: Parsing helps in extracting structured information from text databases based on
syntactic patterns.
3. Question Answering: Analyzing syntactic structures aids in understanding questions and finding
appropriate answers from text.
4. Grammar Checking: Identifying syntactic errors and suggesting corrections in written text.
Syntactic Analysis:
o Identify the subject: "She"
o Identify the verb phrase: "sings beautifully"
o Assign syntactic roles: "She" as the subject, "sings" as the verb, and "beautifully" as an adverb
modifying "sings".
In conclusion, syntactic analysis plays a critical role in NLP by enabling computers to understand the grammatical
structure of human language, facilitating a wide range of applications from information extraction to automated
translation and beyond.
Examples include:
1. WordNet: A large lexical database of English developed at Princeton University. WordNet groups words
into sets of synonyms called synsets, provides short definitions and usage examples, and records various
semantic relations between these synonym sets.
2. Merriam-Webster's Dictionary API: Provides programmatic access to dictionary and thesaurus content
from Merriam-Webster, including definitions, synonyms, antonyms, and example sentences.
3. Oxford Dictionaries API: Offers access to Oxford's lexical data, including definitions, translations, and
usage examples across multiple languages.
Lexical Databases
Lexical databases are more comprehensive than dictionaries, often including not only word definitions but also
detailed information on morphological, syntactic, and semantic properties. Key examples include:
1. WordNet
WordNet is a large lexical database of English developed at Princeton University. It groups English words into sets
of synonyms called synsets, provides short definitions and usage examples, and records various semantic relations
between these synonym sets.
2. FrameNet
FrameNet is based on frame semantics and provides a rich set of frames (conceptual structures describing an event,
relation, or object) along with their associated roles (participants in the frames) and lexical units (words that evoke
the frames).
3. VerbNet
VerbNet is a verb lexicon for English that organizes verbs into classes based on their syntactic and semantic
properties. Each class is associated with a set of syntactic frames and thematic roles.
Verb Classes: Groups of verbs with similar syntactic and semantic behavior.
Syntactic Frames: Descriptions of syntactic constructions in which verbs can appear.
Thematic Roles: Roles played by participants in the actions described by the verbs (e.g., Agent, Patient).
4. BabelNet
BabelNet is a multilingual lexicalized semantic network and ontology, integrating information from WordNet and
various other resources, including Wikipedia. It covers multiple languages and provides rich lexical and semantic
information.
5. OntoNotes
OntoNotes is a large, multilingual annotated corpus that includes various layers of annotation, such as syntax,
semantics, and coreference. It is useful for training and evaluating NLP models.
Applications in NLP
1. Word Sense Disambiguation (WSD)
Determining which sense of a word is used in a given context. Resources like WordNet are crucial for providing the
necessary sense inventories.
2. Machine Translation
Improving the accuracy of translations by understanding the meanings and usage of words in different languages,
aided by resources like BabelNet.
3. Information Retrieval
Enhancing search engines and information retrieval systems by understanding the relationships between words and
concepts.
4. Text Mining and Information Extraction
Extracting meaningful information from large corpora of text by utilizing the structured knowledge provided by
lexical databases.
5. Natural Language Understanding
Building systems that can comprehend and generate human language by leveraging the rich semantic information in
resources like FrameNet and VerbNet.
6. Semantic Search
Improving search results by understanding the intent behind queries and the relationships between concepts, using
resources like WordNet and OntoNotes.
Machine-readable dictionaries and lexical databases are indispensable in NLP, providing the foundation for various
tasks that require a deep understanding of language and meaning.
Transition Networks
Sentences:-
a.Big white fluffy clouds.
b.Our bright children.
c.A large beautiful white flower.
d.Large green leaves.
e.Buildings.
f.Boston‟s best seafood restaurants.
RTN Example
Let us consider a sentence “The stone was dark black”.
The: ART
Stone: ADJ NOUN
Was: VERB
Dark: ADJ
Black: ADJ NOUN
1. What are regular expressions, and how are they used in word-level text
processing?
2. How can you use regular expressions to extract specific patterns from a
given text?
3. Explain the difference between greedy and non-greedy quantifiers in
regular expressions.
4. What is a finite-state automaton (FSA), and how is it used in
morphological analysis?
5. How do deterministic and non-deterministic finite-state automata differ?
6. How can FSAs be used to recognize regular languages in natural
language processing?
7. What is morphological parsing, and why is it important in natural
language processing?
8. How does a morphological parser handle inflectional and derivational
morphology?
9. Explain the role of lexicons and finite-state transducers in
morphological parsing.
10. What are the common techniques for detecting spelling errors in a text?
11. Describe the role of a dictionary in both detecting and correcting
spelling errors.
12. What are word classes, and how are they categorized in natural language
processing?
13. How do open and closed word classes differ, and what are some
examples of each?
14. Why is understanding word classes important for tasks like part-of-
speech tagging?
15. What is part-of-speech tagging, and how is it typically performed?
16. What is a context-free grammar, and how is it used to represent the
syntax of natural languages?
17. Explain the concept of derivation trees in the context of CFGs.
18. How does Chomsky Normal Form relate to context-free grammars?