0% found this document useful (0 votes)

21 views

unit2

Uploaded by

azimasheikh878

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views

unit2

Uploaded by

azimasheikh878

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

Bhilai Institute of Technology, Durg

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

Natural Language Processing

UNIT 2: Word level and syntactic analysis

Prepared By- Dr Shikha Pandey

UNIT – II (CO2)
Word level and syntactic analysis: Word Level Analysis: Regular Expressions,
Finite-State Automata, Morphological Parsing, Spelling Error Detection and
correction, Words and Word classes, Part-of Speech Tagging. Syntactic
Analysis: Context-free Grammar, Constituency, Parsing-Probabilistic Parsing.
Machine readable dictionaries and lexical databases, RTN, ATN [7Hrs]

WORD LEVEL ANALYSIS IN NLP

Word-level analysis in Natural Language Processing (NLP) focuses on understanding and processing individual
words in a text. This type of analysis is fundamental and serves as the basis for more complex language processing
tasks. Here are key components and techniques involved in word-level analysis:

Word-Level Analysis

1. Tokenization :- Splitting text into individual words or tokens.

2. Morphological Analysis :-Studying the structure of words to understand their base forms and affixes.

Here's a summary of the differences between lemmatization, stemming, and morphological analysis in NLP:

Stemming:
- Reduces words to their stem form
- Removes suffixes and other morphological endings
- Often uses simple rules or algorithms
- May not always produce a valid word

Lemmatization:
- Reduces words to their lemma or dictionary form
- Uses a dictionary or lexicon to determine the base form
- Produces a valid word
- More accurate than stemming

Key differences:

- Stemming and lemmatization focus on reducing words to a base form

- Morphological analysis examines the internal structure and meaning of words
- Stemming is a simpler approach, while lemmatization and morphological analysis are more sophisticated techniques

In summary, stemming is a basic technique for reducing words to a stem, lemmatization is a more accurate technique for
reducing words to their dictionary form, and morphological analysis is a more comprehensive technique for understanding the
internal structure and meaning of words.

Note: Practical Examples will be in the separate sheet.

Regular Expressions

Regular expressions (regex) are powerful tools used for matching patterns in text. They provide a concise and
flexible means for identifying strings of text that follow a particular structure or format. Here‟s a comprehensive
overview of regular expressions:

Basics of Regular Expressions

1. Literal Characters
o Description: Regular characters in the pattern that match themselves.
o Example: The regex cat matches the string "cat" exactly.
2. Metacharacters
o Description: Special characters with unique meanings in regex.
o Examples:
 . (dot): Matches any single character except newline.
 ^ (caret): Matches the start of a string.
 $ (dollar sign): Matches the end of a string.
 *, +, ?: Quantifiers that specify repetition (zero or more, one or more, zero or one
respectively).
3. Character Classes
o Description: Matches any one of a set of characters.
o Examples:
 [abc]: Matches either 'a', 'b', or 'c'.
 [a-z]: Matches any lowercase letter.
 [^0-9]: Matches any character except digits.
4. Anchors
o Description: Specify positions in the text.
o Examples:
 ^ at the beginning of a pattern anchors it to the start of a string.
 $ at the end of a pattern anchors it to the end of a string.
5. Quantifiers
o Description: Specify how many times a character or group can appear.
o Examples:
 *: Matches zero or more occurrences.
 +: Matches one or more occurrences.
 ?: Matches zero or one occurrence.
6. Grouping and Capturing
o Description: Enclose parts of a pattern in parentheses to group them together or capture matched
text.
o Examples:
 (abc)+: Matches "abc", "abcabc", etc.
 (a|b): Matches either 'a' or 'b'.

Matching Email Addresses

python
Copy code
import re

pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
text = "Contact us at [email protected] for more information."
emails = re.findall(pattern, text)
print(emails) # Output: ['[email protected]']

Extracting Phone Numbers

python
Copy code
pattern = r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b'
text = "Contact us at 123-456-7890 or 456.789.0123 for support."
phones = re.findall(pattern, text)
print(phones) # Output: ['123-456-7890', '456.789.0123']

Tokenizing Text

python
Copy code
pattern = r'\b\w+\b'
text = "Tokenize this sentence into words."
tokens = re.findall(pattern, text)
print(tokens) # Output: ['Tokenize', 'this', 'sentence', 'into', 'words']

Using Regular Expressions in Python

In Python, regular expressions are implemented using the re module:

 re.findall(pattern, text): Finds all occurrences of pattern in text.

 re.search(pattern, text): Searches for a single occurrence of pattern in text.
 re.match(pattern, text): Matches pattern at the beginning of text.
 re.sub(pattern, replacement, text): Replaces occurrences of pattern with replacement in text.

Tips for Using Regular Expressions

 Compile Regex: Use re.compile(pattern) for efficiency if using the same pattern multiple times.
 Escape Special Characters: Use re.escape() for dynamic patterns where special characters need to be
treated literally.
 Test Patterns: Use online regex testers (e.g., Regex101, RegExr) to visualize and debug patterns.

Regular expressions are versatile and indispensable tools for text processing tasks such as data validation, pattern
matching, and text manipulation in various programming languages, including Python.
Process of Morphological Parsing or (WORD LEVEL ANALYSIS IN NLP)

How Morphological Parsing perform explain with example???

Morphological analysis perform following task:

- Analyzes the internal structure of words
- Identifies the root, prefixes, and suffixes
- Determines the relationships between morphemes (smallest units of meaning)
- Provides a more detailed understanding of word formation
Morphological parsing, also known as morphological analysis, is the process of analyzing the structure of words to
determine their morphemes — the smallest meaningful units of language. This process is essential in natural
language processing (NLP) for tasks such as stemming, lemmatization, and understanding the grammatical structure
of words.

Word-Level Analysis

1. Tokenization :- Splitting text into individual words or tokens.

2. Morphological Analysis :-Studying the structure of words to understand their base forms and affixes.

1. Stemming: Reducing words to their root form (e.g., "running" to "run").Examples: Porter Stemmer,
Snowball Stemmer.
2. Lemmatization: Reducing words to their base or dictionary form, considering the context. Examples:
WordNet Lemmatizer, spaCy‟s lemmatizer.
3. Part-of-Speech (POS) Tagging Assigning parts of speech (e.g., noun, verb, adjective) to each word.
4. Named Entity Recognition (NER)Description:- Identifying and classifying proper nouns and entities in
text (e.g., names, locations, organizations).
5. Word Sense Disambiguation (WSD):- Determining the correct meaning of a word in context.
6. Stop-word Removal:- Removing common words that do not carry significant meaning (e.g., "the", "is").
7. Dictionary Lookup: Each segmented morpheme is typically looked up in a lexicon or dictionary to
determine its meaning and grammatical properties.
8. Morphological Analysis: Based on the dictionary lookup, the system identifies the part-of-speech (POS)
tags and other morpho syntactic features associated with each morpheme and the entire word.

Example of Morphological Parsing

Let's consider the word "unhappiest" and perform morphological parsing on it:

Input Word:

 Word: unhappiest

Steps:

1. Tokenization:
o Tokenize the word into individual morphemes:
 Stem: "happy"
 Affixes: "un" (prefix), "est" (suffix)
2. Morpheme Segmentation:
o Stem: "happy"
o Affixes:
 Prefix: "un-"
 Suffix: "-est"
3. Dictionary Lookup:
o Look up each morpheme in a dictionary or lexicon:
 Stem: "happy" (adjective, positive degree)
 Prefix: "un-" (negative or reversing meaning)

Suffix: "-est" (superlative degree)
4. Morphological Analysis:
o Assign appropriate POS tags and morphosyntactic features:
 "unhappiest" is analyzed as an adjective in superlative degree, indicating the most unhappy
state.

Importance of Morphological Parsing

 Stemming and Lemmatization: Helps in reducing words to their base or dictionary forms, which is crucial
for tasks like information retrieval and text mining.
 Grammatical Analysis: Provides insights into the syntactic and semantic structures of sentences, aiding in
parsing and understanding natural language.
 Machine Translation: Ensures accurate translation by handling inflections and derivations in different
languages.
 Text-to-Speech and Speech Recognition: Helps in generating correct pronunciations and understanding
spoken language by breaking down words into manageable units.

SPELLING ERROR DETECTION AND CORRECTION

How does autocorrect work in NLP? Autocorrect tries first to find whether a given word is correct or not.??

Spelling correction in NLP refers to detecting incorrect spellings and then correcting them.
There are many commercial as well as non-commercial spelling error detection and correction
tools available in the market for almost all popular languages. And every tool works on word level with the help
of integral dictionary/Wordnet as the backend database for correction and detection. Every word from the text is
looked up in the speller lexicon. When a word is not in the dictionary, it is detected as an error. In order to
correct the error, a spellchecker searches the dictionary/Wordnet for the word that is most resembled to the
erroneous word. These words are then suggested to the user to choose the intended word.

Spell checking technique comprises of two stages

i. Error detection and
ii. Error correction.

TYPES OF SPELL ERRORS Various techniques that were designed on the basis of spelling errors and trends
also called error patterns, and most notable studies on these were performed by Damerau. According to the
studies spelling errors are generally divided into two types
Typographic errors and Cognitive errors.

Typographic errors (Non Word Errors): These errors occur when the correct spelling of the word is known but
the word is mistyped by mistake. These errors are mostly related to the keyboard and therefore do not follow any
linguistic criteria.

Cognitive errors (Real Word Errors): These errors occur when the correct spellings of the word are not known.
In the case of cognitive errors, the pronunciation of misspelled word is the same or similar to the pronunciation
of the intended correct word.

ERROR DETECTION

N-gram analysis and

Dictionary/Wordnet lookup.

N-gram analysis :- N-gram Analysis N-gram analysis is a method to find incorrectly spelled words in a mass of
text. Instead of comparing each entire word in a text to a dictionary, just n grams are checked. A check is done
by using an n-dimensional matrix where real n-gram frequencies are stored. If a non-existent or rare n-gram is
found the word is flagged as a misspelling, otherwise not. An n-gram is a set of consecutive characters taken
from a string with a length of whatever n is set to. This method is language independent as it requires no
knowledge of the language for which itis used. In this algorithm, each string that is involved in the comparison
process is split up into sets of adjacent n-grams. The similarity between two strings is achieved by discovering
the number of unique n-grams that they share and then calculating a similarity coefficient, i.e. the number of the
ngrams in common (intersection), divided by the total number of n-grams in the two words (union)

Dictionary Lookup :- A dictionary/Wordnet is a lexical source that contains list of correct words
a particular language. The non-word errors can be easily detected by checking each word against a
dictionary. Drawbacks of this method i. Difficulties in keeping such a dictionary up to date, and sufficiently
extensive tocover all the words in a text. A large scale lexical resources is given by linguistic ontology that
covers many words of alanguage and has a hierarchical structure based on the relationship between
concepts.WordNet is a widely used lexical resource. It covers nouns, verbs, adjectives and adverbs.

ERROR CORRECTION

Error correction consists of two steps:

The generation of candidate corrections: - The candidate generation process usuallymakes use of a
precompiled table of legal n-grams to locate one or more potential correction terms.

The ranking of candidate corrections:-The ranking process usually invokes some lexical similarity measure
between the misspelled string and the candidates or a probabilistic estimate of the likelihood of the
correction to rank order the candidates.

These two steps are most of the time treated as a separate process and executed in sequence. Some techniques
can omit the second process though, leaving the ranking and final selectionto the user.

simple Python example using the `textblob` library for spelling correction:

```python
from textblob import TextBlob
def correct_spelling(text):
blob = TextBlob(text)

corrected_text = blob.correct()
return corrected_text

# Example usage
text = "This is an exmple of speling eror detection and corection."
corrected_text = correct_spelling(text)
print(corrected_text)
```
This code will output:
```
This is an example of spelling error detection and correction.
```

For more advanced applications, you can use libraries like `pyspellchecker` for dictionary-based corrections or
train machine learning models using frameworks like `transformers` to perform contextual corrections. Here‟s an
example using `pyspellchecker`:

```python
from spellchecker import SpellChecker

def correct_spelling(text):
spell = SpellChecker()
words = text.split()
corrected_words = [spell.correction(word) for word in words]
corrected_text = " ".join(corrected_words)
return corrected_text

# Example usage
text = "This is an exmple of speling eror detection and corection."
corrected_text = correct_spelling(text)
print(corrected_text)
```

This code also corrects spelling errors and provides the corrected text.

WORDS AND WORD CLASSS

In Natural Language Processing (NLP), words and word classes (also known as parts of speech or POS) are
fundamental concepts used for understanding and processing text. Here‟s an overview:

Words

Words are the basic units of language and text. They can be further categorized into various types based on their
function and meaning.

Word Classes (Parts of Speech)

Words can be grouped into classes based on their grammatical roles in sentences. Here are the main parts of speech:

1. Nouns: Represent people, places, things, or ideas (e.g., "cat", "New York", "happiness").
2. Pronouns: Stand in for nouns (e.g., "he", "she", "it", "they").
3. Verbs: Describe actions, states, or occurrences (e.g., "run", "is", "seem").
4. Adjectives: Describe or modify nouns (e.g., "blue", "quick", "happy").
5. Adverbs: Modify verbs, adjectives, or other adverbs (e.g., "quickly", "very", "well").
6. Prepositions: Show relationships between nouns (or pronouns) and other words in a sentence (e.g., "in",
"on", "at").
7. Conjunctions: Connect words, phrases, or clauses (e.g., "and", "but", "because").
8. Determiners: Introduce nouns and specify them in some way (e.g., "the", "a", "this").
9. Interjections: Express emotions or reactions (e.g., "oh!", "wow!", "ouch!").

NLP Techniques for Handling Words and Word Classes

1. Tokenization: Splitting text into individual words or tokens.

o Example: "I love NLP." → ["I", "love", "NLP", "."]
2. POS Tagging: Assigning parts of speech to each word in a sentence.
o Example: "I love NLP." → [("I", "PRP"), ("love", "VBP"), ("NLP", "NNP"), (".", ".")]
3. Lemmatization: Reducing words to their base or root form.
o Example: "running" → "run"
4. Stemming: Trimming words to their base form (less accurate than lemmatization).
o Example: "running" → "run"
5. Named Entity Recognition (NER): Identifying and classifying named entities (like people, organizations,
locations).
o Example: "Apple Inc. is based in Cupertino." → [("Apple Inc.", "ORG"), ("Cupertino", "LOC")]
PARTS OF SPEECH (POS) TAGGING

One of the core tasks in Natural Language Processing (NLP) is Parts of Speech (PoS) tagging, which is
giving each word in a text a grammatical category, such as nouns, verbs, adjectives, and adverbs. Through
improved comprehension of phrase structure and semantics, this technique makes it possible for machines to
study and comprehend human language more accurately.

Parts of Speech

 Noun (NN): Represents people, places, things, or concepts (e.g., "cat", "New York", "happiness").
 Verb (VB): Represents actions or states (e.g., "run", "is").
 Adjective (JJ): Describes or modifies nouns (e.g., "happy", "blue").
 Adverb (RB): Modifies verbs, adjectives, or other adverbs (e.g., "quickly", "very").
 Pronoun (PRP): Substitutes for nouns (e.g., "he", "they").
 Preposition (IN): Shows relationships between nouns (e.g., "on", "in").
 Conjunction (CC): Connects words, phrases, or clauses (e.g., "and", "but").
 Determiner (DT): Introduces nouns (e.g., "the", "a").
 Interjection (UH): Expresses emotion or exclamation (e.g., "oh", "wow").

Example of POS Tagging

Consider the sentence: “The quick brown fox jumps over the lazy dog.”
After performing POS Tagging:
 “The” is tagged as determiner (DT)
 “quick” is tagged as adjective (JJ)
 “brown” is tagged as adjective (JJ)
 “fox” is tagged as noun (NN)
 “jumps” is tagged as verb (VBZ)
 “over” is tagged as preposition (IN)
 “the” is tagged as determiner (DT)
 “lazy” is tagged as adjective (JJ)
 “dog” is tagged as noun (NN)
By offering insights into the grammatical structure, this tagging aids machines in
comprehending not just individual words but also the connections between them inside a
phrase.

Applications of POS Tagging

 Text-to-Speech Systems: Helps in correct pronunciation and intonation.

 Information Extraction: Facilitates extracting specific types of information from text.
 Named Entity Recognition (NER): Assists in identifying proper nouns and specific entities.
 Parsing and Syntax Analysis: Aids in constructing syntactic structures of sentences.

POS tagging is a fundamental component of many NLP applications and plays a crucial role in understanding and
processing human language.

Workflow of POS Tagging in NLP

The following are the processes in a typical natural language processing (NLP) example of part-of-
speech (POS) tagging:
 Tokenization: Divide the input text into discrete tokens, which are usually units of words or
subwords. The first stage in NLP tasks is tokenization.
 Loading Language Models: To utilize a library such as NLTK or SpaCy, be sure to load the
relevant language model. These models offer a foundation for comprehending a language‟s
grammatical structure since they have been trained on a vast amount of linguistic data.
 Text Processing: If required, preprocess the text to handle special characters, convert it to
lowercase, or eliminate superfluous information. Correct PoS labeling is aided by clear text.
 Linguistic Analysis: To determine the text‟s grammatical structure, use linguistic analysis. This
entails understanding each word‟s purpose inside the sentence, including whether it is an
adjective, verb, noun, or other.
 Part-of-Speech Tagging: To determine the text‟s grammatical structure, use linguistic analysis.
This entails understanding each word‟s purpose inside the sentence, including whether it is an
adjective, verb, noun, or other.
 Results Analysis: Verify the accuracy and consistency of the PoS tagging findings with the
source text. Determine and correct any possible problems or mistagging.
SYNTACTIC ANALYSIS IN NLP

Syntactic analysis, also known as parsing, is a fundamental task in Natural Language Processing (NLP) that involves
analyzing the grammatical structure of sentences to establish how words relate to each other. This process is crucial
for understanding the syntax (sentence structure) of natural language and is typically performed using syntactic
parsers. Here‟s a detailed explanation of syntactic analysis in NLP:

Syntactic analysis or parsing or syntax analysis is the third phase of NLP. The purpose of this phase is to draw exact
meaning, or you can say dictionary meaning from the text. Syntax analysis checks the text for meaningfulness
comparing to the rules of formal grammar. For example, the sentence like “hot ice-cream” would be rejected by
semantic analyzer.

In this sense, syntactic analysis or parsing may be defined as the process of analyzing the strings of symbols in
natural language conforming to the rules of formal grammar. The origin of the word „parsing‟ is from Latin
word „pars‟ which means „part‟.

Goals of Syntactic Analysis

1. Structural Understanding: Identifying the hierarchical structure of sentences, including phrases and
clauses.
2. Dependency Relations: Establishing relationships between words to determine which words modify or
depend on others.
3. Grammatical Role Assignment: Assigning syntactic roles such as subject, object, predicate, etc., to words
within a sentence.

The above NLTK POS tag list contains all the NLTK POS Tags. NLTK POS tagger is used to assign
grammatical information of each word of the sentence. Installing, Importing and downloading all the
packages of POS NLTK is complete.

Tools and Libraries for Syntactic Analysis

 Stanford Parser: Implements probabilistic context-free grammars for syntactic parsing.

 spaCy: Provides dependency parsing and syntactic analysis capabilities, along with POS tagging and named
entity recognition.
 NLTK: Offers various parsers and tools for syntactic analysis, including CFG parsers and interfaces to
external parsers.
 SyntaxNet (now part of TensorFlow): Deep learning-based framework for dependency parsing.

Applications of Syntactic Analysis

1. Machine Translation: Understanding the grammatical structure of sentences in both source and target
languages is crucial for accurate translation.
2. Information Retrieval: Parsing helps in extracting structured information from text databases based on
syntactic patterns.
3. Question Answering: Analyzing syntactic structures aids in understanding questions and finding
appropriate answers from text.
4. Grammar Checking: Identifying syntactic errors and suggesting corrections in written text.

Example of Syntactic Analysis

Consider the sentence: "She sings beautifully."

 Syntactic Analysis:
o Identify the subject: "She"
o Identify the verb phrase: "sings beautifully"
o Assign syntactic roles: "She" as the subject, "sings" as the verb, and "beautifully" as an adverb
modifying "sings".

Challenges in Syntactic Analysis

 Ambiguity: Sentences often have multiple valid syntactic interpretations.

 Parsing Complexity: Syntax can vary widely between languages and can be context-dependent.
 Performance: Achieving real-time parsing for large datasets can be computationally intensive.

In conclusion, syntactic analysis plays a critical role in NLP by enabling computers to understand the grammatical
structure of human language, facilitating a wide range of applications from information extraction to automated
translation and beyond.

MACHINE READABLE DICTIONARIES AND LEXICAL DATABASES

Machine-readable dictionary (MRD) is a dictionary stored as machine-readable data instead of being

printed on paper. It is an electronic dictionary and lexical database. A machine-readable dictionary is a
dictionary in an electronic form that can be loaded in a database and can be queried via application
software. Machine-readable dictionaries and lexical databases are essential resources in natural language
processing (NLP). They provide structured and accessible linguistic information, which can be used for a variety of
tasks, such as part-of-speech tagging, named entity recognition, sentiment analysis, and machine translation. Here
are some key examples and explanations:

Examples include:

1. WordNet: A large lexical database of English developed at Princeton University. WordNet groups words
into sets of synonyms called synsets, provides short definitions and usage examples, and records various
semantic relations between these synonym sets.
2. Merriam-Webster's Dictionary API: Provides programmatic access to dictionary and thesaurus content
from Merriam-Webster, including definitions, synonyms, antonyms, and example sentences.
3. Oxford Dictionaries API: Offers access to Oxford's lexical data, including definitions, translations, and
usage examples across multiple languages.

Lexical Databases

Lexical databases are more comprehensive than dictionaries, often including not only word definitions but also
detailed information on morphological, syntactic, and semantic properties. Key examples include:

1. WordNet

WordNet is a large lexical database of English developed at Princeton University. It groups English words into sets
of synonyms called synsets, provides short definitions and usage examples, and records various semantic relations
between these synonym sets.

 Synsets: Groups of synonyms that share the same meaning.

 Semantic Relations: Includes hypernyms (superordinate terms), hyponyms (subordinate terms), meronyms
(part-whole relationships), and antonyms (opposites).

2. FrameNet

FrameNet is based on frame semantics and provides a rich set of frames (conceptual structures describing an event,
relation, or object) along with their associated roles (participants in the frames) and lexical units (words that evoke
the frames).

 Frames: Conceptual structures representing events or situations.

 Roles: Participants or properties in the frame (e.g., "Buyer" and "Seller" in a "Commercial_transaction"
frame).
 Lexical Units: Words or phrases that evoke specific frames.

3. VerbNet

VerbNet is a verb lexicon for English that organizes verbs into classes based on their syntactic and semantic
properties. Each class is associated with a set of syntactic frames and thematic roles.

 Verb Classes: Groups of verbs with similar syntactic and semantic behavior.
 Syntactic Frames: Descriptions of syntactic constructions in which verbs can appear.
 Thematic Roles: Roles played by participants in the actions described by the verbs (e.g., Agent, Patient).

4. BabelNet

BabelNet is a multilingual lexicalized semantic network and ontology, integrating information from WordNet and
various other resources, including Wikipedia. It covers multiple languages and provides rich lexical and semantic
information.

 Multilingual Lexicon: Covers words and their meanings in multiple languages.

 Semantic Network: Connects words through various semantic relations.
 Ontology: Provides structured knowledge representation.

5. OntoNotes

OntoNotes is a large, multilingual annotated corpus that includes various layers of annotation, such as syntax,
semantics, and coreference. It is useful for training and evaluating NLP models.

 Annotated Corpus: Texts annotated with various linguistic information.

 Syntax and Semantics: Includes both syntactic trees and semantic roles.
 Coreference: Information about which expressions refer to the same entity in a text.

Applications in NLP
1. Word Sense Disambiguation (WSD)
Determining which sense of a word is used in a given context. Resources like WordNet are crucial for providing the
necessary sense inventories.
2. Machine Translation
Improving the accuracy of translations by understanding the meanings and usage of words in different languages,
aided by resources like BabelNet.
3. Information Retrieval
Enhancing search engines and information retrieval systems by understanding the relationships between words and
concepts.
4. Text Mining and Information Extraction
Extracting meaningful information from large corpora of text by utilizing the structured knowledge provided by
lexical databases.
5. Natural Language Understanding
Building systems that can comprehend and generate human language by leveraging the rich semantic information in
resources like FrameNet and VerbNet.
6. Semantic Search
Improving search results by understanding the intent behind queries and the relationships between concepts, using
resources like WordNet and OntoNotes.
Machine-readable dictionaries and lexical databases are indispensable in NLP, providing the foundation for various
tasks that require a deep understanding of language and meaning.
Transition Networks



Transition Network used to represent formal and natural language structures.

They are formed using directed graphs and finite state automata.
It consist of No. of nodes and labeled arcs.
The nodes represent different states in traversing a sentence & the arcs represent rules or test
conditions required to make the transition from one state to the next.
A path through a T.N. corresponds to a permissible sequence of word types for a given grammar

Sentences:-
a.Big white fluffy clouds.
b.Our bright children.
c.A large beautiful white flower.
d.Large green leaves.
e.Buildings.
f.Boston‟s best seafood restaurants.

Top Down Versus Bottom Up Parsing

Top Down Bottom Up
Deterministic Versus Non Deterministic Parser

Deterministic parser permits only one choice for each word category. So each arc has a different
test condition
Non Deterministic parsers permit different arcs to be labelled with same test
Recursive Transition Network
RTN permits arc labels to refer to other networks.
In RTN, one state is specified as a start state. A string is accepted by an RTN if a
POP arc is reached and all the input has been consumed.

RTN Example
Let us consider a sentence “The stone was dark black”.
The: ART
Stone: ADJ NOUN
Was: VERB
Dark: ADJ
Black: ADJ NOUN

Top-Down Parsing with RTN

The algorithm for parsing with RTNs represents a parse state as follows:
-- current node : the node at which you are located in the network
-- current position : a pointer to the next word to be parse -- return points : a
stack of nodes in other networks where you will continue if you pop from the
current network .
•At each node, you can leave the current node and traverse an arc in the
following cases:
Case 1: IF arc is word category and next word in the sentence is in that
category, THEN (1) update current position to start at the next word (2)
update current node to the destination of the arc.
Case 2: IF arc is a push arc to a network N, THEN (1) add the destination of
the arc onto return points; (2) update current node to the starting node in the
network N.
Case 3: IF arc is a pop arc and return points list is not empty, THEN (1)
remove first return point and make it current node
Case 4: IF arc is a pop arc, return points list is empty and there are no words
left THEN (1) parse completes successfully.
UNIT2 Questions :

Here are some questions related to the topics you mentioned:

1. What are regular expressions, and how are they used in word-level text
processing?
2. How can you use regular expressions to extract specific patterns from a
given text?
3. Explain the difference between greedy and non-greedy quantifiers in
regular expressions.
4. What is a finite-state automaton (FSA), and how is it used in
morphological analysis?
5. How do deterministic and non-deterministic finite-state automata differ?
6. How can FSAs be used to recognize regular languages in natural
language processing?
7. What is morphological parsing, and why is it important in natural
language processing?
8. How does a morphological parser handle inflectional and derivational
morphology?
9. Explain the role of lexicons and finite-state transducers in
morphological parsing.
10. What are the common techniques for detecting spelling errors in a text?
11. Describe the role of a dictionary in both detecting and correcting
spelling errors.
12. What are word classes, and how are they categorized in natural language
processing?
13. How do open and closed word classes differ, and what are some
examples of each?
14. Why is understanding word classes important for tasks like part-of-
speech tagging?
15. What is part-of-speech tagging, and how is it typically performed?
16. What is a context-free grammar, and how is it used to represent the
syntax of natural languages?
17. Explain the concept of derivation trees in the context of CFGs.
18. How does Chomsky Normal Form relate to context-free grammars?

19. What is parsing, and why is it a crucial step in syntactic analysis?

20. How does probailistic parsing differ from deterministic parsing?
21. Explain the concept of a probabilistic context-free grammar (PCFG) and
its application in parsing.
22. Machine Readable Dictionaries and Lexical Databases
23. What is a machine-readable dictionary, and how is it utilized in
computational linguistics?
24. How do electronic dictionaries differ from traditional printed
dictionaries in terms of structure and use?
25. Discuss the role of machine-readable dictionaries in natural language
processing tasks like word sense disambiguation.
26. What is a lexical database, and how is it different from a traditional
dictionary?
27. Describe the structure and purpose of WordNet as a lexical database.
28. How do lexical databases contribute to semantic analysis in natural
language processing?
29. What are recursive transition networks, and how do they model the
syntax of natural languages?
30. How does an RTN differ from a finite-state automaton (FSA)?
31. Provide an example of how RTNs are used in natural language parsing.
32. What are augmented transition networks, and how do they extend the
capabilities of RTNs?
33. Explain how ATNs handle non-determinism and recursion in syntactic
analysis.
34. What advantages do ATNs offer over traditional CFGs in parsing
natural language?

Magnaseal Plus Lubricant From Project Sales Corp
No ratings yet
Magnaseal Plus Lubricant From Project Sales Corp
1 page
DSA - Introduction (Revised) PDF
No ratings yet
DSA - Introduction (Revised) PDF
75 pages
A Comparative Study of Machine Learning Algorithms For Gas Leak Detection
No ratings yet
A Comparative Study of Machine Learning Algorithms For Gas Leak Detection
9 pages
NLP 2 Chapter 2
No ratings yet
NLP 2 Chapter 2
6 pages
module-II
No ratings yet
module-II
47 pages
Module 2
No ratings yet
Module 2
78 pages
Lecture-2n-04032024-081220pm-19022025-105409am
No ratings yet
Lecture-2n-04032024-081220pm-19022025-105409am
38 pages
Word Level Analysis
No ratings yet
Word Level Analysis
49 pages
Natural Language Processing CS 1462: Some Slides Borrows From Carl Sable
No ratings yet
Natural Language Processing CS 1462: Some Slides Borrows From Carl Sable
54 pages
Wordlevel Analysis - Chap2
No ratings yet
Wordlevel Analysis - Chap2
97 pages
IR....
No ratings yet
IR....
5 pages
NLP Lect-5 02.02.21
No ratings yet
NLP Lect-5 02.02.21
18 pages
Chapter 2
No ratings yet
Chapter 2
8 pages
Grapheme:: Morpheme
No ratings yet
Grapheme:: Morpheme
20 pages
NLP m2
No ratings yet
NLP m2
71 pages
NLB final lab manual (2)
No ratings yet
NLB final lab manual (2)
23 pages
Chapter 1 + 2
No ratings yet
Chapter 1 + 2
9 pages
NLP Lect-6 03.02.21
No ratings yet
NLP Lect-6 03.02.21
17 pages
CL_lec 6
No ratings yet
CL_lec 6
28 pages
2-Regular expressions, Text Normalization, Edit Distance
No ratings yet
2-Regular expressions, Text Normalization, Edit Distance
42 pages
AI6122 Topic 1.2 - WordLevel
No ratings yet
AI6122 Topic 1.2 - WordLevel
63 pages
Unit2 A
No ratings yet
Unit2 A
22 pages
Natural Language Processing - Session 3 - Regular Expressions
No ratings yet
Natural Language Processing - Session 3 - Regular Expressions
39 pages
3. text-processing
No ratings yet
3. text-processing
70 pages
NLP CT1
No ratings yet
NLP CT1
6 pages
Regular Expression
No ratings yet
Regular Expression
29 pages
02 Textprocessingboth
No ratings yet
02 Textprocessingboth
46 pages
UNIT-1 notes
No ratings yet
UNIT-1 notes
19 pages
Natural Language Computing
No ratings yet
Natural Language Computing
20 pages
Unit 6 - AI (NLP)
No ratings yet
Unit 6 - AI (NLP)
37 pages
chapter three Words and Transducers
No ratings yet
chapter three Words and Transducers
56 pages
18 Text Mining - Text Preprocessing
No ratings yet
18 Text Mining - Text Preprocessing
40 pages
AIML-HC Mod 04
No ratings yet
AIML-HC Mod 04
71 pages
NLP Lecture2 Text Pre Processing
No ratings yet
NLP Lecture2 Text Pre Processing
54 pages
Unit Ii NLP Notes Final
No ratings yet
Unit Ii NLP Notes Final
6 pages
Chapter-1 Introduction To NLP
No ratings yet
Chapter-1 Introduction To NLP
12 pages
Word Level Analysis NLP Mod 2
No ratings yet
Word Level Analysis NLP Mod 2
18 pages
Introduction
No ratings yet
Introduction
23 pages
NLP Part1
No ratings yet
NLP Part1
67 pages
Unraveling The Power of Natural Language Processing
No ratings yet
Unraveling The Power of Natural Language Processing
11 pages
NLPAssignment Purna
No ratings yet
NLPAssignment Purna
12 pages
2 Text Operations
No ratings yet
2 Text Operations
32 pages
Unit V Intelligence and Applications: Morphological Analysis/Lexical Analysis
No ratings yet
Unit V Intelligence and Applications: Morphological Analysis/Lexical Analysis
30 pages
Natural Language Processing Dossier 20231110 141736 0000
No ratings yet
Natural Language Processing Dossier 20231110 141736 0000
114 pages
Morphological Analysis (1)
No ratings yet
Morphological Analysis (1)
118 pages
Week 8-Module 7 NLP
No ratings yet
Week 8-Module 7 NLP
52 pages
Basic Text Processing: Regular Expressions
No ratings yet
Basic Text Processing: Regular Expressions
41 pages
Unit 1a
No ratings yet
Unit 1a
53 pages
Unit 5
No ratings yet
Unit 5
4 pages
Introduction to NLP Basics of Text Processing, Spelling Correction-Edit Distance, Weighted Edit Distance
No ratings yet
Introduction to NLP Basics of Text Processing, Spelling Correction-Edit Distance, Weighted Edit Distance
35 pages
Basic Text Processing: Regular Expressions and Text Normalization
No ratings yet
Basic Text Processing: Regular Expressions and Text Normalization
53 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
16 pages
AI_NLP
No ratings yet
AI_NLP
9 pages
Basic Text Processing: Regular Expressions and Text Normalization
No ratings yet
Basic Text Processing: Regular Expressions and Text Normalization
53 pages
Apex Institute of Technology Natural Language Processing (20CST354)
No ratings yet
Apex Institute of Technology Natural Language Processing (20CST354)
43 pages
ir manual
No ratings yet
ir manual
53 pages
AIUnit 6 10
No ratings yet
AIUnit 6 10
8 pages
Sentiment Analysis Using Supervised Machine Learning Ijariie13051
No ratings yet
Sentiment Analysis Using Supervised Machine Learning Ijariie13051
7 pages
NLP Record
No ratings yet
NLP Record
15 pages
PART B NOTES
No ratings yet
PART B NOTES
62 pages
NLP TT-1 Question Bank
No ratings yet
NLP TT-1 Question Bank
21 pages
Natural Language Processing
From Everand
Natural Language Processing
Ajit Singh
No ratings yet
Syntax Choices
From Everand
Syntax Choices
Hugo Raines
No ratings yet
How A New Wearable Device Can Detect Breast Cancer Earlier
No ratings yet
How A New Wearable Device Can Detect Breast Cancer Earlier
3 pages
Mobile Phone Usage
No ratings yet
Mobile Phone Usage
4 pages
Gekujok Funko Pop Box Template Fodezaxibeven
No ratings yet
Gekujok Funko Pop Box Template Fodezaxibeven
3 pages
Least Mastered Competencies 2023-2024
No ratings yet
Least Mastered Competencies 2023-2024
4 pages
Statechart Based Modeling and Controller Implementation of Complex Reactive Systems
No ratings yet
Statechart Based Modeling and Controller Implementation of Complex Reactive Systems
6 pages
Online Gamified Quiz Platform: Core Group
No ratings yet
Online Gamified Quiz Platform: Core Group
39 pages
How Are They Similar, How Are They Different?: Business Process Automation vs. Robotic Process Automation
No ratings yet
How Are They Similar, How Are They Different?: Business Process Automation vs. Robotic Process Automation
6 pages
Eaton Cooper Fire Addressable Beam Detectors Mab50r Mab100r Datasheet en
No ratings yet
Eaton Cooper Fire Addressable Beam Detectors Mab50r Mab100r Datasheet en
2 pages
Louvin Int - R1
No ratings yet
Louvin Int - R1
39 pages
MATH204 Linear Algebra
No ratings yet
MATH204 Linear Algebra
3 pages
Python Course PDF
No ratings yet
Python Course PDF
5 pages
Aadhaar Update Form: Aadhaar Enrolment Is Free & Voluntary
No ratings yet
Aadhaar Update Form: Aadhaar Enrolment Is Free & Voluntary
4 pages
Case Study 2
No ratings yet
Case Study 2
3 pages
An Effective Color Quantization Method Using Octre
No ratings yet
An Effective Color Quantization Method Using Octre
11 pages
Multiple Choice Questions, COPA, Semester-2: Dr.V.Nagaradjane
No ratings yet
Multiple Choice Questions, COPA, Semester-2: Dr.V.Nagaradjane
115 pages
IAEME Library Subscription 2023
No ratings yet
IAEME Library Subscription 2023
10 pages
SOP-18 Hydrogen Sulfide Protection Plan
100% (1)
SOP-18 Hydrogen Sulfide Protection Plan
28 pages
IOT Based Single Axis Solar Panel Monitoring System Using Arduino
No ratings yet
IOT Based Single Axis Solar Panel Monitoring System Using Arduino
2 pages
Programming Challenges 13-13
No ratings yet
Programming Challenges 13-13
1 page
A. Tujuan. 1. For R-S Flip-Flop
No ratings yet
A. Tujuan. 1. For R-S Flip-Flop
16 pages
8051 Asembler I C Programiranje PDF
No ratings yet
8051 Asembler I C Programiranje PDF
150 pages
SE FLY Wago Inspection Kit GEN24Plus en
No ratings yet
SE FLY Wago Inspection Kit GEN24Plus en
1 page
GEOGRAPHY EM IV Revision
No ratings yet
GEOGRAPHY EM IV Revision
3 pages
Declaration of Conformity: Elcometer Limited
No ratings yet
Declaration of Conformity: Elcometer Limited
1 page
Restraint Systems: Air Bag Module
No ratings yet
Restraint Systems: Air Bag Module
8 pages
BACnet Application Map For Network Management Card For Easy UPS, 3-Phase
No ratings yet
BACnet Application Map For Network Management Card For Easy UPS, 3-Phase
10 pages
Integrate Google Calendar With ServiceNow - How To Read The Docs - Integration Part 1 - Developer Community - Article - ServiceNow Community
No ratings yet
Integrate Google Calendar With ServiceNow - How To Read The Docs - Integration Part 1 - Developer Community - Article - ServiceNow Community
16 pages

Uploaded by

Uploaded by

Bhilai Institute of Technology, Durg

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

Natural Language Processing

Prepared By- Dr Shikha Pandey

WORD LEVEL ANALYSIS IN NLP

1. Tokenization :- Splitting text into individual words or tokens.

- Stemming and lemmatization focus on reducing words to a base form

Note: Practical Examples will be in the separate sheet.

Basics of Regular Expressions

Matching Email Addresses

Extracting Phone Numbers

Using Regular Expressions in Python

In Python, regular expressions are implemented using the re module:

 re.findall(pattern, text): Finds all occurrences of pattern in text.

Tips for Using Regular Expressions

How Morphological Parsing perform explain with example???

Morphological analysis perform following task:

1. Tokenization :- Splitting text into individual words or tokens.

Example of Morphological Parsing

Importance of Morphological Parsing

SPELLING ERROR DETECTION AND CORRECTION

Spell checking technique comprises of two stages

N-gram analysis and

Error correction consists of two steps:

WORDS AND WORD CLASSS

Word Classes (Parts of Speech)

NLP Techniques for Handling Words and Word Classes

1. Tokenization: Splitting text into individual words or tokens.

Example of POS Tagging

Applications of POS Tagging

 Text-to-Speech Systems: Helps in correct pronunciation and intonation.

Workflow of POS Tagging in NLP

Goals of Syntactic Analysis

NLTK POS Tags Examples are as below:

Tools and Libraries for Syntactic Analysis

 Stanford Parser: Implements probabilistic context-free grammars for syntactic parsing.

Applications of Syntactic Analysis

Example of Syntactic Analysis

Consider the sentence: "She sings beautifully."

Challenges in Syntactic Analysis

 Ambiguity: Sentences often have multiple valid syntactic interpretations.

MACHINE READABLE DICTIONARIES AND LEXICAL DATABASES

Machine-readable dictionary (MRD) is a dictionary stored as machine-readable data instead of being

 Synsets: Groups of synonyms that share the same meaning.

 Frames: Conceptual structures representing events or situations.

 Multilingual Lexicon: Covers words and their meanings in multiple languages.

 Annotated Corpus: Texts annotated with various linguistic information.

Transition Network used to represent formal and natural language structures.

Top Down Versus Bottom Up Parsing

Top-Down Parsing with RTN

Here are some questions related to the topics you mentioned:

19. What is parsing, and why is it a crucial step in syntactic analysis?

You might also like