0% found this document useful (0 votes)
21 views

unit2

Uploaded by

azimasheikh878
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

unit2

Uploaded by

azimasheikh878
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Bhilai Institute of Technology, Durg

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

Natural Language Processing


UNIT 2: Word level and syntactic analysis

Prepared By- Dr Shikha Pandey


UNIT – II (CO2)
Word level and syntactic analysis: Word Level Analysis: Regular Expressions,
Finite-State Automata, Morphological Parsing, Spelling Error Detection and
correction, Words and Word classes, Part-of Speech Tagging. Syntactic
Analysis: Context-free Grammar, Constituency, Parsing-Probabilistic Parsing.
Machine readable dictionaries and lexical databases, RTN, ATN [7Hrs]

WORD LEVEL ANALYSIS IN NLP

Word-level analysis in Natural Language Processing (NLP) focuses on understanding and processing individual
words in a text. This type of analysis is fundamental and serves as the basis for more complex language processing
tasks. Here are key components and techniques involved in word-level analysis:

Word-Level Analysis

1. Tokenization :- Splitting text into individual words or tokens.


2. Morphological Analysis :-Studying the structure of words to understand their base forms and affixes.

1. Stemming: Reducing words to their root form (e.g., "running" to "run").Examples: Porter Stemmer,
Snowball Stemmer.
2. Lemmatization: Reducing words to their base or dictionary form, considering the context. Examples:
WordNet Lemmatizer, spaCy‟s lemmatizer.
3. Part-of-Speech (POS) Tagging Assigning parts of speech (e.g., noun, verb, adjective) to each word.
4. Named Entity Recognition (NER)Description:- Identifying and classifying proper nouns and entities in
text (e.g., names, locations, organizations).
5. Word Sense Disambiguation (WSD):- Determining the correct meaning of a word in context.
6. Stop-word Removal:- Removing common words that do not carry significant meaning (e.g., "the", "is").

Here's a summary of the differences between lemmatization, stemming, and morphological analysis in NLP:

Stemming:
- Reduces words to their stem form
- Removes suffixes and other morphological endings
- Often uses simple rules or algorithms
- May not always produce a valid word

Lemmatization:
- Reduces words to their lemma or dictionary form
- Uses a dictionary or lexicon to determine the base form
- Produces a valid word
- More accurate than stemming

Key differences:

- Stemming and lemmatization focus on reducing words to a base form


- Morphological analysis examines the internal structure and meaning of words
- Stemming is a simpler approach, while lemmatization and morphological analysis are more sophisticated techniques

In summary, stemming is a basic technique for reducing words to a stem, lemmatization is a more accurate technique for
reducing words to their dictionary form, and morphological analysis is a more comprehensive technique for understanding the
internal structure and meaning of words.

Note: Practical Examples will be in the separate sheet.


Regular Expressions

Regular expressions (regex) are powerful tools used for matching patterns in text. They provide a concise and
flexible means for identifying strings of text that follow a particular structure or format. Here‟s a comprehensive
overview of regular expressions:

Basics of Regular Expressions

1. Literal Characters
o Description: Regular characters in the pattern that match themselves.
o Example: The regex cat matches the string "cat" exactly.
2. Metacharacters
o Description: Special characters with unique meanings in regex.
o Examples:
 . (dot): Matches any single character except newline.
 ^ (caret): Matches the start of a string.
 $ (dollar sign): Matches the end of a string.
 *, +, ?: Quantifiers that specify repetition (zero or more, one or more, zero or one
respectively).
3. Character Classes
o Description: Matches any one of a set of characters.
o Examples:
 [abc]: Matches either 'a', 'b', or 'c'.
 [a-z]: Matches any lowercase letter.
 [^0-9]: Matches any character except digits.
4. Anchors
o Description: Specify positions in the text.
o Examples:
 ^ at the beginning of a pattern anchors it to the start of a string.
 $ at the end of a pattern anchors it to the end of a string.
5. Quantifiers
o Description: Specify how many times a character or group can appear.
o Examples:
 *: Matches zero or more occurrences.
 +: Matches one or more occurrences.
 ?: Matches zero or one occurrence.
6. Grouping and Capturing
o Description: Enclose parts of a pattern in parentheses to group them together or capture matched
text.
o Examples:
 (abc)+: Matches "abc", "abcabc", etc.
 (a|b): Matches either 'a' or 'b'.

Matching Email Addresses

python
Copy code
import re

pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
text = "Contact us at [email protected] for more information."
emails = re.findall(pattern, text)
print(emails) # Output: ['[email protected]']

Extracting Phone Numbers

python
Copy code
pattern = r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b'
text = "Contact us at 123-456-7890 or 456.789.0123 for support."
phones = re.findall(pattern, text)
print(phones) # Output: ['123-456-7890', '456.789.0123']

Tokenizing Text

python
Copy code
pattern = r'\b\w+\b'
text = "Tokenize this sentence into words."
tokens = re.findall(pattern, text)
print(tokens) # Output: ['Tokenize', 'this', 'sentence', 'into', 'words']

Using Regular Expressions in Python

In Python, regular expressions are implemented using the re module:

 re.findall(pattern, text): Finds all occurrences of pattern in text.


 re.search(pattern, text): Searches for a single occurrence of pattern in text.
 re.match(pattern, text): Matches pattern at the beginning of text.
 re.sub(pattern, replacement, text): Replaces occurrences of pattern with replacement in text.

Tips for Using Regular Expressions

 Compile Regex: Use re.compile(pattern) for efficiency if using the same pattern multiple times.
 Escape Special Characters: Use re.escape() for dynamic patterns where special characters need to be
treated literally.
 Test Patterns: Use online regex testers (e.g., Regex101, RegExr) to visualize and debug patterns.

Regular expressions are versatile and indispensable tools for text processing tasks such as data validation, pattern
matching, and text manipulation in various programming languages, including Python.
Process of Morphological Parsing or (WORD LEVEL ANALYSIS IN NLP)

How Morphological Parsing perform explain with example???

Morphological analysis perform following task:


- Analyzes the internal structure of words
- Identifies the root, prefixes, and suffixes
- Determines the relationships between morphemes (smallest units of meaning)
- Provides a more detailed understanding of word formation
Morphological parsing, also known as morphological analysis, is the process of analyzing the structure of words to
determine their morphemes — the smallest meaningful units of language. This process is essential in natural
language processing (NLP) for tasks such as stemming, lemmatization, and understanding the grammatical structure
of words.

Word-Level Analysis

1. Tokenization :- Splitting text into individual words or tokens.


2. Morphological Analysis :-Studying the structure of words to understand their base forms and affixes.

1. Stemming: Reducing words to their root form (e.g., "running" to "run").Examples: Porter Stemmer,
Snowball Stemmer.
2. Lemmatization: Reducing words to their base or dictionary form, considering the context. Examples:
WordNet Lemmatizer, spaCy‟s lemmatizer.
3. Part-of-Speech (POS) Tagging Assigning parts of speech (e.g., noun, verb, adjective) to each word.
4. Named Entity Recognition (NER)Description:- Identifying and classifying proper nouns and entities in
text (e.g., names, locations, organizations).
5. Word Sense Disambiguation (WSD):- Determining the correct meaning of a word in context.
6. Stop-word Removal:- Removing common words that do not carry significant meaning (e.g., "the", "is").
7. Dictionary Lookup: Each segmented morpheme is typically looked up in a lexicon or dictionary to
determine its meaning and grammatical properties.
8. Morphological Analysis: Based on the dictionary lookup, the system identifies the part-of-speech (POS)
tags and other morpho syntactic features associated with each morpheme and the entire word.

Example of Morphological Parsing

Let's consider the word "unhappiest" and perform morphological parsing on it:

Input Word:

 Word: unhappiest

Steps:

1. Tokenization:
o Tokenize the word into individual morphemes:
 Stem: "happy"
 Affixes: "un" (prefix), "est" (suffix)
2. Morpheme Segmentation:
o Stem: "happy"
o Affixes:
 Prefix: "un-"
 Suffix: "-est"
3. Dictionary Lookup:
o Look up each morpheme in a dictionary or lexicon:
 Stem: "happy" (adjective, positive degree)
 Prefix: "un-" (negative or reversing meaning)

Suffix: "-est" (superlative degree)
4. Morphological Analysis:
o Assign appropriate POS tags and morphosyntactic features:
 "unhappiest" is analyzed as an adjective in superlative degree, indicating the most unhappy
state.

Importance of Morphological Parsing

 Stemming and Lemmatization: Helps in reducing words to their base or dictionary forms, which is crucial
for tasks like information retrieval and text mining.
 Grammatical Analysis: Provides insights into the syntactic and semantic structures of sentences, aiding in
parsing and understanding natural language.
 Machine Translation: Ensures accurate translation by handling inflections and derivations in different
languages.
 Text-to-Speech and Speech Recognition: Helps in generating correct pronunciations and understanding
spoken language by breaking down words into manageable units.

SPELLING ERROR DETECTION AND CORRECTION

How does autocorrect work in NLP? Autocorrect tries first to find whether a given word is correct or not.??

Spelling correction in NLP refers to detecting incorrect spellings and then correcting them.
There are many commercial as well as non-commercial spelling error detection and correction
tools available in the market for almost all popular languages. And every tool works on word level with the help
of integral dictionary/Wordnet as the backend database for correction and detection. Every word from the text is
looked up in the speller lexicon. When a word is not in the dictionary, it is detected as an error. In order to
correct the error, a spellchecker searches the dictionary/Wordnet for the word that is most resembled to the
erroneous word. These words are then suggested to the user to choose the intended word.

Spell checking technique comprises of two stages


i. Error detection and
ii. Error correction.

TYPES OF SPELL ERRORS Various techniques that were designed on the basis of spelling errors and trends
also called error patterns, and most notable studies on these were performed by Damerau. According to the
studies spelling errors are generally divided into two types
Typographic errors and Cognitive errors.

Typographic errors (Non Word Errors): These errors occur when the correct spelling of the word is known but
the word is mistyped by mistake. These errors are mostly related to the keyboard and therefore do not follow any
linguistic criteria.

Cognitive errors (Real Word Errors): These errors occur when the correct spellings of the word are not known.
In the case of cognitive errors, the pronunciation of misspelled word is the same or similar to the pronunciation
of the intended correct word.

ERROR DETECTION

N-gram analysis and


Dictionary/Wordnet lookup.

N-gram analysis :- N-gram Analysis N-gram analysis is a method to find incorrectly spelled words in a mass of
text. Instead of comparing each entire word in a text to a dictionary, just n grams are checked. A check is done
by using an n-dimensional matrix where real n-gram frequencies are stored. If a non-existent or rare n-gram is
found the word is flagged as a misspelling, otherwise not. An n-gram is a set of consecutive characters taken
from a string with a length of whatever n is set to. This method is language independent as it requires no
knowledge of the language for which itis used. In this algorithm, each string that is involved in the comparison
process is split up into sets of adjacent n-grams. The similarity between two strings is achieved by discovering
the number of unique n-grams that they share and then calculating a similarity coefficient, i.e. the number of the
ngrams in common (intersection), divided by the total number of n-grams in the two words (union)

Dictionary Lookup :- A dictionary/Wordnet is a lexical source that contains list of correct words
a particular language. The non-word errors can be easily detected by checking each word against a
dictionary. Drawbacks of this method i. Difficulties in keeping such a dictionary up to date, and sufficiently
extensive tocover all the words in a text. A large scale lexical resources is given by linguistic ontology that
covers many words of alanguage and has a hierarchical structure based on the relationship between
concepts.WordNet is a widely used lexical resource. It covers nouns, verbs, adjectives and adverbs.

ERROR CORRECTION

Error correction consists of two steps:


The generation of candidate corrections: - The candidate generation process usuallymakes use of a
precompiled table of legal n-grams to locate one or more potential correction terms.

The ranking of candidate corrections:-The ranking process usually invokes some lexical similarity measure
between the misspelled string and the candidates or a probabilistic estimate of the likelihood of the
correction to rank order the candidates.

These two steps are most of the time treated as a separate process and executed in sequence. Some techniques
can omit the second process though, leaving the ranking and final selectionto the user.

simple Python example using the `textblob` library for spelling correction:

```python
from textblob import TextBlob
def correct_spelling(text):
blob = TextBlob(text)

corrected_text = blob.correct()
return corrected_text

# Example usage
text = "This is an exmple of speling eror detection and corection."
corrected_text = correct_spelling(text)
print(corrected_text)
```
This code will output:
```
This is an example of spelling error detection and correction.
```

For more advanced applications, you can use libraries like `pyspellchecker` for dictionary-based corrections or
train machine learning models using frameworks like `transformers` to perform contextual corrections. Here‟s an
example using `pyspellchecker`:

```python
from spellchecker import SpellChecker

def correct_spelling(text):
spell = SpellChecker()
words = text.split()
corrected_words = [spell.correction(word) for word in words]
corrected_text = " ".join(corrected_words)
return corrected_text

# Example usage
text = "This is an exmple of speling eror detection and corection."
corrected_text = correct_spelling(text)
print(corrected_text)
```

This code also corrects spelling errors and provides the corrected text.

WORDS AND WORD CLASSS


In Natural Language Processing (NLP), words and word classes (also known as parts of speech or POS) are
fundamental concepts used for understanding and processing text. Here‟s an overview:

Words

Words are the basic units of language and text. They can be further categorized into various types based on their
function and meaning.

Word Classes (Parts of Speech)

Words can be grouped into classes based on their grammatical roles in sentences. Here are the main parts of speech:

1. Nouns: Represent people, places, things, or ideas (e.g., "cat", "New York", "happiness").
2. Pronouns: Stand in for nouns (e.g., "he", "she", "it", "they").
3. Verbs: Describe actions, states, or occurrences (e.g., "run", "is", "seem").
4. Adjectives: Describe or modify nouns (e.g., "blue", "quick", "happy").
5. Adverbs: Modify verbs, adjectives, or other adverbs (e.g., "quickly", "very", "well").
6. Prepositions: Show relationships between nouns (or pronouns) and other words in a sentence (e.g., "in",
"on", "at").
7. Conjunctions: Connect words, phrases, or clauses (e.g., "and", "but", "because").
8. Determiners: Introduce nouns and specify them in some way (e.g., "the", "a", "this").
9. Interjections: Express emotions or reactions (e.g., "oh!", "wow!", "ouch!").

NLP Techniques for Handling Words and Word Classes

1. Tokenization: Splitting text into individual words or tokens.


o Example: "I love NLP." → ["I", "love", "NLP", "."]
2. POS Tagging: Assigning parts of speech to each word in a sentence.
o Example: "I love NLP." → [("I", "PRP"), ("love", "VBP"), ("NLP", "NNP"), (".", ".")]
3. Lemmatization: Reducing words to their base or root form.
o Example: "running" → "run"
4. Stemming: Trimming words to their base form (less accurate than lemmatization).
o Example: "running" → "run"
5. Named Entity Recognition (NER): Identifying and classifying named entities (like people, organizations,
locations).
o Example: "Apple Inc. is based in Cupertino." → [("Apple Inc.", "ORG"), ("Cupertino", "LOC")]
PARTS OF SPEECH (POS) TAGGING

One of the core tasks in Natural Language Processing (NLP) is Parts of Speech (PoS) tagging, which is
giving each word in a text a grammatical category, such as nouns, verbs, adjectives, and adverbs. Through
improved comprehension of phrase structure and semantics, this technique makes it possible for machines to
study and comprehend human language more accurately.

Parts of Speech

 Noun (NN): Represents people, places, things, or concepts (e.g., "cat", "New York", "happiness").
 Verb (VB): Represents actions or states (e.g., "run", "is").
 Adjective (JJ): Describes or modifies nouns (e.g., "happy", "blue").
 Adverb (RB): Modifies verbs, adjectives, or other adverbs (e.g., "quickly", "very").
 Pronoun (PRP): Substitutes for nouns (e.g., "he", "they").
 Preposition (IN): Shows relationships between nouns (e.g., "on", "in").
 Conjunction (CC): Connects words, phrases, or clauses (e.g., "and", "but").
 Determiner (DT): Introduces nouns (e.g., "the", "a").
 Interjection (UH): Expresses emotion or exclamation (e.g., "oh", "wow").

Example of POS Tagging


Consider the sentence: “The quick brown fox jumps over the lazy dog.”
After performing POS Tagging:
 “The” is tagged as determiner (DT)
 “quick” is tagged as adjective (JJ)
 “brown” is tagged as adjective (JJ)
 “fox” is tagged as noun (NN)
 “jumps” is tagged as verb (VBZ)
 “over” is tagged as preposition (IN)
 “the” is tagged as determiner (DT)
 “lazy” is tagged as adjective (JJ)
 “dog” is tagged as noun (NN)
By offering insights into the grammatical structure, this tagging aids machines in
comprehending not just individual words but also the connections between them inside a
phrase.

Applications of POS Tagging

 Text-to-Speech Systems: Helps in correct pronunciation and intonation.


 Information Extraction: Facilitates extracting specific types of information from text.
 Named Entity Recognition (NER): Assists in identifying proper nouns and specific entities.
 Parsing and Syntax Analysis: Aids in constructing syntactic structures of sentences.

POS tagging is a fundamental component of many NLP applications and plays a crucial role in understanding and
processing human language.

Workflow of POS Tagging in NLP


The following are the processes in a typical natural language processing (NLP) example of part-of-
speech (POS) tagging:
 Tokenization: Divide the input text into discrete tokens, which are usually units of words or
subwords. The first stage in NLP tasks is tokenization.
 Loading Language Models: To utilize a library such as NLTK or SpaCy, be sure to load the
relevant language model. These models offer a foundation for comprehending a language‟s
grammatical structure since they have been trained on a vast amount of linguistic data.
 Text Processing: If required, preprocess the text to handle special characters, convert it to
lowercase, or eliminate superfluous information. Correct PoS labeling is aided by clear text.
 Linguistic Analysis: To determine the text‟s grammatical structure, use linguistic analysis. This
entails understanding each word‟s purpose inside the sentence, including whether it is an
adjective, verb, noun, or other.
 Part-of-Speech Tagging: To determine the text‟s grammatical structure, use linguistic analysis.
This entails understanding each word‟s purpose inside the sentence, including whether it is an
adjective, verb, noun, or other.
 Results Analysis: Verify the accuracy and consistency of the PoS tagging findings with the
source text. Determine and correct any possible problems or mistagging.
SYNTACTIC ANALYSIS IN NLP

Syntactic analysis, also known as parsing, is a fundamental task in Natural Language Processing (NLP) that involves
analyzing the grammatical structure of sentences to establish how words relate to each other. This process is crucial
for understanding the syntax (sentence structure) of natural language and is typically performed using syntactic
parsers. Here‟s a detailed explanation of syntactic analysis in NLP:

Syntactic analysis or parsing or syntax analysis is the third phase of NLP. The purpose of this phase is to draw exact
meaning, or you can say dictionary meaning from the text. Syntax analysis checks the text for meaningfulness
comparing to the rules of formal grammar. For example, the sentence like “hot ice-cream” would be rejected by
semantic analyzer.

In this sense, syntactic analysis or parsing may be defined as the process of analyzing the strings of symbols in
natural language conforming to the rules of formal grammar. The origin of the word „parsing‟ is from Latin
word „pars‟ which means „part‟.

Goals of Syntactic Analysis

1. Structural Understanding: Identifying the hierarchical structure of sentences, including phrases and
clauses.
2. Dependency Relations: Establishing relationships between words to determine which words modify or
depend on others.
3. Grammatical Role Assignment: Assigning syntactic roles such as subject, object, predicate, etc., to words
within a sentence.

NLTK POS Tags Examples are as below:

Abbreviation Meaning
CC coordinating conjunction
CD cardinal digit
DT determiner
EX existential there
FW foreign word
IN preposition/subordinating conjunction
JJ This NLTK POS Tag is an adjective (large)
JJR adjective, comparative (larger)
JJS adjective, superlative (largest)
LS list market
MD modal (could, will)
NN noun, singular (cat, tree)
NNS noun plural (desks)
NNP proper noun, singular (sarah)
NNPS proper noun, plural (indians or americans)
PDT predeterminer (all, both, half)
POS possessive ending (parent\ „s)
PRP personal pronoun (hers, herself, him, himself)
PRP$ possessive pronoun (her, his, mine, my, our )
RB adverb (occasionally, swiftly)
RBR adverb, comparative (greater)
Abbreviation Meaning
RBS adverb, superlative (biggest)
RP particle (about)
TO infinite marker (to)
UH interjection (goodbye)
VB verb (ask)
VBG verb gerund (judging)
VBD verb past tense (pleaded)
VBN verb past participle (reunified)
VBP verb, present tense not 3rd person singular(wrap)
VBZ verb, present tense with 3rd person singular (bases)
WDT wh-determiner (that, what)
WP wh- pronoun (who)
WRB wh- adverb (how)

The above NLTK POS tag list contains all the NLTK POS Tags. NLTK POS tagger is used to assign
grammatical information of each word of the sentence. Installing, Importing and downloading all the
packages of POS NLTK is complete.

Tools and Libraries for Syntactic Analysis

 Stanford Parser: Implements probabilistic context-free grammars for syntactic parsing.


 spaCy: Provides dependency parsing and syntactic analysis capabilities, along with POS tagging and named
entity recognition.
 NLTK: Offers various parsers and tools for syntactic analysis, including CFG parsers and interfaces to
external parsers.
 SyntaxNet (now part of TensorFlow): Deep learning-based framework for dependency parsing.

Applications of Syntactic Analysis

1. Machine Translation: Understanding the grammatical structure of sentences in both source and target
languages is crucial for accurate translation.
2. Information Retrieval: Parsing helps in extracting structured information from text databases based on
syntactic patterns.
3. Question Answering: Analyzing syntactic structures aids in understanding questions and finding
appropriate answers from text.
4. Grammar Checking: Identifying syntactic errors and suggesting corrections in written text.

Example of Syntactic Analysis

Consider the sentence: "She sings beautifully."

 Syntactic Analysis:
o Identify the subject: "She"
o Identify the verb phrase: "sings beautifully"
o Assign syntactic roles: "She" as the subject, "sings" as the verb, and "beautifully" as an adverb
modifying "sings".

Challenges in Syntactic Analysis

 Ambiguity: Sentences often have multiple valid syntactic interpretations.


 Parsing Complexity: Syntax can vary widely between languages and can be context-dependent.
 Performance: Achieving real-time parsing for large datasets can be computationally intensive.

In conclusion, syntactic analysis plays a critical role in NLP by enabling computers to understand the grammatical
structure of human language, facilitating a wide range of applications from information extraction to automated
translation and beyond.

MACHINE READABLE DICTIONARIES AND LEXICAL DATABASES

Machine-readable dictionary (MRD) is a dictionary stored as machine-readable data instead of being


printed on paper. It is an electronic dictionary and lexical database. A machine-readable dictionary is a
dictionary in an electronic form that can be loaded in a database and can be queried via application
software. Machine-readable dictionaries and lexical databases are essential resources in natural language
processing (NLP). They provide structured and accessible linguistic information, which can be used for a variety of
tasks, such as part-of-speech tagging, named entity recognition, sentiment analysis, and machine translation. Here
are some key examples and explanations:

Examples include:

1. WordNet: A large lexical database of English developed at Princeton University. WordNet groups words
into sets of synonyms called synsets, provides short definitions and usage examples, and records various
semantic relations between these synonym sets.
2. Merriam-Webster's Dictionary API: Provides programmatic access to dictionary and thesaurus content
from Merriam-Webster, including definitions, synonyms, antonyms, and example sentences.
3. Oxford Dictionaries API: Offers access to Oxford's lexical data, including definitions, translations, and
usage examples across multiple languages.

Lexical Databases

Lexical databases are more comprehensive than dictionaries, often including not only word definitions but also
detailed information on morphological, syntactic, and semantic properties. Key examples include:

1. WordNet

WordNet is a large lexical database of English developed at Princeton University. It groups English words into sets
of synonyms called synsets, provides short definitions and usage examples, and records various semantic relations
between these synonym sets.

 Synsets: Groups of synonyms that share the same meaning.


 Semantic Relations: Includes hypernyms (superordinate terms), hyponyms (subordinate terms), meronyms
(part-whole relationships), and antonyms (opposites).

2. FrameNet

FrameNet is based on frame semantics and provides a rich set of frames (conceptual structures describing an event,
relation, or object) along with their associated roles (participants in the frames) and lexical units (words that evoke
the frames).

 Frames: Conceptual structures representing events or situations.


 Roles: Participants or properties in the frame (e.g., "Buyer" and "Seller" in a "Commercial_transaction"
frame).
 Lexical Units: Words or phrases that evoke specific frames.

3. VerbNet

VerbNet is a verb lexicon for English that organizes verbs into classes based on their syntactic and semantic
properties. Each class is associated with a set of syntactic frames and thematic roles.

 Verb Classes: Groups of verbs with similar syntactic and semantic behavior.
 Syntactic Frames: Descriptions of syntactic constructions in which verbs can appear.
 Thematic Roles: Roles played by participants in the actions described by the verbs (e.g., Agent, Patient).

4. BabelNet

BabelNet is a multilingual lexicalized semantic network and ontology, integrating information from WordNet and
various other resources, including Wikipedia. It covers multiple languages and provides rich lexical and semantic
information.

 Multilingual Lexicon: Covers words and their meanings in multiple languages.


 Semantic Network: Connects words through various semantic relations.
 Ontology: Provides structured knowledge representation.

5. OntoNotes

OntoNotes is a large, multilingual annotated corpus that includes various layers of annotation, such as syntax,
semantics, and coreference. It is useful for training and evaluating NLP models.

 Annotated Corpus: Texts annotated with various linguistic information.


 Syntax and Semantics: Includes both syntactic trees and semantic roles.
 Coreference: Information about which expressions refer to the same entity in a text.

Applications in NLP
1. Word Sense Disambiguation (WSD)
Determining which sense of a word is used in a given context. Resources like WordNet are crucial for providing the
necessary sense inventories.
2. Machine Translation
Improving the accuracy of translations by understanding the meanings and usage of words in different languages,
aided by resources like BabelNet.
3. Information Retrieval
Enhancing search engines and information retrieval systems by understanding the relationships between words and
concepts.
4. Text Mining and Information Extraction
Extracting meaningful information from large corpora of text by utilizing the structured knowledge provided by
lexical databases.
5. Natural Language Understanding
Building systems that can comprehend and generate human language by leveraging the rich semantic information in
resources like FrameNet and VerbNet.
6. Semantic Search
Improving search results by understanding the intent behind queries and the relationships between concepts, using
resources like WordNet and OntoNotes.
Machine-readable dictionaries and lexical databases are indispensable in NLP, providing the foundation for various
tasks that require a deep understanding of language and meaning.
Transition Networks

Transition Network used to represent formal and natural language structures.


They are formed using directed graphs and finite state automata.
It consist of No. of nodes and labeled arcs.
The nodes represent different states in traversing a sentence & the arcs represent rules or test
conditions required to make the transition from one state to the next.
A path through a T.N. corresponds to a permissible sequence of word types for a given grammar

Sentences:-
a.Big white fluffy clouds.
b.Our bright children.
c.A large beautiful white flower.
d.Large green leaves.
e.Buildings.
f.Boston‟s best seafood restaurants.

Top Down Versus Bottom Up Parsing


Top Down Bottom Up
Deterministic Versus Non Deterministic Parser

Deterministic parser permits only one choice for each word category. So each arc has a different
test condition
Non Deterministic parsers permit different arcs to be labelled with same test
Recursive Transition Network
RTN permits arc labels to refer to other networks.
In RTN, one state is specified as a start state. A string is accepted by an RTN if a
POP arc is reached and all the input has been consumed.

RTN Example
Let us consider a sentence “The stone was dark black”.
The: ART
Stone: ADJ NOUN
Was: VERB
Dark: ADJ
Black: ADJ NOUN

Top-Down Parsing with RTN


The algorithm for parsing with RTNs represents a parse state as follows:
-- current node : the node at which you are located in the network
-- current position : a pointer to the next word to be parse -- return points : a
stack of nodes in other networks where you will continue if you pop from the
current network .
•At each node, you can leave the current node and traverse an arc in the
following cases:
Case 1: IF arc is word category and next word in the sentence is in that
category, THEN (1) update current position to start at the next word (2)
update current node to the destination of the arc.
Case 2: IF arc is a push arc to a network N, THEN (1) add the destination of
the arc onto return points; (2) update current node to the starting node in the
network N.
Case 3: IF arc is a pop arc and return points list is not empty, THEN (1)
remove first return point and make it current node
Case 4: IF arc is a pop arc, return points list is empty and there are no words
left THEN (1) parse completes successfully.
UNIT2 Questions :

Here are some questions related to the topics you mentioned:

1. What are regular expressions, and how are they used in word-level text
processing?
2. How can you use regular expressions to extract specific patterns from a
given text?
3. Explain the difference between greedy and non-greedy quantifiers in
regular expressions.
4. What is a finite-state automaton (FSA), and how is it used in
morphological analysis?
5. How do deterministic and non-deterministic finite-state automata differ?
6. How can FSAs be used to recognize regular languages in natural
language processing?
7. What is morphological parsing, and why is it important in natural
language processing?
8. How does a morphological parser handle inflectional and derivational
morphology?
9. Explain the role of lexicons and finite-state transducers in
morphological parsing.
10. What are the common techniques for detecting spelling errors in a text?
11. Describe the role of a dictionary in both detecting and correcting
spelling errors.
12. What are word classes, and how are they categorized in natural language
processing?
13. How do open and closed word classes differ, and what are some
examples of each?
14. Why is understanding word classes important for tasks like part-of-
speech tagging?
15. What is part-of-speech tagging, and how is it typically performed?
16. What is a context-free grammar, and how is it used to represent the
syntax of natural languages?
17. Explain the concept of derivation trees in the context of CFGs.
18. How does Chomsky Normal Form relate to context-free grammars?

19. What is parsing, and why is it a crucial step in syntactic analysis?


20. How does probailistic parsing differ from deterministic parsing?
21. Explain the concept of a probabilistic context-free grammar (PCFG) and
its application in parsing.
22. Machine Readable Dictionaries and Lexical Databases
23. What is a machine-readable dictionary, and how is it utilized in
computational linguistics?
24. How do electronic dictionaries differ from traditional printed
dictionaries in terms of structure and use?
25. Discuss the role of machine-readable dictionaries in natural language
processing tasks like word sense disambiguation.
26. What is a lexical database, and how is it different from a traditional
dictionary?
27. Describe the structure and purpose of WordNet as a lexical database.
28. How do lexical databases contribute to semantic analysis in natural
language processing?
29. What are recursive transition networks, and how do they model the
syntax of natural languages?
30. How does an RTN differ from a finite-state automaton (FSA)?
31. Provide an example of how RTNs are used in natural language parsing.
32. What are augmented transition networks, and how do they extend the
capabilities of RTNs?
33. Explain how ATNs handle non-determinism and recursion in syntactic
analysis.
34. What advantages do ATNs offer over traditional CFGs in parsing
natural language?

You might also like