NLP Practice Problems (2)
NLP Practice Problems (2)
Grammar-based language models are designed to generate natural language text that
adheres to specific grammatical rules and structures. Their primary purpose is to produce
coherent and grammatically correct sentences or paragraphs.
What do you understand by statistical language modelling?
Statistical language modelling is the process of using statistical methods to predict the
probability of sequences of words in a language, helping to determine the likelihood of a
given sentence or phrase based on observed patterns in a large text corpus.
Regular grammar is a set of production rules that generate strings of symbols from a finite
set of terminals and non-terminals, following specific patterns and structures, used to
describe regular languages.
State the difference between deterministic finite automata and non-deterministic finite
automata.
The main difference between Deterministic Finite Automata (DFA) and Non-
Deterministic Finite Automata (NFA) lies in how they handle transitions:
1. For each state and input symbol, there is exactly one possible next state.
2. No ambiguity exists in the transitions; the machine follows a single, unique path
for any given input string.
1. For each state and input symbol, there can be multiple possible next states or no
next state at all.
2. The machine can follow multiple paths simultaneously, and an input string is
accepted if at least one path leads to an accepting state.
In morphological parsing, the structure of a word is analysed to identify its base form
(root or stem) and its morphological components, such as prefixes, suffixes, and
inflectional endings. This process helps to determine the word's meaning, grammatical
features (like tense, number, or case), and how it relates to other words in a sentence.
What are stems?
Stems are the core part of a word that carries its primary meaning, to which affixes (such
as prefixes or suffixes) can be added. In morphological analysis, the stem is what remains
after removing any inflectional or derivational affixes. For example, in the word
"unhappily," the stem is "happy."
Affixes are morphemes that are attached to a stem or root word to modify its meaning or
create a new word. There are three main types of affixes:
Prefixes: Attached to the beginning of a word (e.g., "un-" in "unhappy").
Suffixes: Attached to the end of a word (e.g., "-ness" in "happiness").
Infixes: Inserted within a word, though infixes are rare in English.
Affixes can change the grammatical category, meaning, or tense of the original word.
What is lexicon?
The minimum edit distance is the smallest number of operations (insertions, deletions, or
substitutions) required to transform one string into another. It measures how dissimilar two
strings are, with the distance representing the minimum number of changes needed to make
them identical.
Similarity key techniques are methods for measuring how similar two objects are, such as
strings or documents, using metrics like edit distance, cosine similarity, and Jaccard index.
What do you understand by n-gram based techniques?
N-gram based techniques are statistical methods used to analyse and model sequences of
items, such as words in text or characters in DNA sequences. An n-gram is a sequence of n
consecutive items. For example, a bigram (2-gram) is a sequence of two consecutive words,
and a trigram (3-gram) is a sequence of three consecutive words.
Rule-based techniques are a class of methods in natural language processing (NLP) that rely
on predefined rules or patterns to analyse and process text. These rules are typically hand-
crafted by experts, and they specify how different linguistic elements (such as words,
phrases, or sentences) should be combined or interpreted.
Part-of-speech tagging is the process of assigning a grammatical category (or part of speech)
to each word in a sentence. These categories can include nouns, verbs, adjectives, adverbs,
prepositions, conjunctions, and more. The goal of part-of-speech tagging is to provide a
deeper understanding of the grammatical structure of a sentence.
Write a regular expression to represent a set of all strings over {a, b} of even length.
(aa|ab|ba|bb)*
Write a regular expression to represent a set of all strings over {a, b} of length 4. starting
with an a.
a(a|b)(a|b)(a|b)
Write a regular expression to represent a set of all strings over {a, b} containing at least one
a.
(b* a b*)+
Write a regular expression to represent a set of strings over {a, b} having exactly 3b’s.
a*ba*ba*ba*
Write a regular expression to represent a set of all strings over {a, b} having “abab” as a
substring.
.*abab.*
Write a regular expression to represent a set of all strings containing exactly 2a’s.
A regular expression for strings over the alphabet {a,b} containing exactly two 'a's is:
b*ab*ab*
This expression ensures that there are exactly two 'a's in the string, with any number of 'b's
before, between, and after them.
Write a regular expression to represent a set of all strings containing at least 2a’s.
A regular expression for strings over the alphabet {a,b} containing at least two 'a's is:
.*a.*a.*
This expression matches any string with at least two ‘a’s, with any combination of characters
(including ‘a’ and ‘b’) before, between, and after them.
Write a regular expression to represent a set of all strings containing at most 2a’s.
A regular expression for strings over the alphabet {a,b} containing at most two ‘a’s is:
This expression allows for zero, one, or two occurrences of ‘a’, each potentially surrounded
by any number of ‘b’s.
Write a regular expression to represent a set of all strings containing the substring aa.
A regular expression for strings over the alphabet {a,b} containing the substring "aa" is:
.*aa.*
This expression matches any string where "aa" appears as a substring, with any combination
of characters before and after it.
Write a regular expression to represent all strings over {a, b} starting with any number of a's,
followed by one or more b's, followed by one or more a's, followed by a single b, followed
by any number of a's, followed by b and ending in any string of a's and b's.
Pragmatic ambiguity refers to the phenomenon where a sentence or phrase can have multiple
interpretations due to contextual or pragmatic factors, rather than solely due to lexical or
syntactic ambiguity. The ambiguity arises from the context in which the sentence is used. In
everyday conversation, the idiomatic meaning is more likely intended, whereas in a literal
sense, it would be absurd to wish harm on someone. Pragmatic ambiguity highlights the
importance of considering context, speaker intention, and shared knowledge in
disambiguating language.
Example:
"Break a leg!"
"Bank"
In this example, the word "bank" has two distinct meanings. WSD involves determining
which meaning is intended based on the context.
What is corpus?
A corpus (plural: corpora) is a large, structured collection of texts, often used for linguistic
research, language teaching, and natural language processing (NLP) tasks.
Steps:
1. Data Collection: Gather labelled data (input features and corresponding output labels).
2. Distance Calculation: Calculate the distance between the new input data and each training
data point.
3. K-Nearest Neighbour’s Selection: Select the K most similar data points (nearest
neighbours) based on the calculated distances.
4. Voting: Assign the label of the majority class among the K nearest neighbours.
5. Prediction: Output the predicted label.
Example:
"Bank"
In this example, the word "bank" has two distinct meanings, illustrating lexical ambiguity.
What is syntactic ambiguity?
Syntactic ambiguity arises when a sentence or phrase can be interpreted in multiple ways due
to its structure or syntax. This occurs when different grammatical structures can generate the
same sequence of words, leading to multiple possible meanings.
Example:
The ambiguity arises from the word "duck", which can be a noun (the bird) or a verb (to
lower oneself).
Find the regular expression representing the set of all strings of the form ambncp where m, n,
p >= 1.
Find the regular expression representing the set of all strings of the form an ba2mb2 where
m>=0, n>= 1.
The regular expression is:
a+ b (aa)* b{2}
The main challenges in NLP include ambiguity, context understanding, data sparsity,
named entity recognition, language variability, sentiment analysis, sarcasm and irony,
long-term dependencies, multilingualism, and data privacy and ethics.
What is tokenization?
Tokenization is the process of dividing text into smaller units called tokens, which can be
words, phrases, or other meaningful elements.
“Rina loves her mother and Tina does too” - contains which type of ambiguity?
The sentence "Rina loves her mother and Tina does too" contains pragmatic ambiguity. It is
unclear whether "does too" refers to Tina also loving Rina's mother or to Tina loving her
own mother.
How many ambiguities exist in the given sentence “I know little Italian.”
The sentence "I know little Italian" contains two types of ambiguity:
Syntactic Ambiguity: It is unclear whether "little" modifies "Italian" (meaning the speaker
knows a small amount of Italian) or "know" (meaning the speaker knows very little about
Italian, possibly referring to the subject of Italian culture or the language itself).
Semantic Ambiguity: It is ambiguous whether "little" means the speaker has limited
proficiency in the Italian language or if "little" refers to the speaker's knowledge of Italian in
terms of its cultural context or other aspects.
Explanation:
TF-IDF (Term Frequency-Inverse Document Frequency) is used to evaluate the importance
of a word in a document relative to a collection of documents (corpus). It helps in identifying
and weighting significant words in a document by combining:
Natural languages are used for human communication and are inherently ambiguous,
evolving, and context-dependent. They are more flexible and less strictly defined compared
to formal languages.
Rule-Based Models: Use predefined rules and patterns to identify and extract relevant
information from text.
Statistical Models: Leverage statistical methods and machine learning to identify and
extract information based on probabilities and patterns learned from training data. Examples
include Hidden Markov Models (HMM) and Conditional Random Fields (CRF).
Supervised Learning Models: Use labeled training data to train models that can classify
and extract information based on learned features. Examples include Support Vector
Machines (SVM) and neural networks.
Deep Learning Models: Utilize neural networks, particularly deep learning architectures
like Recurrent Neural Networks (RNNs) and Transformer models (e.g., BERT, GPT), to
learn complex patterns and perform extraction tasks.
Hybrid Models: Combine rule-based and machine learning approaches to leverage the
strengths of both methods for improved extraction performance.
How can you differentiate Artificial Intelligence, Machine Learning, and Natural Language
Processing?
AI is the overarching concept.
ML is a subfield of AI focused on learning from data.
NLP is a subfield of AI and ML focused on processing and understanding human language.
Stop words: Commonly occurring words like "and," "the," or "in" that do not carry significant
meaning.
Punctuation: Symbols like commas, periods, and exclamation marks that are not needed for
analysis.
Special characters: Non-alphanumeric characters that may be irrelevant to the task.
Typos and misspellings: Incorrect or inconsistent text that can affect analysis accuracy.
The goal is to clean and preprocess the text to focus on the most meaningful and relevant
information for downstream NLP tasks.
What are the stages in the lifecycle of a natural language processing (NLP) project?
The stages in the lifecycle of a natural language processing (NLP) project are:
Note: These stages may vary depending on the specific project and requirements.
1. Public Datasets: Using existing datasets available from sources like Kaggle, UCI, or
academic repositories.
2. Web Scraping: Extracting data from websites using web scraping tools and
techniques.
3. APIs: Collecting data from online services and platforms through their APIs.
4. Crowdsourcing: Gathering data by outsourcing tasks to a large group of people via
platforms like Amazon Mechanical Turk.
5. Text Repositories: Accessing large collections of text from libraries, archives, and
data repositories.
6. Manual Collection: Creating custom datasets through surveys, interviews, or direct
data entry.
In essence, n-grams can be unigrams, bigrams, trigrams, etc., depending on the value of ‘n’.
Describe F1 score.
The F1 score is a metric used to evaluate the accuracy of a classification model, particularly
in cases where the data is imbalanced. It combines precision and recall into a single measure
that captures the trade-off between the two.
2 ∗ 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑅𝑒𝑐𝑎𝑙𝑙
𝐹1 =
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙
The F1 score ranges from 0 to 1, where 1 indicates perfect precision and recall, and 0
indicates poor performance.
How regular grammar and context free grammar differs?
Regular Grammar and Context-Free Grammar (CFG) are both formal grammar types
used in computational linguistics and automata theory to define language syntax, but they
differ in expressive power and structure.
Regular Grammar: Defines regular languages with rules that can be expressed using
regular expressions; can be represented by finite automata.
Context-Free Grammar: CFGs are more powerful than regular grammars and can describe
context-free languages. They allow more complex structures, such as nested and hierarchical
patterns.
The meaning components of a language are the elements that contribute to conveying meaning
in communication. These components include:
1. Lexical Semantics: This relates to the meaning of individual words and phrases.
Lexical semantics explores word meanings, relationships between words (such as
synonyms and antonyms), and word senses (polysemy and homonymy).
2. Compositional Semantics: This examines how words combine to form phrases and
sentences, ensuring that the meaning of a larger expression is derived from the
meanings of its parts and their syntactic arrangement. It follows the principle that
sentence meaning arises from both word meanings and grammatical structure.
3. Pragmatics: Pragmatics deals with how context influences meaning. This includes
understanding implied meanings, conversational implicatures, and the roles of the
speaker and listener. Pragmatics allows us to interpret meanings based on social
context, tone, or cultural background.
4. Discourse Semantics: Discourse semantics examines how sentences relate to one
another within a larger text or conversation, creating coherence. It involves reference
(identifying pronouns and their antecedents), presupposition, and text structure to
maintain context and flow.
5. Thematic Roles: Thematic roles refer to the functions words or phrases play in
sentences, such as agent (doer of action), patient (receiver of action), or instrument
(means by which action is done). They clarify the relationships between verbs and
nouns in a sentence.
These meaning components work together to construct and interpret meaning in language,
allowing for clear and contextually appropriate communication.
The rationalist approach in NLP uses predefined linguistic rules and grammar theories,
focusing on innate structures for language understanding. It is highly interpretable but less
adaptable to real-world language variability.
The empiricist approach relies on data-driven methods, using statistical models and machine
learning to learn patterns from large datasets. It’s highly adaptable but often lacks
interpretability and requires extensive labelled data.
Essentially, rationalists emphasize rules, while empiricists prioritize learning from data.
List the motivation behind the development of computational models of languages.
The motivation behind the development of computational models of languages includes:
What are the problems associated with n-gram model? How are these problems handled?
Solutions:
In linguistics, lexical rules define the properties and behavior of words in a language. They
specify:
This lexical entry provides information about the verb "write" for use in LFG. It includes
phonological, morphological, syntactic, semantic, and functional information. Here's a
complete entry for the verb "write":
Verb: write
- Phonology: /raɪt/
- Morphology: write (base), writes (3rd person singular), writing (present participle), wrote
(past), written (past participle)
- Syntax:
- Category: V (verb)
- Subcategorization: [NP] (transitive) or [] (intransitive)
- Argument Structure: <AGENT, THEME>
- Semantics:
- Sense: create_written_text
- Semantic Features: [+dynamic], [+volitional], [+communicative]
- Functional Information:
- Subject: AGENT
- Object: THEME
- Predicate: write
Define a finite automaton that accepts the following language: (aa)*(bb)*.
Do it by yourself (DIBY).
Compute the minimum edit distance between paecful and peaceful.
To compute the minimum edit distance between "paecful" and "peaceful," follow these steps:
DFA: DIBY
Construct a nondeterministic finite automaton accepting {ab, ba}, and use it to find a
deterministic automaton accepting the same set.
DIBY
Construct a nondeterministic finite automaton accepting the set of all strings over {a, b}
ending in aba. Use it to construct DFA accepting the same set of strings.
DIBY
Construct a transition system which can accept strings over the alphabet a, b,….containing
either cat or rat.
DIBY
Find all strings of length 5 or less in the regular set represented by the following regular
expressions:
(a) (ab + a)*(aa + b)
(b) (a*b + b*a)*a
Note: The '*' operator represents zero or more occurrences, so the expressions can generate
strings of varying lengths. The above lists only include strings of length 5 or less.
Describe, in the English language, the sets represented by the following regular expressions:
(a) a(a+b)*ab
(b) (aa+b)*(bb+a)*
(a) a(a+b)*ab
In other words, the set includes strings that start with 'a' and end with 'ab', with any
combination of 'a's and 'b's in between.
(b) (aa+b)*(bb+a)*
In other words, the set includes strings composed of alternating segments of 'aa' or 'b' followed
by segments of 'bb' or 'a', with any number of repetitions.
Find a derivation tree a*b + a*b of a grammar L(G), where G is given by S S + S|S*S, S
a|b.
DIBY
A context-free grammar G has the following productions:
S0S0 l 1S1 I A, A2B3. B 2B3|3 Describe the language generated by the parameters.
To analyze the language generated by the grammar GGG, let's examine the productions:
1. Productions of the Grammar GGG:
o S→0S0
o S→1S1
o S →A
o A→2B3
o B→2B3
o B→3
We can derive strings by repeatedly applying these rules, which will help us understand the
language L(G).
Analysis of Each Production Rule
1. Rules for S:
o S→0S0 and S→1S1:
These rules suggest that the string produced by S can have matching
pairs of 0's or 1's around a centre part derived from S.
This means that any string generated by these rules will be
palindromic (it reads the same forwards and backwards).
o S→A:
This rule allows S to generate a string that is derived from A, which we
analyze next.
2. Rules for A:
o A→2B3:
This rule requires that any string derived from A begins with 2,
followed by some string derived from B, and ends with 3.
Thus, any string generated by A will start with 2 and end with 3.
3. Rules for B:
o B→2B3:
This recursive rule allows B to generate strings with pairs of 2's and 3's
in the form 2…3.
Each application of this rule wraps the previous string in a new 2 and
3 pair, so B generates strings with balanced pairs of 2's and 3's.
o B→3:
This base case terminates the recursion for B, meaning B can produce
the string "3" alone or be part of a larger string with balanced pairs of
2's and 3's.
Derivation of 12331:
S → 1S1 → 1A1 → 12B31 → 12331
Consider the following productions:
S aB | bA
A aS | bAA | a
B bS | aBB | b
For the string aaabbabbba, find
(a) the leftmost derivation,
(b) the rightmost derivation,
and (c) the parse tree.
DIBY
DIBY
Show that the grammar S aBb, A aAB |a, B ABb| b is ambiguous.
To show that the grammar is ambiguous, we need to find a string that has two different parse
trees or leftmost derivations.
Derivation 1:
S → aBb
→ a(ABb)b
→ aa(ABb)bb (Expanding B)
→ aaABbbb
→ aaabbbb
Derivation 2:
S → aBb
→ a(ABb)b
→ a(aAB)Bbb (Expanding A)
→ a(aAB)bbb
→ aaabbbb
Since there are two different derivations for the same string, the grammar is ambiguous.
In Parse Tree 2, "stolen" is an adjective modifying the implicit noun "painting" in the phrase
"stolen painting", and the prepositional phrase "by tree" could be interpreted as the thief being
located by a tree, or the painting being stolen by someone near a tree.
The ambiguity arises from the fact that "stolen" can be interpreted as an adjective or a verb,
and the prepositional phrase "by tree" can be attached to different parts of the sentence.
Overall, semantic grammar offers improved language understanding and more accurate
parsing but comes with increased complexity and requirements for large datasets and context
awareness.
Discuss lexical ambiguity.
Lexical ambiguity, also known as word-sense ambiguity, occurs when a single word has
multiple related or unrelated meanings, making it difficult to determine the intended meaning
without additional context. This ambiguity can lead to confusion, misinterpretation, or
incorrect understanding of language.
1. Homographs: Words with the same spelling but different meanings (e.g., "bank" as a
financial institution or the side of a river).
2. Homophones: Words with the same pronunciation but different meanings (e.g., "to", "too",
and "two").
3. Polysemy: Words with multiple related meanings (e.g., "head" as a body part or the source
of a river).
4. Metaphorical extension: Words with extended meanings beyond their literal sense (e.g.,
"heart" as a symbol of love).
Examples:
"I saw the man with the telescope."
Could mean either "I used a telescope to see the man" or "The man I saw had a telescope."
"The old man the boats."
Could mean "The elderly people are the ones who man the boats" or "The old man is the one
who controls the boats."
Causes:
Ambiguous Grammar: The grammar rules allow multiple valid interpretations.
Lexical Ambiguity: Words can have more than one meaning or function.
Resolution:
Context: Additional information from the surrounding text or situation can clarify the intended
meaning.
Disambiguation Techniques: In natural language processing, various algorithms and
heuristics can help choose the most likely interpretation.
Discuss semantic ambiguity.
Semantic ambiguity occurs when a word, phrase, or sentence has multiple meanings or
interpretations due to the inherent ambiguity in the semantics (meaning) of the language.
Examples:
"Bank"
Examples:
"Can you pass the salt?"
Pragmatic ambiguity arises if it is unclear whether the speaker is making a request or simply
asking if the listener is capable of passing the salt.
"I don't have any cash."
Could be ambiguous if it’s unclear whether the speaker means they don’t have cash on hand
right now or if they don’t have any cash at all.
"It’s cold in here."
Could be ambiguous whether the speaker is simply stating a fact or indirectly requesting that
someone adjust the temperature.
Causes:
Context Dependence: The meaning of a statement can vary depending on the situational
context, including the relationship between speakers and the specific circumstances.
Implicature: The intended meaning might rely on implied information or assumptions that are
not explicitly stated.
Speaker Intent: The ambiguity can stem from the speaker's intentions or the conversational
norms and expectations.
Resolution:
Contextual Understanding: Interpreting pragmatic ambiguity often requires understanding the
broader context of the conversation and the relationships between participants.
Clarification: Asking for clarification or providing additional information can help resolve
pragmatic ambiguity.
Conversational Norms: Understanding social norms and conventions can also aid in
interpreting ambiguous statements.
Discuss knowledge-based word sense disambiguation approaches.
Knowledge-based word sense disambiguation (WSD) approaches use external resources and
structured knowledge to determine the correct sense of a word in context. These approaches
leverage various types of linguistic and encyclopedic information. Here are some key types:
Dictionary-Based Approaches:
Use definitions and example sentences from dictionaries or thesauri (e.g., WordNet) to match
the context of a word with its possible senses.
Ontology-Based Approaches:
Utilize structured knowledge bases like ontologies (e.g., WordNet, FrameNet) to understand
the relationships between different senses and their contexts.
Knowledge Graph-Based Approaches:
Statistical Methods:
Feature Extraction: Use context features (e.g., surrounding words) to train classifiers on
annotated data to predict word senses.
Unsupervised Learning:
Clustering: Group contexts into clusters and infer senses based on these clusters without
labeled data.
Word Embeddings:
Contextual Embeddings: Utilize models like Word2Vec or BERT to represent words in high-
dimensional space, aiding sense disambiguation based on context similarity.
These approaches rely on analyzing large text datasets to capture and use contextual patterns
for accurate sense identification.
Discuss Lesk’s algorithm.
Lesk’s Algorithm is a method for word sense disambiguation that determines the most
appropriate sense of a word based on its context by comparing it with the definitions of
potential senses.
Key Steps:
Context Extraction: Collect words surrounding the ambiguous word in the sentence.
Sense Definitions: Retrieve definitions for each possible sense from a lexical resource (e.g.,
WordNet).
Overlap Calculation: Measure the overlap between words in the context and words in each
sense’s definition.
Sense Selection: Choose the sense with the highest overlap as the most likely meaning.
Strengths:
Simple and straightforward.
Does not require training data.
Limitations:
Dependent on the quality of definitions in the lexical resource.
May not handle complex contexts or nuanced meanings well.
Lesk’s Algorithm uses context and definitions to resolve ambiguities by finding the sense with
the greatest contextual overlap.
Discuss Walker’s algorithm.
Walker's algorithm is a method for parsing context-free grammars in a top-down manner. It
was developed by David Walker in 1975. Here's a brief overview:
Key features:
1. Top-down parsing: Starts with the start symbol and attempts to derive the input string.
2. Recursive descent: Uses recursive functions to parse the input string.
3. Backtracking: If a parse fails, the algorithm backtracks and tries alternative parses.
How it works:
Advantages:
Disadvantages:
Walker's algorithm is a simple and intuitive parsing method, but it has limitations. More
efficient algorithms like LL and LR parsing are often preferred in practice.
Discuss the concept of Bayesian classification.
Bayesian classification is a statistical approach to classify data based on Bayes' theorem. It
predicts the probability of a data point belonging to a particular class, given its features.
Key components:
How it works:
Advantages:
Common applications:
1. Spam detection
2. Sentiment analysis
3. Image classification
4. Medical diagnosis
Key characteristics:
How it works:
Advantages:
Disadvantages:
Common applications:
How it works:
Example:
Suppose we want to classify a new person as "fit" or "unfit" based on their age and weight.
Training data:
1. Calculate distances:
- Distance to (25, 70) = √((29-25)^2 + (78-70)^2) = 5.66
- Distance to (30, 80) = √((29-30)^2 + (78-80)^2) = 2.24
- Distance to (28, 75) = √((29-28)^2 + (78-75)^2) = 3.16
- Distance to (35, 90) = √((29-35)^2 + (78-90)^2) = 10.44
2. Select K=3 nearest neighbours: (30, 80), (28, 75), (25, 70)
3. Predict class: Majority vote = "fit" (2 out of 3 neighbours)
KNN is a simple, intuitive algorithm, but it can be slow for large datasets and high-
dimensional data.
Use systemic grammar to analyse to handle the following sentences:
(a) Savitha will sing a song.
(b) Savitha sings a song.
- Clause structure: Subject (Savitha) - Finite (will) - Predicator (sing) - Object (a song)
- Transitivity: The process is transitive, with Savitha acting on the song.
- Modality: The sentence expresses future possibility, with "will" indicating a prediction.
In systemic grammar, the analysis focuses on the functional roles of elements in the sentence,
such as process, participant, and goal. This approach highlights the relationships between
elements and their contributions to the overall meaning.
Use the FUF grammar to build a fully unified FD for the following sentences:
(a) Savitha will sing a song.
(b) Savitha sings a song.
- FD (Functional Description):
- (Savitha [NP]) AGENT
- (will [AUX]) TENSE FUTURE
- (sing [V]) ACTION
- (a song [NP]) GOAL
- Unification:
- AGENT → Savitha
- TENSE → FUTURE
- ACTION → sing
- GOAL → a song
- FD (Functional Description):
- (Savitha [NP]) AGENT
- (sings [V]) ACTION
- (a song [NP]) GOAL
- Unification:
- AGENT → Savitha
- ACTION → sings
- GOAL → a song
In FUF grammar, the FD represents the functional structure of the sentence, and unification
integrates the information into a single, unified representation. The main difference between
the two sentences is the presence of the auxiliary verb "will" in sentence (a), indicating future
tense.
Note: FUF grammar is a complex and abstract framework, and this analysis is a simplified
representation of the actual process.
In FUF (Functional Unification Grammar), the Functional Structure (FD) represents the
abstract, functional aspects of a sentence, while unification is the process of integrating
different pieces of information into a coherent, unified representation.
Example Analysis:
Consider two sentences:
Use:
Precision: Important when the cost of false positives is high (e.g., spam detection).
Recall: Crucial when the cost of false negatives is high (e.g., medical diagnoses).
Discuss the basic information retrieval process.
The basic information retrieval (IR) process includes:
Query Formulation: Users submit a query, which may be preprocessed for better matching.
Document Indexing: Documents are processed and indexed, creating an inverted index that
maps terms to document locations.
Matching: The system uses a retrieval model to match the query with indexed documents and
ranks them based on relevance.
Retrieval: Relevant documents are selected and retrieved based on their relevance scores.
Feedback: User interactions and feedback can help refine and improve the retrieval process.
This process aims to efficiently find and present the most relevant documents based on user
queries.
𝐶
𝑓(𝑟) =
𝑟
Implications:
A few words are very common, while many words are rare.
This distribution pattern is observed not only in language but also in other areas like
city populations and species distributions.
Here are the regular expressions for each of the given languages:
1. The set of all alphabetic strings:
o Regular Expression: [a-zA-Z]*
o Explanation: This expression matches any string made up of alphabetic
characters (both lowercase a-z and uppercase A-Z), including the empty string.
The * denotes zero or more occurrences.
2. The set of all lowercase alphabetic strings ending in a b:
o Regular Expression: [a-z]*b
o Explanation: This expression matches any string consisting of lowercase
alphabetic characters (a-z) that ends with the letter b. The * allows for zero or
more occurrences of lowercase letters before the final b.
3. The set of all strings from the alphabet a,b such that each a is immediately
preceded by and immediately followed by a b:
o Regular Expression: L = b(ab)+
o Explanation: Explanation:
b matches the first 'b'
(ab)+ matches one or more occurrences of 'ab'. This regular expression
captures strings where each a is properly bracketed by bs, and the entire string
can also consist of just bs.
Write regular expressions for the following languages. By “word”, we mean an alphabetic
string separated from other words by whitespace, any relevant punctuation, line breaks, and
so forth.
1. the set of all strings with two consecutive repeated words (e.g., “Humbert Humbert” and
“the the” but not “the bug” or “the big bug”);
2. all strings that start at the beginning of the line with an integer and that end at the end of
the line with a word;
3. all strings that have both the word grotto and the word raven in them (but not, e.g., words
like grottos that merely contain the word grotto);
4. write a pattern that places the first word of an English sentence in a register. Deal with
punctuation
Here are the regular expressions for the specified languages:
1. Strings with two consecutive repeated words:
o Regular Expression: \b(\w+)\s+\1\b
o Explanation:
\b(\w+) matches a word and captures it in a group.
\s+ matches one or more whitespace characters separating the words.
\1 refers to the captured word, ensuring it matches the next word
exactly.
2. Strings that start with an integer and end with a word:
o Regular Expression: ^\d+.*\b\w+\b$
o Explanation:
^\d+ matches one or more digits at the beginning of the line.
.* matches any characters (including whitespace) in between.
\b\w+\b$ matches a word at the end of the line.
3. Strings containing both the word "grotto" and the word "raven":
o Regular Expression: \bgrotto\b.*\braven\b|\braven\b.*\bgrotto\b
o Explanation:
\b: Matches a word boundary.
grotto: Matches the word "grotto".
.*: Matches any character zero or more times.
\braven\b: Matches the word "raven".
|: Logical OR.
The two alternatives ensure that both words are present in any order.
4. Pattern for the first word of an English sentence, handling punctuation:
o Regular Expression: ^\s*[A-Z][a-z]*(\.|!|\?)?\s+
o Explanation:
o
^\s*: Matches zero or more whitespace characters at the beginning of
the string.
[A-Z][a-z]*: Matches an uppercase letter followed by zero or more
lowercase letters.
(\.|!|\?)?: Optionally matches a period, exclamation mark, or question
mark.
\s+: Matches one or more whitespace characters. These regular
expressions handle the specified patterns and constraints, dealing with
words, integers, and sentence structure.
What are f-structure and C-structure in LFG? Consider the following sentence and explain
both the structure, “She saw stars in the sky”
F-structure:
C-structure:
-S
- NP (She)
- VP
- V (saw)
- NP (stars)
- PP (in the sky)
- P (in)
- NP (the sky)
In summary, f-structure highlights the functional relationships, while c-structure shows the
hierarchical constituent structure. Both structures are necessary for a comprehensive
understanding of sentence structure in LFG.
What are Karaka relations? Explain the difference between Karta and Agent.
Karaka relations are semantic roles in Sanskrit grammar that describe the relationships
between a verb and its arguments:
Karta: The doer or agent performing the action (e.g., "John" in "John eats an apple").
Karma: The object or recipient affected by the action (e.g., "apple" in "John eats an apple").
Difference:
For instance:
In the sentence "Ram reads a book," "Ram" is both the Karta and the Agent. He is the doer of
the action of reading.
However, in the sentence "The book was read by Ram," "Ram" is still the Agent, as he is the
one who performs the action. However, grammatically, "book" is the Karta, as it is the subject
of the passive sentence.
Calculate the bigram probability for each of the words of the sentence “I want Chinese food.”
To calculate bigram probabilities, we need a large corpus of text data. This corpus will
provide the necessary frequency counts for both unigrams and bigrams.
Once we have the corpus, we can calculate the probability of a bigram using the following
formula:
Where:
P(w2 | w1) is the probability of word w2 given the previous word w1.
Count(w1, w2) is the number of times the bigram w1, w2 appears in the corpus.
Count(w1) is the number of times the word w1 appears in the corpus.
Example:
P(want | I) = 2/3
P(Chinese | want) = 1/2
P(food | Chinese) = 1/1 = 1
We are given the following corpus:
<s> I am sam </s>
<s> Sam I am </s>
<s> I am Sam </s>
<s> I do not like green eggs and Sam</s>
Using a bigram language model with add-one smoothing, what is P(Sam | am)? Include <s>
& </s> in your counts just like any other token.
To calculate P(Sam|am) using a bigram language model with add-one smoothing, follow these
steps:
- Count'(am) = Count(am) + V = 3 + 12 = 15
- Count'(am Sam) = Count(am Sam) + 1 = 1 + 1=2
Bi-gram model:
- Count(am Sam) = 1
- Count(am) = 3
- P(Sam|am) = Count(am Sam) / Count(am) = 1/3
Unigram model:
- Count(Sam) = 3
- Count(total) = 12
- P(Sam) = Count(Sam) / Count(total) = 3/12 = 1/4
2. Stopword removal: ignoring common words like "the", "and", etc. that do not carry much
meaning.
3. Stemming or Lemmatization: reducing words to their base form (e.g., "running" becomes
"run").
4. Feature extraction: representing words as numerical vectors (e.g., word embeddings like
Word2Vec).
5. Part-of-speech (POS) tagging: identifying the grammatical category of each word (e.g.,
noun, verb, adjective).
6. Named Entity Recognition (NER): identifying named entities like names, locations,
organizations.
8. Handling out-of-vocabulary (OOV) words: dealing with words not present in the training
data.
By handling words in these ways, the tagging process can effectively analyze and understand
the meaning of text.
Comment on the validity of the following statements:
a) Rule-based taggers are non-deterministic
b) Stochastic taggers are language independent
c) Brill’s tagger is a rule-based tagger
Agree or disagree with the following statements:
Rule-based taggers are deterministic, meaning they apply a set of predefined rules to assign
tags, and the output is predictable.
Construct a transition system corresponding to the regular expressions (i) (ab + c*)*b (ii) a +
bb +bab*a
DIBY
Construct a finite automaton recognizing L(G), where G is the grammar SaS | bA| b and A
aA | bS | a.
Hint:
To construct a finite automaton recognizing L(G), we need to first analyze the grammar G
and identify the patterns it generates.
The grammar G has two non-terminals, S and A, and two terminals, a and b. The productions
are:
S → aS | bA | b
A → aA | bS | a
- S can generate strings starting with 'a' followed by any string generated by S, or strings
starting with 'b' followed by any string generated by A, or just 'b'.
- A can generate strings starting with 'a' followed by any string generated by A, or strings
starting with 'b' followed by any string generated by S, or just 'a'.
The grammar has four non-terminals (S, A, B, C) and two terminals (a, b). The productions
are:
S → aS | bS | aA
A → bB
B → aC
C→A
- S can generate strings starting with 'a' or 'b' followed by any string generated by S, or strings
starting with 'a' followed by any string generated by A.
- A can generate strings starting with 'b' followed by any string generated by B.
- B can generate strings starting with 'a' followed by any string generated by C.
- C can generate strings generated by A.
DIBY
Construct a reduced grammar equivalent to the grammar
SaAa, ASb | bCC | DaA, Cabb | DD, EaC, D aDA
DIBY
Let G be SAB, Aa, BC|b, CD, DE and Ea. Eliminate unit productions and get
an equivalent grammar.
DIBY
Reduce the following grammar G to CNF. G is SaAD, AaB | bAB, Bb and Dd.
To reduce the grammar G to Chomsky Normal Form (CNF), we need to follow these steps:
- Replace A → bAB with A → bA'B' (introduce new non-terminals A' and B')
- Add productions A' → A and B' → B
- Replace S → aAD with S → aA'D' (introduce new non-terminals A' and D')
- Add productions A' → AD and D' → D
S → AA'D'
A → AB' | BA'B'
B' → B
B→b
A' → A
D' → D
D→d
Note that in CNF, all productions are in one of the following forms: