0% found this document useful (0 votes)
349 views48 pages

NLP Practice Problems (2)

Uploaded by

Jeet Dutta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
349 views48 pages

NLP Practice Problems (2)

Uploaded by

Jeet Dutta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

Very Short Questions

What is the purpose of grammar-based language model?

Grammar-based language models are designed to generate natural language text that
adheres to specific grammatical rules and structures. Their primary purpose is to produce
coherent and grammatically correct sentences or paragraphs.
What do you understand by statistical language modelling?

Statistical language modelling is the process of using statistical methods to predict the
probability of sequences of words in a language, helping to determine the likelihood of a
given sentence or phrase based on observed patterns in a large text corpus.

Define regular grammar.

Regular grammar is a set of production rules that generate strings of symbols from a finite
set of terminals and non-terminals, following specific patterns and structures, used to
describe regular languages.

Define finite state automata.

A finite-state automaton (FSA) is a mathematical model used to represent systems that


transition from one state to another based on input symbols. It is a simple but powerful
model that can be used to recognize patterns in strings.

State the difference between deterministic finite automata and non-deterministic finite
automata.

The main difference between Deterministic Finite Automata (DFA) and Non-
Deterministic Finite Automata (NFA) lies in how they handle transitions:

Deterministic Finite Automata (DFA):

1. For each state and input symbol, there is exactly one possible next state.
2. No ambiguity exists in the transitions; the machine follows a single, unique path
for any given input string.

Non-Deterministic Finite Automata (NFA):

1. For each state and input symbol, there can be multiple possible next states or no
next state at all.
2. The machine can follow multiple paths simultaneously, and an input string is
accepted if at least one path leads to an accepting state.

What can be done in morphological parsing?

In morphological parsing, the structure of a word is analysed to identify its base form
(root or stem) and its morphological components, such as prefixes, suffixes, and
inflectional endings. This process helps to determine the word's meaning, grammatical
features (like tense, number, or case), and how it relates to other words in a sentence.
What are stems?

Stems are the core part of a word that carries its primary meaning, to which affixes (such
as prefixes or suffixes) can be added. In morphological analysis, the stem is what remains
after removing any inflectional or derivational affixes. For example, in the word
"unhappily," the stem is "happy."

What are called affixes?

Affixes are morphemes that are attached to a stem or root word to modify its meaning or
create a new word. There are three main types of affixes:
Prefixes: Attached to the beginning of a word (e.g., "un-" in "unhappy").
Suffixes: Attached to the end of a word (e.g., "-ness" in "happiness").
Infixes: Inserted within a word, though infixes are rare in English.
Affixes can change the grammatical category, meaning, or tense of the original word.

What is lexicon?

A lexicon is a collection or database of words and their meanings, including information


about their grammatical properties, usage, and relationships to other words. In linguistics and
computational language processing, a lexicon often includes not only word definitions but
also details like pronunciation, part of speech, and morphological structure. It serves as a
fundamental resource for tasks like language analysis, translation, and natural language
processing (NLP).

What do you understand by morphotactic?

Morphotactic is the study of the rules governing the combination of morphemes in a


language. Morphemes are the smallest units of meaning in a language, and they can be
combined in various ways to form words. Morphotactic deals with the patterns and
constraints that determine which combinations of morphemes are grammatically correct and
meaningful in a given language.

What is the minimum edit distance?

The minimum edit distance is the smallest number of operations (insertions, deletions, or
substitutions) required to transform one string into another. It measures how dissimilar two
strings are, with the distance representing the minimum number of changes needed to make
them identical.

What do you understand by similarity key techniques?

Similarity key techniques are methods for measuring how similar two objects are, such as
strings or documents, using metrics like edit distance, cosine similarity, and Jaccard index.
What do you understand by n-gram based techniques?

N-gram based techniques are statistical methods used to analyse and model sequences of
items, such as words in text or characters in DNA sequences. An n-gram is a sequence of n
consecutive items. For example, a bigram (2-gram) is a sequence of two consecutive words,
and a trigram (3-gram) is a sequence of three consecutive words.

What do you understand by rule-based techniques?

Rule-based techniques are a class of methods in natural language processing (NLP) that rely
on predefined rules or patterns to analyse and process text. These rules are typically hand-
crafted by experts, and they specify how different linguistic elements (such as words,
phrases, or sentences) should be combined or interpreted.

What is Part-of-speech tagging process?

Part-of-speech tagging is the process of assigning a grammatical category (or part of speech)
to each word in a sentence. These categories can include nouns, verbs, adjectives, adverbs,
prepositions, conjunctions, and more. The goal of part-of-speech tagging is to provide a
deeper understanding of the grammatical structure of a sentence.

Write a regular expression to represent a set of all strings over {a, b} of even length.

(aa|ab|ba|bb)*

Write a regular expression to represent a set of all strings over {a, b} of length 4. starting
with an a.

a(a|b)(a|b)(a|b)

Write a regular expression to represent a set of all strings over {a, b} containing at least one
a.

(b* a b*)+

Write a regular expression to represent a set of strings over {a, b} having exactly 3b’s.

a*ba*ba*ba*

Write a regular expression to represent a set of all strings over {a, b} having “abab” as a
substring.

.*abab.*
Write a regular expression to represent a set of all strings containing exactly 2a’s.

A regular expression for strings over the alphabet {a,b} containing exactly two 'a's is:
b*ab*ab*
This expression ensures that there are exactly two 'a's in the string, with any number of 'b's
before, between, and after them.

Write a regular expression to represent a set of all strings containing at least 2a’s.

A regular expression for strings over the alphabet {a,b} containing at least two 'a's is:

.*a.*a.*

This expression matches any string with at least two ‘a’s, with any combination of characters
(including ‘a’ and ‘b’) before, between, and after them.

Write a regular expression to represent a set of all strings containing at most 2a’s.

A regular expression for strings over the alphabet {a,b} containing at most two ‘a’s is:

b* | b*a b* | b*a b*a b*

This expression allows for zero, one, or two occurrences of ‘a’, each potentially surrounded
by any number of ‘b’s.

Write a regular expression to represent a set of all strings containing the substring aa.

A regular expression for strings over the alphabet {a,b} containing the substring "aa" is:

.*aa.*

This expression matches any string where "aa" appears as a substring, with any combination
of characters before and after it.

Write a regular expression to represent all strings over {a, b} starting with any number of a's,
followed by one or more b's, followed by one or more a's, followed by a single b, followed
by any number of a's, followed by b and ending in any string of a's and b's.

The regular expression is:


a*b+ a+ b a* b [a|b]*
a* matches any number of 'a's at the beginning.
b+ matches one or more 'b's.
a+ matches one or more 'a's.
b matches a single 'b'.
a* matches any number of 'a's.
b matches a single 'b'.
[a|b]* matches any sequence of 'a's and 'b's at the end.
What is pragmatic ambiguity?

Pragmatic ambiguity refers to the phenomenon where a sentence or phrase can have multiple
interpretations due to contextual or pragmatic factors, rather than solely due to lexical or
syntactic ambiguity. The ambiguity arises from the context in which the sentence is used. In
everyday conversation, the idiomatic meaning is more likely intended, whereas in a literal
sense, it would be absurd to wish harm on someone. Pragmatic ambiguity highlights the
importance of considering context, speaker intention, and shared knowledge in
disambiguating language.

Example:

"Break a leg!"

This sentence can be interpreted in two ways:

1. Literal meaning: "Cause physical harm to your leg."


2. Idiomatic meaning: "Good luck!" (often used to wish someone success before a
performance).

What do you understand by the term “word sense disambiguation”?


Word Sense Disambiguation (WSD) is the process of identifying the correct meaning of a
word in a given context, when that word has multiple possible meanings (polysemy or
homonymy).
Example:

"Bank"

1. Financial institution: "I'm going to the bank to deposit my paycheck."


2. Riverbank: "The park is located on the bank of the lake."

In this example, the word "bank" has two distinct meanings. WSD involves determining
which meaning is intended based on the context.

What is corpus?
A corpus (plural: corpora) is a large, structured collection of texts, often used for linguistic
research, language teaching, and natural language processing (NLP) tasks.

Some notable corpora include:

1. Google's Ngram Corpus


2. Wikipedia Corpus
3. British National Corpus (BNC)
4. Corpus of Contemporary American English (COCA)
What do you understand by supervised learning?
Supervised learning is a type of machine learning where an algorithm learns from labelled
data to predict or classify outcomes. The algorithm is trained on a dataset with input features
and corresponding output labels, allowing it to learn patterns and relationships between the
inputs and outputs. The goal is to make accurate predictions on new, unseen data. Supervised
learning is commonly used for tasks such as image classification, speech recognition, and
text analysis.

Types of Supervised Learning:

1. Classification (e.g., spam vs. non-spam emails)


2. Regression (e.g., predicting house prices)

What do you understand by unsupervised learning?


Unsupervised learning is a machine learning approach where a model is trained on an
unlabelled dataset, meaning the data does not have predefined output labels. The model tries
to identify patterns, relationships, or structures within the data, such as clustering similar
items together or reducing dimensionality to reveal underlying features.

How KNN works?


K-Nearest Neighbors (KNN) is a supervised learning algorithm for classification problem
(predicting classes) that works as follows:

Steps:

1. Data Collection: Gather labelled data (input features and corresponding output labels).
2. Distance Calculation: Calculate the distance between the new input data and each training
data point.
3. K-Nearest Neighbour’s Selection: Select the K most similar data points (nearest
neighbours) based on the calculated distances.
4. Voting: Assign the label of the majority class among the K nearest neighbours.
5. Prediction: Output the predicted label.

What is the speciality of naïve bayes classifier?


The Naïve Bayes classifier's specialty is its simplicity and efficiency. It is based on Bayes'
theorem with the assumption that features are conditionally independent given the class label,
which simplifies computations and requires less training data compared to other methods.
Despite its simplicity, it often performs well in various classification tasks, especially with
text data.

What is the purpose of rule-based taggers?


The purpose of rule-based taggers is to assign part-of-speech (POS) tags to words in a text
using predefined linguistic rules and patterns. These rules are based on grammatical and
contextual information, helping to identify the correct POS tags without relying on statistical
models or training data.
What is the purpose of stochastic taggers?
The purpose of stochastic taggers is to assign part-of-speech (POS) tags to words in a text
using probabilistic models. These taggers leverage statistical methods, such as Hidden
Markov Models or probabilistic context-free grammars, to predict the most likely tags based
on the likelihood of sequences of tags and words. They learn from annotated training data
and can handle ambiguity and context more flexibly compared to rule-based taggers.

What is the purpose of hybrid taggers?


Hybrid taggers combine rule-based and machine learning (ML) approaches to achieve more
accurate and efficient part-of-speech (POS) tagging, named entity recognition (NER), and
other natural language processing (NLP) tasks.

Define context-free grammar.


A context-free grammar (CFG) is a formal grammar where each production rule has a single
non-terminal symbol on the left-hand side and a sequence of terminals and/or non-terminals
on the right-hand side. CFGs are used to define the syntactic structure of languages,
especially in programming languages and natural language processing. They can generate all
context-free languages, which include many programming languages and some aspects of
natural languages.

What is the purpose of parse tree?


The purpose of a parse tree is to visually represent the syntactic structure of a string according
to a formal grammar. It shows how a string is derived from the grammar's start symbol
through a series of production rules, illustrating the hierarchical relationship between
symbols and providing insights into the grammatical structure of the string.

When a grammar is called ambiguous?


A grammar is called ambiguous if there exists at least one string that can be generated by the
grammar in more than one distinct way, meaning the string has more than one valid parse
tree or derivation. Ambiguity in a grammar leads to multiple possible interpretations of the
same string, which can create challenges in parsing and understanding.

State the difference between top-down parsing and bottom-up parsing.


Top-down parsing starts from the start symbol and attempts to derive the input string, while
bottom-up parsing starts from the input string and attempts to build up to the start symbol.

What is lexical ambiguity?


Lexical ambiguity, also known as word-sense ambiguity, occurs when a single word has
multiple related or unrelated meanings, making its interpretation context-dependent.

Example:

"Bank"

1. Financial institution: "I'm going to the bank to deposit my pay check."


2. Riverbank: "The park is located on the bank of the lake."

In this example, the word "bank" has two distinct meanings, illustrating lexical ambiguity.
What is syntactic ambiguity?
Syntactic ambiguity arises when a sentence or phrase can be interpreted in multiple ways due
to its structure or syntax. This occurs when different grammatical structures can generate the
same sequence of words, leading to multiple possible meanings.

What is semantic ambiguity?


Semantic ambiguity occurs when a phrase or sentence can be interpreted in multiple ways
due to unclear or ambiguous meaning.

Example:

"They saw her duck."

This sentence has two possible interpretations:

1. They saw her (someone's) duck (the bird).


2. They saw her duck (lower herself to avoid something).

The ambiguity arises from the word "duck", which can be a noun (the bird) or a verb (to
lower oneself).

Find the regular expression representing the set of all strings of the form ambncp where m, n,
p >= 1.

The regular expression is:


a+ b+ c+

Find the regular expression representing the set of all strings of the form an ba2mb2 where
m>=0, n>= 1.
The regular expression is:
a+ b (aa)* b{2}

What are the main challenges in NLP?

The main challenges in NLP include ambiguity, context understanding, data sparsity,
named entity recognition, language variability, sentiment analysis, sarcasm and irony,
long-term dependencies, multilingualism, and data privacy and ethics.

Mention some areas of NLP application.


Areas of NLP application include machine translation, speech recognition, sentiment
analysis, text classification, named entity recognition, chatbots, information retrieval,
summarization, part-of-speech tagging, and question answering.

What is machine translation?


Machine translation is the automatic process of translating text or speech from one language
to another using computational models.
What is morphological segmentation?
Morphological segmentation is the process of breaking down a word into its constituent
morphemes (the smallest units of meaning), such as roots, prefixes, and suffixes.

What is tokenization?
Tokenization is the process of dividing text into smaller units called tokens, which can be
words, phrases, or other meaningful elements.

“I saw bats” – contains which type of ambiguity?


The phrase "I saw bats" contains lexical ambiguity, as the word "bats" can refer to either the
flying mammals or sports equipment used in baseball.

“Rina loves her mother and Tina does too” - contains which type of ambiguity?
The sentence "Rina loves her mother and Tina does too" contains pragmatic ambiguity. It is
unclear whether "does too" refers to Tina also loving Rina's mother or to Tina loving her
own mother.

How many ambiguities exist in the given sentence “I know little Italian.”
The sentence "I know little Italian" contains two types of ambiguity:

Syntactic Ambiguity: It is unclear whether "little" modifies "Italian" (meaning the speaker
knows a small amount of Italian) or "know" (meaning the speaker knows very little about
Italian, possibly referring to the subject of Italian culture or the language itself).

Semantic Ambiguity: It is ambiguous whether "little" means the speaker has limited
proficiency in the Italian language or if "little" refers to the speaker's knowledge of Italian in
terms of its cultural context or other aspects.

At which phase of language processing parse tree is required?


The parse tree is required during the syntactic analysis of language processing. This phase
involves analysing the syntactic structure of a sentence to understand how the components
fit together according to a formal grammar. The parse tree represents the hierarchical
structure of the sentence, showing how it is derived from the grammar rules.

At which phase meaning of the word is checked?


The meaning of the word is typically checked during the semantic analysis phase of language
processing. In this phase, the focus is on interpreting the meanings of words and their
relationships within the context of the sentence to ensure that the text makes sense and is
logically consistent.

What type of words called “stop word”?


"Stop words" are common words like "the", "and", "a", "an", etc. that do not carry significant
meaning in a sentence and are often ignored in natural language processing and information
retrieval.
What is the use of TF-IDF?

TF-IDF is used to evaluate the importance of words in a document relative to a corpus,


helping to identify key terms and improve text search and classification.

Explanation:
TF-IDF (Term Frequency-Inverse Document Frequency) is used to evaluate the importance
of a word in a document relative to a collection of documents (corpus). It helps in identifying
and weighting significant words in a document by combining:

Term Frequency (TF): How often a word appears in a document.


Inverse Document Frequency (IDF): How rare or common a word is across all documents in
the corpus.
TF-IDF is commonly used in text mining, information retrieval, and document classification
to improve search results and feature selection by highlighting important words while
minimizing the impact of commonly occurring words.

What is the difference between formal and natural languages?


Formal languages are precisely defined by formal grammar rules and are used in areas like
mathematics and computer science. They have strict syntax and semantics.

Natural languages are used for human communication and are inherently ambiguous,
evolving, and context-dependent. They are more flexible and less strictly defined compared
to formal languages.

What do you understand by information extraction?


Information extraction (IE) is the process of automatically identifying and extracting
structured information from unstructured or semi-structured text. This often involves
extracting specific data such as names, dates, relationships, and events from large text
corpora, making it easier to analyse and use the information for tasks like database
population, summarization, or question answering.
What are the various models of information extraction?

Various models of information extraction include:

Rule-Based Models: Use predefined rules and patterns to identify and extract relevant
information from text.

Statistical Models: Leverage statistical methods and machine learning to identify and
extract information based on probabilities and patterns learned from training data. Examples
include Hidden Markov Models (HMM) and Conditional Random Fields (CRF).

Supervised Learning Models: Use labeled training data to train models that can classify
and extract information based on learned features. Examples include Support Vector
Machines (SVM) and neural networks.

Unsupervised Learning Models: Extract information without labeled training data by


clustering and pattern recognition techniques. Examples include topic modeling and
clustering algorithms.

Deep Learning Models: Utilize neural networks, particularly deep learning architectures
like Recurrent Neural Networks (RNNs) and Transformer models (e.g., BERT, GPT), to
learn complex patterns and perform extraction tasks.

Hybrid Models: Combine rule-based and machine learning approaches to leverage the
strengths of both methods for improved extraction performance.

How can you differentiate Artificial Intelligence, Machine Learning, and Natural Language
Processing?
AI is the overarching concept.
ML is a subfield of AI focused on learning from data.
NLP is a subfield of AI and ML focused on processing and understanding human language.

What is noise removal in NLP?


Noise removal in NLP refers to the process of eliminating irrelevant or extraneous
information from text data to improve the quality and effectiveness of text processing tasks.
This may include removing:

Stop words: Commonly occurring words like "and," "the," or "in" that do not carry significant
meaning.
Punctuation: Symbols like commas, periods, and exclamation marks that are not needed for
analysis.
Special characters: Non-alphanumeric characters that may be irrelevant to the task.
Typos and misspellings: Incorrect or inconsistent text that can affect analysis accuracy.

The goal is to clean and preprocess the text to focus on the most meaningful and relevant
information for downstream NLP tasks.
What are the stages in the lifecycle of a natural language processing (NLP) project?
The stages in the lifecycle of a natural language processing (NLP) project are:

1. Data Collection: Gathering text data for the project.


2. Data Preprocessing: Cleaning, tokenizing, and formatting the data.
3. Data Annotation: Labelling and annotating the data for training.
4. Model Training: Building and training the NLP model.
5. Model Evaluation: Testing and evaluating the model's performance.
6. Deployment: Integrating the model into a larger application or system.
7. Maintenance: Monitoring and updating the model as needed.

Note: These stages may vary depending on the specific project and requirements.

What is meant by data augmentation?


Data augmentation is a technique used to increase the diversity of a training dataset without
actually collecting new data. It is especially useful in fields like Natural Language Processing
(NLP) and computer vision, where obtaining large amounts of labelled data can be time-
consuming and expensive. In NLP, data augmentation helps improve model robustness and
generalization by creating variations of the existing data.

How can data be obtained for NLP projects?

Data for NLP projects can be obtained through:

1. Public Datasets: Using existing datasets available from sources like Kaggle, UCI, or
academic repositories.
2. Web Scraping: Extracting data from websites using web scraping tools and
techniques.
3. APIs: Collecting data from online services and platforms through their APIs.
4. Crowdsourcing: Gathering data by outsourcing tasks to a large group of people via
platforms like Amazon Mechanical Turk.
5. Text Repositories: Accessing large collections of text from libraries, archives, and
data repositories.
6. Manual Collection: Creating custom datasets through surveys, interviews, or direct
data entry.

What do you mean by Text Extraction and Clean-up?


Text Extraction is retrieving relevant text from a source. Text Clean-up is preprocessing the
extracted text by removing noise, normalizing, and standardizing it to improve its quality for
analysis.

What is the meaning of Text Normalization in NLP?


Text normalization in NLP is the process of converting text to a standard format, including
tasks like lowercasing, removing punctuation, and standardizing variations to ensure
consistency and improve processing.
What are the steps to follow when building a text classification system?
Data Collection, Data Preprocessing, Feature Extraction, Model Selection, Training,
Evaluation, Hyperparameter Tuning, Deployment.

What do you mean by a Bag of Words?


A Bag of Words (BoW) is a text representation method where a document is represented by
a set of its words, disregarding grammar and word order but keeping track of the frequency
of each word. This model transforms text into a numerical vector based on word counts or
term frequencies.

Which step is required as pre-processing step in language processing?


Tokenization is a required pre-processing step in language processing.

What is unigram? Explain with example.


A unigram is a single unit or token from a sequence of text, representing each word
individually without considering its context or relationships with other words. It is the
simplest form of n-gram, where n = 1.

Example: For the sentence "I love NLP":

The unigrams are: "I", "love", "NLP"

What is bigram? Explain with example.


A bigram is a sequence of two adjacent words or tokens from a text. It captures pairs of
words to consider their relationships and context.

Example: For the sentence "I love NLP":

The bigrams are: "I love", "love NLP"

How unigram and bigram differ?


Unigram represents single words, while bigram represents pairs of consecutive words.

How n-gram differs from unigram and bigram?


N-gram is a general term for sequences of ‘n’ adjacent words or tokens.

Unigram: A 1-gram, representing individual words.


Bigram: A 2-gram, representing pairs of consecutive words.

In essence, n-grams can be unigrams, bigrams, trigrams, etc., depending on the value of ‘n’.

Describe F1 score.
The F1 score is a metric used to evaluate the accuracy of a classification model, particularly
in cases where the data is imbalanced. It combines precision and recall into a single measure
that captures the trade-off between the two.
2 ∗ 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑅𝑒𝑐𝑎𝑙𝑙
𝐹1 =
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙
The F1 score ranges from 0 to 1, where 1 indicates perfect precision and recall, and 0
indicates poor performance.
How regular grammar and context free grammar differs?

Regular Grammar and Context-Free Grammar (CFG) are both formal grammar types
used in computational linguistics and automata theory to define language syntax, but they
differ in expressive power and structure.

Regular Grammar: Defines regular languages with rules that can be expressed using
regular expressions; can be represented by finite automata.

Context-Free Grammar: CFGs are more powerful than regular grammars and can describe
context-free languages. They allow more complex structures, such as nested and hierarchical
patterns.

What is term frequency?


Term Frequency (TF) is a measure of how often a term appears in a document relative to the
total number of terms in that document. It helps assess the importance of a term within a
specific document.
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑖𝑚𝑒𝑠 𝑡𝑒𝑟𝑚 𝑡 𝑎𝑝𝑝𝑒𝑎𝑟𝑠 𝑖𝑛 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡 𝑑
𝑇𝐹(𝑡, 𝑑) =
𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑒𝑟𝑚𝑠 𝑖𝑛 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡 𝑑

What is inverse document frequency?


Inverse Document Frequency (IDF) is a metric used in text analysis to evaluate the
importance of a term within a collection of documents, often as part of the TF-IDF (Term
Frequency-Inverse Document Frequency) calculation. IDF helps reduce the weight of
common terms that appear across many documents while increasing the weight of rarer, more
distinctive terms that might be more meaningful.

The IDF of a term t in a document corpus D is calculated as:

𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑖𝑛 𝐷


𝐼𝐷𝐹(𝑡, 𝑑) = 𝑙𝑜𝑔
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑐𝑜𝑛𝑡𝑎𝑖𝑛𝑖𝑛𝑔 𝑡
Alternatively, a small constant (like 1) is often added to the denominator to avoid division
by zero:
𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑖𝑛 𝐷
𝐼𝐷𝐹(𝑡, 𝑑) = 𝑙𝑜𝑔
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑐𝑜𝑛𝑡𝑎𝑖𝑛𝑖𝑛𝑔 𝑡 + 1

Why term weighting is important?


Term weighting is important because it helps prioritize significant terms over common ones,
improving the accuracy of text analysis and retrieval.

What is the purpose of Jaccard’s coefficient?


Jaccard’s coefficient measures the similarity between two sets by comparing the size of their
intersection to the size of their union. It quantifies how similar two sets are, with a coefficient
ranging from 0 (no similarity) to 1 (identical sets).
Short Questions:

Briefly discuss the meaning components of a language.

The meaning components of a language are the elements that contribute to conveying meaning
in communication. These components include:

1. Lexical Semantics: This relates to the meaning of individual words and phrases.
Lexical semantics explores word meanings, relationships between words (such as
synonyms and antonyms), and word senses (polysemy and homonymy).
2. Compositional Semantics: This examines how words combine to form phrases and
sentences, ensuring that the meaning of a larger expression is derived from the
meanings of its parts and their syntactic arrangement. It follows the principle that
sentence meaning arises from both word meanings and grammatical structure.
3. Pragmatics: Pragmatics deals with how context influences meaning. This includes
understanding implied meanings, conversational implicatures, and the roles of the
speaker and listener. Pragmatics allows us to interpret meanings based on social
context, tone, or cultural background.
4. Discourse Semantics: Discourse semantics examines how sentences relate to one
another within a larger text or conversation, creating coherence. It involves reference
(identifying pronouns and their antecedents), presupposition, and text structure to
maintain context and flow.
5. Thematic Roles: Thematic roles refer to the functions words or phrases play in
sentences, such as agent (doer of action), patient (receiver of action), or instrument
(means by which action is done). They clarify the relationships between verbs and
nouns in a sentence.

These meaning components work together to construct and interpret meaning in language,
allowing for clear and contextually appropriate communication.

Differentiate between the rationalist and empiricist approaches to natural language


processing.

The rationalist approach in NLP uses predefined linguistic rules and grammar theories,
focusing on innate structures for language understanding. It is highly interpretable but less
adaptable to real-world language variability.

The empiricist approach relies on data-driven methods, using statistical models and machine
learning to learn patterns from large datasets. It’s highly adaptable but often lacks
interpretability and requires extensive labelled data.

Essentially, rationalists emphasize rules, while empiricists prioritize learning from data.
List the motivation behind the development of computational models of languages.
The motivation behind the development of computational models of languages includes:

Automating Language Tasks: To perform tasks like translation, summarization, and


sentiment analysis efficiently.
Understanding Human Language: To gain insights into linguistic structures and cognitive
processes.
Enhancing Communication: To improve human-computer interaction through natural
language interfaces.
Processing Large Data: To analyze and extract information from vast amounts of textual
data.
Enabling AI Applications: To support AI systems in tasks requiring language understanding,
like virtual assistants and chatbots.

Give the representation of a sentence in d-structure and s-structure in GB.


In the context of syntax theory, D-Structure (Deep Structure) represents the underlying
syntactic structure of a sentence, capturing the core relationships between elements. S-
Structure (Surface Structure), on the other hand, represents the actual word order and
arrangement as it appears in spoken or written language. While D-Structure focuses on the
abstract, more fundamental organization, S-Structure reflects how those abstract elements are
realized in real-world language use.

Compare GB and PG. Why is PG called syntactico-semantic theory?


Government and Binding (GB) is a syntactic theory focusing on the structure of sentences
using abstract syntactic rules and principles. It emphasizes syntax as the primary determinant
of sentence structure, with semantics as a secondary consideration.

Predicate-Argument Structure (PG), on the other hand, is called a syntactico-semantic theory


because it integrates both syntax and semantics. In PG, the structure of a sentence is directly
tied to its meaning, particularly the relationship between predicates (like verbs) and their
arguments (like subjects and objects). This theory views syntax and semantics as deeply
interconnected, with syntactic structure reflecting the underlying semantic relationships.

What are the problems associated with n-gram model? How are these problems handled?

Problems with n-gram Models:

1. Sparsity: Many unseen word sequences.


2. Limited Context: Only captures context within a fixed window.
3. Data Requirements: Needs large amounts of data for accurate estimates.
4. Scalability: High computational and memory costs with large n.

Solutions:

1. Smoothing: Techniques like Laplace, Good-Turing, and Kneser-Ney adjust


probabilities for unseen sequences.
2. Back-off and Interpolation: Use lower-order models when higher-order ones are not
available and combine probabilities from different models.
What are lexical rules? Give the complete entry for a verb in the lexicon, to be used in LFG.

In linguistics, lexical rules define the properties and behavior of words in a language. They
specify:

1. Morphological structure (prefixes, suffixes, roots)


2. Syntactic category (part-of-speech)
3. Semantic features (meaning, sense)
4. Phonological representation (pronunciation)

Lexical Entry for a Verb in LFG (Lexical Functional Grammar):

This lexical entry provides information about the verb "write" for use in LFG. It includes
phonological, morphological, syntactic, semantic, and functional information. Here's a
complete entry for the verb "write":

Verb: write

- Phonology: /raɪt/
- Morphology: write (base), writes (3rd person singular), writing (present participle), wrote
(past), written (past participle)
- Syntax:
- Category: V (verb)
- Subcategorization: [NP] (transitive) or [] (intransitive)
- Argument Structure: <AGENT, THEME>
- Semantics:
- Sense: create_written_text
- Semantic Features: [+dynamic], [+volitional], [+communicative]
- Functional Information:
- Subject: AGENT
- Object: THEME
- Predicate: write
Define a finite automaton that accepts the following language: (aa)*(bb)*.

Do it by yourself (DIBY).
Compute the minimum edit distance between paecful and peaceful.
To compute the minimum edit distance between "paecful" and "peaceful," follow these steps:

Insertions: Add missing characters.


Deletions: Remove extra characters.
Substitutions: Replace characters.
Here’s the detailed computation:

Edit Distance Calculation:


Transform "paecful" to "peaceful":

paecful → peaecful (Insert 'e' after 'p')


peaecful → peaccful (Substitute 'e' with 'c')
peaccful → peaceful (Substitute 'c' with 'e')
Total Edit Distance: 3 (1 Insertion + 2 substitution)
Construct a finite automaton for the regular expression (a+b)*abb.
NFA:

DFA: DIBY
Construct a nondeterministic finite automaton accepting {ab, ba}, and use it to find a
deterministic automaton accepting the same set.

DIBY

Construct a nondeterministic finite automaton accepting the set of all strings over {a, b}
ending in aba. Use it to construct DFA accepting the same set of strings.

DIBY
Construct a transition system which can accept strings over the alphabet a, b,….containing
either cat or rat.

DIBY
Find all strings of length 5 or less in the regular set represented by the following regular
expressions:
(a) (ab + a)*(aa + b)
(b) (a*b + b*a)*a

(a) (ab + a)*(aa + b)

Strings of length 5 or less: b,aa,ab,aaa,aab,abb,aaaa,abaa,aaaab,aaaaa,ababb

(b) (a*b + b*a)*a

Strings of length 5 or less:


a,ba,aa,aba,baa,bba,aaa,bbba,aaba,aaaa,bbaa,bbbba,ababa,aaaba,aaaaa,babaa,bbbaa

Note: The '*' operator represents zero or more occurrences, so the expressions can generate
strings of varying lengths. The above lists only include strings of length 5 or less.

Describe, in the English language, the sets represented by the following regular expressions:
(a) a(a+b)*ab
(b) (aa+b)*(bb+a)*
(a) a(a+b)*ab

This regular expression represents the set of all strings that:

- Start with 'a'


- Followed by any number of 'a's and 'b's (including none)
- End with 'ab'

In other words, the set includes strings that start with 'a' and end with 'ab', with any
combination of 'a's and 'b's in between.

Examples: aab, ab, aaab, abb, aabbab, ...

(b) (aa+b)*(bb+a)*

This regular expression represents the set of all strings that:

- Consist of any number of repetitions of the following patterns:


- Either 'aa' or 'b' (including none)
- Followed by either 'bb' or 'a' (including none)
- The string can be empty, or can contain any combination of these patterns

In other words, the set includes strings composed of alternating segments of 'aa' or 'b' followed
by segments of 'bb' or 'a', with any number of repetitions.

Examples: aa, bb, ab, ba, aabb, bbaa, aaabbb, ...


Construct a finite automaton accepting all strings over {0, 1} ending in 010 or 0010.
DIBY

Find a derivation tree a*b + a*b of a grammar L(G), where G is given by S S + S|S*S, S
a|b.
DIBY
A context-free grammar G has the following productions:
S0S0 l 1S1 I A, A2B3. B 2B3|3 Describe the language generated by the parameters.

To analyze the language generated by the grammar GGG, let's examine the productions:
1. Productions of the Grammar GGG:
o S→0S0
o S→1S1
o S →A
o A→2B3
o B→2B3
o B→3
We can derive strings by repeatedly applying these rules, which will help us understand the
language L(G).
Analysis of Each Production Rule
1. Rules for S:
o S→0S0 and S→1S1:
 These rules suggest that the string produced by S can have matching
pairs of 0's or 1's around a centre part derived from S.
 This means that any string generated by these rules will be
palindromic (it reads the same forwards and backwards).
o S→A:
 This rule allows S to generate a string that is derived from A, which we
analyze next.
2. Rules for A:
o A→2B3:
 This rule requires that any string derived from A begins with 2,
followed by some string derived from B, and ends with 3.
 Thus, any string generated by A will start with 2 and end with 3.
3. Rules for B:
o B→2B3:
 This recursive rule allows B to generate strings with pairs of 2's and 3's
in the form 2…3.
 Each application of this rule wraps the previous string in a new 2 and
3 pair, so B generates strings with balanced pairs of 2's and 3's.
o B→3:
 This base case terminates the recursion for B, meaning B can produce
the string "3" alone or be part of a larger string with balanced pairs of
2's and 3's.

Here is an example of derivations using the grammar G:

Derivation of 12331:
S → 1S1 → 1A1 → 12B31 → 12331
Consider the following productions:
S  aB | bA
A  aS | bAA | a
B  bS | aBB | b
For the string aaabbabbba, find
(a) the leftmost derivation,
(b) the rightmost derivation,
and (c) the parse tree.

DIBY

Show that the grammar S a |abSb | aAb, A  bS |aAAb is ambiguous.


Hint: To show that the grammar is ambiguous, we need to find a string that has two different
parse trees.

DIBY
Show that the grammar S  aBb, A  aAB |a, B  ABb| b is ambiguous.
To show that the grammar is ambiguous, we need to find a string that has two different parse
trees or leftmost derivations.

Let's consider the string "aaabbbb".

Here are two different leftmost derivations for the string:

Derivation 1:

S → aBb
→ a(ABb)b
→ aa(ABb)bb (Expanding B)
→ aaABbbb
→ aaabbbb

Derivation 2:

S → aBb
→ a(ABb)b
→ a(aAB)Bbb (Expanding A)
→ a(aAB)bbb
→ aaabbbb

Since there are two different derivations for the same string, the grammar is ambiguous.

Draw two different parse tree.


Give two possible parse trees for the sentence, Stolen painting found by tree.

Hint: Draw the parse tree


In Parse Tree 1, "stolen" is an adjective modifying the noun "painting", and the prepositional
phrase "by tree" indicates the location where the painting was found.

In Parse Tree 2, "stolen" is an adjective modifying the implicit noun "painting" in the phrase
"stolen painting", and the prepositional phrase "by tree" could be interpreted as the thief being
located by a tree, or the painting being stolen by someone near a tree.

The ambiguity arises from the fact that "stolen" can be interpreted as an adjective or a verb,
and the prepositional phrase "by tree" can be attached to different parts of the sentence.

What are the advantages and disadvantages of using semantic grammar?


Advantages of Semantic Grammar:

1. Improved language understanding: Semantic grammar focuses on the meaning of language,


allowing for a deeper understanding of the context and intended message.
2. More accurate parsing: By considering the semantics of the language, parsing becomes
more accurate, and the risk of misinterpretation is reduced.
3. Better handling of ambiguity: Semantic grammar can resolve ambiguities in language by
considering the context and intended meaning.
4. Enhanced machine learning: Semantic grammar can be used to improve machine learning
models, enabling them to better understand and generate human-like language.
5. More natural language processing: Semantic grammar enables more natural and human-
like language processing, making it suitable for applications like chatbots and virtual
assistants.

Disadvantages of Semantic Grammar:

1. Complexity: Semantic grammar is more complex than traditional grammar, requiring a


deeper understanding of language and its nuances.
2. Context dependence: Semantic grammar relies heavily on context, making it challenging
to apply in situations where context is limited or unclear.
3. Ambiguity in semantics: Semantics can be ambiguous, leading to challenges in interpreting
the intended meaning.
4. Limited scalability: Semantic grammar can be difficult to scale for large and complex
languages.
5. Requires large datasets: Training semantic grammar models requires large datasets of
annotated text, which can be time-consuming and expensive to create.

Overall, semantic grammar offers improved language understanding and more accurate
parsing but comes with increased complexity and requirements for large datasets and context
awareness.
Discuss lexical ambiguity.
Lexical ambiguity, also known as word-sense ambiguity, occurs when a single word has
multiple related or unrelated meanings, making it difficult to determine the intended meaning
without additional context. This ambiguity can lead to confusion, misinterpretation, or
incorrect understanding of language.

Types of lexical ambiguity:

1. Homographs: Words with the same spelling but different meanings (e.g., "bank" as a
financial institution or the side of a river).
2. Homophones: Words with the same pronunciation but different meanings (e.g., "to", "too",
and "two").
3. Polysemy: Words with multiple related meanings (e.g., "head" as a body part or the source
of a river).
4. Metaphorical extension: Words with extended meanings beyond their literal sense (e.g.,
"heart" as a symbol of love).

Resolving lexical ambiguity:

1. Context: Use surrounding words and phrases to disambiguate the meaning.


2. Syntax: Analyze sentence structure to determine the intended meaning.
3. Semantics: Consider the relationships between words and their meanings.
4. World knowledge: Apply real-world knowledge to understand the intended meaning.
5. Disambiguation techniques: Use algorithms or machine learning models to resolve
ambiguity in natural language processing applications.

Lexical ambiguity is a fundamental challenge in natural language processing, and resolving it


is crucial for accurate language understanding and effective communication.
Discuss syntactic ambiguity.
Syntactic ambiguity occurs when a sentence or phrase can be parsed in multiple ways due to
its structure. This can happen when a sentence has more than one possible syntactic structure
or parse tree.

Examples:
"I saw the man with the telescope."

Could mean either "I used a telescope to see the man" or "The man I saw had a telescope."
"The old man the boats."

Could mean "The elderly people are the ones who man the boats" or "The old man is the one
who controls the boats."
Causes:
Ambiguous Grammar: The grammar rules allow multiple valid interpretations.
Lexical Ambiguity: Words can have more than one meaning or function.
Resolution:
Context: Additional information from the surrounding text or situation can clarify the intended
meaning.
Disambiguation Techniques: In natural language processing, various algorithms and
heuristics can help choose the most likely interpretation.
Discuss semantic ambiguity.
Semantic ambiguity occurs when a word, phrase, or sentence has multiple meanings or
interpretations due to the inherent ambiguity in the semantics (meaning) of the language.

Examples:
"Bank"

Can mean a financial institution or the side of a river.


"He went to the club."

Can refer to a social venue or a sports organization.


"The bat flew out of the cave."

Could refer to a flying mammal or a piece of sports equipment.


Causes:
Polysemy: A single word having multiple meanings (e.g., "bank" for a financial institution
and "bank" for a riverbank).
Homonymy: Different words with the same form but different meanings (e.g., "bat" as a
mammal and "bat" as sports equipment).
Context Dependence: Meaning can change based on the context in which a word or phrase is
used.
Resolution:
Contextual Clues: Additional information or context can help disambiguate the intended
meaning.
Disambiguation Techniques: Techniques in natural language processing, such as semantic
role labelling or word sense disambiguation, use context to infer the correct meaning.
Discuss pragmatic ambiguity.
Pragmatic ambiguity arises when the meaning of a statement is unclear due to the context or
the speaker’s intentions, rather than due to the syntactic or semantic structure of the statement
itself.

Examples:
"Can you pass the salt?"

Pragmatic ambiguity arises if it is unclear whether the speaker is making a request or simply
asking if the listener is capable of passing the salt.
"I don't have any cash."

Could be ambiguous if it’s unclear whether the speaker means they don’t have cash on hand
right now or if they don’t have any cash at all.
"It’s cold in here."

Could be ambiguous whether the speaker is simply stating a fact or indirectly requesting that
someone adjust the temperature.
Causes:
Context Dependence: The meaning of a statement can vary depending on the situational
context, including the relationship between speakers and the specific circumstances.
Implicature: The intended meaning might rely on implied information or assumptions that are
not explicitly stated.
Speaker Intent: The ambiguity can stem from the speaker's intentions or the conversational
norms and expectations.
Resolution:
Contextual Understanding: Interpreting pragmatic ambiguity often requires understanding the
broader context of the conversation and the relationships between participants.
Clarification: Asking for clarification or providing additional information can help resolve
pragmatic ambiguity.
Conversational Norms: Understanding social norms and conventions can also aid in
interpreting ambiguous statements.
Discuss knowledge-based word sense disambiguation approaches.
Knowledge-based word sense disambiguation (WSD) approaches use external resources and
structured knowledge to determine the correct sense of a word in context. These approaches
leverage various types of linguistic and encyclopedic information. Here are some key types:

Dictionary-Based Approaches:

Use definitions and example sentences from dictionaries or thesauri (e.g., WordNet) to match
the context of a word with its possible senses.
Ontology-Based Approaches:

Utilize structured knowledge bases like ontologies (e.g., WordNet, FrameNet) to understand
the relationships between different senses and their contexts.
Knowledge Graph-Based Approaches:

Employ knowledge graphs or semantic networks to capture relationships between concepts


and disambiguate word senses based on contextual relevance.
Rules and Heuristics:

Apply predefined rules or heuristics based on linguistic knowledge or patterns observed in


data to determine the most likely word sense.
These approaches generally rely on rich, external knowledge resources to resolve ambiguities
by comparing the context in which a word appears to the structured information about its
possible meanings.

Discuss corpus-based word sense disambiguation approaches.


Corpus-based word sense disambiguation (WSD) approaches use statistical and machine
learning methods to determine word senses based on context within large text corpora. Key
types include:

Statistical Methods:

Co-occurrence Frequencies: Analyze patterns of word co-occurrence to determine the most


likely sense based on context.
Supervised Learning:

Feature Extraction: Use context features (e.g., surrounding words) to train classifiers on
annotated data to predict word senses.
Unsupervised Learning:

Clustering: Group contexts into clusters and infer senses based on these clusters without
labeled data.
Word Embeddings:

Contextual Embeddings: Utilize models like Word2Vec or BERT to represent words in high-
dimensional space, aiding sense disambiguation based on context similarity.
These approaches rely on analyzing large text datasets to capture and use contextual patterns
for accurate sense identification.
Discuss Lesk’s algorithm.
Lesk’s Algorithm is a method for word sense disambiguation that determines the most
appropriate sense of a word based on its context by comparing it with the definitions of
potential senses.

Key Steps:
Context Extraction: Collect words surrounding the ambiguous word in the sentence.
Sense Definitions: Retrieve definitions for each possible sense from a lexical resource (e.g.,
WordNet).
Overlap Calculation: Measure the overlap between words in the context and words in each
sense’s definition.
Sense Selection: Choose the sense with the highest overlap as the most likely meaning.
Strengths:
Simple and straightforward.
Does not require training data.
Limitations:
Dependent on the quality of definitions in the lexical resource.
May not handle complex contexts or nuanced meanings well.
Lesk’s Algorithm uses context and definitions to resolve ambiguities by finding the sense with
the greatest contextual overlap.
Discuss Walker’s algorithm.
Walker's algorithm is a method for parsing context-free grammars in a top-down manner. It
was developed by David Walker in 1975. Here's a brief overview:

Key features:

1. Top-down parsing: Starts with the start symbol and attempts to derive the input string.
2. Recursive descent: Uses recursive functions to parse the input string.
3. Backtracking: If a parse fails, the algorithm backtracks and tries alternative parses.

How it works:

1. Start with the start symbol and the input string.


2. Apply productions to the start symbol to derive new strings.
3. Recursively apply productions to the derived strings until the input string is parsed or a
failure occurs.
4. If a failure occurs, backtrack and try alternative productions.

Advantages:

1. Efficient for grammars with a small number of productions.


2. Easy to implement.

Disadvantages:

1. Can be slow for grammars with a large number of productions.


2. May get stuck in infinite loops for left-recursive grammars.

Walker's algorithm is a simple and intuitive parsing method, but it has limitations. More
efficient algorithms like LL and LR parsing are often preferred in practice.
Discuss the concept of Bayesian classification.
Bayesian classification is a statistical approach to classify data based on Bayes' theorem. It
predicts the probability of a data point belonging to a particular class, given its features.

Key components:

1. Prior probability: Initial probability of a class before observing the data.


2. Likelihood: Probability of observing the data given a class.
3. Posterior probability: Updated probability of a class after observing the data.

How it works:

1. Calculate prior probabilities for each class.


2. Calculate likelihoods for each class given the data.
3. Apply Bayes' theorem to update posterior probabilities.
4. Classify data points based on the highest posterior probability.

Advantages:

1. Handles uncertainty and ambiguity.


2. Incorporates prior knowledge.
3. Flexible and interpretable.

Common applications:

1. Spam detection
2. Sentiment analysis
3. Image classification
4. Medical diagnosis

Bayesian classification provides a probabilistic framework for classification, allowing for


nuanced and informed decision-making.
Discuss the concept of Naïve Bayesian classification.
Naïve Bayesian (NB) classification is a simple, probabilistic approach to classification based
on Bayes' theorem. It assumes independence between features, making it "naïve".

Key characteristics:

1. Independence assumption: Features are assumed to be independent given the class.


2. Probabilistic: Predicts probabilities of class membership.
3. Simple and efficient: Computationally fast and easy to implement.

How it works:

1. Calculate prior probabilities for each class.


2. Calculate likelihoods for each feature given each class.
3. Apply Bayes' theorem to update posterior probabilities.
4. Classify data points based on the highest posterior probability.

Advantages:

1. Handles high-dimensional data.


2. Robust to noise and missing values.
3. Fast training and prediction times.

Disadvantages:

1. Independence assumption can be unrealistic.


2. Sensitive to feature scaling.

Common applications:

1. Text classification (spam vs. non-spam)


2. Sentiment analysis
3. Medical diagnosis
4. Image classification

Naïve Bayesian classification provides a simple, effective, and interpretable approach to


classification, despite its independence assumption limitation.
Discuss k-nearest neighbour algorithm with example.
K-Nearest Neighbour (KNN) algorithm:

- A supervised learning algorithm used for classification and regression tasks.


- Based on the idea that similar instances are likely to have similar properties.

How it works:

1. Choose a value for K (number of nearest neighbours).


2. Calculate the distance between the new instance and all training instances.
3. Select the K most similar instances (nearest neighbours).
4. Predict the class or value based on the majority vote of the K neighbours.

Example:

Suppose we want to classify a new person as "fit" or "unfit" based on their age and weight.

Training data:

| Age | Weight | Class |


| --- | --- | --- |
| 25 | 70 | fit |
| 30 | 80 | unfit |
| 28 | 75 | fit |
| 35 | 90 | unfit |

New instance: Age = 29, Weight = 78

1. Calculate distances:
- Distance to (25, 70) = √((29-25)^2 + (78-70)^2) = 5.66
- Distance to (30, 80) = √((29-30)^2 + (78-80)^2) = 2.24
- Distance to (28, 75) = √((29-28)^2 + (78-75)^2) = 3.16
- Distance to (35, 90) = √((29-35)^2 + (78-90)^2) = 10.44
2. Select K=3 nearest neighbours: (30, 80), (28, 75), (25, 70)
3. Predict class: Majority vote = "fit" (2 out of 3 neighbours)

KNN is a simple, intuitive algorithm, but it can be slow for large datasets and high-
dimensional data.
Use systemic grammar to analyse to handle the following sentences:
(a) Savitha will sing a song.
(b) Savitha sings a song.

Systemic grammar, developed by Michael Halliday, is a functional approach to analyzing


language. Here's the analysis of the given sentences:

(a) Savitha will sing a song.

- Process: "will sing" (verbal process, future tense)


- Participant: "Savitha" (actor, performer)
- Goal: "a song" (affected entity, thing being sung)

- Clause structure: Subject (Savitha) - Finite (will) - Predicator (sing) - Object (a song)
- Transitivity: The process is transitive, with Savitha acting on the song.
- Modality: The sentence expresses future possibility, with "will" indicating a prediction.

(b) Savitha sings a song.

- Process: "sings" (verbal process, present tense)


- Participant: "Savitha" (actor, performer)
- Goal: "a song" (affected entity, thing being sung)

- Clause structure: Subject (Savitha) - Finite (sings) - Object (a song)


- Transitivity: The process is transitive, with Savitha acting on the song.
- Modality: The sentence expresses present reality, with no modality marker.

In systemic grammar, the analysis focuses on the functional roles of elements in the sentence,
such as process, participant, and goal. This approach highlights the relationships between
elements and their contributions to the overall meaning.
Use the FUF grammar to build a fully unified FD for the following sentences:
(a) Savitha will sing a song.
(b) Savitha sings a song.

FUF (Functional Unification Formalism) grammar is a theoretical framework for natural


language processing. Here's the analysis of the given sentences:

(a) Savitha will sing a song.

- FD (Functional Description):
- (Savitha [NP]) AGENT
- (will [AUX]) TENSE FUTURE
- (sing [V]) ACTION
- (a song [NP]) GOAL

- Unification:
- AGENT → Savitha
- TENSE → FUTURE
- ACTION → sing
- GOAL → a song

(b) Savitha sings a song.

- FD (Functional Description):
- (Savitha [NP]) AGENT
- (sings [V]) ACTION
- (a song [NP]) GOAL

- Unification:
- AGENT → Savitha
- ACTION → sings
- GOAL → a song

In FUF grammar, the FD represents the functional structure of the sentence, and unification
integrates the information into a single, unified representation. The main difference between
the two sentences is the presence of the auxiliary verb "will" in sentence (a), indicating future
tense.

Note: FUF grammar is a complex and abstract framework, and this analysis is a simplified
representation of the actual process.
In FUF (Functional Unification Grammar), the Functional Structure (FD) represents the
abstract, functional aspects of a sentence, while unification is the process of integrating
different pieces of information into a coherent, unified representation.

Example Analysis:
Consider two sentences:

(a) "She will go to the store."


(b) "She goes to the store."
Functional Structure (FD):
Sentence (a): The FD includes elements for future tense (indicated by "will") and the main
verb "go".
Sentence (b): The FD has the present tense verb "go" without an auxiliary.
Unification:
In Sentence (a): Unification integrates the future tense auxiliary "will" with the verb "go" to
indicate that the action is in the future.
In Sentence (b): Unification integrates only the present tense verb "go" with the subject and
object.
Main Difference:
The presence of the auxiliary verb "will" in sentence (a) indicates future tense, affecting the
functional structure. Unification processes the auxiliary and the main verb to reflect this future
tense accurately.
This process demonstrates how FUF grammar manages functional and syntactic information
to generate and analyze sentences.

Discuss the concept of precision and recall and their use.


Precision and recall are metrics for evaluating classification systems:
 Precision: Measures the accuracy of positive predictions. It's the ratio of true positives
to the total number of positive predictions (true positives + false positives). High
precision means fewer false positives.
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 + 𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
 Recall: Measures the ability to identify all relevant instances. It's the ratio of true
positives to the total number of actual positives (true positives + false negatives). High
recall means fewer false negatives.
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
𝑅𝑒𝑐𝑎𝑙𝑙 =
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 + 𝐹𝑎𝑙𝑠𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠

Use:
 Precision: Important when the cost of false positives is high (e.g., spam detection).
 Recall: Crucial when the cost of false negatives is high (e.g., medical diagnoses).
Discuss the basic information retrieval process.
The basic information retrieval (IR) process includes:

Query Formulation: Users submit a query, which may be preprocessed for better matching.

Document Indexing: Documents are processed and indexed, creating an inverted index that
maps terms to document locations.

Matching: The system uses a retrieval model to match the query with indexed documents and
ranks them based on relevance.

Retrieval: Relevant documents are selected and retrieved based on their relevance scores.

Presentation: Retrieved documents are displayed to the user in a ranked order.

Feedback: User interactions and feedback can help refine and improve the retrieval process.

This process aims to efficiently find and present the most relevant documents based on user
queries.

Discuss Zipf’s law.


Zipf’s Law is a statistical principle that describes the frequency distribution of words in a
language. It states that in a large corpus of text:
 The frequency of a word is inversely proportional to its rank. For example, the second
most frequent word occurs about half as often as the most frequent word, the third
most frequent word occurs about a third as often, and so on.
 Mathematically, it can be expressed as

𝐶
𝑓(𝑟) =
𝑟

Implications:
 A few words are very common, while many words are rare.
 This distribution pattern is observed not only in language but also in other areas like
city populations and species distributions.

Discuss term weighting.


Term weighting is a technique used to evaluate the importance of terms in documents. The
main methods include:
1. Term Frequency (TF): Measures how often a term appears in a document, with higher
frequencies indicating greater importance.
2. Inverse Document Frequency (IDF): Measures how common or rare a term is across
the entire document collection. Terms that appear in fewer documents are given higher
weight.
3. TF-IDF: Combines TF and IDF to balance term importance within a document and
across the collection, highlighting terms that are both frequent in a document and rare
in the overall collection.
Term weighting helps improve search results and text classification by emphasizing relevant
terms based on their significance.
Long Answer type questions:
What is the role of transformational rules in transformational grammar? Explain with the help
of examples.
In transformational grammar, transformational rules play a crucial role in generating and
understanding the complex structures of sentences by transforming one sentence structure into
another. These rules help in capturing the relationships between different sentence forms and
explaining how different structures can convey similar meanings.
Role of Transformational Rules
1. Sentence Generation:
o Transformational rules help generate sentences by transforming basic sentence
structures into more complex forms. For example, generating passive
constructions from active ones.
2. Structural Relationships:
o They capture the relationship between different sentence structures, such as
declarative, interrogative, and imperative forms.
3. Syntax and Semantics:
o Transformational rules provide insights into how syntactic structures
correspond to semantic meanings, revealing how different structures can
express similar ideas.
Examples
1. Active to Passive Transformation:
o Active Sentence: "The chef cooked the meal."
o Passive Sentence: "The meal was cooked by the chef."
o Rule: Move the object ("the meal") to the subject position, and the verb is
changed to its passive form ("was cooked").
2. Question Formation:
o Declarative Sentence: "You are reading a book."
o Interrogative Sentence: "Are you reading a book?"
o Rule: Move the auxiliary verb ("are") to the beginning of the sentence to form
a question.
3. Negation:
o Affirmative Sentence: "She is coming to the party."
o Negative Sentence: "She is not coming to the party."
o Rule: Insert the negation word ("not") before the main verb.
Transformational rules thus allow for the generation of a variety of sentence forms and help
in understanding the syntactic and semantic relationships between them. They are essential
for analyzing and constructing complex sentence structures in transformational grammar.
Write regular expressions for the following languages.
1. the set of all alphabetic strings;
2. the set of all lower case alphabetic strings ending in a b;
3. the set of all strings from the alphabet a,b such that each a is immediately preceded by and
immediately followed by a b;

Here are the regular expressions for each of the given languages:
1. The set of all alphabetic strings:
o Regular Expression: [a-zA-Z]*
o Explanation: This expression matches any string made up of alphabetic
characters (both lowercase a-z and uppercase A-Z), including the empty string.
The * denotes zero or more occurrences.
2. The set of all lowercase alphabetic strings ending in a b:
o Regular Expression: [a-z]*b
o Explanation: This expression matches any string consisting of lowercase
alphabetic characters (a-z) that ends with the letter b. The * allows for zero or
more occurrences of lowercase letters before the final b.
3. The set of all strings from the alphabet a,b such that each a is immediately
preceded by and immediately followed by a b:
o Regular Expression: L = b(ab)+
o Explanation: Explanation:
b matches the first 'b'
(ab)+ matches one or more occurrences of 'ab'. This regular expression
captures strings where each a is properly bracketed by bs, and the entire string
can also consist of just bs.
Write regular expressions for the following languages. By “word”, we mean an alphabetic
string separated from other words by whitespace, any relevant punctuation, line breaks, and
so forth.
1. the set of all strings with two consecutive repeated words (e.g., “Humbert Humbert” and
“the the” but not “the bug” or “the big bug”);
2. all strings that start at the beginning of the line with an integer and that end at the end of
the line with a word;
3. all strings that have both the word grotto and the word raven in them (but not, e.g., words
like grottos that merely contain the word grotto);
4. write a pattern that places the first word of an English sentence in a register. Deal with
punctuation
Here are the regular expressions for the specified languages:
1. Strings with two consecutive repeated words:
o Regular Expression: \b(\w+)\s+\1\b
o Explanation:
 \b(\w+) matches a word and captures it in a group.
 \s+ matches one or more whitespace characters separating the words.
 \1 refers to the captured word, ensuring it matches the next word
exactly.
2. Strings that start with an integer and end with a word:
o Regular Expression: ^\d+.*\b\w+\b$
o Explanation:
 ^\d+ matches one or more digits at the beginning of the line.
 .* matches any characters (including whitespace) in between.
 \b\w+\b$ matches a word at the end of the line.
3. Strings containing both the word "grotto" and the word "raven":
o Regular Expression: \bgrotto\b.*\braven\b|\braven\b.*\bgrotto\b
o Explanation:
 \b: Matches a word boundary.
 grotto: Matches the word "grotto".
 .*: Matches any character zero or more times.
 \braven\b: Matches the word "raven".
 |: Logical OR.
 The two alternatives ensure that both words are present in any order.

4. Pattern for the first word of an English sentence, handling punctuation:
o Regular Expression: ^\s*[A-Z][a-z]*(\.|!|\?)?\s+
o Explanation:
o
 ^\s*: Matches zero or more whitespace characters at the beginning of
the string.
 [A-Z][a-z]*: Matches an uppercase letter followed by zero or more
lowercase letters.
 (\.|!|\?)?: Optionally matches a period, exclamation mark, or question
mark.
 \s+: Matches one or more whitespace characters. These regular
expressions handle the specified patterns and constraints, dealing with
words, integers, and sentence structure.
What are f-structure and C-structure in LFG? Consider the following sentence and explain
both the structure, “She saw stars in the sky”

In Lexical Functional Grammar (LFG), f-structure (functional structure) and c-structure


(constituent structure) are two complementary representations of a sentence's structure.

F-structure:

- Represents the functional relationships between elements, such as subject-verb-object.


- Focuses on the predicate-argument structure.
- Uses attributes like PRED (predicate), SUBJ (subject), OBJ (object), etc.

F-structure for "She saw stars in the sky":

- PRED: 'see <SUBJ, OBJ>'


- SUBJ: 'She'
- OBJ: 'stars'
- ADJUNCT: 'in the sky' (prepositional phrase modifying the verb)

C-structure:

- Represents the hierarchical constituent structure of a sentence.


- Focuses on the syntactic categories and their relationships.
- Uses categories like S (sentence), NP (noun phrase), VP (verb phrase), etc.

C-structure for "She saw stars in the sky":

-S
- NP (She)
- VP
- V (saw)
- NP (stars)
- PP (in the sky)
- P (in)
- NP (the sky)

In summary, f-structure highlights the functional relationships, while c-structure shows the
hierarchical constituent structure. Both structures are necessary for a comprehensive
understanding of sentence structure in LFG.
What are Karaka relations? Explain the difference between Karta and Agent.
Karaka relations are semantic roles in Sanskrit grammar that describe the relationships
between a verb and its arguments:

Karta: The doer or agent performing the action (e.g., "John" in "John eats an apple").
Karma: The object or recipient affected by the action (e.g., "apple" in "John eats an apple").
Difference:

Karta: Specifically denotes the entity executing the action in a sentence.


Agent: A broader term often used in various linguistic contexts to describe the entity
performing an action, similar to Karta but with potentially wider applications.

For instance:

In the sentence "Ram reads a book," "Ram" is both the Karta and the Agent. He is the doer of
the action of reading.
However, in the sentence "The book was read by Ram," "Ram" is still the Agent, as he is the
one who performs the action. However, grammatically, "book" is the Karta, as it is the subject
of the passive sentence.

Calculate the bigram probability for each of the words of the sentence “I want Chinese food.”

To calculate bigram probabilities, we need a large corpus of text data. This corpus will
provide the necessary frequency counts for both unigrams and bigrams.

Once we have the corpus, we can calculate the probability of a bigram using the following
formula:

P(w2 | w1) = Count(w1, w2) / Count(w1)

Where:

 P(w2 | w1) is the probability of word w2 given the previous word w1.
 Count(w1, w2) is the number of times the bigram w1, w2 appears in the corpus.
 Count(w1) is the number of times the word w1 appears in the corpus.

Example:

Let's assume a small corpus with the following sentences:

1. I want Chinese food.


2. I want Japanese food.
3. I like Italian food.

Based on this small corpus:

 P(want | I) = 2/3
 P(Chinese | want) = 1/2
 P(food | Chinese) = 1/1 = 1
We are given the following corpus:
<s> I am sam </s>
<s> Sam I am </s>
<s> I am Sam </s>
<s> I do not like green eggs and Sam</s>
Using a bigram language model with add-one smoothing, what is P(Sam | am)? Include <s>
& </s> in your counts just like any other token.

To calculate P(Sam|am) using a bigram language model with add-one smoothing, follow these
steps:

1. Count the occurrences of "am" and "am Sam" in the corpus:

- Count(am) = 3 (from "I am sam", "I am Sam", and "Sam I am")


- Count(am Sam) = 1 (from "I am Sam" and "Sam I am")

1. Apply add-one smoothing:

- Count'(am) = Count(am) + V = 3 + 12 = 15
- Count'(am Sam) = Count(am Sam) + 1 = 1 + 1=2

1. Calculate the bigram probability:

- P(Sam|am) = Count'(am Sam) / Count'(am) = 2/15

So, P(Sam|am) = 2/15.

Note: Add-one smoothing adds 1 to numerator and V to denominator to avoid zero


probabilities and smooth the model.
We are given the following corpus:
<s> I am sam </s>
<s> Sam I am </s>
<s> I am Sam </s>
<s> I do not like green eggs and Sam</s>
If we use linear interpolation smoothing between a maximum-likelihood bi-gram model and
a maximum-likelihood unigram model with λ1 = 1/2 and λ2 = 1/2, what is P(Sam|am)? Include
in your counts just like any other token.
To calculate P(Sam|am) using linear interpolation smoothing, we need to calculate the
probabilities from both the maximum-likelihood bi-gram model and the maximum-likelihood
unigram model.

Bi-gram model:

- Count(am Sam) = 1
- Count(am) = 3
- P(Sam|am) = Count(am Sam) / Count(am) = 1/3

Unigram model:

- Count(Sam) = 3
- Count(total) = 12
- P(Sam) = Count(Sam) / Count(total) = 3/12 = 1/4

Linear interpolation smoothing:

- P(Sam|am) = λ1 * P(Sam|am) + λ2 * P(Sam)


= (1/2) * (1/3) + (1/2) * (1/4)
= 1/6 + 1/8
= 7/24

So, P(Sam|am) =7/24.


How words can be handled in the tagging process?
In the tagging process, words can be handled in the following ways:

1. Tokenization: breaking down text into individual words or tokens.

2. Stopword removal: ignoring common words like "the", "and", etc. that do not carry much
meaning.

3. Stemming or Lemmatization: reducing words to their base form (e.g., "running" becomes
"run").

4. Feature extraction: representing words as numerical vectors (e.g., word embeddings like
Word2Vec).

5. Part-of-speech (POS) tagging: identifying the grammatical category of each word (e.g.,
noun, verb, adjective).

6. Named Entity Recognition (NER): identifying named entities like names, locations,
organizations.

7. Contextual analysis: considering the surrounding context to disambiguate word meanings.

8. Handling out-of-vocabulary (OOV) words: dealing with words not present in the training
data.

9. Subword modeling: breaking down words into subwords or character sequences.

10. Word sense induction: identifying different senses of words.

By handling words in these ways, the tagging process can effectively analyze and understand
the meaning of text.
Comment on the validity of the following statements:
a) Rule-based taggers are non-deterministic
b) Stochastic taggers are language independent
c) Brill’s tagger is a rule-based tagger
Agree or disagree with the following statements:

a) Rule-based taggers are non-deterministic: DISAGREE

Rule-based taggers are deterministic, meaning they apply a set of predefined rules to assign
tags, and the output is predictable.

b) Stochastic taggers are language independent: DISAGREE

Stochastic taggers are language-dependent, as they rely on statistical models trained on


specific language data.

c) Brill's tagger is a rule-based tagger: DISAGREE

Brill's tagger is actually a transformation-based tagger, which combines rule-based and


stochastic approaches. It starts with an initial guess and applies a set of rules to transform and
improve the tags.

In summary, rule-based taggers are deterministic, stochastic taggers are language-dependent,


and Brill's tagger is a transformation-based tagger.

Construct a transition system corresponding to the regular expressions (i) (ab + c*)*b (ii) a +
bb +bab*a

DIBY
Construct a finite automaton recognizing L(G), where G is the grammar SaS | bA| b and A
 aA | bS | a.

Hint:

To construct a finite automaton recognizing L(G), we need to first analyze the grammar G
and identify the patterns it generates.

The grammar G has two non-terminals, S and A, and two terminals, a and b. The productions
are:

S → aS | bA | b
A → aA | bS | a

From the productions, we can see that:

- S can generate strings starting with 'a' followed by any string generated by S, or strings
starting with 'b' followed by any string generated by A, or just 'b'.
- A can generate strings starting with 'a' followed by any string generated by A, or strings
starting with 'b' followed by any string generated by S, or just 'a'.

Remaining part: DIBY

Construct a deterministic finite automaton equivalent to the grammar S aS | bS | aA, A 


bB, BaC, CA

To construct a deterministic finite automaton (DFA) equivalent to the given grammar, we


need to analyze the grammar and identify the patterns it generates.

The grammar has four non-terminals (S, A, B, C) and two terminals (a, b). The productions
are:

S → aS | bS | aA
A → bB
B → aC
C→A

From the productions, we can see that:

- S can generate strings starting with 'a' or 'b' followed by any string generated by S, or strings
starting with 'a' followed by any string generated by A.
- A can generate strings starting with 'b' followed by any string generated by B.
- B can generate strings starting with 'a' followed by any string generated by C.
- C can generate strings generated by A.

Construct the DFA.


Find a reduced grammar equivalent to the grammar G whose productions are SAB|CA,
BBC|AB, Aa , CaB|b

DIBY
Construct a reduced grammar equivalent to the grammar
SaAa, ASb | bCC | DaA, Cabb | DD, EaC, D aDA

DIBY
Let G be SAB, Aa, BC|b, CD, DE and Ea. Eliminate unit productions and get
an equivalent grammar.

DIBY
Reduce the following grammar G to CNF. G is SaAD, AaB | bAB, Bb and Dd.
To reduce the grammar G to Chomsky Normal Form (CNF), we need to follow these steps:

Step 1: Eliminate unit productions:

- Replace A → aB with A → aB' (introduce new non-terminal B')


- Add production B' → B

Step 2: Eliminate mixed strings:

- Replace A → bAB with A → bA'B' (introduce new non-terminals A' and B')
- Add productions A' → A and B' → B

Step 3: Eliminate strings of length greater than 2:

- Replace S → aAD with S → aA'D' (introduce new non-terminals A' and D')
- Add productions A' → AD and D' → D

Step 4: Convert to CNF:

- Replace A → aB' with A → AB' (remove terminal 'a')


- Replace A → bA'B' with A → BA'B' (remove terminal 'b')
- Replace S → aA'D' with S → AA'D' (remove terminal 'a')

The reduced grammar G' in CNF is:

S → AA'D'
A → AB' | BA'B'
B' → B
B→b
A' → A
D' → D
D→d

Note that in CNF, all productions are in one of the following forms:

- A → BC (where A, B, and C are non-terminals)


- A → a (where A is a non-terminal and 'a' is a terminal)

This grammar generates the same language as the original grammar G.

You might also like