NLP notes
NLP notes
(R18)
Natural Language Processing
Lecture Notes
Prepared by
Mrs.Swapna
( Professor&HOD-CSM)
Dept. CSE(AIML)
Course Objectives:
Introduce to some of the problems and solutions of NLP and their relation to linguistics
and statistics.
Course Outcomes:
Show sensitivity to linguistic phenomena and an ability to model them with formal grammars.
Understand and carry out proper experimental methodology for training and
evaluating empirical NLP systems
Able to manipulate probabilities, construct statistical models over strings and trees,
and estimate parameters using supervised and unsupervised training methods.
Able to design, implement, and analyze NLP algorithms
Able to design different language modeling Techniques.
UNIT - I
Finding the Structure of Words: Words and Their Components, Issues and Challenges,
Morphological Models
Finding the Structure of Documents: Introduction, Methods, Complexity of the Approaches,
Performances of the Approaches
UNIT - II
Syntax Analysis: Parsing Natural Language, Treebanks: A Data-Driven Approach to Syntax,
Representation of Syntactic Structure, Parsing Algorithms, Models for Ambiguity Resolution in
Parsing, Multilingual Issues
UNIT - III
Semantic Parsing: Introduction, Semantic Interpretation, System Paradigms, Word Sense
Systems, Software.
UNIT - IV
Predicate-Argument Structure, Meaning Representation Systems, Software.
UNIT - V
Discourse Processing: Cohension, Reference Resolution, Discourse Cohension and
Structure Language Modeling: Introduction, N-Gram Models, Language Model Evaluation,
Parameter Estimation, Language Model Adaptation, Types of Language Models, Language-
Specific Modeling Problems, Multilingual and Crosslingual Language Modeling
TEXT BOOKS:
1. Multilingual natural Language Processing Applications: From Theory to Practice – Daniel
M. Bikel and Imed Zitouni, Pearson Publication
Natural Language Processing and Information Retrieval: Tanvier Siddiqui, U.S. Tiwary
Unit 1
Unit-1
UNIT - I Finding the Structure of Words: Words and Their Components, Issues and
Challenges, Morphological Models Finding the Structure ofDocuments: Introduction,
Methods, Complexity of the Approaches, Performances of the Approaches
NLP INTRODUCTION:
Overhead can be expressed as the time a user spends in all of the steps leading to reading an
item containing the needed information (e.g., query generation,
queryexecution,scanningresultsofquerytoselectitemsto read,readingnon-relevantitems).
When a user decides to issuea search looking for information on a topic,the total
database is logically
Relevant items are those documents that contain information that helps the searcher in
answering his question.
Non-relevant items are those items that do not provide any directly useful information.
There are two possibilities with respect to each item: it can be retrieved or
notretrieved by the user'squery.
Where:
Number_Possible_Relevantarethe
Number Total Relevantis the total number of items retrieved from the query.
Precision measures one aspect of information retrieval overhead for a user associated with a
particular search.
If a search has a 85 per cent precision,then 15 per cent of the user effort is overhead
reviewing non-relevant items.
Recall gauges how well a system processing a particular query is able to retrieve the
relevant items Functional Overview
A total Information Storage and Retrieval System is composed of four major functional
processes:
1) Item Normalization
4) Database Search along with the Automatic File Build process that
supportsIndex Files.
1) Item Normalization:
The first step in any integrated system is to normalize the incoming items to a
standard format. Item normalization provides logical restructuring of the item.
Additional operations during item normalization are needed to create a searchable
data structure: identification of processing tokens (e.g., words), characterization of the
tokens, and stemming (e.g., removing word endings) of the tokens.
The processing tokens and their characterization are used to define the
searchable text from the total received text. Figure 1.5 shows the normalization
process. Standardizing the input takes the different external formats of input data and
performs the translation to the formats acceptable to the system. A system may have a
single format for all items or allow multiple formats. One example of standardization
could be translation of foreign languages into Unicode. Every language has a different
internal binary encoding for the characters in the language. One standard encoding
To assist the users in generating indexes, especially the professional indexers, the
system provides a process called Automatic File Build(AFB).
2) Inter-word symbols
The Document Database Search Process provides the capability for a query to search
against all items received by the system. The Document Database Search process is
composed of the search process, user entered queries (typically ad hoc queries) and
the document database which contains all items that have been received, processed
and stored by the system. Typically items in the Document Database do not change
(i.e., are not edited) once received.
the index process. In this process the user can logically store an item in a file along
with additional index terms and descriptive text the user wants to associate with the
item. The Index Database Search Process (see Figure 1.4) provides the capability to
create indexes and search them.
Every user can have one or more Private Index files leading to a very large number
of files. Each Private Index file references only a small subset of the total number of
items in the Document Database. Public Index files are maintained by professional
library services personnel and typically index every item in the Document Database.
There is a small number of Public Index files. These files have access lists (i.e., lists of
users and their privileges) that allow anyone to search or retrieve data. Private Index
files typically have very limited access lists. To assist the users in generating indexes,
especially the professional indexers, the system provides a process called Automatic
File Build shown in Figure 1.4 (also called Information Extraction).
Browse Capabilities
Miscellaneous Capabilities
Standards
The search capabilities address both Boolean and Natural Language queries. The
algorithms used for searching are called Boolean, natural language processing and
probabilistic. Probabilistic algorithms use frequency of occurrence of processing
tokens (words) in determining similarities between queries and items and also in
predictors on the potential relevance of the found item to the searcher.
The newer systems such as TOPIC, RetrievalWare, and INQUERY all allow for natural
language queries.
Browse functions to assist the user in filtering the search results to find relevant
information are very important.
search terms in either a Boolean or natural language interface. Given the following
natural language query statement where the importance of a particular search term is
indicated by a value in parenthesis between 0.0 and 1.0 with 1.0 being the most
important.
The search statement may apply to the complete item or contain additional
parameters limiting it to a logical division of the item (i.e., to a zone). Based upon the
algorithms used in a system many different functions are associated with the system’s
understanding the search statement. The functions define the relationships between
the terms in the search statement (e.g., Boolean, Natural Language, Proximity,
Contiguous Word Phrases, and Fuzzy Searches) and the interpretation of a particular
word (e.g., Term Masking, Numeric and Date Range, Contiguous Word Phrases, and
Concept/Thesaurus expansion).
Boolean Logic
Boolean logic allows a user to logically relate multiple concepts together to define
what information is needed. Typically the Boolean functions apply to processing
tokens identified anywhere within an item. The typical Boolean operators are AND,
OR, and NOT. These operations are implemented using set intersection, set union and
set difference procedures. Asearch terms in either a Boolean or natural language
interface. Given the following natural language query statement where the importance
of a particular search term is indicated by a value in parenthesis between 0.0 and 1.0
with 1.0 being the most important.
the search statement may apply to the complete item or contain additional
paramesearch terms in either a Boolean or natural language interface. Given the
following natural language query statement where the importance of a particular
search term is indicated by a value in parenthesis between 0.0 and 1.0 with 1.0 being
the most important.
The search statement may apply to the complete item or contain additional
parameters limiting it to a logical division of the item (i.e., to a zone). Based upon the
algorithms used in a system many different functions are associated with the system’s
understanding the search statement. The functions define the relationships between
the terms in the search statement (e.g., Boolean, Natural Language, Proximity,
Contiguous Word Phrases, and Fuzzy Searches) and the interpretation of a particular
word (e.g., Term Masking, Numeric and Date Range, Contiguous Word Phrases, and
Concept/Thesaurus expansion).
limiting it to a logical division of the item (i.e., to a zone). Based upon the
algorithms used in a system many different functions are associated with the system’s
understanding the search statement. The functions define the relationships between
the terms in the search statement (e.g., Boolean, Natural Language, Proximity,
Contiguous Word Phrases, and Fuzzy Searches) and the interpretation of a particular
word (e.g., Term Masking, Numeric and Date Range, Contiguous Word Phrases, and
Concept/Thesaurus expansion).
few systems introduced the concept of “exclusive or” but it is equivalent to a slightly
more complex query using the other operators and is not generally useful to users
since most users do not understand it.
A special type of Boolean search is called “M of N” logic. The user lists a set of
possible search terms and identifies, as acceptable, any item that contains a subset of
the terms. For example, “Find any item containing any two of the following terms:
“AA,” “BB,” “CC.” This can be expanded into a Boolean search that performs an AND
between all combinations of two terms and “OR”s the results together ((AA AND BB)
or (AA AND CC) or (BB AND CC)).
Proximity
Proximity is used to restrict the distance allowed within an item between two search
terms. The semantic concept is that the clossearch terms in either a Boolean or natural
language interface. Given the following natural language query statement where the
importance of a particular search term is indicated by a value in parenthesis between
The search statement may apply to the complete item or contain additional
parameters limiting it to a logical division of the item (i.e., to a zone). Based upon the
algorithms used in a system many different functions are associated with the system’s
understanding the search statement. The functions define the relationships between
the terms in the search statement (e.g., Boolean, Natural Language, Proximity,
Contiguous Word Phrases, and Fuzzy Searches) and the interpretation of a particular
word (e.g., Term Masking, Numeric and Date Range, Contiguous Word Phrases, and
Concept/Thesaurus expansion).
two terms are found in a text the more likely they are related in the description of a
particular concept. Proximity is used to increase the precision of a search. If the terms
COMPUTER and DESIGN are found within a few words of each other then the item is
more likely to be discussing the design of computers than if the words are paragraphs
apart. The typical format for proximity is:
A special case of the Proximity operator is the Adjacent (ADJ) operator that normally
has a distance operator of one and a forward only direction (i.e., in WAIS). Another
special case is where the distance is set to zero meaning within the same semantic
unit.
A Contiguous Word Phrase (CWP) is both a way of specifying a query term and a
special search operator. A Contiguous Word Phrase is two or more words that are
treated as a single semantic unit. An example of a CWP is “United States of America.” It
is four words that specify a search term representing a single specific semantic
concept (a country) that can be used with any of the operators discussed above. Thus a
query could specify “manufacturing” AND “United States of America” which returns
any item that contains the word “manufacturing” and the contiguous words “United
States of America.”
A contiguous word phrase also acts like a special search operator that is similar
to the proximity (Adjacency) operator but allows for additional specificity. If two
terms are specified, the contiguous word phrase and the proximity operator using
directional one word parameters or the Adjacent operator are identical. For
contiguous word phrases of more than two terms the only way of creating an
equivalent search statement using proximity and Boolean operators is via nested
Adjacencies which are not found in most commercial systems. This is because
Proximity and Boolean operators are binary operators but contiguous word phrases
are an “N”ary operator where “N” is the number of words in the CWP.
Contiguous Word Phrases are called Literal Strings in WAIS and Exact Phrases in
RetrievalWare. In WAIS multiple Adjacency (ADJ) operators are used to define a
Literal String (e.g., “United” ADJ “States” ADJ “of” ADJ “America”).
Fuzzy Searches
Fuzzy Searches provide the capability to locate spellings of words that are similar to
the entered search term. This function is primarily used to compensate for errors in
spelling of words. Fuzzy searching increases recall at the expense of decreasing
precision (i.e., it can erroneously identify terms as the search term). In the process of
expanding a query term fuzzy searching includes other terms that have similar
spellings, giving more weight (in systems that rank output) to words in the database
that have similar word lengths and position of the characters as the entered term. A
Fuzzy Search on the term “computer” would automatically include the following
words from the information database: “computer,” “compiter,” “conputer,”
“computter,” “compute.”
Term Masking
Term masking is the ability to expand a query term by masking a portion of the term
and accepting as valid any processing token that maps to the unmasked portion of the
term. The value of term masking is much higher in systems that do not perform
stemming or only provide a very simple stemming algorithm. There are two types of
search term masking: fixed length and variable length. Sometimes they are called fixed
Term masking is useful when applied to words, but does not work for finding ranges of numbers or numeric dates. To find numbers larger than “125,” using a term
“125*” will not find any number except those that begin with the digits “125.”
Concept/Thesaurus Expansion
Associated with both Boolean and Natural Language Queries is the ability to expand the search terms via Thesaurus or Concept Class database reference tool. A
Thesaurus is typically a one-level or two-level expansion of a term to other terms that are similar in meaning. A Concept Class is a tree structure that expands each
meaning of a word into potential concepts that are related to the initial term (e.g., in the TOPIC system). Concept classes are sometimes implemented as a network
structure that links word stems (e.g., in the RetrievalWare system). An example of Thesaurus and Concept Class structures are shown in Figure 2.4 (Thesaurus-93)
and Figure 2.5.
Thesauri are either semantic or based upon statistics. A semantic thesaurus is alisting of words and then other words that are semantically similar.
The problem with thesauri is that they are generic to a language and can introduce many search terms that are not found in the document database. An alternative
uses the database or a representative sample of it to create statistically related terms. It is conceptually a thesaurus in that words that are statistically related to other
words by their frequently occurring together in the same items. This type of thesaurus is very dependent upon the database being searched and may not be portable
to other databases.
Natural language interfaces improve the recall of systems with a decrease in precision when negation is required.
Browse Capabilities
Once the search is complete, Browse capabilities provide the user with the capability to determine which items are of interest and select those to be displayed. There
are two ways of displaying a summary of the items that are associated with a query: line item status and data visualization. From these summary displays, the user
can select the specific items and zones within the items for display.
Ranking
Typically relevance scores are normalized to a value between 0.0 and 1.0. The highest value of 1.0 is interpreted that the system is sure that the item is relevant to the
search statement. In addition to ranking based upon the characteristics of the item and the database, in many circumstances collaborative filtering is providing an
option forselecting and ordering output.
Collaborative filtering has been very successful in sites such as AMAZON.COM MovieFinder.com, and CDNow.com in deciding what products to display to users
based upon their queries.
Rather than limiting the number of items that can be assessed by the number of lines on a screen, other graphical visualization techniques showing the
relevance
relationships of the hit items can be used. For example, a two or three dimensional graph can be displayed where points on the graph represent items and the
location of the points represent their relative relationship between each other and the user’s query. In some cases color is also used in this representation. This
technique allows a user to see the clustering of items by topics and browse through a cluster or move to another topical cluster.
Zoning
Related to zoning for use in minimizing what an end user needs to review from a hit item is the idea of locality and passage based search and retrieval.
Highlighting
Most systems allow the display of an item to begin with the first highlight within the item and allow subsequent jumping to the next highlight. The DCARS system that
acts as a user frontend to the Retrieval Ware search system allows the user to browse an item in the order of the paragraphs or individual words that contributed
most to the rank value associated with the item. The highlighting may vary by introducing colors and intensities to indicate the relative importance of a particular
word in the item in the decision to retrieve the item.
Vocabulary Browse provides the capability to display in alphabetical sorted order words from the document database. Logically, all unique words (processing tokens)
in the database are kept in sorted order along with a count of the number of unique items in which the word is found. The user can enter a word or word fragment
and thesystem will begin to display the dictionary around the entered text.
It helps the user determine the impact of using a fixed or variable length mask on a search term and potential mis-spellings. The user can determine that
entering thesearch term “compul*” in effect is searching for “compulsion” or “compulsive” or “compulsory.” It also shows that someone probably entered the word
“computen” when they really meant “computer.”
Iterative Search and Search History Log
Frequently a search returns a Hit file containing many more items than the user wants to review. Rather than typing in a complete new query, the results of the
previous search can be used as a constraining list to create a new query that is applied against it. This has the same effect as taking the original query and adding
additional search statement against it in an AND condition. This process of refining the results of a previous search to focus on relevant items is called iterative
search. This also applies when a user uses relevance feedback to enhance a previous search. The search history log is the capability to display all the previous
searches that were executed during the current session.
Canned Query
The capability to name a query and store it to be retrieved and executed during a later user session is called canned or stored queries. A canned query allows a user to
create and refine a search that focuses on the user’s general area of interest one time and then retrieve it to add additional search criteria to retrieve data that is
currently needed. Canned query features also allow for variables to be inserted into the query and bound to specific values at execution time.
Difficulties in NLP:
For example, “He lifted the beetle with red cap.” − Did he use cap to lift the beetle or he lifted a beetle that had red cap?
Referential ambiguity − Referring to something using pronouns. For example, Rima went to Gauri. She said, “I am tired.” − Exactly who is tired?
One input can mean different meanings.
NLP Terminology
Syntax − It refers to arranging words to make a sentence. It also involves determining the structural role of words in the sentence and in phrases.
Semantics − It is concerned with the meaning of words and how to combine words into meaningful phrases and sentences.
Pragmatics − It deals with using and understanding sentences in different situations and how the interpretation of the sentence is affected.
Discourse − It deals with how the immediately preceding sentence can affect the interpretation of the next sentence.
To make interactions between computers and humans, computers need to understand natural languages used by
humans.
Natural language processing is all about making computers learn, understand, analyse, manipulate and interpret
natural(human) languages.
NLP stands for Natural Language Processing, which is a part of Computer Science, Humanlanguage, and Artificial
Intelligence.
Processing of Natural Language is required when you want an intelligent system like robot to perform as per yourinstructions, when you want to
hear decision from a dialogue based clinical expert system, etc.
The ability of machines to interpret human language is now at the core of many applications that we use every day
- chatbots, Email classification and spam filters, search engines, grammar checkers, voice assistants, and sociallanguage
translators.
The input and output of an NLP system can be Speech or Written Text
Components of NLP
Natural Language Generation (NLG) acts as a translator that converts the computerized data into natural
language representation.
NLP Terminology
Phonology − It is study of organizing sound systematically.
Steps in NLP
1. Lexical Analysis
2. Syntactic Analysis (Parsing)
3. Semantic Analysis
4. Discourse Integration
5. Pragmatic Analysis
Lexical Analysis –
This phase scans the source code as a stream of characters and converts it into meaningful
lexemes.
Syntactic Analysis is used to check grammar, word arrangements, and shows the relationship
The sentence such as “The school goes to boy” is rejected by English syntactic analyzer.
Semantic Analysis –
Discourse Integration –
Discourse Integration depends upon the sentences that proceeds it and also invokes the
Pragmatic Analysis –
It involves deriving those aspects of language which require real world knowledge.
We use it to express our thoughts, and through language, we receive information and infer its
meaning.
The point of morphology, for instance, is to study the variable forms and functions of words,
The syntax is concerned with the arrangement of words into phrases, clauses, and sentences.
The meaning of a linguistic expression is its semantics, and etymology and lexicology cover especially the evolution of words and explain the
semantic, morphological, and other links among them.
Words are perhaps the most intuitive units of language, yet they are in general tricky to define.
Knowing how to work with them allows, in particular, the development of syntactic and
semantic abstractions and simplifies other advanced views on language.
Here, first we explore how to identify words of distinct types in human languages, and how the internal structure of words can be modelled in
connection with the grammatical properties and lexical concepts the words should represent.
The discovery of word structure is morphological parsing.
punctuation.
But in many other languages, the writing system leaves it up to the reader to tell words
morphemes.
Tokens
Suppose, for a moment, that words in English are delimited only by whitespace andpunctuation (the marks, such as full stop, comma, and
brackets)
Example: Will you read the newspaper? Will you read it? I won’t read it.
If we confront our assumption with insights from syntax, we notice two here: words
For reasons of generality, linguists prefer to analyze won’t as two syntactic words, or tokens,
each of which has its independent role and can be reverted to its normalized form.
In English, this kind of tokenization and normalization may apply to just a limited set of
cases, but in other languages, these phenomena have to be treated in a less trivial manner.
Tokens behaving in this way can be found in various languages and are often called clitics.
Lexemes
By the term word, we often denote not just the one linguistic form in the given context but also the concept behind the form and the set of
alternative forms that can express it.
Such sets are called lexemes or lexical items, and they constitute the lexicon of a
language.
Lexemes can be divided by their behaviour into the lexical categories of verbs, nouns,
lemma.
When we convert a word into its other forms, such as turning the singular mouse into the plural mice or mouses, we say we inflect the lexeme.
When we transform a lexeme into another one that is morphologically related, regardless of its lexical category, we say we derive the
lexeme: for instance, thenouns receiver and reception are derived from the verb to receive.
Example: Did you see him? I didn’t see him. I didn’t see anyone.
• Example presents the problem of tokenization of didn’t and the investigation of the internal structure of anyone.
Morphemes
Morphological theories differ on whether and how to associate the properties of wordforms with their structural components.
These components are usually called segments or morphs.
The morphs that by themselves represent some aspect of the meaning of a word arecalled morphemes of some function.
• Human languages employ a variety of devices by which morphs and morphemes are
words.
cats = cat-s
unfriendly = un-friend-ly
Two types of morphemes: i Stems: play, cat,
friendii Affixes: -ed, -s, un-, -ly
play = play
replayed = re-play-ed
computerized = comput-er-ize-d
Problems in morphological processing
Inflectional morphology: inflected forms are constructed from base forms and inflectional
affixes.
Derivational morphology: words are constructed from roots (or stems) and derivational
affixes:
Typology
Morphological typology divides languages into groups by characterizing the prevalent
Isolating, or analytic, languages include no or relatively few words that would comprise more
Morphological parsing tries to eliminate the variability of word forms to provide higher- level linguistic units whose lexical and morphological
properties are explicit and well defined.
It attempts to remove unnecessary irregularity and give limits to ambiguity, both of which are present inherently in human language.
By irregularity, we mean existence of such forms and structures that are not described
language.
Morphological modelling also faces the problem of productivity and creativity in language, bywhich unconventional but perfectly meaningful new
words or new senses are coined.
Irregularity
Morphological parsing is motivated by the quest for generalization and abstraction in theworld of words.
Immediate descriptions of given linguistic data may not be the ultimate ones, due to either
their inadequate accuracy or inappropriate complexity, and better formulations may beneeded.
The design principles of the morphological model are therefore very important.
With the proper abstractions made, irregular morphology can be seen as merely enforcing
some extended rules, the nature of which is phonological, over the underlying or prototypical
regular word forms.
Morphophonemic templates capture morphological processes by just organizing stem patterns and generic affixes without any context-dependent
variation of the affixes or ad hoc modification of the stems.
The merge rules, indeed very neatly or effectively concise, then ensure that such structured
representations can be converted into exactly the surface forms, both orthographic and
phonological, used in the natural language.
Applying the merge rules is independent of and irrespective of any grammatical parametersor information other than that contained in a template.
Ambiguity
Morphological ambiguity is the possibility that word forms be understood in multiple
homonyms.
Ambiguity is present in all aspects of morphological processing and language
processing at large.
Productivity
Is the inventory of words in a language finite, or is it unlimited?
This question leads directly to discerning two fundamental approaches to language, summarized in the distinction between langue and parole, or in
In one view, language can be seen as simply a collection of utterances (parole) actually
This ideal data set can in practice be approximated by linguistic corpora, which are finite collections of linguistic data that are studied with
empirical(based on) methods and can be used for comparison when linguistic models are developed.
Yet, if we consider language as a system (langue), we discover in it structural devices like recursion, iteration, or compounding(make up;
constitute)that allow to produce (competence) an infinite set of concrete linguistic utterances.
This general potential holds for morphological processes as well and is called morphological productivity.
We denote the set of word forms found in a corpus of a language as its vocabulary.
The members of this set are word types, whereas every original instance of a word form is a word token.
The distribution of words or other elements of language follows the “80/20 rule frame which is a protocol,” also knownas the law of the vital few.
It says that most of the word tokens in a given corpus(a collection of written texts)can be identified with just a couple of word types in its vocabulary,
and words from the
rest of the vocabulary occur much less commonly if not rarely in the corpus.
Furthermore, new, unexpected words will always appear as the collection of linguistic data is
enlarged.
Morphological Models
There are many possible approaches to designing and implementing morphological models.
Over time, computational linguistics has witnessed the development of a number of formalisms and frameworks, in particular grammars of different
kinds and expressive power, with which to address whole classes of problems in processing natural as well as formal languages.
Let us now look at the most prominent types of computational approaches to morphology.
Dictionary Lookup
Morphological parsing is a process by which word forms of a language are associated with
Morphological systems that specify these associations by merely enumerating(is the act or process of making or stating a list of things one
after another) them case by case do not offer any generalization means.
2. Dictionary
4. For each word a listof documents in which the word is found is stored(inversion of document
5. Each document is given a unique the numerical identifier that is stored in inversion list . Dictionary is used to located the inversion list for a particular
word.
Which is a sorted list( processing tokens) in the system and a pointer to thelocation of its inversion list.
Dictionary can also store other information used in query optimizationsuch as length of inversion lists to increase the precision.
Inversion list consists of document identifier for each document in which the word is found.
Ex: bit 1(10),1(12) 1(18) is in 10,12, 18 position of the word bit in the document #1.
When a search is performed, the inversion lists for the terms in the query are locate and appropriate logic is applied between inversion lists.
Words with special characteristics can be stored in their own dictionary. Ex: Date… which require date ranging and numbers.
The inversion lists may be at the leaf level or referenced in higher level pointers.
All leaves are at the same level or differ by at most one level.
Finite-State Morphology
By finite-state morphological models, we mean those in which the specifications written by human programmers are directly compiled into finite-
states
The two most popular tools supporting this approach, XFST (Xerox Finite-State Tool),
LexTools.
They consist of a finite set of nodes connected by directed edges labeled with pairs of input
The set of possible sequences accepted by the defines the input language; the set of possible sequences emitted by the defines the output
language.
Input Input Morphological parsed output
matching words in the infinite regular language definedby grandson, great-grandson, great-great-grandson.
In finite-state computational morphology, it is common to refer to the input word forms as surface strings and to
the output descriptions as lexical strings, if the transducer is used for morphological analysis, or vice versa, if it is
used for morphological generation.
•In English, a finite-state transducer could analyze the surface string children into the lexical
string child [+plural], for instance, or generate women from woman [+plural].
Relations on languages can also be viewed as functions. Let us have a relation R, and let us denote by [Σ] the set
of all sequences over some set of symbols Σ, so that the domain and the range of R are subsets of [Σ].
We can then consider R as a function mapping an input string into a set of output strings, formally denoted by this
A theoretical limitation of finite-state models of morphology is the problem of capturing reduplication of words or
Unification-Based Morphology
The concepts and methods of these formalisms are often closely connected to those
of logic programming.
In finite-state morphological models, both surface and lexical forms are by themselvesunstructured strings of atomic symbols.
data structures that can include complex values or can be recursively nested if
needed.
Morphological parsing P thus associates linear forms φ with alternatives of structured
content ψ, cf.
morphological modelling, word forms are best captured by regular expressions, while the linguistic content is best described through typed feature
structures.
Morphological models of this kind are typically formulated as logic programs, and
Many functional morphology implementations are embedded in a general-purpose programming language, which gives programmers more freedom
with advanced programming techniques and allows them to develop full-featured, real-world applications for their models.
It influenced the functional morphology framework in with which morphologies of Latin, Swedish, Spanish, Urdu, and other languages have been
implemented.
The notation then constitutes a so-called domain-specific embedded language, which makes programming even
more fun.
Even without the options provided by general-purpose programming languages, functional morphology models
Morphological grammars in Grammatical Framework can be extended with descriptions of the syntax andsemantics of a language.
Grammatical Framework itself supports multilinguality, and models of more than a dozen languages are available in
it as open-source software.
2.Finding structure of Documents
Introduction
In human language, words and sentences do not appear randomly but have structure.
For example, combinations of words from sentences- meaningful grammatical units, such as statements, requests, and commands.
Automatic extraction of structure of documents helps subsequent NLP tasks: for example, parsing, machine translation, and semantic role labelling
use sentences as the basic processing unit.
Task of deciding where sentences start and end given a sequence of characters(made of wordsand typographical cues) sentences boundary
detection.
Topic segmentation as the task of determining when a topic starts and ends in a sequence of
sentences.
The statistical classification approaches that try to find the presence of sentence and topicboundaries given human-annotated training data, for
segmentation.
These methods base their predictions on features of the input: local characteristics that give
Features are the core of classification approaches and require careful design and selection in
Most statistical approaches described here are language independent, every language is achallenging in itself.
For example, for processing of Chinese documents, the processor may need to first segmentthe character sequences into words, as the words
Similarly, for morphological rich languages, the word structure may need to be analyzed toextract additional features.
Such processing is usually done in a pre-processing step, where a sequence of tokens isdetermined.
Tokens can be word or sub-word units, depending on the task and language.
In written text in English and some other languages, the beginning of a sentence is usuallymarked with an uppercase letter, and the end
In addition to their role as sentence boundary markers, capitalized initial letters are used
distinguish proper nouns, periods are used in abbreviations, and numbers and punctuationmarks are used inside proper names.
The period at the end of an abbreviation can mark a sentence boundary at the same time.
In the first sentence, the abbreviation Dr. does not end a sentence, and in the second it does.
Especially quoted sentences are always problematic, as the speakers may have utteredmultiple sentences, and
punctuation marks.
An automatic method that outputs word boundaries as ending sentences according to the
presence of such punctuation marks would result in cutting some sentences incorrectly.
Ambiguous abbreviations and capitalizations are not only problem of sentence segmentationin written text.
Spontaneously written texts, such as short message service (SMS) texts or instant
messaging(IM) texts, tend to be nongrammatical and have poorly used or missing punctuation, which makes sentence segmentation even more
challenging.
Similarly, if the text input to be segmented into sentences comes from an automatic system,
such as optical character recognition (OCR) or ASR, that aims to translate images of handwritten, type written, or printed text or spoken utterances into
machine editable text, thefinding of sentences boundaries must deal with the errors of those systems as well.
On the other hand, for conversational speech or text or multiparty meetings with
ungrammatical sentences and disfluencies, in most cases it is not clear where the boundaries
are.
Code switching -that is, the use of words, phrases, or sentences from multiple languages bymultilingual speakers- is another problem that can
affect the characteristics of sentences.
For example, when switching to a different language, the writer can either keep the
punctuation rules from the first language or resort to the code of the second language.
to identify potential ends of sentences and lists of abbreviations for disambiguating them.
For example, if the word before the boundary is a known abbreviation, such as “Mr.” or “Gov.,”the text is not segmented at that position even though
some periods are exceptions.
Given the training data where all sentence boundaries are marked, we can train a classifier torecognize them.
Segmentation(Discourse or text segmentation) is the task of automatically dividing a streamof text or speech into topically homogenous blocks.
This is, given a sequence of(written or spoken) words, the aim of topic segmentation is to
Topic segmentation is an important task for various language understanding applications, such as information extraction and retrieval and text
summarization.
For example, in information retrieval, if a long documents can be segmented into shorter, topically coherent segments, then only the segment that is
about the user’s query could be retrieved.
During the late1990s, the U.S defence advanced research project agency(DARPA) initiated thetopic detection and tracking program to further the
state of the art in finding and following new topic in a stream of broadcast news stories.
One of the tasks in the TDT effort was segmenting a news stream into individual stories.
Methods
classification problem.
Given a boundary candidate( between two word tokens for sentence segmentation and between two sentences for topic segmentation), the goal is
to predict whether or not the candidate is an actual boundary (sentence or topic boundary).
Formally, let xƐX be the vector of features (the observation) associated with a candidate and y
Alternatively to the binary classification problem, it is possible to model boundary types using
finer-grained categories.
For segmentation in text be framed as a three-class problem: sentence boundary ba, withoutan abbreviation and abbreviation not as a boundary
Similarly spoken language, a three way classification can be made between non-boundaries
• The natural unit of sentence segmentation is words and of topic segmentation is sentence, as
we can assume that topics typically do not change in the middle of a sentences.
The words or sentences are then grouped into categories stretches belonging to one sentences or topic- that is word or sentence boundaries are
classified into sentences or topic boundaries and -non-boundaries.
The classification can be done at each potential boundary i (local modelling); then, the aim is
to estimate the most probable boundary type i for each candidate xi
=
Here, the ^ is used to denote estimated categories, and a variable without a ^ is used to showpossible categories.
In this formulation, a category is assigned to each example in isolation; hence, decision is made locally.
However, the consecutive types can be related to each other. For example, in broadcast news speech, two consecutive sentences boundaries that
form a single word sentence are very infrequent.
In local modelling, features can be extracted from surrounding example context of the
• It is also possible to see the candidate boundaries as a sequence and search for the sequence of boundary types
=
We categorize the methods into local and sequence classification.
Another categorization of methods is done according to the type of the machine learning algorithm: generative versus
discriminative.
Generative sequence models estimate the joint distribution of the observations P(X,Y) (words, punctuation) and thelabels(sentence boundary,
topic boundary).
Discriminative sequence models, however, focus on features that categorize the differences between the labelling of that
examples.
Most commonly used generative sequence classification method for topic and
sentence is the hidden Markov model (HMM) function is being used in which the model is proposed according to bayers rule
Hmm Means: A hidden Markov model (HMM) is a statistical model that can be used to describe the evolution of observable events that depend on internal
factors, which are not directly observable
generative models c a n b e h a n d l e d b y HELMs(hidden event language model) which can handle multiple orders of magnitude larger
The probability in equation 2.2 is rewritten as the following, using the Bayes rule:
= (|) ( 2.1)
()( 2 . 2 )
= = ()Τ() =
(|
Here = Predicted class(boundary) label
P(X) in the denominator is dropped because it is fixed for different Y and hence does notchange the argument of max.
The most important distinction is that class densities P(x|y) are model assumptions
in generative approaches
A number of discriminative classification approaches are used, such as support vector machines,
boosting, maximum entropy, and regression. Are based on different machine learning
algorithms which are used in discrimination process in classifying the sentence boundary.
For sentence segmentation, supervised learning methods – have primarily been applied to
newspaper articles.
Supervised learning methods are used where a machine is being trained by giving instruction and accordingly it will perform .there are many supervised
Stamatatos, Fakotakis and Kokkinakis are authors who used transformation based learning (TBL) to infer rulesfor finding sentence
boundaries.
Many supervised learning method classifiers have been tried for the sentence boundary task such as regression trees, neural networks,
classification, maximum entropy classifiers, support vector machines, and naïve Bayes classifiers.
The most Text tiling method is used for topic segmentation which uses a lexical cohesion (binding of word to another) metric in a
Originally, two methods for computing the similarity scores were proposed: block
comparison and vocabulary introduction.
The first, block comparison, compares adjacent blocks of text to see how similar they areaccording to how many words the adjacent
blocks have in common.
Given two blocks, b1 and b2, each having k tokens (sentences or paragraphs),
the similarity(or topical cohesion) between two blocks score is computed by the formula:
The weights can be binary or may be computed using other information retrieval- metricssuch as term frequency(calculation of weight ).
The second method is, the vocabulary introduction method, assigns a score to a token-sequence gap
on the basis of how many new words are seen in the interval in which it is the midpoint.
Similar to the block comparison formulation, given two
consecutive blocks b1 and b2, of equal number of words w, the
Complexity of the
Approaches
The above approaches described here have advantages and
disadvantages.
other.
datasets.
of generative ones because they require multiple passes over the training
data to adjust for feature weights.
However, generative models c a n b e h a n d l e d b y HELMs(hidden
magnitude larger
On the other hand the disadvantage is , they work with only a few
features .
Department of Computer Science and Engineering (AIML)
(R18)
Natural Language Processing
Lecture Notes
Prepared by
Mrs.Swapna
( Professor&HOD-CSM)
Dept. CSE(AIML)
Course Objectives:
Introduce to some of the problems and solutions of NLP and their relation to linguistics and statistics.
Course Outcomes:
Show sensitivity to linguistic phenomena and an ability to model them with formal grammars.
Understand and carry out proper experimental methodology for training and evaluating empirical NLP systems
Able to manipulate probabilities, construct statistical models over strings and trees, and estimate parameters using supervised and unsupervised training methods.
Able to design, implement, and analyze NLP algorithms
Able to design different language modeling Techniques.
UNIT - I
Finding the Structure of Words: Words and Their Components, Issues and Challenges, Morphological Models
Finding the Structure of Documents: Introduction, Methods, Complexity of the Approaches, Performances of the Approaches
UNIT - II
Syntax Analysis: Parsing Natural Language, Treebanks: A Data-Driven Approach to Syntax, Representation of Syntactic Structure, Parsing Algorithms, Models for Ambiguity Resolution in
Parsing, Multilingual Issues
Faculty Name : Mrs Swapna Subject Name :NLP
UNIT - III
Semantic Parsing: Introduction, Semantic Interpretation, System Paradigms, Word Sense Systems, Software.
UNIT - IV
Predicate-Argument Structure, Meaning Representation Systems, Software.
UNIT - V
Discourse Processing: Cohension, Reference Resolution, Discourse Cohension and
Structure Language Modeling: Introduction, N-Gram Models, Language Model Evaluation, Parameter Estimation, Language Model Adaptation, Types of Language Models,
Language-Specific Modeling Problems, Multilingual and Crosslingual Language Modeling
TEXT BOOKS:
Multilingual natural Language Processing Applications: From Theory to Practice – DAniel M. Bikel and Imed Zitouni, Pearson Publication
Natural Language Processing and Information Retrieval: Tanvier Siddiqui, U.S. Tiwary
Natural Language Processing (NLP) is a field of study that deals with understanding, interpreting, and manipulating human spoken languages using computers.
Since most of the significant information is written down in natural languages such as English, French, German, etc. thus, NLP helps computers communicate with
humans in their own languages and perform other language-related tasks.
In conclusion, NLP makes it possible for computers to read the text, hear speech, interpret and realize it, understand the sentiment, and identify important parts of a text
or speech.
What is Syntax?
A natural language typically follows a hierarchical structure, and contains the following components:
Sentences
Clauses
Phrases
Words
Syntax refers to the set of rules, principles, processes that govern the structure of sentences in a natural language. One basic des cription of syntax is how different
words such as Subject, Verbs, Nouns, Noun Phrases, etc. are sequenced in a sentence.
Some of the syntactic categories of a natural language are as follows:
Sentence(S)
Noun Phrase(NP)
Faculty Name : Mrs Swapna Subject Name :NLP
Determiner(Det)
Verb Phrase(VP)
Prepositional Phrase(PP)
Verb(V)
Noun(N)
Syntax Tree:
A Syntax tree or a parse tree is a tree representation of different syntactic categories of a sentence. It helps us to understand the syntactical structure of a sentence.
Example:
The syntax tree for the sentence given below is as follows:
I drive a car to my college.
To help us know what word will come next we need to learn as much as we can what words previously came in a sentence. To fulfil this need and to understand what
words came priorly is where parts of speech and syntactic parsing are very important and integral topics in NLP.
One of the important use cases of Natural Language Processing (NLP) is generative text. Generative text is predicting what word must come next in a sentence.
Applications of generative text are question-answering chatbots, sentence or word autocorrection, and autocompletion, grammar check, and these cases have now become
indispensable and part of our day-to-day lives.
To help us know what word will come next we need to learn as much as we can what words previously came in a sentence. To fulfil this need and to understand wha
words came priorly is where parts of speech and syntactic parsing are very important and integral topics in NLP.
Language Syntax
The language syntax is fundamental for generative text and sets the foundation for parts of speech and parse trees.
The word syntax originates from the Greek word syntaxis, meaning “arrangement”, and refers to how the words are arranged together. Henceforth, language syntax means
how the language is structured or arranged.
Faculty Name : Mrs Swapna Subject Name :NLP
How are words arranged together?
There are many different ways to categorize these structures or arrangements. One way to classify how the words are arranged is by grouping them as the words behave as
a single unit or phrase, which is also known as a constituent.
A sentence can have different language rules applied to it and have different types of structure. As different parts of the sentence are based on different parts of the synta
that follow the same grammar rules that are of a noun phrase, verb phrase, and prepositional phrase.
S = NP + VP + PP
The different word groups that exist according to English grammar rules are:
For instance, let us see a sentence: The boy ate the pancakes. This sentence has the following structure:
However, now taking another sentence: The boy ate the pancakes under the door.
Here, the preposition under is followed by the noun phrase the door, which is syntactically correct but not correct contextually.
This sentence is syntactically incorrect as the preposition form is followed by a verb phrase jumping table.
Such sentence are marked to the root and are been cut off from the tree
The following below sentence in which the sentence is been shortened and it is been removed from the tree and you can observe that there is circle been rounded
off.
Two main approaches to syntax analysis are used to construct tree bank
1.Dependency graphs
The word „Parsing‟ whose origin is from Latin word ‘pars’ (which means ‘part’), is used to draw exact meaning or dictionary meaning from the text. It is also called
Syntactic analysis or syntax analysis. Comparing the rules of formal grammar, syntax analysis checks the text for meaningfulness. The sentence like “Give me hot ice
cream”, for example, would be rejected by parser or syntactic analyzer.
In this sense, we can define parsing or syntactic analysis or syntax analysis as follows −
It may be defined as the process of analyzing the strings of symbols in natural language conforming to the rules of formal grammar.
The term Dependency Parsing (DP) refers to the process of examining the dependencies between the phrases of a sentence in order to determine its grammatical structure.
A sentence is divided into many sections based mostly on this. The process is based on the assumption that there is a direct relationship between each linguistic unit in
sentence. These hyperlinks are called dependencies.
Consider the following statement: “I prefer the morning flight through Denver.”
In a written dependency structure, the relationships between each linguistic unit, or phrase, in the sentence are expressed by directed arcs. The root of the tree “prefer”
varies the pinnacle of the preceding sentence, as labelled within the illustration.
A dependence tag indicates the relationship between two phrases. For example, the word “flight” changes the meaning of the noun “Denver.” As a result, you may identify
a dependence from
flight -> Denver, where flight is the pinnacle and Denver is the kid or dependent. It‟s represented by nmod, which stands for the nominal modifier.
aux auxiliary
cc coordinating conjunction
cc:preconj preconjunct
clf classifier
compound compound
conj conjunct
cop copula
det determiner
expl expletive
flat:title names
checklist checklist
mark marker
obj object
orphan orphan
parataxis parataxis
punct punctuation
root root
vocative vocative
The Pure Language Toolkit (NLTK) package deal will be used for Dependency Parsing, which is a set of libraries and codes used during statistical Pure Languag
Processing (NLP) of human language.
1. Probabilistic, projective dependency parser: These parsers predict new sentences by using human language data acquired from hand-parsed sentences. They‟re
known to make mistakes and work with a limited collection of coaching information.
2. Stanford parser: It is a Java-based pure language parser. You would want the Stanford CoreNLP parser to perform dependency parsing. The parser supports a numbe
of languages, including English, Chinese, German, and Arabic.
Constituency Parsing is based on context-free grammars. Constituency Context-free grammars are used to parse text. Right here the parse tree includes sentences that have
been broken down into sub-phrases, each of which belongs to a different grammar class. A terminal node is a linguistic unit or phrase that has a mother or father node an
a part-of-speech tag.
Faculty Name : Mrs Swapna Subject Name :NLP
Fоr exаmрle, Consider the following example sentence: “I shot an elephant in my pajamas.” The constituency parse tree is shown graphically as follows:
The entire sentence is broken down into sub-phases till we‟ve got terminal phrases remaining. VP stands for verb phrases, whereas NP stands for noun phrases.
Dependency Parsing vs
Faculty Name : Mrs Swapna Subject Name :NLP
Constituency Parsing 0r Pharse parser
The Stanford parser will also be used to do constituency parsing. It begins by parsing a phrase using the constituency parser and then transforms the constituency pars
tree into a dependency tree.
In case your main objective is to interrupt a sentence into sub-phrases, it is ideal to implement constituency parsing. However, dependency parsing is the best method for
discovering the dependencies between phrases in a sentence.
A constituency parse tree denotes the subdivision of a text into sub-phrases. The tree‟s non-terminals are different sorts of phrases, the terminals are the sentence‟s words,
and the edges are unlabeled. A constituency parse for the simple statement “John sees Bill” would be:
A projective dependency tree is one where if we put the words in a linear order based on the sentence with the root symbol in the first position, the dependency
arcs can be drawn above the words without any crossing dependencies.
While non Projective dependency implies structure similar to phrase structure or constituency parse tree
As discussed, a parser is basically a procedural interpretation of grammar. It finds an optimal tree for the given sentence after searching through the space of a variety of
trees. Let us see some of the available parsers below −
1.Shift-reduce parser
Following are some important points about shift-reduce parser −
It follows a simple bottom-up process.
It tries to find a sequence of words and phrases that correspond to the right-hand side of a grammar production and replaces them with the left-hand side of th
production.
The above attempt to find a sequence of word continues until the whole sentence is reduced.
In other simple words, shift-reduce parser starts with the input symbol and tries to construct the parser tree up to the start symbol.
In creation of PAT trees each position in the input string is the anchor point for a sub-string that starts at that point and includes all new text up to the end of the input.
Binary tree, most common class for prefix search,But Pat trees are sorted logically which facilitate range search, and more accurate then inversion file .
PAT trees provide alternate structure if supporting strings search.
Examples of sistrings
Faculty Name : Mrs Swapna Subject Name :NLP
• The key values are stored at the leaf nodes (bottom nodes) in the PAT Tree.
• For a text input of size “n” there are “n” leaf nodes and “n-1” at most higher level nodes.
• It is possible to place additional constraints on sistrings for the leaf nodes.
If the binary representations of “h” is (100), “o” is (110), “m” is (001) and “e” is (101) then the word “home” produces the input 100110001101.
Using the sistrings
Smartworld asia
• Here we want to provide a model that matches the intuition that the second tree above ispreferred over the first.
• The parses can be thought of as ambiguous (leftmost to rightmost) derivation of the followingCFG:
• We assign scores or probabilities to the rules in CGF in order to provide a score or probabilityfor each derivation.
• Above example is complex to calculate the tree shown below is a example for PCFG calculation
• The rule probabilities can be derived from a treebank, consider a treebank with three
• if we assume that tree t1 occurred 10 times in the treebank, t2 occurred 20 times and t3occurred 50 times, then the PCFG we obtain from this treebank is:
Generative models: A Generative Sequence of model tree for particular context is generated and is created according to the weights the
particular parse tree is retrieved
• To find the most plausible parse tree, the parser has to choose between the possible
• Let each derivation D = d1,d2,…..,dn, which is the sequence of decisions used to build theparse tree.
• Then for input sentence x, the output parse tree y is defined by the sequence of steps in thederivation.
• The probability for each derivation:
Depending on the score of the tree which is been evaluated with the help of
PCFG the next generative model or the tree sequence is being shortlisted
Thus reducing the ambiguity
discriminative
• Having ø(x,y).w and GEN(x) specified, we would like to choose the height
scoring candidate
(R18)
Natural Language Processing
Lecture Notes
Prepared by
Mrs.Swapna
( Professor&HOD-CSM)
Dept. CSE(AIML)
Course Objectives:
Introduce to some of the problems and solutions of NLP and their relation to
linguistics and statistics.
Course Outcomes:
Show sensitivity to linguistic phenomena and an ability to model them with formal
grammars.
Understand and carry out proper experimental methodology for training and
evaluating empirical NLP systems
Able to manipulate probabilities, construct statistical models over strings and
trees, and estimate parameters using supervised and unsupervised training
methods.
Able to design, implement, and analyze NLP algorithms
Able to design different language modeling Techniques.
UNIT - I
Finding the Structure of Words: Words and Their Components, Issues and Challenges,
Morphological Models
Finding the Structure of Documents: Introduction, Methods, Complexity of the
Approaches, Performances of the Approaches
UNIT - II
Syntax Analysis: Parsing Natural Language, Treebanks: A Data-Driven Approach to Syntax,
Representation of Syntactic Structure, Parsing Algorithms, Models for Ambiguity Resolution
in Parsing, Multilingual Issues
UNIT - III
Semantic Parsing: Introduction, Semantic Interpretation, System Paradigms, Word
Sense Systems, Software.
UNIT - IV
Predicate-Argument Structure, Meaning Representation Systems, Software.
UNIT - V
Discourse Processing: Cohension, Reference Resolution, Discourse Cohension and
Structure Language Modeling: Introduction, N-Gram Models, Language Model
Evaluation, Parameter Estimation, Language Model Adaptation, Types of Language
Models, Language-Specific Modeling Problems, Multilingual and Crosslingual Language
Faculty Name : Mrs Swapna Subject Name :NLP
Modeling
TEXT BOOKS:
1. Multilingual natural Language Processing Applications: From Theory to Practice –
Daniel
M. Bikel and Imed Zitouni, Pearson Publication
Natural Language Processing and Information Retrieval: Tanvier Siddiqui, U.S. Tiwary
Unit 3 & 4
SEMANTIC PARSING
A semantic system brings entities, concepts, relations and predicates together to provide more context
to language so machines can understand text data with more accuracy. Semantic analysis derives
meaning from language and lays the foundation for a semantic system to help machines interpret
meaning.
Machines lack a reference system to understand the meaning of words, sentences and documents.
Word sense disambiguation and meaning recognition can provide a better understanding of language
data for machines. Here is how each part of semantic analysis work
Lexical analysis is the process of reading a stream of characters, identifying the lexemes and
converting them into tokens that machines can read.
Grammatical analysis correlates the sequence of lexemes (words) and applies formal grammar to
them so part-of-speech tagging can occur.
Syntactical analysis analyzes or parses the syntax and applies grammar rules to provide context to
meaning at the word and sentence level.
Semantic analysis uses all of the above to understand the meaning of words and interpret sentence
structure so machines can understand language as humans do.
Why Is Semantic Analysis Important to NLP?
Language data is a rich source of business intelligence. However, many organizations struggle to
capitalize on it because of their inability to analyze unstructured data. This challenge is a frequent
roadblock for artificial intelligence (AI) initiatives that tackle language-intensive processes.
Every type of communication — be it a tweet, LinkedIn post, or review in the comments section of a
website — may contain potentially relevant and even valuable information that companies must
capture and understand to stay ahead of their competition. Capturing the information is the easy part
but understanding what is being said (and doing this at scale) is a whole different story.
To understand how NLP and semantic processing work together, consider this:
Word Sense Disambiguation basically solves the ambiguity that arises in determining the meaning of
the same word used in different situations.
Faculty Name : Mrs Swapna Subject Name :NLP
Challenges in Word Sense Disambiguation
WSD faces a lot of challenges and problems.
The most common problem is the difference between various dictionaries or text corpus. Different
dictionaries have different meanings for words, which makes the sense of the words to be perceived
as different. A lot of text information is out there and often it is not possible to process everything
properly.
Different applications need different algorithms and that is often a challenge for WSD.
A problem also arises is that words cannot be divided into discrete meanings. Words often have
related meanings and this causes a lot of problems.
2.Semantic Interpretation
Semantic parsing can be considered as part of Semantic interpretation, which involves various
components that together define a representation of text that can be fed into a computer to allow
further computations manipulations and search, which are prerequisite for any language
understanding system or application. Here we start with discus with structure of semantic
theory.
A Semantic theory should be able to:
1.Explain sentence having ambiguous meaning: The bill is large is ambiguous in the sense that
is could represent money or the beak of a bird.
2.Resolve the ambiguities of words in context. The bill is large but need not be paid, the
theory should be able to disambiguate the monetary meaning of bill.
3.Identify meaningless sentence to syntactically well-formed sentence.
Rule-based
Dictionary lookups,
POS Tagging,
Dependency Parsing.
For Example,
Date: Thursday, Time: night, Location: Chateau Marmot, Person: Cate Blanchett
Faculty Name : Mrs Swapna Subject Name :NLP
Now, we can start our discussion on Named Entity Recognition (NER),
1. Named Entity Recognition is one of the key entity detection methods in NLP.
2. Named entity recognition is a natural language processing technique that can automatically scan
entire articles and pull out some fundamental entities in a text and classify them into predefined
categories. Entities may be,
Organizations,
Quantities,
Monetary values,
Percentages, and more.
People’s names
Company names
Geographic locations (Both physical and political)
Product names
Dates and times
Amounts of money
Names of events
3. In simple words, Named Entity Recognition is the process of detecting the named entities such as
person names, location names, company names, etc from the text.
For Example,
5. With the help of named entity recognition, we can extract key information to understand the text,
or merely use it to extract important information to store in a database.
Automated Chatbots,
Content Analyzers,
Faculty Name : Mrs Swapna Subject Name :NLP
Consumer Insights, etc.
Phrase Classification
In this classification step, we classified all the extracted noun phrases from the above step into their
respective categories. To disambiguate locations, Google Maps API can provide a very good path.
and to identify person names or company names, the open databases from DBpedia, Wikipedia can be
used. Apart from this, we can also make the lookup tables and dictionaries by combining information
with the help of different sources.
Entity Disambiguation
Sometimes what happens is that entities are misclassified, hence creating a validation layer on top of
the results becomes useful. The use of knowledge graphs can be exploited for this purpose. Some of
the popular knowledge graphs are:
Faculty Name : Mrs Swapna Subject Name :NLP
Google Knowledge Graph,
IBM Watson,
Wikipedia, etc.
The blue cells represent the nouns. Some of these nouns describe real things present in the world.
For Example, From the above, the following nouns represent physical places on a map.
Therefore, the goal of NER is to detect and label these nouns with the real-world concepts that they
represent.
So, when we run each token present in the sentence through a NER tagging model, our sentence looks
like as,
NER systems aren’t just doing a simple dictionary lookup. Instead, they are using the context of how
a word appears in the sentence and used a statistical model to guess which type of noun that particular
word represents.
Since NER makes it easy to grab structured data out of the text, therefore it has tons of uses. It’s one
of the easiest methods to quickly get insightful value out of an NLP pipeline.
If you want to try out NER yourself, then refer to the link.
Sentence: Sundar Pichai, the CEO of Google Inc. is walking in the streets of California.
From the above sentence, we can identify three types of entities: (Named Entities)
But to do the same thing with the help of computers, we need to help them recognize entities first so
that they can categorize them. So, to do so we can take the help of machine learning and Natural
Language Processing (NLP).
Let’s discuss the role of both these things while implementing NER using computers:
NLP: It studies the structure and rules of language and forms intelligent systems that are capable of
deriving meaning from text and speech.
Machine Learning: It helps machines learn and improve over time.
To learn what an entity is, a NER model needs to be able to detect a word or string of words that form
an entity (e.g. California) and decide which entity category it belongs to.
So, as a concluding step we can say that the heart of any NER model is a two-step process:
So first, we need to create entity categories, like Name, Location, Event, Organization, etc., and feed
a NER model relevant training data.
Then, by tagging some samples of words and phrases with their corresponding entities, we’ll
eventually teach our NER model to detect the entities and categorize them.
Coreference resolution (CR) is the task of finding all linguistic expressions (called mentions) in a
given text that refer to the same real-world entity. After finding and grouping these mentions we can
resolve them by replacing, as stated above, pronouns with noun phrases.
Coreference resolution is an exceptionally versatile tool and can be applied to a variety of NLP tasks
such as text understanding, information extraction, machine translation, sentiment analysis, or
document summarization. It is a great way to obtain unambiguous sentences which can be much more
easily understood by computers.
Meaning Representation
The final process of the semantic interpretation is to build a semantic representation or meaning
representation that can then be manipulated by algorithms to various application to better sense of word.
This process is sometimes called the deep representation.
The following two examples
The phrase "for every x'' (sometimes "for all x'') is called a universal quantifier and is denoted by ∀x.
The phrase "there exists an x such that'' is called an existential quantifier and is denoted by ∃x.
3.SystemParadigms
It is important to get a perspective on the various primary dimensions on which the problem
ofsemantic interpretation has been tackled.
The approaches generally fall into the following three categories: 1.System architecture 2.Scope 3.
Coverage.
System Architectures
a.Knowledge based: These systems use a predefined set of rules or a knowledge base to obtain
a solution to a new problem.
b.Supervised :
AI Chatbots and AI Virtual Assistants using Supervised Learning are trained using data that is well-
labeled (or tagged). During training, those systems learn the best mapping function between known
Faculty Name : Mrs Swapna Subject Name :NLP
data input and the expected known output. Supervised NLP models then use the best approximating
mapping learned during training to analyze unforeseen input data (never seen before) to accurately
predict the corresponding output.
Usually, Supervised Learning models require extensive and iterative optimization cycles to adjust the
input-output mapping until they converge to an expected and well-accepted level of performance.
This type of learning keeps the word ―supervised‖ because its way of learning from training data
mimics the same process of a teacher supervising the end-to-end learning process. Supervised
Learning models are typically capable of achieving excellent levels of performance but only when
enough labeled data is available.
For example, a typical task delivered by a supervised learning model for AI chatbot / Virtual
Assistants is the classification (via a variety of different algorithms like (Support Vector
Machine, Random Forest, Classification Trees, etc.) of an input user utterance into a known
class of user intents.
The precision achieved by those techniques is really remarkable though the shortfall is limited
coverage of intent classes to only those for which labeled data is available for training.
c.Unsupervised Learning
To overcome the limitations of Supervised Learning, academia, and industry started pivoting towards
the more advanced (but more computationally complex) Unsupervised Learning which promises
effective learning using unlabeled data (no labeled data is required for training) and no human
supervision (no data scientist or high-technical expertise is required). This is an important advantage
compared to Supervised Learning, as unlabeled text in digital form is in abundance, but labeled
datasets are usually expensive to construct or acquire, especially for common NLP tasks like PoS
tagging or Syntactic Parsing.
Unsupervised Learning models are equipped with all the needed intelligence and automation to work
on their own and automatically discover information, structure, and patterns from the data itself. This
allows for the Unsupervised NLP to shine.
Advancing AI with Unsupervised Learning
Unsupervised Learning is also used for association rules mining which aims at discovering
relationships between features directly from data. This technique is typically used to
automatically extract existing dependencies between named entities from input user utterances,
dependencies of intents across a set of user utterances part of the same user/system session, or
dependencies of questions and answers from conversational logs capturing the interactions
between users and live agents during the problem troubleshooting process.
2.Scope:
a.Domain Dependent: These systems are specific to certain domains, such as air travel
reservations or simulated football coaching.
b.Domain Independent: These systems are general enough that the techniques can be applicable
to multiple domains without little or no change.
3.Coverage:
a.Shallow: These systems tend to produce an intermediate representation that can then
beconverted to one that a machine can base its action on.
b. Deep: These systems usually create a terminal representation that is directly consumed by a
machine or application.
rule based or knowledge based there use dictionary definitions of senses of words.
Much of this information is historical and cannot readily be translated and made available for
building systems today. But some of techniques and algorithms are still available.
The simplest and oldest dictionary based sense disambiguation algorithm was introduced byleak
author .
The core of the algorithm is that the dictionary sense whose terms most closely overlap withthe terms in
the context.
The word is mapped to the signature file structure present in the database..given below is signature
file structure.
A word signature is fixed length code with a fixed number of bits set to 1.
The bit positions that are set to one are determined via a hash function of
the word.
The word signatures are Ored together to create signature of an item..
• The block size is set at five words, the code length is 16 bits
and the number of bits that are allowed to be ―1‖ for each
word isfive.
• TEXT: Computer Science graduate students study (assume block size is
five words)
wORD Signature
• This study used Roget’s Thesaurus categories and classified unseen words into one
of these 1042 categories based on a statistical analysis of 100 word concordances for each
Finally, in third step, the unseen words in the test set are classified into the classified into the category
that has the maximum weight and according to the Rank the information is retrieved.
The disambiguating which word to retrieved is categorized according to the weight and ranking
which is being calculated with rogets formula and concept /thesaurus is generated for the specific
word
2.Supervised Learning
In supervised learning, models are trained using labelled dataset, where the model learns about each type of data. Once the training
process is completed, the model is tested on the basis of test data (a subset of the training set), and then it predicts the output.
The working of Supervised learning can be easily understood by the below example and diagram:
o If the given shape has four sides, and all the sides are equal, then it will be labelled as a Square.
o If the given shape has three sides, then it will be labelled as a triangle.
o If the given shape has six equal sides then it will be labelled as hexagon.
Now, after training, we test our model using the test set, and the task of the model is to identify the shape.
The machine is already trained on all types of shapes, and when it finds a new shape, it classifies the shape on the bases of a number of
sides, and predicts the output.
Text Classification is an automated process of classification of text into predefined categories. We can
classify Emails into spam or non-spam, news articles into different categories and clusters like Politics,
Stock Market, Sports, etc.
This can be done with the help of Natural Language Processing and different Classification Algorithms like
Naive Bayes, Support Vector Machine and even Neural Networks in Python.These are the common
attributes in supervised learning for nlp woed sensing
a)Features: Here we discuss a more commonly found subset of features that have been useful in
supervised learning of word sense.
b)Lexical context: The feature comprises the words and lemma of words occurring in the
entireparagraph or a smaller window of usually five words.
c)Parts of speech : the feature comprises the POS information for words in the surrounding
the word that is being sense tagged according to the suitable parts of speech .
d)Bag of words context: this feature comprises using an unordered set of words in the context
Window which is being properly classified.
e)Local Collections : Local collections are an ordered sequence of phrases near the target word that provide
semantic context for disambiguation. Usually, a very small window of about three tokens on each side of the
target word, most often in contiguous pairs or triplets, are added as a list of features.
f)Syntactic relations: if the parse of the sentence containing the target word is available, then we can use
syntactic features.
g)Topic features: The board topic, or domain, of the article that word belongs to is also a good indicator of
what sense of the word might be most frequent to that specific domain.
The chen and palmer author noticed the word sensed creates diambigutiy and confusion in sensing the word
when it is unable identify the
2.. Presence of subject/ object: unble to identify the subject when given large amount of training
data
3.Sentential complement: Sentential complementation is a kind of sentence in which one of the
arguments of a verb is a clause. That clausal argument is called a complement clause.
examples
The term complement clause is extended by some analysts to include clauses selected
by nouns or adjectives.
Examples:
Above mention are the rules for selecting the domain of the word and concept thesuaras is generated
accordingly with pos tags and proper word is being retrieved.
3.Unsupervised learning:
Unsupervised Learning which promises effective learning using unlabeled data (no labeled data is
required for training) and no human supervision (no data scientist or high-technical expertise is
required). This is an important advantage compared to Supervised Learning, as unlabeled text in
digital form is in abundance, but labeled datasets are usually expensive to construct or acquire,
especially for common NLP tasks like PoS tagging or Syntactic Parsing.
Unsupervised Learning models are equipped with all the needed intelligence and automation to work
on their own and automatically discover information, structure, and
Clustering is typically used to automatically group semantically similar user utterances together
to accelerate the derivation and verification of an underneath common user intent (notice
derivation of a new class, not classification into an existing class).
Unsupervised Learning is also used for association rules mining which aims at discovering
relationships between features directly from data. This technique is typically used to
automatically extract existing dependencies between named entities from input u ser utterances,
dependencies of intents across a set of user utterances part of the same user/system session, or
In Unsupervised learning for the specific topic domain the depth of tree is calculated by concept density
(CD) or conceptual density
The formula is given below
The depth of the tree is being examined and best word sense is being retrieved and the
disambiguity is cleared by examining the depth of the word by refereeing to sources like word
net(antonyms synonyms and metonyms).
Software:
Several software programs are made available by the research community for word sense disambuguity.
Few are listed below
Predicate argument structure is based on the function features of lexical items (most often verbs). The
function features determine the thematic roles to be played by the other words in the sentence.
However, function features and thematic roles don't always coincide
Shallow semantics parsing or semantic role labelling, is the process of identifying the
variousarguments of predicates in a sentence.
In linguistics, predicate refers to the main verb in the sentence. Predicate takes arguments.
The role of Semantic Role Labelling (SRL) is to determine how these arguments are
semantically related to the predicate.
Examples
The sun (subject) / was shining brightly (predicate).
The dogs (subject) / were barking loudly (predicate).
The pretty girl (subject) / was wearing a blue frock (predicate).
My younger brother (subject) / serves in the army (predicate).
The man and his wife (subject) / were working in their garden (predicate).
Generally, this process can be defined as the identification of who did what to whom,
where, why and how. This is shown with help of diagram
These two grammar structures are used to identify semantic role Labeling (subject and predicate):
Phrase structure grammar, also known as constituency grammar, is a way of representing the syntactic
structure of natural language sentences using hierarchical trees(refer to unit 2 )
In natural language processing (NLP), phrase structure grammar can be used to analyze, parse, and generate
natural language texts and semantic role labeling uses this structure
FrameNet : is a linguistic knowledge graph containing information about lexical and predicate
argument semantics of the English language. FrameNet contains two distinct entity classes: frames
and lexical units, where a frame is a meaning and a lexical unit is a single meaning for a word.
FrameNet is based on the theory of frame semantics, where a given predicate invokes a
semantic frame, this instantiating some or all of the possible semantic roles belonging to that frame.
FrameNet contains frame-specific semantic annotation of a number of predicates in English.
It contains tagged sentences extracted from British National Corpus (BNC: Search the British National
Corpus online. Various online services offer the possibility to search and explore the BNC via different
interfaces which is a trusted search platform).
The process of FrameNet annotation consists of identifying specific semantic frames and
creating a set of frame-specific roles called frame elements.
Then, a set of predicates that instantiate the semantic frame, irrespective of their grammatical
category, are identified, and a variety of sentences are labelled for those predicates.
The labelling process entails identifying the frame that an instance of the predicate lemma
invokes, then identifying semantic arguments for that instance, and tagging them with one of the
predetermined ser of frame elements for that frame.
The combination of the predicate lemma and the frame that its instance invokes is called a
lexical unit (LU).
This is therefore the pairing of a word with its meaning.
Using: The child frame presupposes the parent frame as background, e.g the ―Speed‖ frame ―uses‖ (or presupposes) the ―Motion‖
frame; however, not all parent FEs need to be bound to child FEs.
Subframe: The child frame is a subevent of a complex event represented by the parent, e.g. the ―Criminal_process‖ frame has
subframes of ―Arrest‖, ―Arraignment‖, ―Trial‖, and ―Sentencing‖.
Perspective_on: The child frame provides a particular perspective on an un-perspectivized parent frame. A pair of examples consists of
the ―Hiring‖ and ―Get_a_job‖ frames, which perspectivize the ―Employment_start‖ frame from the Employer’s and the Employee’s
point of view, respectively.
Each LU is linked to a Frame, and hence to the other words which evoke that Frame. This makes the FrameNet database
similar to a thesaurus, grouping together semantically similar words.
Resources of framenet
The Berkeley FrameNet project is creating an on-line lexical resource for English,
based on frame semantics and supported by corpus evidence. The aim is to
document the range of semantic and syntactic combinatory possibilities (valences) of
each word in each of its senses, through computer-assisted annotation of example
sentences and automatic tabulation and display of the annotation results. The major
product of this work, the FrameNet lexical database, currently contains more than
10,000 lexical units (defined below), more than 6,100 of which are fully annotated, in
more than 825 semantic frames, exemplified in more than 135,000 annotated
sentences. It has gone through three releases, and is now in use by hundreds of
researchers, teachers, and students around the world (see FrameNet Users). Active
research projects are now seeking to produce comparable frame-semantic lexicons
for other languages and to devise means of automatically labeling running text with
semantic frame information.
2.Propbank:(Propostional bank)
PropBank was developed with the idea of serving as training data for machine learning-based semantic
Representation of Propbank
Example : frame :Commerce Propstionalbank orPropbank can connect to the following verbs in
different scenarios
Probable verb to which it can cannot and concept thesaurus of that frame is generated connected to
different argument structures
Many of the framset for a particular concept is identified and associated verb referring to propbank is
retrived according to the argument structure it will hit on to the proper treebank and information is
retrived.
https://wasp-sweden.org/ai-graduate-school-courses/
2.krisper: https://www.cs.utexas.edu/~ml/krisp/
3.chill:[http://www.cs.utexas.edu/ml/hill
(R18)
Natural Language Processing
Lecture Notes
Prepared by
Mrs.Swapna
( Professor&HOD-CSM)
Dept. CSE(AIML)
Course Objectives:
Introduce to some of the problems and solutions of NLP and their relation to
linguistics and statistics.
Course Outcomes:
Show sensitivity to linguistic phenomena and an ability to model them with formal
grammars.
Understand and carry out proper experimental methodology for training
and evaluating empirical NLP systems
Able to manipulate probabilities, construct statistical models over strings
and trees, and estimate parameters using supervised and unsupervised
training methods.
Able to design, implement, and analyze NLP algorithms
Able to design different language modeling Techniques.
UNIT - I
Finding the Structure of Words: Words and Their Components, Issues and
Challenges, Morphological Models
Finding the Structure of Documents: Introduction, Methods, Complexity of the
Approaches, Performances of the Approaches
UNIT - II
Syntax Analysis: Parsing Natural Language, Treebanks: A Data-Driven Approach to
Syntax, Representation of Syntactic Structure, Parsing Algorithms, Models for
Ambiguity Resolution in Parsing, Multilingual Issues
UNIT - III
Semantic Parsing: Introduction, Semantic Interpretation, System Paradigms,
Word Sense Systems, Software.
UNIT - V
Discourse Processing: Cohension, Reference Resolution, Discourse Cohension and
Structure Language Modeling: Introduction, N-Gram Models, Language Model
Evaluation, Parameter Estimation, Language Model Adaptation, Types of Language
Models, Language-Specific Modeling Problems, Multilingual and Crosslingual
Language Modeling
TEXT BOOKS:
1. Multilingual natural Language Processing Applications: From Theory to Practice
–
Da
niel M. Bikel and Imed Zitouni, Pearson Publication
Natural Language Processing and Information Retrieval: Tanvier Siddiqui, U.S. Tiwary
Discourse Processing:
Language Modeling:
Introduction, N-Gram Models, Language Model Evaluation, Parameter Estimation, Language Model
Adaptation, Types of Language Models, Language-Specific Modeling Problems, Multilingual and
Crosslingual Language Modeling
Cohesion is a term in linguistics that refers to how the structure and content of a sentence or text is
linked together to create meaning.
Cohesion needs to be achieved in a sentence, within a paragraph and across paragraphs for a text to
make sense.
It means related ideas are kept together and flow logically from one to another
It helps to express their ideas in a way that the reader will clearly understand
They must then write appropriate sentences, which organise their ideas and follow a logical sequence.
Adverbials
Expressing place, time or manner, adverbials can help create cohesion in or across paragraphs. Take a
look at this example:
Without adverbials:
With adverbials:
"Tim left home early in the morning. Later that day, he realised he had forgotten his homework."
The adverbial of time keeps the sentences in a logical order and makes it easier for the reader to follow
the text.
Ellipsis
Ellipsis is another cohesive device that can improve the flow of a sentence. Ellipsis refers to the removal
of superfluous words, as the meaning can be inferred from the preceding clause. For example:
Without ellipsis:
With ellipsis:
In this case, the sentence with ellipsis avoids repeating words, but still makes sense.
Repetition
Repetition can also be used as a cohesive device to highlight or emphasise important information. For
example:
“Peter was running late. He had promised not to be late this time. If only he could get to places on time!”
In this example, the repetition of the adjective/adverb ‘late’ helps create a sense of urgency across the
sentences.
Pronouns
Using pronouns in a sentence allows you to refer back to a noun without having to repeat it. For example:
With pronoun:
In this example, using a pronoun helps the sentence flow more smoothly, while still making sense.
This is also known as an anaphoric reference. A cataphoric reference is the opposite of this, referring to
something later in a text. For example:
“Jess couldn’t wait to see him, but Jack would not be back until next week.”
Remember...
It is also important to use the same tense throughout a text in order for it to achieve cohesion.
The most difficult problem of AI is to process the natural language by computers or in other words natural
language processing is the most difficult problem of artificial intelligence. If we talk about the major
problems in NLP, then one of the major problems in NLP is discourse processing − building theories and
models of how utterances stick together to form coherent discourse. Actually, the language always
consists of collocated, structured and coherent groups of sentences rather than isolated and unrelated
sentences like movies. These coherent groups of sentences are referred to as discourse.
Concept of Coherence
Coherence and discourse structure are interconnected in many ways. Coherence, along with property of
good text, is used to evaluate the output quality of natural language generation system. The question that
arises here is what does it mean for a text to be coherent? Suppose we collected one sentence from
every page of the newspaper, then will it be a discourse? Of-course, not. It is because these sentences
do not exhibit coherence. The coherent discourse must possess the following properties −
Discourse structure
An important question regarding discourse is what kind of structure the discourse must have. The answer
to this question depends upon the segmentation we applied on discourse. Discourse segmentations may
be defined as determining the types of structures for large discourse. It is quite difficult to implement
Text Coherence
Lexical repetition is a way to find the structure in a discourse, but it does not satisfy the requirement of
being coherent discourse. To achieve the coherent discourse, we must focus on coherence relations in
specific. As we know that coherence relation defines the possible connection between utterances in a
discourse. Hebb has proposed such kind of relations as follows −
We are taking two terms S0 and S1 to represent the meaning of the two related sentences −
Result
It infers that the state asserted by term S0 could cause the state asserted by S1. For example, two
statements show the relationship result: Ram was caught in the fire. His skin burned.
Explanation
It infers that the state asserted by S1 could cause the state asserted by S0. For example, two statements
show the relationship − Ram fought with Shyam’s friend. He was drunk.
Parallel
It infers p(a1,a2,…) from assertion of S0 and p(b1,b2,…) from assertion S1. Here ai and bi are similar for
all i. For example, two statements are parallel − Ram wanted car. Shyam wanted money.
Elaboration
It infers the same proposition P from both the assertions − S0 and S1 For example, two statements show
the relation elaboration: Ram was from Chandigarh. Shyam was from Kerala.
Occasion
It happens when a change of state can be inferred from the assertion of S0, final state of which can be
Reference Resolution
Interpretation of the sentences from any discourse is another important task and to achieve this we need
to know who or what entity is being talked about. Here, interpretation reference is the key
element. Reference may be defined as the linguistic expression to denote an entity or individual. For
example, in the passage, Ram, the manager of ABC bank, saw his friend Shyam at a shop. He went to
meet him, the linguistic expressions like Ram, His, He are reference.
On the same note, reference resolution may be defined as the task of determining what entities are
referred to by which linguistic expression.
Pronouns
It is a form of definite reference. For example, Ram laughed as loud as he could. The word he represents
pronoun referring expression.
Demonstratives
These demonstrate and behave differently than simple definite pronouns. For example, this and that are
demonstrative pronouns.
Names
It is the simplest type of referring expression. It can be the name of a person, organization and location
also. For example, in the above examples, Ram is the name-refereeing expression.
Coreference Resolution
It is the task of finding referring expressions in a text that refer to the same entity. In simple words, it is the
But how do we calculate it? The answer lies in the chain rule of probability:
P(X_1,X_2, ...,X_n) = P(X_1) P(X_2 | X_1) P(X_3 | X_1, X_2) .... P(X_n | X_1, X_2,...X_n)\\
P(w_1 w_2 w_3 ...w_n) =\prod_i P(w_i | w_1 w_2 ... w_n)
For unigram:
For Bigram:
N-gram is a sequence of the N-words in the modeling of NLP. Consider an example of the
statement for modeling. “I love reading history books and watching documentaries”. In one-gram
or unigram, there is a one-word sequence. As for the above statement, in one gram it can be “I”,
For N-1 words, the N-gram modeling predicts most occurred words that can follow the
sequences. The model is the probabilistic language model which is trained on the collection of
the text. This model is useful in applications i.e. speech recognition, and machine translations. A
simple model has some limitations that can be improved by smoothing, interpolations, and back
off. So, the N-gram language model is about finding probability distributions over the sequences
of the word. Consider the sentences i.e. "There was heavy rain" and "There was heavy flood". By
using experience, it can be said that the first statement is good. The N-gram language model tells
that the "heavy rain" occurs more frequently than the "heavy flood". So, the first statement is
more likely to occur and it will be then selected by this model. In the one-gram model, the model
usually relies on that which word occurs often without pondering the previous words. In 2-gram,
As it is not practical to calculate the conditional probability but by using the “Markov
Assumptions”, this is approximated to the bi-gram model as [4]:
P (“There was heavy rain”) ~ P (“There”) P (“was” |“'There”) P (“heavy” |“was”) P (“rain”
|“heavy”)
In speech recognition, the input can be noisy. This noise can make a wrong speech to the text
conversion. The N-gram language model corrects the noise by using probability knowledge.
Likewise, this model is used in machine translations for producing more natural statements in
target and specified languages. For spelling error corrections, the dictionary is useless
sometimes. For instance, "in about fifteen minutes" 'minuets' is a valid word according to the
dictionary but it is incorrect in the phrase. The N-gram language model can rectify this type of
error.
The N-gram language model is generally at the word levels. It is also used at the character levels
for doing the stemming i.e. for separating the root words from a suffix. By looking at the N-gram
model, the languages can be classified or differentiated between the US and UK spellings. Many
applications get benefit from the N-gram model including tagging of part of the speech, natural
language generations, word similarities, and sentiments extraction. [4].
One of the main challenges of language modeling benchmarks is ensuring that the
data used to train and test the models is of high quality and diversity. Data quality
refers to the accuracy, completeness, and consistency of the data, while data
diversity refers to the variety and coverage of the data in terms of domains, genres,
languages, and styles. Poor data quality and diversity can lead to overfitting, bias,
and generalization errors in language models. For example, some benchmarks may
contain noisy, outdated, or irrelevant data that do not reflect the real-world usage of
language. Some benchmarks may also be skewed towards certain topics, domains,
or languages that limit the applicability and robustness of language models to other
scenarios.
There are many different forms of multilingual NLP, but in general, it enables computational
software to understand the language of certain texts, along with contextual nuances. Multilingual
NLP is also capable of obtaining specific data and delivering key insights. In short, multilingual
NLP technology makes the impossible possible which is to process and analyze large amounts
One of the biggest obstacles preventing multilingual NLP from scaling quickly is relating to low
availability of labelled data in low-resource languages.
Among the 7,100 languages that are spoken worldwide, each of them has its own linguistic rules
and some languages simply work in different ways. For instance, there are undeniable
similarities between Italian, French and Spanish, whilst on the other hand, these three languages
are totally different from a specific Asian language group, that is Chinese, Japanese, and Korean
which share some similar symbols and ideographs.
The outcome from this leads to the need to have various techniques to generate language models
that can work with all these languages. In short, different languages often require different vector
spaces, even if there are existing pre-trained language embeddings.
Even though pre-trained word embeddings in different languages exist, it is possible that all of
them are in different vector spaces. This means that similar words can signify different vector
representations, basically due to the natural characteristics of a certain language.
This is why scaling multilingual NLP applications can be challenging. They use large amounts of
labelled data, process it, learn patterns, and generate prediction models. When building NLP on a
text comprising different languages, it is best to consider multilingual NLP.
When we need to build NLP on a text containing different languages, we may look at
multilingual word embeddings for NLP models that have the potential to scale effectively.
First of all, when machines can analyze a language, they encode and decode not only the
linguistic structure but also the culture that this language is connected. On the other hand, the
world is far from being a place where nations live in isolation, it is very global. The whole
world interacts with each other. This increases the need for multilingual NLP and this
· Finance, review, and analysis of financial records in different languages, credit processes
Many more important aspects and needs, such as the use of multilingual NLP can be given
as examples of Multilingual NLP tasks.
The first suggested solution is to train an NLP model for a specific language. A well-known
example would be a few new versions of Bidirectional Encoder Representations from
Transformers (BERT) that have been trained in numerous languages.
However, the biggest problem with this approach is its low success rate of scaling. It takes lots of
time and money to train a new model, let alone many models. NLP systems require various large
models, hence the processes can be very expensive and time-consuming.
This technique also does not scale effectively in terms of inference. Using NLP in different
languages means the business would have to sustain different models and provision several
servers and GPUs. Again, this can be extremely costly for the business.
The past years have seen that new emerging multilingual NLP models can be incredibly
accurate, at times even more accurate than specific, dedicated non-English language models.
Whilst there are several high-quality pre-trained models for text classification, so far there has
not been a multilingual model for text generation with impressive performance.
The last solution some businesses benefit from is to use translation. Companies can translate
their non-English content to English, provide the NLP model with that English content, then
translate the result back to the needed language.
Although studies in the field of NLP, or Natural Language Processing, have been carried out for
years, all studies in the field have been in English. The vast majority of sentences that machines
could understand, or rather perceive and encode, were in English. Breaking this English-
dominated orientation and enabling machines to perceive and encode almost every language that
exists in a global manner is called Cross-Lingual Language studies.
Cross-Lingual Language NLP is a very difficult and complex process. The reason for this
complexity and difficulty lies in the fundamental differences between languages. All of the
more than 5000 languages spoken around the world have different rules and vectors. So
machines need to be trained to recognize these languages
Many different models make this possible. Different approaches and diversity in the field
increase the chances of success. But some models stand out because they are more popular or
more successful than others. The most important factor in the popularity of these models is their
ease of use. Models that can be developed without the need for too much time and financial
resources are of course preferred by many people. Let's take a brief look at some of these popular
models.
· Masked Language Modeling (MLM): Here, the model randomly masks 15% of the
words in the sentence or text. And the sentence has to guess these words. This distinguishes it
from other ways of working, such as RNN, because it does not learn words one after the other.
· Next Sentence Prediction (NSP): Here the model combines two masked sentences as
input. These sentences may or may not be ordered in the text. The model then needs to predict
whether these sentences follow one after the other.
Through them, the model gains insight into how languages work. It can perceive languages
without human supervision.
XLM
XLM is a model that combines many different models. These are:
XLM uses two different pre-training methods. These can be divided into supervised and
unsupervised. One is the source language and the other is the target language. XLM has many
different checkpoints where it checks whether the correct choice has been made.
Multifit
Muliti-fit is a model that works differently than the other two models. It is based on the
Let's briefly explain subword tokenization. Morphology is the study of the structure, inflection,
and inflection of words. Therefore, working only on "words" does not give accurate results in
languages rich in this respect.
The tokenization of these subwords allows the machines to detect words that are not very
common.