Unit - 1 Introduction
Unit - 1 Introduction
INTRODUCTION
Give this NLP sentiment analyzer a spin to see how NLP automatically understands and analyzes
sentiments in text (Positive, Neutral, and Negative).
The same words and phrases can have different meanings according the context of a sentence and many
words – especially in English – have the exact same pronunciation but totally different meanings.
For example:
These are easy for humans to understand because we read the context of the sentence and we understand
all of the different definitions. And, while NLP language models may have learned all of the definitions,
differentiating between them in context can present problems.
Homonyms – two or more words that are pronounced the same but have different definitions – can be
problematic for question answering and speech-to-text applications because they aren’t written in text
form. Usage of their and there, for example, is even a common problem for humans.
2. Synonyms
Synonyms can lead to issues similar to contextual understanding because we use many different words
to express the same idea. Furthermore, some of these words may convey exactly the same meaning,
while some may be levels of complexity (small, little, tiny, minute) and different people use synonyms
to denote slightly different meanings within their personal vocabulary.
So, for building NLP systems, it’s important to include all of a word’s possible meanings and all
possible synonyms. Text analysis models may still occasionally make mistakes, but the more relevant
training data they receive, the better they will be able to understand synonyms.
3. Irony and sarcasm (Verbal irony is a figure of speech that communicates the opposite of what is
said, while sarcasm is a form of irony that is directed at a person, with the intent to criticize.)
Irony and sarcasm present problems for machine learning models because they generally use words
and phrases that, strictly by definition, may be positive or negative, but actually connote the opposite.
Models can be trained with certain cues that frequently accompany ironic or sarcastic phrases, like
“yeah right,” “whatever” etc., and word embeddings (where words that have the same meaning have a
similar representation), but it’s still a tricky process.
4. Ambiguity
Ambiguity in NLP refers to sentences and phrases that potentially have two or more possible
interpretations.
Even for humans this sentence alone is difficult to interpret without the context of surrounding text.
POS (part of speech) tagging is one NLP solution that can help solve the problem, somewhat.
5. Errors in text and speech
Misspelled or misused words can create problems for text analysis. Autocorrect and grammar
correction applications can handle common mistakes, but don’t always understand the writer’s
intention.
With spoken language, mispronunciations, different accents, stutters, etc., can be difficult for a machine
to understand. However, as language databases grow and smart assistants are trained by their individual
users, these issues can be minimized.
Informal phrases, expressions, idioms, and culture-specific lingo present a number of problems for NLP
– especially for models intended for broad use. Because as formal language, colloquialisms may have
no “dictionary definition” at all, and these expressions may even have different meanings in different
geographic areas. Furthermore, cultural slang is constantly morphing and expanding, so new words pop
up every day.
This is where training and regularly updating custom models can be helpful, although it oftentimes
requires quite a lot of data.
7. Domain-specific language
Different businesses and industries often use very different language. An NLP processing model needed
for healthcare, for example, would be very different than one used to process legal documents. These
days, however, there are a number of analysis tools trained for specific fields, but extremely niche
industries (a specialized segment of the market for a particular kind of product or service.) may need to build
or train their own models.
8. Low-resource languages
AI machine learning NLP applications have been largely built for the most common, widely used
languages. And it’s downright amazing at how accurate translation systems have become. However,
many languages, especially those spoken by people with less access to technology often go overlooked
and under processed. For example, by some estimations, (depending on language vs. dialect) there are
over 3,000 languages in Africa, alone. There simply isn’t very much data on many of these languages.
However, new techniques, like multilingual transformers (using Google’s BERT “Bidirectional
Encoder Representations from Transformers”) and multilingual sentence embeddings aim to identify
and leverage universal similarities that exist between languages.
Machine learning requires a lot of data to function to its outer limits – billions of pieces of training data.
The more data NLP models are trained on, the smarter they become. That said, data (and human
language!) is only growing by the day, as are new machine learning techniques and custom algorithms.
All of the problems above will require more research and new techniques in order to improve on them.
Advanced practices like artificial neural networks and deep learning allow a multitude of NLP
techniques, algorithms, and models to work progressively, much like the human mind does. As they
grow and strengthen, we may have solutions to some of these challenges in the near future.
3
There are different types of n-gram models such as
unigrams that evaluate each word independently;
bigrams that consider the probability of a word given the previous word;
trigrams that consider the probability of a word given the two previous words; and so on.
N-grams are relatively simple and efficient, but they do not consider the long-term context of the
words in a sequence.
The parenthesis and number operators can also be used to specify that a certain string or expression must occur
twice in the text. For example, suppose we are looking for the pattern “the Xer they were, the Xer they will be”,
where we want to constrain the two X’s to be the same string. We do this by surrounding the first X with the
parenthesis operator, and replacing the second X with the number operator \1, as follows:
/the (.*)er they were, the \1er they will be/
Here the \1 will be replaced by whatever string matched the first item in parentheses.
So this will match The bigger they were, the bigger they will be but not The bigger they were, the faster they
will be.
The number operator can be used with other numbers:
if you match two different sets of parenthesis, \2 means whatever matched the second set. For example
/the (.*)er they (.*), the \1er they \2/
will match: The bigger they were, the bigger they were but not the bigger they were, the bigger they will be.
These numbered memories are called registers (e.g. register 1, Register register 2, register 3, etc.). This
memory feature is not part of every regular expression language and is often considered an “extended” feature
of regular expressions.
Since multiple substitutions can apply to a given input, substitutions are assigned a rank
and applied in order.
not.
1.4 Finite State Automata
Transition Diagram:
Transition Table :
Solution:
States: Q= {q0,q1,q2,q3}
Initial State: q0
Deletion – This means that a letter or letters should be missing from the string.
Insertion – This means that a certain letter or letters need to be added.
Substitution – This means that a certain letter or letters in the string needs to be replaced by
another letter or letters in the string.
Transposition - This means that all the letters are present but in a different order. The letters
in the string could be swapped between each other to get a meaningful word.
Semantic Similarity refers to the degree of similarity between the words. The focus is
on the structure and lexical resemblance of words and phrases. Semantic similarity
delves into the understanding and meaning of the content. The aim of the similarity is
to measure how closely related or analogous the concepts, ideas, or information
conveyed in two texts are.
Example: The word 'crash' can mean an accident, a drop in the stock market, or attending
a party without an invitation.
In NLP semantic similarity is used in various tasks such as
1. Question Answering – Enhances QA system by deriving semantic similarity
between user queries and document content.
2. Recommendation systems – Semantic similarity between user content and
available content
3. Summarization – Helps in summarizing similar content question answering, and
text matching.
4. Corpus clustering -Helps in grouping documents with similar content.