0% found this document useful (0 votes)
9 views33 pages

Unit - 1 Introduction

Uploaded by

hhvhm2mzgp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views33 pages

Unit - 1 Introduction

Uploaded by

hhvhm2mzgp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

UNIT – 1

INTRODUCTION

Origins and Challenges of NLP:


Artificial intelligence has become part of our everyday lives – Alexa and Siri, text and email
autocorrect, customer service chatbots. They all use machine learning algorithms and Natural
Language Processing (NLP) to process, “understand”, and respond to human language, both written
and spoken.

Give this NLP sentiment analyzer a spin to see how NLP automatically understands and analyzes
sentiments in text (Positive, Neutral, and Negative).

Natural Language Processing (NLP) Challenges


NLP is a powerful tool with huge benefits, but there are still a number of Natural Language Processing
limitations and problems:

 Contextual words and phrases and homonyms


 Synonyms
 Irony and sarcasm
 Ambiguity
 Errors in text or speech
 Colloquialisms and slang
 Domain-specific language
 Low-resource languages
 Lack of research and development

1. Contextual words and phrases and homonyms

The same words and phrases can have different meanings according the context of a sentence and many
words – especially in English – have the exact same pronunciation but totally different meanings.

For example:

I ran to the store because we ran out of milk.


Can I run something past you real quick?

The house is looking really run down.

These are easy for humans to understand because we read the context of the sentence and we understand
all of the different definitions. And, while NLP language models may have learned all of the definitions,
differentiating between them in context can present problems.

Homonyms – two or more words that are pronounced the same but have different definitions – can be
problematic for question answering and speech-to-text applications because they aren’t written in text
form. Usage of their and there, for example, is even a common problem for humans.

2. Synonyms

Synonyms can lead to issues similar to contextual understanding because we use many different words
to express the same idea. Furthermore, some of these words may convey exactly the same meaning,
while some may be levels of complexity (small, little, tiny, minute) and different people use synonyms
to denote slightly different meanings within their personal vocabulary.

So, for building NLP systems, it’s important to include all of a word’s possible meanings and all
possible synonyms. Text analysis models may still occasionally make mistakes, but the more relevant
training data they receive, the better they will be able to understand synonyms.

3. Irony and sarcasm (Verbal irony is a figure of speech that communicates the opposite of what is
said, while sarcasm is a form of irony that is directed at a person, with the intent to criticize.)

Irony and sarcasm present problems for machine learning models because they generally use words
and phrases that, strictly by definition, may be positive or negative, but actually connote the opposite.

Models can be trained with certain cues that frequently accompany ironic or sarcastic phrases, like
“yeah right,” “whatever” etc., and word embeddings (where words that have the same meaning have a
similar representation), but it’s still a tricky process.

4. Ambiguity

Ambiguity in NLP refers to sentences and phrases that potentially have two or more possible
interpretations.

 Lexical ambiguity: a word that could be used as a verb, noun, or adjective.


 Semantic ambiguity: the interpretation of a sentence in context.
For example: I saw the boy on the beach with my binoculars. This could mean that I saw a boy
through my binoculars or the boy had my binoculars with him
 Syntactic ambiguity: In the sentence above, this is what creates the confusion of meaning. The
phrase with my binoculars could modify the verb, “saw,” or the noun, “boy.”

Even for humans this sentence alone is difficult to interpret without the context of surrounding text.
POS (part of speech) tagging is one NLP solution that can help solve the problem, somewhat.
5. Errors in text and speech

Misspelled or misused words can create problems for text analysis. Autocorrect and grammar
correction applications can handle common mistakes, but don’t always understand the writer’s
intention.

With spoken language, mispronunciations, different accents, stutters, etc., can be difficult for a machine
to understand. However, as language databases grow and smart assistants are trained by their individual
users, these issues can be minimized.

6. Colloquialisms and slang (Colloquialism is informal, everyday language that is used by a


specific geographical region)

Informal phrases, expressions, idioms, and culture-specific lingo present a number of problems for NLP
– especially for models intended for broad use. Because as formal language, colloquialisms may have
no “dictionary definition” at all, and these expressions may even have different meanings in different
geographic areas. Furthermore, cultural slang is constantly morphing and expanding, so new words pop
up every day.

This is where training and regularly updating custom models can be helpful, although it oftentimes
requires quite a lot of data.

Colloquial Expressions List


Colloquial Expression Formal Equivalent Meaning

Wanna Want to Want to

Gonna Going to Going to

Kinda Kind of Kind of

Gotta Got to Have to

Lemme Let me Let me

Cuz Because Because

Y’all You all You all

Ain’t Am not / Is not / Are not Am not / Is not / Are not

Dunno Don’t know Don’t know

Outta Out of Out of

7. Domain-specific language

Different businesses and industries often use very different language. An NLP processing model needed
for healthcare, for example, would be very different than one used to process legal documents. These
days, however, there are a number of analysis tools trained for specific fields, but extremely niche
industries (a specialized segment of the market for a particular kind of product or service.) may need to build
or train their own models.
8. Low-resource languages

AI machine learning NLP applications have been largely built for the most common, widely used
languages. And it’s downright amazing at how accurate translation systems have become. However,
many languages, especially those spoken by people with less access to technology often go overlooked
and under processed. For example, by some estimations, (depending on language vs. dialect) there are
over 3,000 languages in Africa, alone. There simply isn’t very much data on many of these languages.

However, new techniques, like multilingual transformers (using Google’s BERT “Bidirectional
Encoder Representations from Transformers”) and multilingual sentence embeddings aim to identify
and leverage universal similarities that exist between languages.

9. Lack of research and development

Machine learning requires a lot of data to function to its outer limits – billions of pieces of training data.
The more data NLP models are trained on, the smarter they become. That said, data (and human
language!) is only growing by the day, as are new machine learning techniques and custom algorithms.
All of the problems above will require more research and new techniques in order to improve on them.

Advanced practices like artificial neural networks and deep learning allow a multitude of NLP
techniques, algorithms, and models to work progressively, much like the human mind does. As they
grow and strengthen, we may have solutions to some of these challenges in the near future.
3
There are different types of n-gram models such as
 unigrams that evaluate each word independently;
 bigrams that consider the probability of a word given the previous word;
 trigrams that consider the probability of a word given the two previous words; and so on.

N-grams are relatively simple and efficient, but they do not consider the long-term context of the
words in a sequence.
The parenthesis and number operators can also be used to specify that a certain string or expression must occur
twice in the text. For example, suppose we are looking for the pattern “the Xer they were, the Xer they will be”,
where we want to constrain the two X’s to be the same string. We do this by surrounding the first X with the
parenthesis operator, and replacing the second X with the number operator \1, as follows:
/the (.*)er they were, the \1er they will be/
Here the \1 will be replaced by whatever string matched the first item in parentheses.
So this will match The bigger they were, the bigger they will be but not The bigger they were, the faster they
will be.
The number operator can be used with other numbers:
if you match two different sets of parenthesis, \2 means whatever matched the second set. For example
/the (.*)er they (.*), the \1er they \2/
will match: The bigger they were, the bigger they were but not the bigger they were, the bigger they will be.
These numbered memories are called registers (e.g. register 1, Register register 2, register 3, etc.). This
memory feature is not part of every regular expression language and is often considered an “extended” feature
of regular expressions.

Substitutions using memory are useful in implementing simple natural-language


understanding programs like ELIZA (Weizenbaum, 1966). Recall that ELIZA simulates
a Rogerian psychologist by carrying on conversations like the following:
User1: Men are all alike.
ELIZA1: IN WHAT WAY
User2: They’re always bugging us about something or other.
ELIZA2: CAN YOU THINK OF A SPECIFIC EXAMPLE
User3: Well, my boyfriend made me come here.
ELIZA3: YOUR BOYFRIEND MADE YOU COME HERE
User4: He says I’m depressed much of the time.
ELIZA4: I AM SORRY TO HEAR YOU ARE DEPRESSED.
Eliza works by having a cascade of regular expression substitutions that each matched
some part of the input lines and changed them. The first substitutions changed all
instances of my to YOUR, and I’m to YOU ARE, and so on. The next set of substitutions,
matched and replaced other patterns in the input. Here are some examples:
s/.* YOU ARE (depressed | sad) .*/I AM SORRY TO HEAR YOU ARE \1/
s/.* YOU ARE (depressed|sad) .*/WHY DO YOU THINK YOU ARE \1/
s/.* all .*/IN WHAT WAY/
s/.* always .*/CAN YOU THINK OF A SPECIFIC EXAMPLE/

Since multiple substitutions can apply to a given input, substitutions are assigned a rank
and applied in order.

not.
1.4 Finite State Automata
 Transition Diagram:

0, 1, 2→States 0 →Initial State 2→Final State a,b→Input Symbols

 Transition Table :

Example − Design Finite Automata which accepts string "abb".

Solution:
States: Q= {q0,q1,q2,q3}

Input Symbols: ∑ = {a,b}

Transition Function δ: {δ (q0,𝑎)=q1,δ(q1,𝑏)=q2,δ(q2,b)=q3}

Initial State: q0

Final State (F): {q3}


English Morphology
Finite State Automata for Morphological Analysis
Finite State Transducers

FST is a machine that reads one string and generates another.


FST can think of as the below four activities:

FST formal definition with 7 parameters:


Parsing upwards, generating downwards
Lexicon Free FST – The Porter Stemmer
into smaller parts (tokens) is known as tokenization.

Word and Sentence Tokenization


Tokenization – segmenting text into words and sentences
Generally separator is a white space
Detecting and Correcting Spelling Errors

 Deletion – This means that a letter or letters should be missing from the string.
 Insertion – This means that a certain letter or letters need to be added.
 Substitution – This means that a certain letter or letters in the string needs to be replaced by
another letter or letters in the string.
 Transposition - This means that all the letters are present but in a different order. The letters
in the string could be swapped between each other to get a meaningful word.
Semantic Similarity refers to the degree of similarity between the words. The focus is
on the structure and lexical resemblance of words and phrases. Semantic similarity
delves into the understanding and meaning of the content. The aim of the similarity is
to measure how closely related or analogous the concepts, ideas, or information
conveyed in two texts are.

Example: The word 'crash' can mean an accident, a drop in the stock market, or attending
a party without an invitation.
In NLP semantic similarity is used in various tasks such as
1. Question Answering – Enhances QA system by deriving semantic similarity
between user queries and document content.
2. Recommendation systems – Semantic similarity between user content and
available content
3. Summarization – Helps in summarizing similar content question answering, and
text matching.
4. Corpus clustering -Helps in grouping documents with similar content.

Phonetic Similarity: Sounds which represent different pronunciations of the same


phoneme must be phonetically similar. We consider two sounds to be phonetically
similar when they share the same place or manner of articulation.
Example: bow and bough, Nike and Mike
Minimum Edit Distance Algorithm

You might also like