0% found this document useful (0 votes)

9 views33 pages

Unit - 1 Introduction

Uploaded by

hhvhm2mzgp

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views33 pages

Unit - 1 Introduction

Uploaded by

hhvhm2mzgp

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 33

UNIT – 1

INTRODUCTION

Origins and Challenges of NLP:

Artificial intelligence has become part of our everyday lives – Alexa and Siri, text and email
autocorrect, customer service chatbots. They all use machine learning algorithms and Natural
Language Processing (NLP) to process, “understand”, and respond to human language, both written
and spoken.

Give this NLP sentiment analyzer a spin to see how NLP automatically understands and analyzes
sentiments in text (Positive, Neutral, and Negative).

Natural Language Processing (NLP) Challenges

NLP is a powerful tool with huge benefits, but there are still a number of Natural Language Processing
limitations and problems:

 Contextual words and phrases and homonyms

 Synonyms
 Irony and sarcasm
 Ambiguity
 Errors in text or speech
 Colloquialisms and slang
 Domain-specific language
 Low-resource languages
 Lack of research and development

1. Contextual words and phrases and homonyms

The same words and phrases can have different meanings according the context of a sentence and many
words – especially in English – have the exact same pronunciation but totally different meanings.

For example:

I ran to the store because we ran out of milk.

Can I run something past you real quick?

The house is looking really run down.

These are easy for humans to understand because we read the context of the sentence and we understand
all of the different definitions. And, while NLP language models may have learned all of the definitions,
differentiating between them in context can present problems.

Homonyms – two or more words that are pronounced the same but have different definitions – can be
problematic for question answering and speech-to-text applications because they aren’t written in text
form. Usage of their and there, for example, is even a common problem for humans.

2. Synonyms

Synonyms can lead to issues similar to contextual understanding because we use many different words
to express the same idea. Furthermore, some of these words may convey exactly the same meaning,
while some may be levels of complexity (small, little, tiny, minute) and different people use synonyms
to denote slightly different meanings within their personal vocabulary.

So, for building NLP systems, it’s important to include all of a word’s possible meanings and all
possible synonyms. Text analysis models may still occasionally make mistakes, but the more relevant
training data they receive, the better they will be able to understand synonyms.

3. Irony and sarcasm (Verbal irony is a figure of speech that communicates the opposite of what is
said, while sarcasm is a form of irony that is directed at a person, with the intent to criticize.)

Irony and sarcasm present problems for machine learning models because they generally use words
and phrases that, strictly by definition, may be positive or negative, but actually connote the opposite.

Models can be trained with certain cues that frequently accompany ironic or sarcastic phrases, like
“yeah right,” “whatever” etc., and word embeddings (where words that have the same meaning have a
similar representation), but it’s still a tricky process.

4. Ambiguity

Ambiguity in NLP refers to sentences and phrases that potentially have two or more possible
interpretations.

 Lexical ambiguity: a word that could be used as a verb, noun, or adjective.

 Semantic ambiguity: the interpretation of a sentence in context.
For example: I saw the boy on the beach with my binoculars. This could mean that I saw a boy
through my binoculars or the boy had my binoculars with him
 Syntactic ambiguity: In the sentence above, this is what creates the confusion of meaning. The
phrase with my binoculars could modify the verb, “saw,” or the noun, “boy.”

Even for humans this sentence alone is difficult to interpret without the context of surrounding text.
POS (part of speech) tagging is one NLP solution that can help solve the problem, somewhat.
5. Errors in text and speech

Misspelled or misused words can create problems for text analysis. Autocorrect and grammar
correction applications can handle common mistakes, but don’t always understand the writer’s
intention.

With spoken language, mispronunciations, different accents, stutters, etc., can be difficult for a machine
to understand. However, as language databases grow and smart assistants are trained by their individual
users, these issues can be minimized.

6. Colloquialisms and slang (Colloquialism is informal, everyday language that is used by a

specific geographical region)

Informal phrases, expressions, idioms, and culture-specific lingo present a number of problems for NLP
– especially for models intended for broad use. Because as formal language, colloquialisms may have
no “dictionary definition” at all, and these expressions may even have different meanings in different
geographic areas. Furthermore, cultural slang is constantly morphing and expanding, so new words pop
up every day.

This is where training and regularly updating custom models can be helpful, although it oftentimes
requires quite a lot of data.

Colloquial Expressions List

Colloquial Expression Formal Equivalent Meaning

Wanna Want to Want to

Gonna Going to Going to

Kinda Kind of Kind of

Gotta Got to Have to

Lemme Let me Let me

Cuz Because Because

Y’all You all You all

Ain’t Am not / Is not / Are not Am not / Is not / Are not

Dunno Don’t know Don’t know

Outta Out of Out of

7. Domain-specific language

Different businesses and industries often use very different language. An NLP processing model needed
for healthcare, for example, would be very different than one used to process legal documents. These
days, however, there are a number of analysis tools trained for specific fields, but extremely niche
industries (a specialized segment of the market for a particular kind of product or service.) may need to build
or train their own models.
8. Low-resource languages

AI machine learning NLP applications have been largely built for the most common, widely used
languages. And it’s downright amazing at how accurate translation systems have become. However,
many languages, especially those spoken by people with less access to technology often go overlooked
and under processed. For example, by some estimations, (depending on language vs. dialect) there are
over 3,000 languages in Africa, alone. There simply isn’t very much data on many of these languages.

However, new techniques, like multilingual transformers (using Google’s BERT “Bidirectional
Encoder Representations from Transformers”) and multilingual sentence embeddings aim to identify
and leverage universal similarities that exist between languages.

9. Lack of research and development

Machine learning requires a lot of data to function to its outer limits – billions of pieces of training data.
The more data NLP models are trained on, the smarter they become. That said, data (and human
language!) is only growing by the day, as are new machine learning techniques and custom algorithms.
All of the problems above will require more research and new techniques in order to improve on them.

Advanced practices like artificial neural networks and deep learning allow a multitude of NLP
techniques, algorithms, and models to work progressively, much like the human mind does. As they
grow and strengthen, we may have solutions to some of these challenges in the near future.
3
There are different types of n-gram models such as
 unigrams that evaluate each word independently;
 bigrams that consider the probability of a word given the previous word;
 trigrams that consider the probability of a word given the two previous words; and so on.

N-grams are relatively simple and efficient, but they do not consider the long-term context of the
words in a sequence.
The parenthesis and number operators can also be used to specify that a certain string or expression must occur
twice in the text. For example, suppose we are looking for the pattern “the Xer they were, the Xer they will be”,
where we want to constrain the two X’s to be the same string. We do this by surrounding the first X with the
parenthesis operator, and replacing the second X with the number operator \1, as follows:
/the (.*)er they were, the \1er they will be/
Here the \1 will be replaced by whatever string matched the first item in parentheses.
So this will match The bigger they were, the bigger they will be but not The bigger they were, the faster they
will be.
The number operator can be used with other numbers:
if you match two different sets of parenthesis, \2 means whatever matched the second set. For example
/the (.*)er they (.*), the \1er they \2/
will match: The bigger they were, the bigger they were but not the bigger they were, the bigger they will be.
These numbered memories are called registers (e.g. register 1, Register register 2, register 3, etc.). This
memory feature is not part of every regular expression language and is often considered an “extended” feature
of regular expressions.

Substitutions using memory are useful in implementing simple natural-language

understanding programs like ELIZA (Weizenbaum, 1966). Recall that ELIZA simulates
a Rogerian psychologist by carrying on conversations like the following:
User1: Men are all alike.
ELIZA1: IN WHAT WAY
User2: They’re always bugging us about something or other.
ELIZA2: CAN YOU THINK OF A SPECIFIC EXAMPLE
User3: Well, my boyfriend made me come here.
ELIZA3: YOUR BOYFRIEND MADE YOU COME HERE
User4: He says I’m depressed much of the time.
ELIZA4: I AM SORRY TO HEAR YOU ARE DEPRESSED.
Eliza works by having a cascade of regular expression substitutions that each matched
some part of the input lines and changed them. The first substitutions changed all
instances of my to YOUR, and I’m to YOU ARE, and so on. The next set of substitutions,
matched and replaced other patterns in the input. Here are some examples:
s/.* YOU ARE (depressed | sad) .*/I AM SORRY TO HEAR YOU ARE \1/
s/.* YOU ARE (depressed|sad) .*/WHY DO YOU THINK YOU ARE \1/
s/.* all .*/IN WHAT WAY/
s/.* always .*/CAN YOU THINK OF A SPECIFIC EXAMPLE/

Since multiple substitutions can apply to a given input, substitutions are assigned a rank
and applied in order.

not.
1.4 Finite State Automata
 Transition Diagram:

0, 1, 2→States 0 →Initial State 2→Final State a,b→Input Symbols

 Transition Table :

Example − Design Finite Automata which accepts string "abb".

Solution:
States: Q= {q0,q1,q2,q3}

Input Symbols: ∑ = {a,b}

Transition Function δ: {δ (q0,𝑎)=q1,δ(q1,𝑏)=q2,δ(q2,b)=q3}

Initial State: q0

Final State (F): {q3}

English Morphology
Finite State Automata for Morphological Analysis
Finite State Transducers

FST is a machine that reads one string and generates another.

FST can think of as the below four activities:

FST formal definition with 7 parameters:

Parsing upwards, generating downwards
Lexicon Free FST – The Porter Stemmer
into smaller parts (tokens) is known as tokenization.

Word and Sentence Tokenization

Tokenization – segmenting text into words and sentences
Generally separator is a white space
Detecting and Correcting Spelling Errors

 Deletion – This means that a letter or letters should be missing from the string.
 Insertion – This means that a certain letter or letters need to be added.
 Substitution – This means that a certain letter or letters in the string needs to be replaced by
another letter or letters in the string.
 Transposition - This means that all the letters are present but in a different order. The letters
in the string could be swapped between each other to get a meaningful word.
Semantic Similarity refers to the degree of similarity between the words. The focus is
on the structure and lexical resemblance of words and phrases. Semantic similarity
delves into the understanding and meaning of the content. The aim of the similarity is
to measure how closely related or analogous the concepts, ideas, or information
conveyed in two texts are.

Example: The word 'crash' can mean an accident, a drop in the stock market, or attending
a party without an invitation.
In NLP semantic similarity is used in various tasks such as
1. Question Answering – Enhances QA system by deriving semantic similarity
between user queries and document content.
2. Recommendation systems – Semantic similarity between user content and
available content
3. Summarization – Helps in summarizing similar content question answering, and
text matching.
4. Corpus clustering -Helps in grouping documents with similar content.

Phonetic Similarity: Sounds which represent different pronunciations of the same

phoneme must be phonetically similar. We consider two sounds to be phonetically
similar when they share the same place or manner of articulation.
Example: bow and bough, Nike and Mike
Minimum Edit Distance Algorithm

6CS4 AI Unit-5
No ratings yet
6CS4 AI Unit-5
65 pages
Shivangi Tyagi (NLP Assignments)
No ratings yet
Shivangi Tyagi (NLP Assignments)
60 pages
Unit - 1
No ratings yet
Unit - 1
9 pages
1 - Introducntion To NLP
No ratings yet
1 - Introducntion To NLP
43 pages
NLP Qna Sem 7 2024 18 11 05 03 29 1
No ratings yet
NLP Qna Sem 7 2024 18 11 05 03 29 1
37 pages
NLP unit-1-introduction-and-word-level-analysis
No ratings yet
NLP unit-1-introduction-and-word-level-analysis
25 pages
NLP IA1
No ratings yet
NLP IA1
7 pages
NLP Notes
No ratings yet
NLP Notes
9 pages
Unit1 SNLP Osmania University
No ratings yet
Unit1 SNLP Osmania University
16 pages
NLP UNIT 1
No ratings yet
NLP UNIT 1
46 pages
01 - Intro NLP
No ratings yet
01 - Intro NLP
13 pages
Module-1
No ratings yet
Module-1
39 pages
Unit-1 Aim 502
No ratings yet
Unit-1 Aim 502
15 pages
NLP Notes For Students
No ratings yet
NLP Notes For Students
18 pages
Unit 6 (NLP)
No ratings yet
Unit 6 (NLP)
8 pages
Natural Language Processing: Dr. Abdulfetah A.A
No ratings yet
Natural Language Processing: Dr. Abdulfetah A.A
25 pages
Basic Terms NLP and Major Challenges
No ratings yet
Basic Terms NLP and Major Challenges
12 pages
AI_NLP
No ratings yet
AI_NLP
9 pages
CL Unit 1
No ratings yet
CL Unit 1
11 pages
CAT King study material 2
No ratings yet
CAT King study material 2
20 pages
NLP- AI2214601 unit 1to unit 5 notes
No ratings yet
NLP- AI2214601 unit 1to unit 5 notes
98 pages
Lecture1
No ratings yet
Lecture1
16 pages
NLP_Presentation1
No ratings yet
NLP_Presentation1
25 pages
NLP Unit 1 and 2
No ratings yet
NLP Unit 1 and 2
106 pages
NLP.pptx
No ratings yet
NLP.pptx
21 pages
NPL
No ratings yet
NPL
2 pages
Natural Language Processing (NLP)
No ratings yet
Natural Language Processing (NLP)
63 pages
Natural Language Processing tools and approaches
No ratings yet
Natural Language Processing tools and approaches
106 pages
NLP Ass 1&2
No ratings yet
NLP Ass 1&2
18 pages
NLP Mod 1 SEE
No ratings yet
NLP Mod 1 SEE
7 pages
CH1
No ratings yet
CH1
87 pages
AI unit 6
No ratings yet
AI unit 6
12 pages
notes
No ratings yet
notes
9 pages
NLP Steps Basic
No ratings yet
NLP Steps Basic
26 pages
UNIT-I NLP
No ratings yet
UNIT-I NLP
37 pages
NLP Unit-I
No ratings yet
NLP Unit-I
7 pages
Unit-6 Natural Language Processing
No ratings yet
Unit-6 Natural Language Processing
7 pages
INFOSYS Natural Language Processing
No ratings yet
INFOSYS Natural Language Processing
13 pages
Archivo - 01 (4 Cópia)
No ratings yet
Archivo - 01 (4 Cópia)
6 pages
NLP Self Notes
No ratings yet
NLP Self Notes
12 pages
2 INTRODUCTION
No ratings yet
2 INTRODUCTION
15 pages
NLP handwritten notes_copy
No ratings yet
NLP handwritten notes_copy
26 pages
Langauage Model
No ratings yet
Langauage Model
148 pages
5.2 Natural Language Processing
No ratings yet
5.2 Natural Language Processing
43 pages
NLP 1
No ratings yet
NLP 1
29 pages
1 - Intro - To - NLP 2
No ratings yet
1 - Intro - To - NLP 2
55 pages
NLP Introduction Overview
No ratings yet
NLP Introduction Overview
34 pages
Intro To NLP: Natural Language Toolkit
No ratings yet
Intro To NLP: Natural Language Toolkit
11 pages
Natural Language Processing
No ratings yet
Natural Language Processing
21 pages
Unit V
No ratings yet
Unit V
16 pages
Natural Language Processing Lec 1
No ratings yet
Natural Language Processing Lec 1
23 pages
NLP Question bank
No ratings yet
NLP Question bank
27 pages
NLP
No ratings yet
NLP
61 pages
Introduction To Natural Language Processing (NLP)
No ratings yet
Introduction To Natural Language Processing (NLP)
87 pages
Natural Language Processing
No ratings yet
Natural Language Processing
6 pages
Natural Language Processing_ Bridging the Gap Between Humans and Machines
No ratings yet
Natural Language Processing_ Bridging the Gap Between Humans and Machines
6 pages
INTRONLP
No ratings yet
INTRONLP
30 pages
Unit 1 2 3 4 5 NLP Notes Merged
100% (1)
Unit 1 2 3 4 5 NLP Notes Merged
105 pages
NLP_Module1-4
No ratings yet
NLP_Module1-4
100 pages
Natural Language Processing
From Everand
Natural Language Processing
Ajit Singh
No ratings yet
Abaas e. Lutaaya Report
No ratings yet
Abaas e. Lutaaya Report
167 pages
Watching Movies With English Subtitles and Vocabulary Performance of SPA Learners
No ratings yet
Watching Movies With English Subtitles and Vocabulary Performance of SPA Learners
12 pages
Anxiety Disorders and Their Treatment Antony-Swinson-1996
No ratings yet
Anxiety Disorders and Their Treatment Antony-Swinson-1996
123 pages
Engl 1302-Final Reflection Essay
No ratings yet
Engl 1302-Final Reflection Essay
6 pages
GEC7 Chap 4 Topic 4
No ratings yet
GEC7 Chap 4 Topic 4
3 pages
Handout For WISC V WJ IV Workshop 5.26.15
No ratings yet
Handout For WISC V WJ IV Workshop 5.26.15
73 pages
MCN Chapter 39
No ratings yet
MCN Chapter 39
35 pages
The West Bengal Madrasah Service Commission: Child Development and Pedagogy
No ratings yet
The West Bengal Madrasah Service Commission: Child Development and Pedagogy
4 pages
Admission Handbook 2018-2019
No ratings yet
Admission Handbook 2018-2019
262 pages
First Quarter Examination
No ratings yet
First Quarter Examination
3 pages
Daily Sadhana Report Submission Statistics: Dailywise 16 Rounds Completed Chart Before 8.00 Am
No ratings yet
Daily Sadhana Report Submission Statistics: Dailywise 16 Rounds Completed Chart Before 8.00 Am
1 page
Writing Project 1 Rough Draft
No ratings yet
Writing Project 1 Rough Draft
3 pages
DLL-JHS-FINAL
No ratings yet
DLL-JHS-FINAL
22 pages
Christian Kravagna: Working On The Community
No ratings yet
Christian Kravagna: Working On The Community
9 pages
LP-Sir-O-3rd-year (1)
No ratings yet
LP-Sir-O-3rd-year (1)
6 pages
Kako Napisati Esej English
No ratings yet
Kako Napisati Esej English
5 pages
Activity Guide and Evaluation Rubric Speaking Practice
No ratings yet
Activity Guide and Evaluation Rubric Speaking Practice
6 pages
ke-unit-4-notes
No ratings yet
ke-unit-4-notes
22 pages
Evaluator Choreographer Ideologue Catalyst The Dis
No ratings yet
Evaluator Choreographer Ideologue Catalyst The Dis
3 pages
Zarnab Assessment
100% (1)
Zarnab Assessment
3 pages
Shaily Bharat Sports Daily
0% (3)
Shaily Bharat Sports Daily
2 pages
Bahria University Medical Admission
No ratings yet
Bahria University Medical Admission
32 pages
The Birth Of Computer Vision James E Dobson instant download
No ratings yet
The Birth Of Computer Vision James E Dobson instant download
87 pages
Why You Need To Quit Drinking Soda Persuasive
No ratings yet
Why You Need To Quit Drinking Soda Persuasive
3 pages
Broken Home: How Does It Affect To Academic Achievements of Students?
No ratings yet
Broken Home: How Does It Affect To Academic Achievements of Students?
2 pages
(IJETA-V8I5P1) :yew Kee Wong
No ratings yet
(IJETA-V8I5P1) :yew Kee Wong
5 pages
[John d. Bransford][Barry s. Stein][the Ideal Problem Solver]
No ratings yet
[John d. Bransford][Barry s. Stein][the Ideal Problem Solver]
13 pages
HK1-2324-Anh GB1+-de Cuong
No ratings yet
HK1-2324-Anh GB1+-de Cuong
3 pages
Ix. Lesson Plans: Code: S4Es-Ivi-10 Code: S4Es-Ivj-11
No ratings yet
Ix. Lesson Plans: Code: S4Es-Ivi-10 Code: S4Es-Ivj-11
2 pages
Gcu Safety Calendar 2
No ratings yet
Gcu Safety Calendar 2
7 pages

Uploaded by

Uploaded by

UNIT – 1

Origins and Challenges of NLP:

Natural Language Processing (NLP) Challenges

 Contextual words and phrases and homonyms

1. Contextual words and phrases and homonyms

I ran to the store because we ran out of milk.

The house is looking really run down.

 Lexical ambiguity: a word that could be used as a verb, noun, or adjective.

6. Colloquialisms and slang (Colloquialism is informal, everyday language that is used by a

Colloquial Expressions List

Wanna Want to Want to

Gonna Going to Going to

Kinda Kind of Kind of

Gotta Got to Have to

Lemme Let me Let me

Cuz Because Because

Y’all You all You all

Ain’t Am not / Is not / Are not Am not / Is not / Are not

Dunno Don’t know Don’t know

Outta Out of Out of

9. Lack of research and development

Substitutions using memory are useful in implementing simple natural-language

0, 1, 2→States 0 →Initial State 2→Final State a,b→Input Symbols

Example − Design Finite Automata which accepts string "abb".

Input Symbols: ∑ = {a,b}

Transition Function δ: {δ (q0,𝑎)=q1,δ(q1,𝑏)=q2,δ(q2,b)=q3}

Final State (F): {q3}

FST is a machine that reads one string and generates another.

FST formal definition with 7 parameters:

Word and Sentence Tokenization

Phonetic Similarity: Sounds which represent different pronunciations of the same

You might also like