0% found this document useful (0 votes)
163 views

19CSE453 - Natural Language Processing: Part of Speech Tagging

The document discusses part-of-speech (POS) tagging. It covers topics like the different parts of speech, common POS tag sets like the UPenn tag set, challenges in POS tagging like ambiguity, and approaches to POS tagging including rule-based tagging and probabilistic tagging using hidden Markov models. Probabilistic taggers assign POS tags by calculating probabilities based on tag transitions and word emissions. Hidden Markov models are well-suited for POS tagging as they allow modeling the hidden states of POS tags along with the observable words.

Uploaded by

girish reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
163 views

19CSE453 - Natural Language Processing: Part of Speech Tagging

The document discusses part-of-speech (POS) tagging. It covers topics like the different parts of speech, common POS tag sets like the UPenn tag set, challenges in POS tagging like ambiguity, and approaches to POS tagging including rule-based tagging and probabilistic tagging using hidden Markov models. Probabilistic taggers assign POS tags by calculating probabilities based on tag transitions and word emissions. Hidden Markov models are well-suited for POS tagging as they allow modeling the hidden states of POS tags along with the observable words.

Uploaded by

girish reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

19CSE453 – Natural Language Processing

Part of Speech tagging

By
Ms. Kavitha C.R.
Dept. of Computer Science and Engineering

Amrita School of Engineering, Bengaluru


Amrita Vishwa Vidyapeetham
Part-of-Speech (POS) tagging

11-04-2022 2
Parts of Speech: How many?

11-04-2022 3
POS Examples

11-04-2022 4
POS Tagging: Choosing a tag set

11-04-2022 5
UPenn Treebank POS Tag set

11-04-2022 6
Using the UPenn tag set

DT–Determiner, JJ-Adjective, VBD-verb past, IN-preposition, NN-noun singular or mass, NNS-plural

11-04-2022 7
Why is POS tagging hard?

adjective

noun in singular

adverb

Verb base form

11-04-2022 8
Ambiguous word types in the Brown Corpus

11-04-2022 9
How bad is the ambiguity problem?

11-04-2022 10
Deciding the correct POS

NNP-proper noun singular


PRP-personal pronoun
RB-adverb
TO-to
VBD-verb past
VBG-verb gerund or present participle
VBN-verb past participle
VBZ-verb 3 sing present
CD-cardinal no
NN-noun singular or mass

RP – Particle
VB – verb base form
IN - preposition

11-04-2022 11
Relevant Knowledge for POS tagging

11-04-2022 12
POS Tagging: Two Approaches

Transformation Based Taggers

11-04-2022 13
TBL Tagger - Transformation Based Taggers

MD-Modal
VBD- verb past

NN – noun in singular
VBN- verb past particle

11-04-2022 14
Probabilistic Tagging: Two different families of models

11-04-2022 15
Generative vs Conditional Models

11-04-2022 16
Generative vs Discriminative Models

11-04-2022 17
Probabilistic Tagging

By Bayesian Inference

11-04-2022 18
Further Simplification

11-04-2022 19
Computing The Probability Values

NN- noun singular

VBZ-verb 3 sing present

11-04-2022 20
Disambiguating “race”

11-04-2022 21
Disambiguating “race”

11-04-2022 22
What is this model?

11-04-2022 23
Hidden Markov Model (HMM)

11-04-2022 24
Markov Chain = First–Order Markov Model

11-04-2022 25
Markov Chain Transition Table

Sunny

Rainy

Foggy

11-04-2022 26
Using Markov Chain

11-04-2022 27
Hidden Markov Model (HMM)

• The next level of complexity that can be introduced into a stochastic


tagger combines the previous two approaches, using both tag
sequence probabilities and word frequency measurements. This is
known as the Hidden Markov Model (HMM).
• Before proceeding with what is a Hidden Markov Model, let us first
look at what is a Markov Model. That will better help understand the
meaning of the term Hidden in HMMs

11-04-2022 28
Example

Say that there are only three kinds of weather conditions, namely
• Rainy
• Sunny
• Cloudy
• How to make a prediction of the weather for today based on what the
weather has been for the past N days?
• Say we have a sequence. Something like this:
• Sunny, Rainy, Cloudy, Cloudy, Sunny, Sunny, Sunny, Rainy
• So, the weather for any given day can be in any of the three states

11-04-2022 29
Example

• Let’s say we decide to use a Markov Chain Model to solve this problem.
Now using the data that we have, we can construct the following state
diagram with the labelled probabilities.

In order to compute the probability of today’s weather given N previous


observations, we will use the Markovian Property.
11-04-2022 30
Markov Chain is essentially the simplest known Markov model, that it obeys the Markov property.

The Markov property suggests that the distribution for a random variable in the future depends solely only
on its distribution in the current state, and none of the previous states have any impact on the future states.

11-04-2022 31
11-04-2022 32
Hidden Markov Model

There are two kinds of probabilities from the state


diagram.
• One is the emission probabilities, which
represent the probabilities of making certain
observations given a particular state.
For example, we have P(noise | awake) = 0.5 .
This is an emission probability.
• The other ones is transition probabilities, which
represent the probability of transitioning to
another state given a particular state.
For example, we have P(asleep | awake) = 0.4 .
This is a transition probability.

11-04-2022 33
• We usually observe longer stretches of the child being awake and being asleep.
If Peter is awake now, the probability of him staying awake is higher than of him
going to sleep. Hence, the 0.6 and 0.4 in the diagram. P (awake | awake) = 0.6 and
P (asleep | awake) = 0.4
• The Transition probabilities matrix

• The Emission probabilities matrix

• Before actually trying to solve the problem at hand using HMMs, let’s relate this
model to the task of Part of Speech Tagging

11-04-2022 34
HMMs for Part of Speech Tagging

• We know that to model any problem using a Hidden Markov Model we need a set
of observations and a set of possible states. The states in an HMM are hidden.
• In the part of speech tagging problem, the observations are the words themselves
in the given sequence.
• As for the states, which are hidden, these would be the POS tags for the words.
• The transition probabilities would be somewhat like P(VP | NP) that is, what is
the probability of the current word having a tag of Verb Phrase given that the
previous tag was a Noun Phrase.
• Emission probabilities would be P(john | NP) or P(will | VP) that is, what is the
probability that the word is, say, John given that the tag is a Noun Phrase.
• Note that this is just an informal modeling of the problem to provide a very basic
understanding of how the Part of Speech tagging problem can be modeled using
an HMM

11-04-2022 35
Why do we need HMM for POS Tagger?

• If you notice closely, we can have the words in a sentence as Observable States (given to us in
the data) but their POS Tags as Hidden states and hence we use HMM for estimating POS
tags.
Note: we call Observable states as ‘Observation’ & Hidden states as ‘States’.
• A Hidden Markov Model has the following components:

11-04-2022 36
HMM components example

• Q: Set of possible Tags


• A: The A matrix contains the tag transition probabilities P(ti|ti−1) which represent
the probability of a tag occurring given the previous tag.
Example: Calculating A[Verb][Noun]:
• P (Noun|Verb): Count(Verb, Noun)/Count(Verb)
• O: Sequence of observation (words in the sentence)
• B: The B emission probabilities, P(wi|ti), represent the probability, given a tag (say
Verb), that it will be associated with a given word (say Playing).
The emission probability B[Verb][Playing] is calculated using:
• P(Playing | Verb): Count (Verb, Playing) / Count (Verb)
• It must be noted that we get all these Count() from the corpus itself used for
training.

11-04-2022 37
A sample HMM with both ‘A’ & ‘B’ matrix will look like this :

MD–Modal can, should


VB- verb base form
NN-Noun singular,mass

Here, the black, continuous arrows represent values of Transition matrix ‘A’ while the
dotted black arrow represents Emission Matrix ‘B’ for a system with Q: {MD, VB, NN}.
11-04-2022 38
Decoding using HMM
• Given an input as HMM (Transition Matrix, Emission Matrix) and a sequence of
observations O = o1, o2, …, oT (Words in sentences of a corpus), find the most
probable sequence of states Q = q1q2q3 …qT (POS Tags in our case)

The 2 major assumptions followed while decoding tag sequence using HMMs:
• The probability of a word appearing depends only on its own tag and is
independent of neighboring words and tags.
• The probability of a tag depends only on the previous tag(bigram
HMM) occurred rather than the entire previous tag sequence i.e. shows Markov
Property. Though we can be flexible with this

11-04-2022 39
• So the problem is to find out what is the sequence of that would be used for this
sentence.
• approach to this problem, given the state transition graph that already exists
• To simplify , find out what are all the possible part of speech text a word can take.
• Already discussed, that some words can take multiple part of speech tags, but
none of the words went beyond 7 part of speech tags. And mostly the words
were having if they were ambiguous they were having 2 or 3 part of space tags.
• 1st of all, need to set up a probability matrix called lattice where we
have columns as our observables (words of a sentence in the same sequence as
in sentence) & rows as hidden states(all possible POS Tags are known).

11-04-2022 40
11-04-2022 41
• There are many possibilities. Now each word can take any of these
possibilities.
• So now, there are 32 possibilities of the part of speech tag sequences.
• Among these , we have to find out which is the most likely sequence
of participation tags that is being used.
• Enumerate all the possibilities and for each possibility compute the
probability separately. So for 32 possibilities, compute 32 different
probability.
• That is one possibility, but this is not very efficient, and think of some
sentences which might have say 15-20 words. This will just go
exponentially with the number of words.
So this is not a good solution.

11-04-2022 42
• So, need to have a solution where it does not grow exponentially with
the size of the sentence. So what will be a good algorithm for doing
that?
• So ideally, want to come up with the actual part of speech tag
sequence like here it will be VBD TO VB DT and NN from all the
possibilities.

• So this is the Viterbi algorithm

11-04-2022 43
Viterbi Decoding for HMM
• The decoding algorithm used for HMMs is called the Viterbi algorithm
penned down by the Founder of Qualcomm, an American MNC

• The Viterbi algorithm is a dynamic programming algorithm for


obtaining the maximum a posteriori probability estimate of the
most likely sequence of hidden states—called the Viterbi path—that
results in a sequence of observed events, especially in the context of
Markov information sources and hidden Markov models (HMM).
• Now, how to compute the probability by using the HMM.

11-04-2022 44
Finding the best path: Viterbi Algorithm

11-04-2022 45
Example
Suppose you want to use a HMM tagger to tag the phrase, “the light book”,
where we have the following probabilities:
P(the|Det) = 0.3, P(the|Noun) = 0.1, P(light|Noun) = 0.003, P(light|Adj) = 0.2,
P(light|Verb) = 0.06, P(book|Noun) = 0.03, P(book|Verb) = 0.001

P(Verb|Det) = 0.00001, P(Noun|Det) = 0.5, P(Adj|Det) = 0.3,


P(Noun|Noun) =0.2, P(Adj|Noun) = 0.002, P(Noun|Adj) = 0.2,
P(Noun|Verb) = 0.3, P(Verb|Noun) = 0.3, P(Verb|Adj) = 0.001,
P(Verb|Verb) = 0.1
Work out in details the steps of the Viterbi algorithm. You can use a Table to
show the steps. Assume all other conditional probabilities, not mentioned to
be zero. Also, assume that all tags have the same probabilities to appear in the
beginning of a sentence. 46
11-04-2022
Example

Det Noun Adj Verb The light book


Det 0 0.5 0.3 0.00001 D 0.3 0 0
Noun 0 0.2 0.002 0.3 N 0.1 0.003 0.03
Adj 0 0.2 0 0.001 A 0 0.2 0
Verb 0 0.3 0 0.1 V 0 0.06 0.001

Tag Transition Probability Word Emission Probability


Word Probability matrix
P(ti/ti-1) P(wi/ti)

11-04-2022 47
11-04-2022 48
11-04-2022 49
11-04-2022 50
11-04-2022 51
11-04-2022 52
11-04-2022 53
11-04-2022 54
• The most suitable POS tagging for the given phrase is
Determinant, Adjective and Noun respectively.

11-04-2022 55
11-04-2022 • P(tj/ti) = C(ti, tj ) / C(ti) = MLE 56
11-04-2022 57
Excercise

• Suppose you want to use a HMM tagger to tag the phrase, “the beautiful lady”, where
we have the following probabilities:
• P(the|Det) = 0.3, P(the|Noun) = 0.1, P(beautiful|Noun) = 0.01, P(beautiful|Adj) =0.07,
P(beautiful|Verb) = 0.001, P(lady|Noun) = 0.08, P(lady|Verb) = 0.01 ,
• P(Verb|Det) = 0.00001, P(Noun|Det) = 0.5, P(Adj|Det) = 0.4,
• P(Noun|Noun) =0.2, P(Adj|Noun) = 0.002, P(Noun|Adj) = 0.2,
• P(Noun|Verb) = 0.3, P(Verb|Noun) = 0.3, P(Verb|Adj) = 0.001,
• P(Verb|Verb) = 0.1
• Work out in detail the steps of the Viterbi algorithm. Assume all other conditional
probabilities, not mentioned to be zero. Also, assume that all tags have the same
probabilities to appear in the beginning of a sentence.

11-04-2022 58
Thank You

You might also like