0% found this document useful (0 votes)
176 views

NLP Asgn2

The document describes Assignment 2 which involves performing bag-of-words, tf-idf, and Word2Vec embeddings on a dataset from Kaggle. The learning objectives are to learn and apply embedding techniques like bag-of-words, tf-idf, and Word2Vec on a real dataset. The learning outcomes are to demonstrate the techniques and select the appropriate one for a given problem. The document then provides detailed explanations and examples of bag-of-words, tf-idf, and Word2Vec embeddings.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
176 views

NLP Asgn2

The document describes Assignment 2 which involves performing bag-of-words, tf-idf, and Word2Vec embeddings on a dataset from Kaggle. The learning objectives are to learn and apply embedding techniques like bag-of-words, tf-idf, and Word2Vec on a real dataset. The learning outcomes are to demonstrate the techniques and select the appropriate one for a given problem. The document then provides detailed explanations and examples of bag-of-words, tf-idf, and Word2Vec embeddings.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Assignment 2

Title: Perform bag-of-words approach (count occurrence, normalized count occurrence), tf-idf on data.
Create embeddings using Word2Vec. Dataset to be used:
https://www.kaggle.com/datasets/CooperUnion/cardataset

Learning Objectives:
 To Learn embedding techniques like bag-of-words ,tf-idf,word2vec
 To use embedding techniques like bag-of-words ,tf-idf,word2vec on real dataset and analyze.

Learning Outcome:

 To demonstrate embedding techniques like bag-of-words ,tf-idf,word2vec


 To select appropriate technique by understating real time problem statement.

Theory:

Bag of Words (BoW)


Bag of words is a Natural Language Processing technique of text modelling.

A problem with modeling text is that it is messy, and techniques like machine learning algorithms
prefer well defined fixed-length inputs and outputs.Machine learning algorithms cannot work with raw
text directly; the text must be converted into numbers. Specifically, vectors of numbers. This is called
feature extraction or feature encoding.

The bag-of-words model is a way of representing text data when modeling text with machine
learning algorithms. It is a popular and simple method of feature extraction from text data.

A bag-of-words is a representation of text that describes the occurrence of words within a document.
It involves two things:

1. A vocabulary of known words.


2. A measure of the presence of known words.

It is called a “bag” of words, because any information about the order or structure of words in the
document is discarded. The model is only concerned with whether known words occur in the
document, not where in the document.
The most common kind of characteristic, or feature calculated from the Bag-of-words model is term
frequency, which is essentially the number of times a term appears in the text. Term frequency is not
necessarily the best representation for the text, but it still does find successful applications in areas
like email filtering. Term frequency isn’t the best representation of the text because common words
such as "the", "a", "to" are almost always the terms with highest frequency in the text. This shows
that having a high raw count does not necessarily indicate that the corresponding word is more
important.
Advantges of BoW Approach
The most significant advantage of the bag-of-words model is its simplicity and ease of use. It can be
used to create an initial draft model before proceeding to more sophisticated word embeddings.

Disadvantges of BoW Approach

 Vocabulary: The vocabulary requires careful design, most specifically in order to manage
the size, which impacts the sparsity of the document representations.
 Sparsity: Sparse representations are harder to model both for computational reasons (space
and time complexity) and also for information reasons, where the challenge is for the models
to harness so little information in such a large representational space.
 Meaning: Discarding word order ignores the context, and in turn meaning of words in the
document (semantics). Context and meaning can offer a lot to the model, that if modeled
could tell the difference between the same words differently arranged (“this is interesting” vs
“is this interesting”), synonyms (“old bike” vs “used bike”), and much more.

Bag-of-words example

Let's assume we have three sentences in our vocabulary.

Sentence 1: Data science is fun and interesting

Sentence 2: Data science is fun

Sentence 3: science is interesting

The unique words in the sentences are : [data, science, is, fun, and, interesting]. Hence, the bag of
words vectors for the above sentences will be

Sentence 1: [1, 1, 1, 1, 1, 1]

Sentence 2: [1, 1, 1, 1, 0, 0]

Sentence 3: [1, 1, 1, 1, 0, 0]

TF-IDF
Term Frequency - Inverse Document Frequency (TF-IDF) is a widely used statistical method in
natural language processing and information retrieval. It measures how important a term is within a
document relative to a collection of documents (i.e., relative to a corpus). Words within a text
document are transformed into importance numbers by a text vectorization process. There are many
different text vectorization scoring schemes, with TF-IDF being one of the most common.

Term Frequency: TF of a term or word is the number of times the term appears in a document
compared to the total number of words in the document.

Inverse Document Frequency: IDF of a term reflects the proportion of documents in the corpus
that contain the term. Words unique to a small percentage of documents (e.g., technical jargon
terms) receive higher importance values than words common across all documents (e.g., a, the,
and).
The TF-IDF of a term is calculated by multiplying TF and IDF scores.

TF-IDF = TF * IDF

Importance of a term is high when it occurs a lot in a given document and rarely in others. In short,
commonality within a document measured by TF is balanced by rarity between documents
measured by IDF. The resulting TF-IDF score reflects the importance of a term for a document in the
corpus.

TF-IDF is useful in many natural language processing applications. For example, Search Engines
use TF-IDF to rank the relevance of a document for a query. TF-IDF is also employed in text
classification, text summarization, and topic modeling.

word2vec Approach
BoW and TF-IDF consist of a set of words (vocabulary) and a metric like frequency or term
frequency-inverse document frequency (TF-IDF) to describe each word’s value in the corpus. That
means BoW and TF-IDF can result in sparse matrices and high dimensional vectors that consume a
lot of computer resources if the vocabulary is very large.

Developed by a team of researchers at Google, word2vec attempts to solve the issues with the BoW
approach:

 High-dimension vectors
 Words assumed completely independent of each other

Using a neural network with only a couple layers, word2vec tries to learn relationships between
words and embeds them in a lower-dimensional vector space. To do this, word2vec trains words
against other words that neighbor them in the input corpus, capturing some of the meaning in the
sequence of words. The researchers devised two novel approaches:

 Continuous bag of words (CBoW)


 Skip-gram

The CBoW architecture predicts the current word based on the context while the skip-gram predicts
surrounding words given the current word.

CBoW : Given a set of (neighbouring) words, guess single words that potentially occur along with
this set of words.

Skip-gram : Guess potential neighboring words based on the single word being analyzed

Both approaches result in a vector space that maps word-vectors close together based on
contextual meaning. That means, if two word-vectors are close together, those words should have
similar meaning based on their context in the corpus.
Bag of Words Algorithm Implementation
''' vectorize() function takes list of words in a sentence as input
and returns a vector of size of filtered_vocab.It puts 0 if the
word is not present in tokens and count of token if present.'''
def vectorize(tokens):
vector=[]
for w in filtered_vocab:
vector.append(tokens.count(w))
return vector
'''unique() functions returns a list in which the order remains
same and no item repeats.Using the set() function does not
preserve the original ordering,so i didnt use that instead'''
def unique(sequence):
seen = set()
return [x for x in sequence if not (x in seen or seen.add(x))]
#create a list of stopwords.You can import stopwords from nltk too
stopwords=["to","was","a"]
#list of special characters.You can use regular expressions too
special_char=[",",":"," ",";",".","?"]

string1 = "Data science is fun and interesting"


string2 = "Data science is fun"
string3 = "science is interesting"

#converting strings to lower case


string1=string1.lower()
string2=string2.lower()
string3=string3.lower()

#split the sentences into tokens


tokens1=string1.split()
tokens2=string2.split()
tokens3=string3.split()

print(tokens1)
print(tokens2)
print(tokens3)
#create a vocabulary list
vocab=unique(tokens1+tokens2+tokens3)
print(vocab)
#filter the vocabulary list
filtered_vocab=[]
for w in vocab:
if w not in stopwords and w not in special_char:
filtered_vocab.append(w)
print("Final filtered vocabulary: ", filtered_vocab)
#convert sentences into vectords
vector1=vectorize(tokens1)
print("Sentence 1 vector :",vector1)
vector2=vectorize(tokens2)
print("Sentence 2 vector :",vector2)
vector3=vectorize(tokens3)
print("Sentence 3 vector :",vector2)
Creating Bag of Words using sklearn library

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

string1 = "Data science is fun and interesting"


string2 = "Data science is fun"
string3 = "science is interesting"

doc = string1+string2+string3

CountVec = CountVectorizer(ngram_range=(1,1))
#transform
Count_data = CountVec.fit_transform([string1,string2,string3])

#create dataframe
cv_dataframe=pd.DataFrame(Count_data.toarray(),columns=CountVec.get_feature_n
ames())
print(cv_dataframe)
Note that the CountVectorize sorts the vocabulary alphabetically before generating vectors.
Count Occurrence

count_vec = CountVectorizer()
count_occurs = count_vec.fit_transform([doc])
count_occur_df = pd.DataFrame((count, word) for word, count in
zip(count_occurs.toarray().tolist()[0], count_vec.get_feature_names_out()))
count_occur_df.columns = ['Word', 'Count']
count_occur_df.sort_values('Count', ascending=False, inplace=True)
count_occur_df.head()
Normalized Count Occurrence

from sklearn.feature_extraction.text import TfidfVectorizer


norm_count_vec = TfidfVectorizer(use_idf=False, norm='l2')
norm_count_occurs = norm_count_vec.fit_transform([doc])
norm_count_occur_df = pd.DataFrame((count, word) for word, count in zip(
norm_count_occurs.toarray().tolist()[0],
norm_count_vec.get_feature_names_out()))
norm_count_occur_df.columns = ['Word', 'Count']
norm_count_occur_df.sort_values('Count', ascending=False, inplace=True)
norm_count_occur_df.head()

import pandas as pd
import numpy as np

corpus = ['Natural language processing is fun and interesting',


'Natural language processing is fun',
'Hindi language is interesting' ]
#creating a word set for the corpus
words_set = set()

for doc in corpus:


words = doc.split(' ')
words_set = words_set.union(set(words))

print('Number of words in the corpus:',len(words_set))


print('The words in the corpus: \n', words_set)

Computing Term Frequency

#creating a dataframe by the number of documents in the corpus and the word
set, and use that information to compute the term frequency (TF)
n_docs = len(corpus) #·Number of documents in the corpus
n_words_set = len(words_set) #·Number of unique words in the

df_tf = pd.DataFrame(np.zeros((n_docs, n_words_set)), columns=words_set)

# Compute Term Frequency (TF)


for i in range(n_docs):
words = corpus[i].split(' ') # Words in the document
for w in words:
df_tf[w][i] = df_tf[w][i] + (1 / len(words))

df_tf

Computing Inverse Document Frequency

print("IDF of: ")

idf = {}

for w in words_set:
k = 0 # number of documents in the corpus that contain this word

for i in range(n_docs):
if w in corpus[i].split():
k += 1

idf[w] = np.log10(n_docs / k)

print(f'{w:>15}: {idf[w]:>10}' )

df_tf_idf = df_tf.copy()

for w in words_set:
for i in range(n_docs):
df_tf_idf[w][i] = df_tf[w][i] * idf[w]

df_tf_idf
TF-IDF using sklearn library
from sklearn.feature_extraction.text import TfidfVectorizer

tr_idf_model = TfidfVectorizer()
tf_idf_vector = tr_idf_model.fit_transform(corpus)

tf_idf_array = tf_idf_vector.toarray()

print(tf_idf_array)

words_set = tr_idf_model.get_feature_names_out()

print(words_set)

df_tf_idf = pd.DataFrame(tf_idf_array, columns = words_set)

df_tf_idf

Conclusion:
 Bag of Word, TF IDF and Word2 Vec are word embedding technique .
 Word are transformed into numerical form which is must to apply machine Learning
algorithms.

You might also like