0% found this document useful (0 votes)

176 views

NLP Asgn2

The document describes Assignment 2 which involves performing bag-of-words, tf-idf, and Word2Vec embeddings on a dataset from Kaggle. The learning objectives are to learn and apply embedding techniques like bag-of-words, tf-idf, and Word2Vec on a real dataset. The learning outcomes are to demonstrate the techniques and select the appropriate one for a given problem. The document then provides detailed explanations and examples of bag-of-words, tf-idf, and Word2Vec embeddings.

Uploaded by

[TE A-1] Chandan Singh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

176 views

NLP Asgn2

Uploaded by

[TE A-1] Chandan Singh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Assignment 2

Title: Perform bag-of-words approach (count occurrence, normalized count occurrence), tf-idf on data.
Create embeddings using Word2Vec. Dataset to be used:
https://www.kaggle.com/datasets/CooperUnion/cardataset

Learning Objectives:
 To Learn embedding techniques like bag-of-words ,tf-idf,word2vec
 To use embedding techniques like bag-of-words ,tf-idf,word2vec on real dataset and analyze.

Learning Outcome:

 To demonstrate embedding techniques like bag-of-words ,tf-idf,word2vec

 To select appropriate technique by understating real time problem statement.

Theory:

Bag of Words (BoW)

Bag of words is a Natural Language Processing technique of text modelling.

A problem with modeling text is that it is messy, and techniques like machine learning algorithms
prefer well defined fixed-length inputs and outputs.Machine learning algorithms cannot work with raw
text directly; the text must be converted into numbers. Specifically, vectors of numbers. This is called
feature extraction or feature encoding.

The bag-of-words model is a way of representing text data when modeling text with machine
learning algorithms. It is a popular and simple method of feature extraction from text data.

A bag-of-words is a representation of text that describes the occurrence of words within a document.
It involves two things:

1. A vocabulary of known words.

2. A measure of the presence of known words.

It is called a “bag” of words, because any information about the order or structure of words in the
document is discarded. The model is only concerned with whether known words occur in the
document, not where in the document.
The most common kind of characteristic, or feature calculated from the Bag-of-words model is term
frequency, which is essentially the number of times a term appears in the text. Term frequency is not
necessarily the best representation for the text, but it still does find successful applications in areas
like email filtering. Term frequency isn’t the best representation of the text because common words
such as "the", "a", "to" are almost always the terms with highest frequency in the text. This shows
that having a high raw count does not necessarily indicate that the corresponding word is more
important.
Advantges of BoW Approach
The most significant advantage of the bag-of-words model is its simplicity and ease of use. It can be
used to create an initial draft model before proceeding to more sophisticated word embeddings.

Disadvantges of BoW Approach

 Vocabulary: The vocabulary requires careful design, most specifically in order to manage
the size, which impacts the sparsity of the document representations.
 Sparsity: Sparse representations are harder to model both for computational reasons (space
and time complexity) and also for information reasons, where the challenge is for the models
to harness so little information in such a large representational space.
 Meaning: Discarding word order ignores the context, and in turn meaning of words in the
document (semantics). Context and meaning can offer a lot to the model, that if modeled
could tell the difference between the same words differently arranged (“this is interesting” vs
“is this interesting”), synonyms (“old bike” vs “used bike”), and much more.

Bag-of-words example

Let's assume we have three sentences in our vocabulary.

Sentence 1: Data science is fun and interesting

Sentence 2: Data science is fun

Sentence 3: science is interesting

The unique words in the sentences are : [data, science, is, fun, and, interesting]. Hence, the bag of
words vectors for the above sentences will be

Sentence 1: [1, 1, 1, 1, 1, 1]

Sentence 2: [1, 1, 1, 1, 0, 0]

Sentence 3: [1, 1, 1, 1, 0, 0]

TF-IDF
Term Frequency - Inverse Document Frequency (TF-IDF) is a widely used statistical method in
natural language processing and information retrieval. It measures how important a term is within a
document relative to a collection of documents (i.e., relative to a corpus). Words within a text
document are transformed into importance numbers by a text vectorization process. There are many
different text vectorization scoring schemes, with TF-IDF being one of the most common.

Term Frequency: TF of a term or word is the number of times the term appears in a document
compared to the total number of words in the document.

Inverse Document Frequency: IDF of a term reflects the proportion of documents in the corpus
that contain the term. Words unique to a small percentage of documents (e.g., technical jargon
terms) receive higher importance values than words common across all documents (e.g., a, the,
and).
The TF-IDF of a term is calculated by multiplying TF and IDF scores.

TF-IDF = TF * IDF

Importance of a term is high when it occurs a lot in a given document and rarely in others. In short,
commonality within a document measured by TF is balanced by rarity between documents
measured by IDF. The resulting TF-IDF score reflects the importance of a term for a document in the
corpus.

TF-IDF is useful in many natural language processing applications. For example, Search Engines
use TF-IDF to rank the relevance of a document for a query. TF-IDF is also employed in text
classification, text summarization, and topic modeling.

word2vec Approach
BoW and TF-IDF consist of a set of words (vocabulary) and a metric like frequency or term
frequency-inverse document frequency (TF-IDF) to describe each word’s value in the corpus. That
means BoW and TF-IDF can result in sparse matrices and high dimensional vectors that consume a
lot of computer resources if the vocabulary is very large.

Developed by a team of researchers at Google, word2vec attempts to solve the issues with the BoW
approach:

 High-dimension vectors
 Words assumed completely independent of each other

Using a neural network with only a couple layers, word2vec tries to learn relationships between
words and embeds them in a lower-dimensional vector space. To do this, word2vec trains words
against other words that neighbor them in the input corpus, capturing some of the meaning in the
sequence of words. The researchers devised two novel approaches:

 Continuous bag of words (CBoW)

 Skip-gram

The CBoW architecture predicts the current word based on the context while the skip-gram predicts
surrounding words given the current word.

CBoW : Given a set of (neighbouring) words, guess single words that potentially occur along with
this set of words.

Skip-gram : Guess potential neighboring words based on the single word being analyzed

Both approaches result in a vector space that maps word-vectors close together based on
contextual meaning. That means, if two word-vectors are close together, those words should have
similar meaning based on their context in the corpus.
Bag of Words Algorithm Implementation
''' vectorize() function takes list of words in a sentence as input
and returns a vector of size of filtered_vocab.It puts 0 if the
word is not present in tokens and count of token if present.'''
def vectorize(tokens):
vector=[]
for w in filtered_vocab:
vector.append(tokens.count(w))
return vector
'''unique() functions returns a list in which the order remains
same and no item repeats.Using the set() function does not
preserve the original ordering,so i didnt use that instead'''
def unique(sequence):
seen = set()
return [x for x in sequence if not (x in seen or seen.add(x))]
#create a list of stopwords.You can import stopwords from nltk too
stopwords=["to","was","a"]
#list of special characters.You can use regular expressions too
special_char=[",",":"," ",";",".","?"]

string1 = "Data science is fun and interesting"

string2 = "Data science is fun"
string3 = "science is interesting"

#converting strings to lower case

string1=string1.lower()
string2=string2.lower()
string3=string3.lower()

#split the sentences into tokens

tokens1=string1.split()
tokens2=string2.split()
tokens3=string3.split()

print(tokens1)
print(tokens2)
print(tokens3)
#create a vocabulary list
vocab=unique(tokens1+tokens2+tokens3)
print(vocab)
#filter the vocabulary list
filtered_vocab=[]
for w in vocab:
if w not in stopwords and w not in special_char:
filtered_vocab.append(w)
print("Final filtered vocabulary: ", filtered_vocab)
#convert sentences into vectords
vector1=vectorize(tokens1)
print("Sentence 1 vector :",vector1)
vector2=vectorize(tokens2)
print("Sentence 2 vector :",vector2)
vector3=vectorize(tokens3)
print("Sentence 3 vector :",vector2)
Creating Bag of Words using sklearn library

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

string1 = "Data science is fun and interesting"

string2 = "Data science is fun"
string3 = "science is interesting"

doc = string1+string2+string3

CountVec = CountVectorizer(ngram_range=(1,1))
#transform
Count_data = CountVec.fit_transform([string1,string2,string3])

#create dataframe
cv_dataframe=pd.DataFrame(Count_data.toarray(),columns=CountVec.get_feature_n
ames())
print(cv_dataframe)
Note that the CountVectorize sorts the vocabulary alphabetically before generating vectors.
Count Occurrence

count_vec = CountVectorizer()
count_occurs = count_vec.fit_transform([doc])
count_occur_df = pd.DataFrame((count, word) for word, count in
zip(count_occurs.toarray().tolist()[0], count_vec.get_feature_names_out()))
count_occur_df.columns = ['Word', 'Count']
count_occur_df.sort_values('Count', ascending=False, inplace=True)
count_occur_df.head()
Normalized Count Occurrence

from sklearn.feature_extraction.text import TfidfVectorizer

norm_count_vec = TfidfVectorizer(use_idf=False, norm='l2')
norm_count_occurs = norm_count_vec.fit_transform([doc])
norm_count_occur_df = pd.DataFrame((count, word) for word, count in zip(
norm_count_occurs.toarray().tolist()[0],
norm_count_vec.get_feature_names_out()))
norm_count_occur_df.columns = ['Word', 'Count']
norm_count_occur_df.sort_values('Count', ascending=False, inplace=True)
norm_count_occur_df.head()

import pandas as pd
import numpy as np

corpus = ['Natural language processing is fun and interesting',

'Natural language processing is fun',
'Hindi language is interesting' ]
#creating a word set for the corpus
words_set = set()

for doc in corpus:

words = doc.split(' ')
words_set = words_set.union(set(words))

print('Number of words in the corpus:',len(words_set))

print('The words in the corpus: \n', words_set)

Computing Term Frequency

#creating a dataframe by the number of documents in the corpus and the word
set, and use that information to compute the term frequency (TF)
n_docs = len(corpus) #·Number of documents in the corpus
n_words_set = len(words_set) #·Number of unique words in the

df_tf = pd.DataFrame(np.zeros((n_docs, n_words_set)), columns=words_set)

# Compute Term Frequency (TF)

for i in range(n_docs):
words = corpus[i].split(' ') # Words in the document
for w in words:
df_tf[w][i] = df_tf[w][i] + (1 / len(words))

df_tf

Computing Inverse Document Frequency

print("IDF of: ")

idf = {}

for w in words_set:
k = 0 # number of documents in the corpus that contain this word

for i in range(n_docs):
if w in corpus[i].split():
k += 1

idf[w] = np.log10(n_docs / k)

print(f'{w:>15}: {idf[w]:>10}' )

df_tf_idf = df_tf.copy()

for w in words_set:
for i in range(n_docs):
df_tf_idf[w][i] = df_tf[w][i] * idf[w]

df_tf_idf
TF-IDF using sklearn library
from sklearn.feature_extraction.text import TfidfVectorizer

tr_idf_model = TfidfVectorizer()
tf_idf_vector = tr_idf_model.fit_transform(corpus)

tf_idf_array = tf_idf_vector.toarray()

print(tf_idf_array)

words_set = tr_idf_model.get_feature_names_out()

print(words_set)

df_tf_idf = pd.DataFrame(tf_idf_array, columns = words_set)

df_tf_idf

Conclusion:
 Bag of Word, TF IDF and Word2 Vec are word embedding technique .
 Word are transformed into numerical form which is must to apply machine Learning
algorithms.

Dbms PPT For Chapter 7
No ratings yet
Dbms PPT For Chapter 7
45 pages
STM Unit4
100% (1)
STM Unit4
75 pages
STQA Mini Project 1
No ratings yet
STQA Mini Project 1
38 pages
Techniques of Knowledge Representation
No ratings yet
Techniques of Knowledge Representation
3 pages
NLP Lab Manual LP 6
No ratings yet
NLP Lab Manual LP 6
43 pages
DIP Notes Unit-2
No ratings yet
DIP Notes Unit-2
159 pages
Module 3 - Paper 1 - Extracting Relations From Text From Word Sequences To Dependency Paths
No ratings yet
Module 3 - Paper 1 - Extracting Relations From Text From Word Sequences To Dependency Paths
11 pages
Lab Manual
No ratings yet
Lab Manual
28 pages
COSC 3100 Brute Force and Exhaustive Search: Instructor: Tanvir
No ratings yet
COSC 3100 Brute Force and Exhaustive Search: Instructor: Tanvir
44 pages
Software Testing Methodologies: Asst - Prof.A.MOHAN
No ratings yet
Software Testing Methodologies: Asst - Prof.A.MOHAN
100 pages
3-1 Bigdata (Spark)
No ratings yet
3-1 Bigdata (Spark)
3 pages
Ccs374 Web Application Security
No ratings yet
Ccs374 Web Application Security
20 pages
Uid-Graphical System Advatages
No ratings yet
Uid-Graphical System Advatages
21 pages
UNIT 4 Information Retrieval Using NLP
No ratings yet
UNIT 4 Information Retrieval Using NLP
13 pages
AL3391-AI Unit IV
No ratings yet
AL3391-AI Unit IV
65 pages
Irs Important Questions
0% (1)
Irs Important Questions
3 pages
Genai Manual
No ratings yet
Genai Manual
103 pages
Push-Pop Get - Return CFG
No ratings yet
Push-Pop Get - Return CFG
5 pages
STM Question Paper R18
No ratings yet
STM Question Paper R18
2 pages
Top Network & Cyber Security Viva Question With Answer
0% (1)
Top Network & Cyber Security Viva Question With Answer
5 pages
LP 4 Lab Manual
No ratings yet
LP 4 Lab Manual
52 pages
Cp5151 Advanced Data Structures and Algorithims
No ratings yet
Cp5151 Advanced Data Structures and Algorithims
3 pages
Part I IR VTU M Tech SSE
No ratings yet
Part I IR VTU M Tech SSE
72 pages
Department of IT It2301-Java Programming Question Bank: Unit I Part A
No ratings yet
Department of IT It2301-Java Programming Question Bank: Unit I Part A
6 pages
NLP Lect Unit I
No ratings yet
NLP Lect Unit I
140 pages
Insights From A Venn Diagram Software Testing
No ratings yet
Insights From A Venn Diagram Software Testing
21 pages
Compiler Design (CS-701) : Develop A Lexical Analyzer To Recognize A Few Patterns in C
No ratings yet
Compiler Design (CS-701) : Develop A Lexical Analyzer To Recognize A Few Patterns in C
17 pages
What Is A Sequential Search and What Is An Example of One?
No ratings yet
What Is A Sequential Search and What Is An Example of One?
3 pages
13-System Testing
No ratings yet
13-System Testing
21 pages
Designing Gui Based On A Data Mining Query Language
0% (1)
Designing Gui Based On A Data Mining Query Language
2 pages
355955B30 Siddesh Mahind SMA Exp-5
No ratings yet
355955B30 Siddesh Mahind SMA Exp-5
11 pages
STM Nice&Ugly Domain
100% (1)
STM Nice&Ugly Domain
8 pages
AI Lab Manual
No ratings yet
AI Lab Manual
37 pages
Unit 3 Path Testing
No ratings yet
Unit 3 Path Testing
2 pages
AD3461 ML lab manual
No ratings yet
AD3461 ML lab manual
32 pages
Macro Processor
100% (1)
Macro Processor
44 pages
Information Retrieval Systems U6
No ratings yet
Information Retrieval Systems U6
13 pages
Shivaji University, Kolhapur
No ratings yet
Shivaji University, Kolhapur
12 pages
Advanced Algorithms - Cse-Cs
No ratings yet
Advanced Algorithms - Cse-Cs
2 pages
CCS366 - Software Testing and Automation UNIT IV
No ratings yet
CCS366 - Software Testing and Automation UNIT IV
48 pages
Automatic Speech Recognition Using Python
No ratings yet
Automatic Speech Recognition Using Python
18 pages
Transfer Learning Seminar
No ratings yet
Transfer Learning Seminar
12 pages
Question Bank: T.E. (Computer Engineering) Data Science and Big Data Analytics (2019 Pattern)
No ratings yet
Question Bank: T.E. (Computer Engineering) Data Science and Big Data Analytics (2019 Pattern)
4 pages
Machine Learning
No ratings yet
Machine Learning
7 pages
CS6001-C Sharp and .NET Programming
No ratings yet
CS6001-C Sharp and .NET Programming
12 pages
SPM Question Bank
No ratings yet
SPM Question Bank
5 pages
Online Feedback Management System
No ratings yet
Online Feedback Management System
27 pages
Mobile Application Development
No ratings yet
Mobile Application Development
193 pages
NLP Lab Tasks
No ratings yet
NLP Lab Tasks
16 pages
Ec6302 - Digital Electronics Question Bank Unit - I Minimization Techniques and Logic Gates
No ratings yet
Ec6302 - Digital Electronics Question Bank Unit - I Minimization Techniques and Logic Gates
12 pages
Unit Iv Web Retrieval and Web Crawling 9
No ratings yet
Unit Iv Web Retrieval and Web Crawling 9
1 page
Direct & Arbitrated Digital Signatures
No ratings yet
Direct & Arbitrated Digital Signatures
10 pages
CS 606 Skill Dev Lab - 7TO 10 - 1648109707
No ratings yet
CS 606 Skill Dev Lab - 7TO 10 - 1648109707
12 pages
METTL - Logical Building 1 - 2 and 3 Links
100% (1)
METTL - Logical Building 1 - 2 and 3 Links
2 pages
Question Paper: Bms College of Engineering
No ratings yet
Question Paper: Bms College of Engineering
3 pages
CS6456-Object Oriented Programming
No ratings yet
CS6456-Object Oriented Programming
15 pages
Module-1 Introduction To File Structures
No ratings yet
Module-1 Introduction To File Structures
50 pages
Web Services Lab Manual
No ratings yet
Web Services Lab Manual
6 pages
Functional Testing - MOD2
No ratings yet
Functional Testing - MOD2
32 pages
The Datadog Handbook: A Guide to Monitoring, Metrics, and Tracing
From Everand
The Datadog Handbook: A Guide to Monitoring, Metrics, and Tracing
Robert Johnson
No ratings yet
Test Chuong 02 - Overview of Transaction Processing - TLHD
No ratings yet
Test Chuong 02 - Overview of Transaction Processing - TLHD
3 pages
3.5inch - 1TB-M1 - HDD Parts List
No ratings yet
3.5inch - 1TB-M1 - HDD Parts List
16 pages
COC Questions theory part
No ratings yet
COC Questions theory part
5 pages
CSE301 Sheet1+Solutions
No ratings yet
CSE301 Sheet1+Solutions
10 pages
10985C ENU Companion
No ratings yet
10985C ENU Companion
66 pages
ERD Revision Class
No ratings yet
ERD Revision Class
1 page
Supplier Document Schedule (SDS) (1.01) : Thunderbolt Equipment
No ratings yet
Supplier Document Schedule (SDS) (1.01) : Thunderbolt Equipment
2 pages
Netbackup Command Sheet
0% (1)
Netbackup Command Sheet
14 pages
BS 1629
No ratings yet
BS 1629
29 pages
Management Information System GTU MBA Data Mining and Warehousing
No ratings yet
Management Information System GTU MBA Data Mining and Warehousing
4 pages
Database Management System Thesis PDF
100% (4)
Database Management System Thesis PDF
7 pages
Decision Support System
No ratings yet
Decision Support System
55 pages
System Administration Accounts Privileges, Users and Roles
No ratings yet
System Administration Accounts Privileges, Users and Roles
31 pages
What Are The 10 Different Sources of Information
No ratings yet
What Are The 10 Different Sources of Information
14 pages
Business Intelligence Analyst
No ratings yet
Business Intelligence Analyst
4 pages
ICT - Information System
100% (1)
ICT - Information System
39 pages
Pertemuan 2 - EXERCISES Entity Relationship Modeling
No ratings yet
Pertemuan 2 - EXERCISES Entity Relationship Modeling
10 pages
The Largest Library in The World Is The
No ratings yet
The Largest Library in The World Is The
2 pages
dokumen.pub_designing-and-building-enterprise-knowledge-graphs-1nbsped-1636391745-9781636391748-9781636391755-9781636391762
No ratings yet
dokumen.pub_designing-and-building-enterprise-knowledge-graphs-1nbsped-1636391745-9781636391748-9781636391755-9781636391762
168 pages
Concepts and Techniques: - Chapter 7
No ratings yet
Concepts and Techniques: - Chapter 7
70 pages
SLiMS 7 en Manual PDF
100% (1)
SLiMS 7 en Manual PDF
173 pages
Mis MCQS
No ratings yet
Mis MCQS
13 pages
Join Operation
No ratings yet
Join Operation
28 pages
How To Back Up A Database UPS (WORLDSHIP)
No ratings yet
How To Back Up A Database UPS (WORLDSHIP)
2 pages
Questions 1
No ratings yet
Questions 1
29 pages
Review of Related Literature and Studies
No ratings yet
Review of Related Literature and Studies
7 pages
Unit I DBMS PNR
No ratings yet
Unit I DBMS PNR
133 pages
Pos 408
No ratings yet
Pos 408
16 pages
Client and Server Model
No ratings yet
Client and Server Model
4 pages
Practical Assignment 1
No ratings yet
Practical Assignment 1
5 pages

Uploaded by

Uploaded by

Assignment 2

 To demonstrate embedding techniques like bag-of-words ,tf-idf,word2vec

Bag of Words (BoW)

1. A vocabulary of known words.

Disadvantges of BoW Approach

Let's assume we have three sentences in our vocabulary.

Sentence 1: Data science is fun and interesting

Sentence 2: Data science is fun

Sentence 3: science is interesting

 Continuous bag of words (CBoW)

string1 = "Data science is fun and interesting"

#converting strings to lower case

#split the sentences into tokens

string1 = "Data science is fun and interesting"

from sklearn.feature_extraction.text import TfidfVectorizer

corpus = ['Natural language processing is fun and interesting',

for doc in corpus:

print('Number of words in the corpus:',len(words_set))

Computing Term Frequency

df_tf = pd.DataFrame(np.zeros((n_docs, n_words_set)), columns=words_set)

# Compute Term Frequency (TF)

Computing Inverse Document Frequency

print("IDF of: ")

df_tf_idf = pd.DataFrame(tf_idf_array, columns = words_set)

You might also like