0% found this document useful (0 votes)
615 views38 pages

A Seminar Report On Machine Learning

This document is a seminar report on machine learning submitted by Amrit Kumar Sah to Rajasthan Technical University. It discusses machine learning objectives such as supervised learning, unsupervised learning, and decision time. The report provides an introduction to machine learning and outlines some common machine learning methods, including supervised and unsupervised algorithms. It also includes sections on the internship enterprise, roles and responsibilities, outcomes, and conclusions.

Uploaded by

Ankit Rai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
615 views38 pages

A Seminar Report On Machine Learning

This document is a seminar report on machine learning submitted by Amrit Kumar Sah to Rajasthan Technical University. It discusses machine learning objectives such as supervised learning, unsupervised learning, and decision time. The report provides an introduction to machine learning and outlines some common machine learning methods, including supervised and unsupervised algorithms. It also includes sections on the internship enterprise, roles and responsibilities, outcomes, and conclusions.

Uploaded by

Ankit Rai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

A

SEMINAR REPORT
On
Machine Learning

Submitted to Rajasthan Technical University


In partial fulfillment of the requirement for the award of the degree
of
Bachelor of Technology
in
COMPUTER SCIENCE & ENGINEERING

Submitted By-
Amrit Kumar Sah (16EVJCS020)

Under the Guidance of


Mr. Bharat Bhushan Singhal
(Asst. Professor, Department of CSE)
at

VIVEKANANDA INSTITUTE OF TECHNOLOGY, JAIPUR


DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
RAJASTHAN TECHNICAL UNIVERSITY, KOTA
July, 2019

i
Certificate Copy

ii
Candidate’s Declaration

I “Amrit Kumar Sah [16EVJCS020]”, B.Tech. (Semester VII) of Vivekananda Institute of


Technology, Jaipur hereby declares that the Seminar Report entitled “Machine Learning” is an original
work and data provided in the study is authentic to the best of our knowledge. This report has been not
submitted to any other Institute for the award of any other degree.

Amrit Kumar Sah


Roll No. 16EVJCS020

Place:Jaipur
Date:

iii
ACKNOWLGEMENT

We take this opportunity to express my deepest gratitude to those who have generously helped me in
providing the valuable knowledge and expertise during my training.
It is great pleasure to represent this report on the project named “Machine Learning” undertaken by
me as part of my B.Tech (CSE) curriculum. I am thankful to Vivekananda Institute of Technology for
offering me such a wonderful challenging opportunity.
It is a pleasure that we find ourselves penning down these lines to express our sincere thanks to the
people who helped us along the way in completing our project. We find inadequate words to express our
sincere gratitude towards them.
I express my sincere gratitude to Prof. (Dr.) N.K Agarwal (Principal, VIT) for providing me an
opportunity to undergo this major project as the part of the curriculum. I am thankful to Miss Kirti
Gupta for her support, co-operation and motivation provided to me during training for constant
inspiration, presence and blessings. I would also like to thank my H.O.D Mr. Tushar Vyas for his
valuable suggestions which help a lot in completion of this project
Lastly, I would like to thank the almighty and my parents for moral support and friends with whom I
share my day-to-day experience and receive lots of suggestion that improve my quality of work.

Name : Amrit Kumar Sah


Roll No. : 16EVJCS020

iv
ABSTRACT
In this project, we were asked to experiment with a real world dataset, and to explore how machine learning
algorithms can be used to find the patterns in data. We were expected to gain experience using a common
data-mining and machine learning library, Weka, and were expected to submit a report about the dataset and
the algorithms used. After performing the required tasks on a dataset of my choice, herein lies my final report.

Keywords: Machine Learning, Pattern Recognition, Classification, Supervised learning, Artificial Intelligence.

v
TABLE OF CONTENTS

CONTENT PAGE NO

Declaration i

Acknowledgement ii
Abstract iii

Table of content iv

Contents v

6
Contents

ACKNOWLGEMENT......................................................................................................... iv

ABSTRACT........................................................................................................................... v

CHAPTER 1 INTRODUCTION……………………………………………………………10

1.1 Objectives…………………………………………………………………………………11
1.1.1 Supervised learning…………………………………………………………………….11
1.1.2 Unsupervised…………………………...……………………………………………….11
1.1.3 Decision time……………………………………………………………………………12
1.2 Motivation...................................................................................................................... 12
1.3 Internship Goals ............................................................................................................ 14
1.4 Report Layout................................................................................................................ 14

CHAPTER 2 INTERNSHIP ENTERPRISE ..................................................................... 16

2.1 About the Company ...................................................................................................... 16


2.2 Head Office .................................................................................................................... 16
2.3 IT Services Offered ........................................................................................................ 16
2.4 Roles in Job Market ...................................................................................................... 17
CHAPTER 3 Internship Roles And Responsibilities ......................................................... 18

3.1 Training Attended ......................................................................................................... 18

3.2 Assigned Responsibilities .............................................................................................. 18

3.3 Work Environment ....................................................................................................... 18

3.4 Data Analyst Responsibilities ......................................................................................... 18


3.5 Data Analyst Job Duties………………………………………………………………….19
3.6 Responsibilities…………………………………………………………………………...19
7
3.7 System Design…………………………………………………………………………….19
3.8 Performed Tasks………………………………………………………………………….21

CHAPTER 4 INTERNSHIP OUTCOMES ....................................................................... 24

4.1 Problem & Solution………………………………………………………………………24

4.1.1 Problems with their Solutions .................................................................................... 24

4.2 Learning Outcomes ....................................................................................................... 26


4.2.1 Python Programming ..................................................................................................... 26
4.2.2 NumPy ........................................................................................................................ 26
4.2.3Pandas…………………………………………………………………………………….27
4.2.4 Data Visualisation ........................................................................................................... 29
4.2.5 Basic Stats And Regression .......................................................................................... 30
4.2.6 Machine Learning & ML project ................................................................................ 31
4.2.7 NLP & NLP project........................................................................................................... 31
5.1 Conclusion ..................................................................................................................... 33
5.2 Future Scopes ................................................................................................................ 33
REFERENCE ...................................................................................................................... 35

8
LIST OF FIGURES
FIGURES
Figure 1: ML
Figure 2: ML
Figure 3: ML
Figure 4: ML
Figure 5: NLP
Figure 6: NLP
Figure 7: NLP
Figure 8: NLP

9
Chapter 1
Introduction
What is Machine Learning? A
definition
Machine learning is an application of artificial intelligence (AI) that provides systems
the ability to automatically learn and improve from experience without being explicitly
programmed. Machine learning focuses on the development of computer
programs that can access data and use it learn for themselves.

The process of learning begins with observations or data, such as examples, direct
experience, or instruction, in order to look for patterns in data and make better decisions
in the future based on the examples that we provide. The primary aim is to allow the
computers learn automatically without human intervention or assistance and adjust
actions accordingly.

Some machine learning methods

Machine learning algorithms are often categorized as supervised or unsupervised.

 Supervised machine learning algorithms can apply what has been learned in the past
to new data using labeled examples to predict future events. Starting from the analysis of
a known training dataset, the learning algorithm produces an inferred function to make
predictions about the output values. The system is able to provide targets for any new
input after sufficient training. The learning algorithm can also compare its output with
the correct, intended output and find errors in order to modify the model accordingly.
 In contrast, unsupervised machine learning algorithms are used when the information
used to train is neither classified nor labeled. Unsupervised learning studies how systems
can infer a function to describe a hidden structure from unlabeled data. The system
doesn’t figure out the right output, but it explores the data and can draw inferences from
datasets to describe hidden structures from unlabeled data.
 Semi-supervised machine learning algorithms fall somewhere in between supervised
and unsupervised learning, since they use both labeled and unlabeled data for training –
typically a small amount of labeled data and a large amount of unlabeled data. The
systems that use this method are able to considerably improve learning accuracy.
Usually, semi-supervised learning is chosen when the acquired labeled data requires
skilled and relevant resources in order to train it / learn from it. Otherwise,
acquiringunlabeled data generally doesn’t require additional resources.
 Reinforcement machine learning algorithms is a learning method that interacts with
its environment by producing actions and discovers errors or rewards. Trial and error
search and delayed reward are the most relevant characteristics of reinforcement
learning. This method allows machines and software agents to automatically determine
the ideal behavior within a specific context in order to maximize its performance.

10
Simple reward feedback is required for the agent to learn which action is best; this is
known as the reinforcement signal.
Machine learning enables analysis of massive quantities of data. While it generally
delivers faster, more accurate results in order to identify profitable opportunities or
dangerous risks, it may also require additional time and resources to train it properly.
Combining machine learning with AI and cognitive technologies can make it even more
effective in processing large volumes of information.

1.1 Objectives
The purpose of machine learning is to discover patterns in your data and then make
predictions based on often complex patterns to answer business questions, detect and
analyse trends and help solve problems.

Machine learning is effectively a method of data analysis that works by automating the
process of building data models.

1.1.1 Supervised learning


Supervised learning is the machine learning task of learning a function that maps
an input to an output based on example input-output pairs. It infers a function
from labeled training data consisting of a set of training examples. In supervised
learning, each example is a pair consisting of an input object (typically a vector)
and a desired output value (also called the supervisory signal). A supervised
learning algorithm analyzes the training data and produces an inferred function,
which can be used for mapping new examples. An optimal scenario will allow
for the algorithm to correctly determine the class labels for unseen instances.
This requires the learning algorithm to generalize from the training data to
unseen situations in a "reasonable" way (see inductive bias).

The parallel task in human and animal psychology is often referred to as concept
Learning.

The aim of supervised machine learning is to build a model that makes predictions based on
evidence in the presence of uncertainty. Supervised learning uses classification and regression
techniques to develop predictive models.
• Classification techniques predict discrete responses
• Regression techniques predict continuous responses
Using Suverpised Learning to Predict Heart Attack.

1.1.2 Unsupervised learning


Unsupervised learning is a type of self-organized Hebbian

11
learning that helps find previously unknown patterns in data set without pre-
existing labels. It is also known as self-organization and allows
modeling probability densities of given inputs.[1] It is one of the main three
categories of machine learning, along with supervised and reinforcement
learning. Semi-supervised learning has also been described, and is a
hybridization of supervised and unsupervised techniques.
Two of the main methods used in unsupervised learning are principal
component and cluster analysis. Cluster analysis is used in unsupervised
learning to group, or segment, datasets with shared attributes in order to
extrapolate algorithmic relationships.[2] Cluster analysis is a branch
of machine learning that groups the data that has not been labelled,
classified or categorized. Instead of responding to feedback, cluster
analysis identifies commonalities in the data and reacts based on the
presence or absence of such commonalities in each new piece of data. This
approach helps detect anomalous data points that do not fit into either
group.
A central application of unsupervised learning is in the field of density
estimation in statistics,[3] though unsupervised learning encompasses
many other domains involving summarizing and explaining data feature.

The aim of unsupervised machine learning is to find hidden patterns or


intrinsic structures in data. It is used to draw inferences from datasets
consisting of input data without labeled responses. Unsupervised learning uses
clustering techniques to develop models.
• Clustering - is the most common unsupervised learning technique. It is used
for exploratory data analysis to find hidden patterns or groupings in data.

Applications for clustering include gene sequence analysis, market research,


and object recognition.
Using Unsuverpised Learning to Predict Heart Attack.

1.1.3 Decision Time !!!!!


How to decide which algorithm to use?
• Choosing the right algorithm can seem overwhelming
• There is no best method or one size fits all. Finding the right algorithm is
partly just trial and error
• Algorithm selection also depends on the size and type of data you’re
working with, the insights you want to get from the data, and how those
insights will be used.

1.2 Motivation
To me it was motivating to learn because finally I could see how all the
math I had studies at university is applied in real life, and it's not only
interesting, but also very useful.

12
Also just the thought that given the data you can extract something
useful from it is already very motivating. For example, if you measure
you weight every day, then, when you accumulate enough data, you
can extract some helpful stuff about it that overwise you won't be able
to learn.
Another motivation could be money. Data science is quite a hot topic
nowadays and data scientists are paid quite well - companies have
tons and tons of data and they need people who know how to mine
something useful from this data. And there are more and more "data
driven" companies that need people who can mine insight from the
raw floods of information.

13
1.3 Internship Goal
1. Gain more professional connections.
We can’t stress this enough—the people you meet at your internship are important. Exchange
contact info, connect on LinkedIn, and make sure you’re not saying goodbye for good when
you walk out the door on your last day! These co-workers are your future references, mentors,
and friends that can alert you to any new job opportunities. You wouldn’t wanna miss out on
that, would you?

2. Develop skills you can add to your resume.


It’s definitely a big win if you can add at least one hard skill to your resume, such mastering
a certain computer program, learning how to analyze data, or something else that’s
measurable. As for your soft skills? Think of things like effective communication, your ability
to work in a team, and your problem-solving skills.

3. Learn what you do and don’t like.


You’re not only at your internship to learn skills; you also want to learn about yourself!
Sometimes you’ll find that you actually hate working on something you thought you’d enjoy,
or you’ll realize there’s an aspect of your job you’d absolutely love to do more of! This will
help you when you’re searching for future opportunities—you’ll know what kinds of job
descriptions you’ll want to avoid, and vice versa.

4. Gain more confidence in a professional setting.


It’s easy to feel a bit sheepish as an intern, but this is your chance to gain confidence. If right
now you shy away from things like sharing your opinions or speaking in front of large groups,
make it a goal to conquer those fears by the end of your internship. It’ll do you good if you
embrace opportunities that initially scare you!

5. Learn about your own working style.


Here’s another important chance to learn a thing or two about yourself. Are you most
productive in the mornings? Maybe that means you should show up a bit early each day! Do
you work best by collaborating with others? For your next opportunity, you can search for a
role involving a lot of teamwork. It’ll be easier steering your career in the right direction once
you’ve got your working style down pat.

1.4 Report Layout


Layout includes such things as the chapter objective details type of paper chosen, the
margins, the line spacing, the pagination, and the incorporation of equation, illustrations,

14
and references. Table 1 presents general specifications for the page layouts. For each
report that I create, I can assign a report layout. Via the report layout I define the layout
features of the report, such as the page format of the report and the numeric format of the
report data. When I use the layout functions, I can set parameters for the report. When I
define a report layout, I define a layout specifically for a report and I can change the
standard layout assigned to a report.
 In Chapter 1 I mention details about web development introduction. Objective.
Objective include some facility about internship as save the time and resource, reduce the
number of workers, reduce the work load, easy to search and record. Motivation to web
development as inheritance, implement, object declare and so on opportunity. Goals of
web development project PHP is more effective and choice full and ultimate for fruitful
programmer. Report layout is the short overview of full report chapters.
 Chapter 2 describe about internship enterprise like about the company, IT services
offered, roles in job market and responsibilities.
 Chapter 3 contain internship roles and responsibilities like training attended, assigned
responsibilities, work environment, using web development project temples, performed
tasks with figure and mention task.
 Chapter 4 describe about internship outcomes some outcomes of internship is problems
and solutions like OOP understanding, view some component in frame. And learning
outcomes, challenges.
 In the Chapter 5 simply I mention about internship discussion and conclusion.

15
Chapter 2
Internship Enterprise

2.1 About the Company


EduGrad
We are an edtech company with a strong belief on "Learn By Doing" approach, building
optimum learning experience, and developing competencies in the area of emerging
technologies.

Journey
EduGrad was born in May 2018 in a bid to offer niche technologies to learn by doing
approach. In Aug'18 our first course in Data Analytics was roled out. By the end of Dec
our batches are running across 20 colleges spreading across NCR, Hyderabad and Delhi.

Vision
Our vision is to empower students with problem-solving skills.

Mission
Our mission is to prepare our learners capable of facing real world challenges by giving
the world class learning experience, expert mentor guidance and learn by doing approach.

2.2 Head Office


Registered Address:- 3RD floor, vakula mansion, beside hppetrol pump, tlecomnagar
gachibowli Hyderabad-500032

Business Address:- H-85, 3RD floor, sector-63, Noida, Uttar Pradesh-201301, CIN:
U72900TG2018PTC124426
www.edugrad.com

2.3 IT Services Offered


Technologies Languages
 Learn Data Analytics using Python
 Master Python for Data Science and Machine Learning
 Introduction to GIT
 Learn Web Scraping using Python
 Intro to Database Tools for Data Science
 Presentation Skills for Data Scientists
 Machine Learning

16
2.4 Roles in Job Market
The top respondents for the job title Data ANALYST are from the companies
Accenture, EduGrad, Tata Consultancy Services Limited and EY (Ernst & Young)
Etc. Reported salaries are highest at HSBC where the average pay is Rs 687,845.
Other companies that offer high salaries for this role include Accenture and Tata
Consultancy Services Limited, earning around Rs 484,711 and Rs 464,577,
respectively. eClerx pays the lowest at around Rs 204,419. Genpact and EY
(Ernst & Young) also pay on the lower end of the scale, paying Rs 350,000 and
Rs 423,728, respectively.

17
Chapter 3
INTERNSHIP ROLES AND RESPONSIBILITIES
3.1 Training Attended
There are more attendant and requirement is need to build a project we know that, we can
mix all language like Python, Opps concept, Python library. The Anaconda, Jupyter
notebook combines the results of the interpreted and executed python code, which may be
any type of data, including images, with the generated Analytical page. Python code may
also be executed with a command-line interface(CLI) and can be used to implement
standalone graphical applications. Those training I was attending in EduGrad given below:
1. Python Programming
2. NumPy
3. Pandas
4. Data Visualisation
5. Basic Stats & Regression Models
6. ML Overview & ML Project
7. NLP Overview & NLP Project Completed

3.2 Assigned Responsibilities:

Instructions -

1. Please read the question carefully before attempting them.


2. Solve all the questions in a SINGLE jupyter notebook file.
3. In case name of the variable to be used is mentioned in the question, use the
same name while coding (marks are associated with it)
4. In your answers include your descriptions as and when mentioned. Think
yourself as a Data Analysts, who needs to suggest and explain solutions to the
client based on Data.

3.3 Work Environment


Front end Developer:
The front end developer generally works at client side dealing with the web page design,
graphics that is accessible to the user.
Back end Developer:
The back end developer is a person who is responsible for the back end development that
interacts with the server. This type of web developer specializes in the languages like
Python.

3.4 Data Analyst Responsibilities:


 Interpreting data, analyzing results using statistical techniques

18
 Developing and implementing data analyses, data collection systems and other
strategies that optimize statistical efficiency and quality
 Acquiring data from primary or secondary data sources and maintaining databases

3.5 Data Analyst Job Duties


Data analyst responsibilities include conducting full lifecycle analysis to
include requirements, activities and design. Data analysts will develop
analysis and reporting capabilities. They will also monitor performance
and quality control plans to identify improvements.

3.6 Responsibilities
 Interpret data, analyze results using statistical techniques and
provide ongoing reports
 Develop and implement databases, data collection systems, data
analytics and other strategies that optimize statistical efficiency and
quality
 Acquire data from primary or secondary data sources and maintain
databases/data systems
 Identify, analyze, and interpret trends or patterns in complex data
sets
 Filter and “clean” data by reviewing computer reports, printouts, and
performance indicators to locate and correct code problems
 Work with management to prioritize business and information needs
 Locate and define new process improvement opportunities

3.7 System Design


This section explains our methodology and the system architecture. Fig. 3 gives a
graphical representation of our prototype system. It consists of two main modules – one
that is language dependent and another that is language independent. The following sub
sections explain individual system components in detail.
A. Data pre-processing This component is part of the language dependent system module.
We designed the preprocessor in such a way that a change in the input language does not
affect the rest of the system components. First, we tokenize the raw surveyquestions with
a tool that is dependent on the survey’s source language. For Latin-character based
languages such as Spanish, German, and French, we build the tokenizers using the python
Natural Language Processing Toolkit (NLTK) [8] toolkit and predefined regular
expressions. For Asian languages such as Japanese, we use morphology-based segmenters
(e.g., MeCab and TinySegmenter for Japanese text) to tokenize the survey text2. Second,

19
we standardize tokens by removing noise terms and stop-words. We used language-
dependent stop-word lists for this purpose. Third, we represent each survey or question as
a document in a sparse bag-of-words format, after building a vocabulary of corpus-words
(separately for each language we used). Finally, we use documents as input to the topic
learning model which, in turn, learns clusters from the term cooccurrence frequencies of
the corresponding documents. See Fig. 3 for more details. B. Topic learning As discussed
earlier, topic models have the ability to learn semantic relationships of words from an
observed text collection. In this system, topic modeling is used for three main purposes i)
categorizing and ranking surveys, ii) survey sub-categorization and ranking, and iii)
clustering of survey questions under an identified survey sub-cluster. Survey ranking is
performed to identify relevant surveys that belong to general (top-level) topics such as
market research, education, and sports. To perform ranking, we first compute the topic
mixtures of the survey documents, which are formed by combining survey questions. To
estimate the topical structure from the survey documents, we use HDP [3], which can learn
the number topics automatically (this is one of our primary goals) along with the topic
model from large document collections. A detailed theoretical review of HDP and its
inference methods is presented by Teh et al [3]. We use a modified version of the HDP
implementation by Wang and Blei [9] in our experiments. The major components of a
learned HDP model are the corpus-level topic word association counts and document-
level topic mixtures. Each topic in the estimated model is represented by its topic-word-
probabilities. These words are used by language experts to name survey categories. The
document level topic mixtures give an idea of the topicality of a particular survey to a
given topic. This is also quite useful in finding similar surveys and grouping them together.
From the observations of the top-level survey categorization explained above, we found
that some of the topics found by the HDP estimation process can be further divided into
subtopics and the corresponding surveys can be ranked by subtopic relevance. For
modeling survey subtopics, we use the original LDA model [2] because it is more accurate
and less computationally expensive than HDP. We use the Gensim package’s [10] online
variational inference implementation for the model estimation process. Conventional
topic modeling algorithms are designed to work on larger documents compared to survey-
questions (section II). The chance of a term re-occurrence in the same question is quite
low compared to typical documents used in the topic modeling literature. So, to cluster
questions to build question banks, we represent questions in a much simpler format such
as TF-IDF and perform LSI, which helps to represent the questions in the smaller LSI
space rather than the vocabulary space. C. Survey relevance ranking We use survey
relevance ranking to group together surveys belonging to an estimated topic (Fig. 1). We
use individual surveys’ estimated document topic mixtures, ˆθd, to rank them on relevance
given a topic or set of topics. For a given topic set T ⊂ K, we calculate m(d) = k∈T ln
ˆθd,k + j /∈T ln(1 − ˆθd,j ) (1) for all surveys d = 1, 2, ..., D in the corpus and sort them to
rank their relevance. Here, we assume that the document topic mixtures ˆ θd satisfy the
multinomial property K j=1 ˆθj = 1. Intuitively, we can see that this equation maximizes
the score of a topic set T ⊂ K given a document. A document with a high value of this

20
score is a highly relevant document for that topic set. D. Question clustering and ranking
One of the goals of this project is to design a system that can recommend useful, relevant
survey questions, given a selected survey topic (e.g., education) for building question
banks. Once we have the surveys that belong to a given topic, we group similar survey
questions into question groups and rank them within group based on several ranking
scores. We first apply fuzzy C-means (FCM) clustering [4], [11] to the set of survey
questions represented in LSI space (section III-B). Second, we rank the questions that
belong to a given cluster based on measures such as string matching, fuzzy set matching
[12], and distance from the cluster centroid. Finally, we remove duplicate questions and
present the ranked questions to survey designers

3.8 Performed Tasks

TOPIC MODELLING

Analytics is all about obtaining useful Information from the data. With the growing
amount of data in recent years, which is mostly unstructured, it’s difficult to obtain the
relevant and desired information. But, with the help of technology, powerful methods
can be deployed to mine through the data and fetch the information that we are looking
for.

One such technique in the field of text mining/data mining is Topic Modelling. As the
name suggests, it is a process to automatically identify topics present in any text object
and to derive hidden patterns exhibited by the text corpus. This helps in assisting better
decision making.

Topic Models are very useful for the purpose for document clustering, organizing large
blocks of textual data, information retrieval from unstructured text and feature selection.

While searching online for any news article on the web, bypassing only some topics the
entire news article is being displayed.

Therefore, each news article or a document can be divided into several topics through
which that entire document can be recreated

This project deals with extracting topics for a couple of news article and also extracts
details such as person name, location and organization for each story.

Project Credit: - Startup Byte

Dataset – The dataset contains two files Startup_data.xlsx and cities_r2.csv

The cities_r2.csv files would help in finding the cities for each Startup

21
The attributes of Startup_data.xlsx are:-

STARTUP NEWS ? The news article posted online on website

SUMMARY ? A shorthand summary of the news article

POSTED BY ? Name of person who posted the startup news

DESCRIPTION ? The complete information or story of that news

The data is being collected from different sources and data is being stored into csv files.

TASK 1: - Loading the dataset

1. Load the necessary libraries into python


2. Load the dataset Startup_data.xlsx into python using pandas data frame and
name it as startup_data
3. Print the top 5 rows of the data frame and perform explanatory analysis of the
data

TASK 2: - Data Cleaning and Wrangling

1. Combine the Startup_News, Summary and Description columns to a new


column Content.
2. Convert the Content column to a list using a suitable method
3. Clean the data by removing Unicode characters and blank spaces
4. Make a function named as clean that accepts a string and only returns string
having all characters as numbers, alphabet and special characters only

TASK 3: - Natural Language Processing

1. After cleaning of text apply natural language processing to each story


2. Tokenize each story and remove Stopwords
3. Also, remove punctuation marks and store the lemmatized word into the final
result.

TASK 4: - Text Visualization

After applying natural language processing to each test take top 5 stories and visualize
most frequent words in those 5 stories using Wordcloud library in python

TASK 5: - Topic Modelling

22
Apply topic modeling LDA algorithm to each and every news article and extract 10
topics for each news article and store it into a new column corresponding to each news.

TASK 6: - Categorizing each news article

Categorize each story into the following ones:-

1. Games
2. Startup
3. Fund
4. Science
5. Women

Make use of topics extracted in the above step and apply regular expression over them to
categorize each news.

TASK 7: - Finding more insights of data

Use NLP to find person name, location and organization name for each news article.

(Hint: - For Person name and organization make use of Named Entity
Recognition(NER) whereas for Location make use of csv file cities_r2.csv)

23
Chapter 4

Internship Outcomes

4.1 Problem & Solution


What is Machine Learning? We can read authoritative definitions of machine
learning, but really, machine learning is defined by the problem being solved.
Therefore the best way to understand machine learning is to look at some example
problems.
In this post we will first look at some well known and understood examples of
machine learning problems in the real world. We will then look at a taxonomy
(naming system) for standard machine learning problems and learn how to identify
a problem as one of these standard cases. This is valuable, because knowing the
type of problem we are facing allows us to think about the data we need and the
types of algorithms to try.

4.1.1 Problems with their Solutions


Machine Learning problems are abound. They make up core or difficult parts of the
software you use on the web or on your desktop everyday. Think of the “do you
want to follow” suggestions on twitter and the speech understanding in Apple’s Siri.

Below are 10 examples of machine learning that really ground what machine
learning is all about.

 Spam Detection: Given email in an inbox, identify those email messages that are
spam and those that are not. Having a model of this problem would allow a program
to leave non-spam emails in the inbox and move spam emails to a spam folder. We
should all be familiar with this example.
 Credit Card Fraud Detection: Given credit card transactions for a customer in a
month, identify those transactions that were made by the customer and those that
were not. A program with a model of this decision could refund those transactions
that were fraudulent.
 Digit Recognition: Given a zip codes hand written on envelops, identify the digit for
each hand written character. A model of this problem would allow a computer
program to read and understand handwritten zip codes and sort envelops by
geographic region.
 Speech Understanding: Given an utterance from a user, identify the specific
request made by the user. A model of this problem would allow a program to

24
understand and make an attempt to fulfil that request. The iPhone with Siri has this
capability.
 Face Detection: Given a digital photo album of many hundreds of digital
photographs, identify those photos that include a given person. A model of this
decision process would allow a program to organize photos by person. Some
cameras and software like iPhoto has this capability.
 Product Recommendation: Given a purchase history for a customer and a
large inventory of products, identify those products in which that customer will
be interested and likely to purchase. A model of this decision process would
allow a program to make recommendations to a customer and motivate
product purchases. Amazon has this capability. Also think of Facebook,
GooglePlus and LinkedIn that recommend users to connect with you after you
sign-up.
 Medical Diagnosis: Given the symptoms exhibited in a patient and a
database of anonymized patient records, predict whether the patient is likely
to have an illness. A model of this decision problem could be used by a
program to provide decision support to medical professionals.
 Stock Trading: Given the current and past price movements for a stock,
determine whether the stock should be bought, held or sold. A model of this
decision problem could provide decision support to financial analysts.
 Customer Segmentation: Given the pattern of behaviour by a user during a
trial period and the past behaviours of all users, identify those users that will
convert to the paid version of the product and those that will not. A model of
this decision problem would allow a program to trigger customer interventions
to persuade the customer to covert early or better engage in the trial.
 Shape Detection: Given a user hand drawing a shape on a touch screen and
a database of known shapes, determine which shape the user was trying to
draw. A model of this decision would allow a program to show the platonic
version of that shape the user drew to make crisp diagrams.
The Instaviz iPhone app does this.
These 10 examples give a good sense of what a machine learning problem looks
like. There is a corpus of historic examples, there is a decision that needs to be
modelled and a business or domain benefit to having that decision modelled and
efficaciously made automatically.

Some of these problems are some of the hardest problems in Artificial Intelligence,
such as Natural Language Processing and Machine Vision (doing things that humans
do easily). Others are still difficult, but are classic examples of machine learning such
as spam detection and credit card fraud detection.

Think about some of your interactions with online and offline software in the last week.
I’m sure you could easily guess at another ten or twenty examples of machine
learning you have directly or indirectly used.

25
4.2 Learning Outcomes

4.2.1 Python Programming

Python is a powerful multi-purpose programming language created by Guido


van Rossum.

It has simple easy-to-use syntax, making it the perfect language for someone
trying to learn computer programming for the first time.

This is a comprehensive guide on how to get started in Python, why you should
learn it and how you can learn it.

However, if you have knowledge of other programming languages and want to


quickly get started with Python, visit Python tutorial page.

4.2.2 NumPy
NumPy is the fundamental package for scientific computing with Python. It
contains among other things:
 a powerful N-dimensional array object
 sophisticated (broadcasting) functions
 tools for integrating C/C++ and Fortran code
 useful linear algebra, Fourier transform, and random number capabilities
Besides its obvious scientific uses, NumPy can also be used as an efficient
multi-dimensional container of generic data. Arbitrary data-types can be defined.
This allows NumPy to seamlessly and speedily integrate with a wide variety of
databases.
NumPy is licensed under the BSD license, enabling reuse with few restrictions.

Getting Started
To install NumPy, we strongly recommend using a scientific Python distribution.
See Installing the SciPy Stack for details.
Many high quality online tutorials, courses, and books are available to get
started with NumPy. For a quick introduction to NumPy we provide the NumPy
Tutorial. We also recommend the SciPy Lecture Notes for a broader introduction
to the scientific Python ecosystem.
For more information on the SciPy Stack (for which NumPy provides the
fundamental array data structure), see scipy.org.

26
Documentation
The most up-to-date NumPy documentation can be found at Latest
(development) version. It includes a user guide, full reference documentation, a
developer guide, meta information, and “NumPy Enhancement Proposals”
(which include the NumPy Roadmap and detailed plans for major new features).
A complete archive of documentation for all NumPy releases (minor versions;
bug fix releases don’t contain significant documentation changes) since 2009
can be found at https://numpy.org/doc/
NumPy Enhancement Proposals (NEPs) can be found at https://numpy.org/neps

Support NumPy
If you have found NumPy to be useful in your work, research or company, please
consider making a donation to the project commensurate with your resources.
Any amount helps! All donations will be used strictly to fund the development of
NumPy’s open source software, documentation and community.
NumPy is a Sponsored Project of NumFOCUS, a 501(c)(3) nonprofit charity in the
United States. NumFOCUS provides NumPy with fiscal, legal, and administrative
support to help ensure the health and sustainability of the project.
Visit numfocus.org for more information.
Donations to NumPy are managed by NumFOCUS. For donors in the United
States, your gift is tax-deductible to the extent provided by law. As with any
donation, you should consult with your tax adviser about your particular tax
situation.
NumPy’s Steering Council will make the decisions on how to best use any funds
received. Technical and infrastructure priorities are documented on the NumPy
Roadmap.

4.2.3 Pandas
pandas is an open source, BSD-licensed library providing high-performance,
easy-to-use data structures and data analysis tools for the Python programming
language.
pandas is a NumFOCUS sponsored project. This will help ensure the success of
development of pandas as a world-class open-source project, and makes it
possible to donate to the project.

v0.25.1 Final (August 22, 2019)

27
This is a minor bug-fix release in the 0.25.x series and includes some regression
fixes and bug fixes. We recommend that all users upgrade to this version.
See the full whatsnew for a list of all the changes.
The release can be installed with conda from the defaults and conda-forge
channels:
conda install pandas
Or via PyPI:
python -m pip install --upgrade pandas
v0.25.0 Final (July 18, 2019)
This is a major release from 0.24.2 and includes a number of API changes, new
features, enhancements, and performance improvements along with a large
number of bug fixes.
Highlights include:
 Dropped Python 2 support

 Groupby aggregation with relabeling


 Better repr for MultiIndex
 Better truncated repr for Series and DataFrame
 Series.explode to split list-like values to rows MultiIndexes
The release can be installed with conda from conda-forge or the default channel:
conda install pandas
Or via PyPI:
python3 -m pip install --upgrade pandas
See the full whatsnew for a list of all the changes.
Best way to Install
The best way to get pandas is via conda
conda install pandas
Packages are available for all supported python versions on Windows, Linux, and
MacOS.
Wheels are also uploaded to PyPI and can be installed with
pip install pandas
Quick vignette
What problem does pandas solve?
Python has long been great for data munging and preparation, but less so for data
analysis and modeling. pandas helps fill this gap, enabling you to carry out your
entire data analysis workflow in Python without having to switch to a more domain
specific language like R.
Combined with the excellent IPython toolkit and other libraries, the environment
for doing data analysis in Python excels in performance, productivity, and the
ability to collaborate.
pandas does not implement significant modeling functionality outside of linear and
panel regression; for this, look to statsmodels and scikit-learn. More work is still
needed to make Python a first class statistical modeling environment, but we are
well on our way toward that goal.

28
4.2.4 Data Visualisation
Data visualization is a general term that describes any effort to help people
understand the significance of data by placing it in a visual context. Patterns,
trends and correlations that might go undetected in text-based data can be
exposed and recognized easier with data visualization software.

Today's data visualization tools go beyond the standard charts and graphs used
in Microsoft Excel spreadsheets, displaying data in more sophisticated ways such
as infographics, dials and gauges, geographic maps, sparklines, heat maps, and
detailed bar, pie and fever charts. The images may include interactive
capabilities, enabling users to manipulate them or drill into the data for querying
and analysis. Indicators designed to alert users when data has been updated or
predefined conditions occur can also be included.

Importance of data visualization

Data visualization has become the de facto standard for modern business intelligence (BI).
The success of the two leading vendors in the BI space, Tableau and Qlik -- both of which
heavily emphasize visualization -- has moved other vendors toward a more visual
approach in their software. Virtually all BI software has strong data visualization
functionality.

Data visualization tools have been important in democratizing data and analytics and
making data-driven insights available to workers throughout an organization. They are
typically easier to operate than traditional statistical analysis software or earlier versions
of BI software. This has led to a rise in lines of business implementing data visualization
tools on their own, without support from IT.

Data visualization software also plays an important role in big data


and advanced analytics projects. As businesses accumulated massive troves of
data during the early years of the big data trend, they needed a way to quickly
and easily get an overview of their data. Visualization tools were a natural fit.

Visualization is central to advanced analytics for similar reasons. When a data


scientist is writing advanced predictive analytics or machine learning algorithms, it
becomes important to visualize the outputs to monitor results and ensure that models are

29
performing as intended. This is because visualizations of complex algorithms are
generally easier to interpret than numerical outputs.

Examples of data visualization

Data visualization tools can be used in a variety of ways. The most common use today is
as a BI reporting tool. Users can set up visualization tools to generate
automatic dashboards that track company performance across key performance
indicators and visually interpret the results.

Many business departments implement data visualization software to track their own
initiatives. For example, a marketing team might implement the software to monitor the
performance of an email campaign, tracking metrics like open rate, click-through
rate and conversion rate.

As data visualization vendors extend the functionality of these tools, they are increasingly
being used as front ends for more sophisticated big data environments. In this setting, data
visualization software helps data engineers and scientists keep track of data sources and
do basic exploratory analysis of data sets prior to or after more detailed advanced analyses.

4.2.5 Basic Stats & Regression Models


What is Regression?
• Regression is a statistical way to establish a relationship between a dependent variable
and a set of independent variable(s). e.g., if we say that Age = 5 + Height * 10 + Weight
* 13
• Here we are establishing a relationship between Height & Weight of a person with his/
her Age. This is a very basic example of Regression.
• Here Age is a dependent variable which depends upon height and weight.
• Height and Weight are independent variables i.e they do not depend upon any other
variable. In other words we predict the value of dependent variable using independent
variables.

What is Linear Regression?


• “Linear Regression” is a statistical method to regress the data with dependent variable
having continuous values whereas independent variables can have either continuous or
categorical values.
• It is a method to predict dependent variable (Y) based on values of independent variables
(X).
• This technique can be used for the cases where we need to predict some continuous
quantity. E.g., Predicting traffic in a retail store, predicting rainfall in a region.

30
Multiple Linear Regression
• If we have more than one independent variable the procedure for fitting a best fit line is
known as “Multiple Linear Regression”
• Fundamentally there is no difference between ‘Simple’ & ‘Multiple’ linear regression.
Both works on OLS principle and procedure to get the best line is also similar. In the case
of later, regression equation will take a shape like: Y=B0+B1X1+B2X2+B3X3.....
• Where, Bi : Different coefficients Xi : Various independent variables.

4.2.6 Machine Learning & ML Project


What is machine learning?
It is as much about ‘Learning’ as it is about ‘Machines’
• Getting computer to program themselves
• For simplicity, some literature define it as ‘Automation +’ means ‘Automating the
automation.
• Machine learning uses algorithms that learn from data, continuously improving the
prediction of future consumer behaviour, with increasing levels of forecast accuracy as
the volumes of data increase.
“Learning is a process by which a system improves it’s performance by experience.” -
Herbert Simmon Definition by Tom mitchell (1998) : Machine Learning is the study of
algorithms that • Improve their performance P • At some task T • With experience E

ML in Nutshell A hard nut to crack !!!!!!


Machine learning teaches computers to do what comes naturally to humans and animals :
learn from experience. Machine Learning algorithms use computational methods to
“learn” information directly from data without relying on a predetermined equation as
model. The algorithms adaptively improve their performance as the number of samples
available for learning increases.

4.2.7 NLP & NLP Project


Natural language processing (NLP) is a subfield of linguistics, computer
science, information engineering, and artificial intelligence concerned with the
interactions between computers and human (natural) languages, in particular
how to program computers to process and analyze large amounts of natural
language data.
Challenges in natural language processing frequently involve speech
recognition, natural language understanding, and natural language generation.
Everything that we express either verbally or in written carries huge amounts of
information. The topic that we choose, our selection of words, our tone, everything adds
to some type of information that can be interpreted and some value can extracted from it.

31
Theoretically, we can understand and even predict human behavior using that information.
But there is one problem i.e. a person may generate hundreds or thousands of words in a
declaration, each sentence with its corresponding complexity. If one wants to scale them
and analyze several hundreds, thousands or millions of people or declarations in a given
geography, then the situation is daunting one and unmanageable. The Data that is being
generated from different conversations, declarations or even tweets are types of
unstructured data. The Unstructured data can’t be represented in form of row and column
structure of relational databases but the irony is that most of world’s data is unstructured.
It is messy and hard to manipulate. According to stats, 95% of world’s data is unstructured
which can’t be used for analysis and is regarded as dark data.

Areas of Use
Simply, NLP can be used for automatic handling of natural human language like speech
or text. NLP can be used for recognizing and prediction of diseases based on e - health
records and patient’s own speech. This capability has been explored in severe health
conditions that go from cardiovascular diseases to depression and even schizophrenia. It
enables organizations to easily determine what customers are saying about a service or
product by identifying and extracting information in sources like social media using
sentiment analysis. This analysis can provide a lot of information about the customer’s
choices and their decisions. Also an inventor at IBM developed a cognitive assistant using
NLP that works like a personalized search engine. It learns all about you and then remind
you of a name, a song, or anything that you can’t remember the moment you need it to.
Companies like Yahoo and Google filter and classify your emails as SPAM or HAM (non
– SPAM) using NLP thereby saving our privacy and security from hackers. The NLP
Group at MIT developed a new system to determine fake source by identifying if a source
is accurate or politically biased, thereby detecting if a news source can be trusted or not.
Amazon’s Alexa and Google Home are examples of intelligent voice driven interfaces
that extensively use NLP to respond to vocal prompts and do everything like find a
particular shop, tell us the weather, suggest the best route to a place or controlling lights
at NLP is also being used in talent identification, recruitment and automated report
generation or minute of meetings. NLP is solely booming in the healthcare industry. This
technology is used in improvising care delivery, disease diagnosis and bringing down
costs. Answering as used by IBM Watson’s answering to a query.

32
5.1 Conclusion
This Report has introduced to Machine Learning & Natural Language
Processing. Now, I know that Machine Learning is a technique of training
machines to perform the activities a human brain can do, albeit bit faster
and better than an average human-being. Today we have seen that the
machines can beat human champions in games such as Chess, AlphaGO,
which are considered very complex. we have seen that machines can be
trained to perform human activities in several areas and can aid humans
in living better lives.
Machine Learning can be a Supervised or Unsupervised. If we have lesser
amount of data and clearly labelled data for training, opt for Supervised
Learning. Unsupervised Learning would generally give better performance
and results for large data sets. If we have a huge data set easily available,
go for deep learning techniques. I also have learned Reinforcement
Learning and Deep Reinforcement Learning. You now know what Neural
Networks are, their applications and limitations.
Finally, when it comes to the development of machine learning models of
our own, I looked at the choices of various development languages, IDEs
and Platforms. Next thing that you need to do is start learning and
practicing each machine learning technique. The subject is vast, it means
that there is width, but if you consider the depth, each topic can be learned
in a few hours. Each topic is independent of each other. I need to take into
consideration one topic at a time, learn it, practice it and implement the
algorithm/s in it using a language choice of yours. This is the best way to
start studying Machine Learning. Practicing one topic at a time, very soon
you would acquire the width that is eventually required of a Machine
Learning expert.

5.2 Future Scope


 Improved cognitive services
With the help of machine learning services like SDKs and APIs, developers are able to
include and hone the intelligent capabilities into their applications. This will empower
machines to apply the various things they come across, and accordingly carry out an
array of duties like vision recognition, speech detection, and understanding of speech
and dialect. Alexa is already talking to us, and our phones are already listening to our
conversations— how else do you think the machine “wakes up” to run a google search
on 9/11 conspiracies for you? Those improved cognitive skills are something we could
not have ever imagined happening a decade ago, yet, here we are. Being able to engage
humans efficiently is under constant alteration to serve and understand the human

33
species better.We already spend so much time in front of screens that our mobiles have
become an extension of us- and through cognitive learning, it has literally become the
case. Your machine learns all about you, and then accordingly alters your results. No
two people’s Google search results are the same: why? Cognitive learning.
 The Rise of Quantum Computing
“Quantum computing”— sounds like something straight out of a science fiction movie,
no? But it has become a genuine phenomenon. Satya Nadella, the chief executive of
Microsoft Corp., calls i7t one of the three technologies that will reshape our
world. Quantum algorithms have the potential to transform and innovate the field of
machine learning. It could process data at a much faster pace and accelerate the ability to
draw insights and synthesize information.

Heavy-duty computation will finally be done in a jiffy, saving so much of time and
resources. The increased performance of machines will open so many doorways that will
elevate and take evolution to the next level. Something as basic as two numbers- 0 and 1
changed the way of the world, imagine what could be achieved if we ventured into a
whole new realm of computers and physics?

 Rise of Robots
With machine learning on the rise, it is only natural that the medium gets a face on it—
robots! The sophistication of machine learning is not a ‘small wonder’ if you know what
I mean.

Multi-agent learning, robot vision, self-supervised learning all will be accomplished


through robotisation. Drones have already become a normality, and have now even
replaced human delivery men. With the rapid speed technology is moving forward, even
the sky is not the limit. Our childhood fantasies of living in an era of the Jetsons will
soon become reality. The smallest of tasks will be automated, and human beings will no
longer have to be self-reliant because you will have a bot following you like a shadow at
all times.

Career Opportunities in the field?


Now that you are aware of the reach of machine learning and how it can single-handedly
change the course of the world, how can you become a part of it?

Here are some job options that you can potentially think of opting –

1. Machine Learning Engineer – They are sophisticated programmers who develop the
systems and machines that learn and apply knowledge without having any specific lead
or direction.

34
2. Deep Learning Engineer – Similar to computer scientists, they specialise in using deep
learning platforms to develop tasks related to artificial intelligence. Their main goal is to
be able to mimic and emulate brain functions.
3. Data Scientist – Someone who extracts meaning from data and analyses and interprets
it. It requires both methods, statistics, and tools.
4. Computer Vision Engineer – They are software developers who
create vision algorithms for recognising patterns in images.
Machine learning already is and will change the course of the world in
the coming decade. Let’s eagerly prep and wait for what the future
awaits. Let’s hope that machines do not get the bright idea of taking
over the world, because not all of us are Arnold Schwarzenegger.
Fingers crossed!

Reference

www.edugrad.com

www.google.com

www.python.org

www.wikipedia.org

www.tutorialspoint.com

List of Figure

35
36
37
38

You might also like