A Seminar Report On Machine Learning
A Seminar Report On Machine Learning
SEMINAR REPORT
On
Machine Learning
Submitted By-
Amrit Kumar Sah (16EVJCS020)
i
Certificate Copy
ii
Candidate’s Declaration
Place:Jaipur
Date:
iii
ACKNOWLGEMENT
We take this opportunity to express my deepest gratitude to those who have generously helped me in
providing the valuable knowledge and expertise during my training.
It is great pleasure to represent this report on the project named “Machine Learning” undertaken by
me as part of my B.Tech (CSE) curriculum. I am thankful to Vivekananda Institute of Technology for
offering me such a wonderful challenging opportunity.
It is a pleasure that we find ourselves penning down these lines to express our sincere thanks to the
people who helped us along the way in completing our project. We find inadequate words to express our
sincere gratitude towards them.
I express my sincere gratitude to Prof. (Dr.) N.K Agarwal (Principal, VIT) for providing me an
opportunity to undergo this major project as the part of the curriculum. I am thankful to Miss Kirti
Gupta for her support, co-operation and motivation provided to me during training for constant
inspiration, presence and blessings. I would also like to thank my H.O.D Mr. Tushar Vyas for his
valuable suggestions which help a lot in completion of this project
Lastly, I would like to thank the almighty and my parents for moral support and friends with whom I
share my day-to-day experience and receive lots of suggestion that improve my quality of work.
iv
ABSTRACT
In this project, we were asked to experiment with a real world dataset, and to explore how machine learning
algorithms can be used to find the patterns in data. We were expected to gain experience using a common
data-mining and machine learning library, Weka, and were expected to submit a report about the dataset and
the algorithms used. After performing the required tasks on a dataset of my choice, herein lies my final report.
Keywords: Machine Learning, Pattern Recognition, Classification, Supervised learning, Artificial Intelligence.
v
TABLE OF CONTENTS
CONTENT PAGE NO
Declaration i
Acknowledgement ii
Abstract iii
Table of content iv
Contents v
6
Contents
ACKNOWLGEMENT......................................................................................................... iv
ABSTRACT........................................................................................................................... v
CHAPTER 1 INTRODUCTION……………………………………………………………10
1.1 Objectives…………………………………………………………………………………11
1.1.1 Supervised learning…………………………………………………………………….11
1.1.2 Unsupervised…………………………...……………………………………………….11
1.1.3 Decision time……………………………………………………………………………12
1.2 Motivation...................................................................................................................... 12
1.3 Internship Goals ............................................................................................................ 14
1.4 Report Layout................................................................................................................ 14
8
LIST OF FIGURES
FIGURES
Figure 1: ML
Figure 2: ML
Figure 3: ML
Figure 4: ML
Figure 5: NLP
Figure 6: NLP
Figure 7: NLP
Figure 8: NLP
9
Chapter 1
Introduction
What is Machine Learning? A
definition
Machine learning is an application of artificial intelligence (AI) that provides systems
the ability to automatically learn and improve from experience without being explicitly
programmed. Machine learning focuses on the development of computer
programs that can access data and use it learn for themselves.
The process of learning begins with observations or data, such as examples, direct
experience, or instruction, in order to look for patterns in data and make better decisions
in the future based on the examples that we provide. The primary aim is to allow the
computers learn automatically without human intervention or assistance and adjust
actions accordingly.
Supervised machine learning algorithms can apply what has been learned in the past
to new data using labeled examples to predict future events. Starting from the analysis of
a known training dataset, the learning algorithm produces an inferred function to make
predictions about the output values. The system is able to provide targets for any new
input after sufficient training. The learning algorithm can also compare its output with
the correct, intended output and find errors in order to modify the model accordingly.
In contrast, unsupervised machine learning algorithms are used when the information
used to train is neither classified nor labeled. Unsupervised learning studies how systems
can infer a function to describe a hidden structure from unlabeled data. The system
doesn’t figure out the right output, but it explores the data and can draw inferences from
datasets to describe hidden structures from unlabeled data.
Semi-supervised machine learning algorithms fall somewhere in between supervised
and unsupervised learning, since they use both labeled and unlabeled data for training –
typically a small amount of labeled data and a large amount of unlabeled data. The
systems that use this method are able to considerably improve learning accuracy.
Usually, semi-supervised learning is chosen when the acquired labeled data requires
skilled and relevant resources in order to train it / learn from it. Otherwise,
acquiringunlabeled data generally doesn’t require additional resources.
Reinforcement machine learning algorithms is a learning method that interacts with
its environment by producing actions and discovers errors or rewards. Trial and error
search and delayed reward are the most relevant characteristics of reinforcement
learning. This method allows machines and software agents to automatically determine
the ideal behavior within a specific context in order to maximize its performance.
10
Simple reward feedback is required for the agent to learn which action is best; this is
known as the reinforcement signal.
Machine learning enables analysis of massive quantities of data. While it generally
delivers faster, more accurate results in order to identify profitable opportunities or
dangerous risks, it may also require additional time and resources to train it properly.
Combining machine learning with AI and cognitive technologies can make it even more
effective in processing large volumes of information.
1.1 Objectives
The purpose of machine learning is to discover patterns in your data and then make
predictions based on often complex patterns to answer business questions, detect and
analyse trends and help solve problems.
Machine learning is effectively a method of data analysis that works by automating the
process of building data models.
The parallel task in human and animal psychology is often referred to as concept
Learning.
The aim of supervised machine learning is to build a model that makes predictions based on
evidence in the presence of uncertainty. Supervised learning uses classification and regression
techniques to develop predictive models.
• Classification techniques predict discrete responses
• Regression techniques predict continuous responses
Using Suverpised Learning to Predict Heart Attack.
11
learning that helps find previously unknown patterns in data set without pre-
existing labels. It is also known as self-organization and allows
modeling probability densities of given inputs.[1] It is one of the main three
categories of machine learning, along with supervised and reinforcement
learning. Semi-supervised learning has also been described, and is a
hybridization of supervised and unsupervised techniques.
Two of the main methods used in unsupervised learning are principal
component and cluster analysis. Cluster analysis is used in unsupervised
learning to group, or segment, datasets with shared attributes in order to
extrapolate algorithmic relationships.[2] Cluster analysis is a branch
of machine learning that groups the data that has not been labelled,
classified or categorized. Instead of responding to feedback, cluster
analysis identifies commonalities in the data and reacts based on the
presence or absence of such commonalities in each new piece of data. This
approach helps detect anomalous data points that do not fit into either
group.
A central application of unsupervised learning is in the field of density
estimation in statistics,[3] though unsupervised learning encompasses
many other domains involving summarizing and explaining data feature.
1.2 Motivation
To me it was motivating to learn because finally I could see how all the
math I had studies at university is applied in real life, and it's not only
interesting, but also very useful.
12
Also just the thought that given the data you can extract something
useful from it is already very motivating. For example, if you measure
you weight every day, then, when you accumulate enough data, you
can extract some helpful stuff about it that overwise you won't be able
to learn.
Another motivation could be money. Data science is quite a hot topic
nowadays and data scientists are paid quite well - companies have
tons and tons of data and they need people who know how to mine
something useful from this data. And there are more and more "data
driven" companies that need people who can mine insight from the
raw floods of information.
13
1.3 Internship Goal
1. Gain more professional connections.
We can’t stress this enough—the people you meet at your internship are important. Exchange
contact info, connect on LinkedIn, and make sure you’re not saying goodbye for good when
you walk out the door on your last day! These co-workers are your future references, mentors,
and friends that can alert you to any new job opportunities. You wouldn’t wanna miss out on
that, would you?
14
and references. Table 1 presents general specifications for the page layouts. For each
report that I create, I can assign a report layout. Via the report layout I define the layout
features of the report, such as the page format of the report and the numeric format of the
report data. When I use the layout functions, I can set parameters for the report. When I
define a report layout, I define a layout specifically for a report and I can change the
standard layout assigned to a report.
In Chapter 1 I mention details about web development introduction. Objective.
Objective include some facility about internship as save the time and resource, reduce the
number of workers, reduce the work load, easy to search and record. Motivation to web
development as inheritance, implement, object declare and so on opportunity. Goals of
web development project PHP is more effective and choice full and ultimate for fruitful
programmer. Report layout is the short overview of full report chapters.
Chapter 2 describe about internship enterprise like about the company, IT services
offered, roles in job market and responsibilities.
Chapter 3 contain internship roles and responsibilities like training attended, assigned
responsibilities, work environment, using web development project temples, performed
tasks with figure and mention task.
Chapter 4 describe about internship outcomes some outcomes of internship is problems
and solutions like OOP understanding, view some component in frame. And learning
outcomes, challenges.
In the Chapter 5 simply I mention about internship discussion and conclusion.
15
Chapter 2
Internship Enterprise
Journey
EduGrad was born in May 2018 in a bid to offer niche technologies to learn by doing
approach. In Aug'18 our first course in Data Analytics was roled out. By the end of Dec
our batches are running across 20 colleges spreading across NCR, Hyderabad and Delhi.
Vision
Our vision is to empower students with problem-solving skills.
Mission
Our mission is to prepare our learners capable of facing real world challenges by giving
the world class learning experience, expert mentor guidance and learn by doing approach.
Business Address:- H-85, 3RD floor, sector-63, Noida, Uttar Pradesh-201301, CIN:
U72900TG2018PTC124426
www.edugrad.com
16
2.4 Roles in Job Market
The top respondents for the job title Data ANALYST are from the companies
Accenture, EduGrad, Tata Consultancy Services Limited and EY (Ernst & Young)
Etc. Reported salaries are highest at HSBC where the average pay is Rs 687,845.
Other companies that offer high salaries for this role include Accenture and Tata
Consultancy Services Limited, earning around Rs 484,711 and Rs 464,577,
respectively. eClerx pays the lowest at around Rs 204,419. Genpact and EY
(Ernst & Young) also pay on the lower end of the scale, paying Rs 350,000 and
Rs 423,728, respectively.
17
Chapter 3
INTERNSHIP ROLES AND RESPONSIBILITIES
3.1 Training Attended
There are more attendant and requirement is need to build a project we know that, we can
mix all language like Python, Opps concept, Python library. The Anaconda, Jupyter
notebook combines the results of the interpreted and executed python code, which may be
any type of data, including images, with the generated Analytical page. Python code may
also be executed with a command-line interface(CLI) and can be used to implement
standalone graphical applications. Those training I was attending in EduGrad given below:
1. Python Programming
2. NumPy
3. Pandas
4. Data Visualisation
5. Basic Stats & Regression Models
6. ML Overview & ML Project
7. NLP Overview & NLP Project Completed
Instructions -
18
Developing and implementing data analyses, data collection systems and other
strategies that optimize statistical efficiency and quality
Acquiring data from primary or secondary data sources and maintaining databases
3.6 Responsibilities
Interpret data, analyze results using statistical techniques and
provide ongoing reports
Develop and implement databases, data collection systems, data
analytics and other strategies that optimize statistical efficiency and
quality
Acquire data from primary or secondary data sources and maintain
databases/data systems
Identify, analyze, and interpret trends or patterns in complex data
sets
Filter and “clean” data by reviewing computer reports, printouts, and
performance indicators to locate and correct code problems
Work with management to prioritize business and information needs
Locate and define new process improvement opportunities
19
we standardize tokens by removing noise terms and stop-words. We used language-
dependent stop-word lists for this purpose. Third, we represent each survey or question as
a document in a sparse bag-of-words format, after building a vocabulary of corpus-words
(separately for each language we used). Finally, we use documents as input to the topic
learning model which, in turn, learns clusters from the term cooccurrence frequencies of
the corresponding documents. See Fig. 3 for more details. B. Topic learning As discussed
earlier, topic models have the ability to learn semantic relationships of words from an
observed text collection. In this system, topic modeling is used for three main purposes i)
categorizing and ranking surveys, ii) survey sub-categorization and ranking, and iii)
clustering of survey questions under an identified survey sub-cluster. Survey ranking is
performed to identify relevant surveys that belong to general (top-level) topics such as
market research, education, and sports. To perform ranking, we first compute the topic
mixtures of the survey documents, which are formed by combining survey questions. To
estimate the topical structure from the survey documents, we use HDP [3], which can learn
the number topics automatically (this is one of our primary goals) along with the topic
model from large document collections. A detailed theoretical review of HDP and its
inference methods is presented by Teh et al [3]. We use a modified version of the HDP
implementation by Wang and Blei [9] in our experiments. The major components of a
learned HDP model are the corpus-level topic word association counts and document-
level topic mixtures. Each topic in the estimated model is represented by its topic-word-
probabilities. These words are used by language experts to name survey categories. The
document level topic mixtures give an idea of the topicality of a particular survey to a
given topic. This is also quite useful in finding similar surveys and grouping them together.
From the observations of the top-level survey categorization explained above, we found
that some of the topics found by the HDP estimation process can be further divided into
subtopics and the corresponding surveys can be ranked by subtopic relevance. For
modeling survey subtopics, we use the original LDA model [2] because it is more accurate
and less computationally expensive than HDP. We use the Gensim package’s [10] online
variational inference implementation for the model estimation process. Conventional
topic modeling algorithms are designed to work on larger documents compared to survey-
questions (section II). The chance of a term re-occurrence in the same question is quite
low compared to typical documents used in the topic modeling literature. So, to cluster
questions to build question banks, we represent questions in a much simpler format such
as TF-IDF and perform LSI, which helps to represent the questions in the smaller LSI
space rather than the vocabulary space. C. Survey relevance ranking We use survey
relevance ranking to group together surveys belonging to an estimated topic (Fig. 1). We
use individual surveys’ estimated document topic mixtures, ˆθd, to rank them on relevance
given a topic or set of topics. For a given topic set T ⊂ K, we calculate m(d) = k∈T ln
ˆθd,k + j /∈T ln(1 − ˆθd,j ) (1) for all surveys d = 1, 2, ..., D in the corpus and sort them to
rank their relevance. Here, we assume that the document topic mixtures ˆ θd satisfy the
multinomial property K j=1 ˆθj = 1. Intuitively, we can see that this equation maximizes
the score of a topic set T ⊂ K given a document. A document with a high value of this
20
score is a highly relevant document for that topic set. D. Question clustering and ranking
One of the goals of this project is to design a system that can recommend useful, relevant
survey questions, given a selected survey topic (e.g., education) for building question
banks. Once we have the surveys that belong to a given topic, we group similar survey
questions into question groups and rank them within group based on several ranking
scores. We first apply fuzzy C-means (FCM) clustering [4], [11] to the set of survey
questions represented in LSI space (section III-B). Second, we rank the questions that
belong to a given cluster based on measures such as string matching, fuzzy set matching
[12], and distance from the cluster centroid. Finally, we remove duplicate questions and
present the ranked questions to survey designers
TOPIC MODELLING
Analytics is all about obtaining useful Information from the data. With the growing
amount of data in recent years, which is mostly unstructured, it’s difficult to obtain the
relevant and desired information. But, with the help of technology, powerful methods
can be deployed to mine through the data and fetch the information that we are looking
for.
One such technique in the field of text mining/data mining is Topic Modelling. As the
name suggests, it is a process to automatically identify topics present in any text object
and to derive hidden patterns exhibited by the text corpus. This helps in assisting better
decision making.
Topic Models are very useful for the purpose for document clustering, organizing large
blocks of textual data, information retrieval from unstructured text and feature selection.
While searching online for any news article on the web, bypassing only some topics the
entire news article is being displayed.
Therefore, each news article or a document can be divided into several topics through
which that entire document can be recreated
This project deals with extracting topics for a couple of news article and also extracts
details such as person name, location and organization for each story.
The cities_r2.csv files would help in finding the cities for each Startup
21
The attributes of Startup_data.xlsx are:-
The data is being collected from different sources and data is being stored into csv files.
After applying natural language processing to each test take top 5 stories and visualize
most frequent words in those 5 stories using Wordcloud library in python
22
Apply topic modeling LDA algorithm to each and every news article and extract 10
topics for each news article and store it into a new column corresponding to each news.
1. Games
2. Startup
3. Fund
4. Science
5. Women
Make use of topics extracted in the above step and apply regular expression over them to
categorize each news.
Use NLP to find person name, location and organization name for each news article.
(Hint: - For Person name and organization make use of Named Entity
Recognition(NER) whereas for Location make use of csv file cities_r2.csv)
23
Chapter 4
Internship Outcomes
Below are 10 examples of machine learning that really ground what machine
learning is all about.
Spam Detection: Given email in an inbox, identify those email messages that are
spam and those that are not. Having a model of this problem would allow a program
to leave non-spam emails in the inbox and move spam emails to a spam folder. We
should all be familiar with this example.
Credit Card Fraud Detection: Given credit card transactions for a customer in a
month, identify those transactions that were made by the customer and those that
were not. A program with a model of this decision could refund those transactions
that were fraudulent.
Digit Recognition: Given a zip codes hand written on envelops, identify the digit for
each hand written character. A model of this problem would allow a computer
program to read and understand handwritten zip codes and sort envelops by
geographic region.
Speech Understanding: Given an utterance from a user, identify the specific
request made by the user. A model of this problem would allow a program to
24
understand and make an attempt to fulfil that request. The iPhone with Siri has this
capability.
Face Detection: Given a digital photo album of many hundreds of digital
photographs, identify those photos that include a given person. A model of this
decision process would allow a program to organize photos by person. Some
cameras and software like iPhoto has this capability.
Product Recommendation: Given a purchase history for a customer and a
large inventory of products, identify those products in which that customer will
be interested and likely to purchase. A model of this decision process would
allow a program to make recommendations to a customer and motivate
product purchases. Amazon has this capability. Also think of Facebook,
GooglePlus and LinkedIn that recommend users to connect with you after you
sign-up.
Medical Diagnosis: Given the symptoms exhibited in a patient and a
database of anonymized patient records, predict whether the patient is likely
to have an illness. A model of this decision problem could be used by a
program to provide decision support to medical professionals.
Stock Trading: Given the current and past price movements for a stock,
determine whether the stock should be bought, held or sold. A model of this
decision problem could provide decision support to financial analysts.
Customer Segmentation: Given the pattern of behaviour by a user during a
trial period and the past behaviours of all users, identify those users that will
convert to the paid version of the product and those that will not. A model of
this decision problem would allow a program to trigger customer interventions
to persuade the customer to covert early or better engage in the trial.
Shape Detection: Given a user hand drawing a shape on a touch screen and
a database of known shapes, determine which shape the user was trying to
draw. A model of this decision would allow a program to show the platonic
version of that shape the user drew to make crisp diagrams.
The Instaviz iPhone app does this.
These 10 examples give a good sense of what a machine learning problem looks
like. There is a corpus of historic examples, there is a decision that needs to be
modelled and a business or domain benefit to having that decision modelled and
efficaciously made automatically.
Some of these problems are some of the hardest problems in Artificial Intelligence,
such as Natural Language Processing and Machine Vision (doing things that humans
do easily). Others are still difficult, but are classic examples of machine learning such
as spam detection and credit card fraud detection.
Think about some of your interactions with online and offline software in the last week.
I’m sure you could easily guess at another ten or twenty examples of machine
learning you have directly or indirectly used.
25
4.2 Learning Outcomes
It has simple easy-to-use syntax, making it the perfect language for someone
trying to learn computer programming for the first time.
This is a comprehensive guide on how to get started in Python, why you should
learn it and how you can learn it.
4.2.2 NumPy
NumPy is the fundamental package for scientific computing with Python. It
contains among other things:
a powerful N-dimensional array object
sophisticated (broadcasting) functions
tools for integrating C/C++ and Fortran code
useful linear algebra, Fourier transform, and random number capabilities
Besides its obvious scientific uses, NumPy can also be used as an efficient
multi-dimensional container of generic data. Arbitrary data-types can be defined.
This allows NumPy to seamlessly and speedily integrate with a wide variety of
databases.
NumPy is licensed under the BSD license, enabling reuse with few restrictions.
Getting Started
To install NumPy, we strongly recommend using a scientific Python distribution.
See Installing the SciPy Stack for details.
Many high quality online tutorials, courses, and books are available to get
started with NumPy. For a quick introduction to NumPy we provide the NumPy
Tutorial. We also recommend the SciPy Lecture Notes for a broader introduction
to the scientific Python ecosystem.
For more information on the SciPy Stack (for which NumPy provides the
fundamental array data structure), see scipy.org.
26
Documentation
The most up-to-date NumPy documentation can be found at Latest
(development) version. It includes a user guide, full reference documentation, a
developer guide, meta information, and “NumPy Enhancement Proposals”
(which include the NumPy Roadmap and detailed plans for major new features).
A complete archive of documentation for all NumPy releases (minor versions;
bug fix releases don’t contain significant documentation changes) since 2009
can be found at https://numpy.org/doc/
NumPy Enhancement Proposals (NEPs) can be found at https://numpy.org/neps
Support NumPy
If you have found NumPy to be useful in your work, research or company, please
consider making a donation to the project commensurate with your resources.
Any amount helps! All donations will be used strictly to fund the development of
NumPy’s open source software, documentation and community.
NumPy is a Sponsored Project of NumFOCUS, a 501(c)(3) nonprofit charity in the
United States. NumFOCUS provides NumPy with fiscal, legal, and administrative
support to help ensure the health and sustainability of the project.
Visit numfocus.org for more information.
Donations to NumPy are managed by NumFOCUS. For donors in the United
States, your gift is tax-deductible to the extent provided by law. As with any
donation, you should consult with your tax adviser about your particular tax
situation.
NumPy’s Steering Council will make the decisions on how to best use any funds
received. Technical and infrastructure priorities are documented on the NumPy
Roadmap.
4.2.3 Pandas
pandas is an open source, BSD-licensed library providing high-performance,
easy-to-use data structures and data analysis tools for the Python programming
language.
pandas is a NumFOCUS sponsored project. This will help ensure the success of
development of pandas as a world-class open-source project, and makes it
possible to donate to the project.
27
This is a minor bug-fix release in the 0.25.x series and includes some regression
fixes and bug fixes. We recommend that all users upgrade to this version.
See the full whatsnew for a list of all the changes.
The release can be installed with conda from the defaults and conda-forge
channels:
conda install pandas
Or via PyPI:
python -m pip install --upgrade pandas
v0.25.0 Final (July 18, 2019)
This is a major release from 0.24.2 and includes a number of API changes, new
features, enhancements, and performance improvements along with a large
number of bug fixes.
Highlights include:
Dropped Python 2 support
28
4.2.4 Data Visualisation
Data visualization is a general term that describes any effort to help people
understand the significance of data by placing it in a visual context. Patterns,
trends and correlations that might go undetected in text-based data can be
exposed and recognized easier with data visualization software.
Today's data visualization tools go beyond the standard charts and graphs used
in Microsoft Excel spreadsheets, displaying data in more sophisticated ways such
as infographics, dials and gauges, geographic maps, sparklines, heat maps, and
detailed bar, pie and fever charts. The images may include interactive
capabilities, enabling users to manipulate them or drill into the data for querying
and analysis. Indicators designed to alert users when data has been updated or
predefined conditions occur can also be included.
Data visualization has become the de facto standard for modern business intelligence (BI).
The success of the two leading vendors in the BI space, Tableau and Qlik -- both of which
heavily emphasize visualization -- has moved other vendors toward a more visual
approach in their software. Virtually all BI software has strong data visualization
functionality.
Data visualization tools have been important in democratizing data and analytics and
making data-driven insights available to workers throughout an organization. They are
typically easier to operate than traditional statistical analysis software or earlier versions
of BI software. This has led to a rise in lines of business implementing data visualization
tools on their own, without support from IT.
29
performing as intended. This is because visualizations of complex algorithms are
generally easier to interpret than numerical outputs.
Data visualization tools can be used in a variety of ways. The most common use today is
as a BI reporting tool. Users can set up visualization tools to generate
automatic dashboards that track company performance across key performance
indicators and visually interpret the results.
Many business departments implement data visualization software to track their own
initiatives. For example, a marketing team might implement the software to monitor the
performance of an email campaign, tracking metrics like open rate, click-through
rate and conversion rate.
As data visualization vendors extend the functionality of these tools, they are increasingly
being used as front ends for more sophisticated big data environments. In this setting, data
visualization software helps data engineers and scientists keep track of data sources and
do basic exploratory analysis of data sets prior to or after more detailed advanced analyses.
30
Multiple Linear Regression
• If we have more than one independent variable the procedure for fitting a best fit line is
known as “Multiple Linear Regression”
• Fundamentally there is no difference between ‘Simple’ & ‘Multiple’ linear regression.
Both works on OLS principle and procedure to get the best line is also similar. In the case
of later, regression equation will take a shape like: Y=B0+B1X1+B2X2+B3X3.....
• Where, Bi : Different coefficients Xi : Various independent variables.
31
Theoretically, we can understand and even predict human behavior using that information.
But there is one problem i.e. a person may generate hundreds or thousands of words in a
declaration, each sentence with its corresponding complexity. If one wants to scale them
and analyze several hundreds, thousands or millions of people or declarations in a given
geography, then the situation is daunting one and unmanageable. The Data that is being
generated from different conversations, declarations or even tweets are types of
unstructured data. The Unstructured data can’t be represented in form of row and column
structure of relational databases but the irony is that most of world’s data is unstructured.
It is messy and hard to manipulate. According to stats, 95% of world’s data is unstructured
which can’t be used for analysis and is regarded as dark data.
Areas of Use
Simply, NLP can be used for automatic handling of natural human language like speech
or text. NLP can be used for recognizing and prediction of diseases based on e - health
records and patient’s own speech. This capability has been explored in severe health
conditions that go from cardiovascular diseases to depression and even schizophrenia. It
enables organizations to easily determine what customers are saying about a service or
product by identifying and extracting information in sources like social media using
sentiment analysis. This analysis can provide a lot of information about the customer’s
choices and their decisions. Also an inventor at IBM developed a cognitive assistant using
NLP that works like a personalized search engine. It learns all about you and then remind
you of a name, a song, or anything that you can’t remember the moment you need it to.
Companies like Yahoo and Google filter and classify your emails as SPAM or HAM (non
– SPAM) using NLP thereby saving our privacy and security from hackers. The NLP
Group at MIT developed a new system to determine fake source by identifying if a source
is accurate or politically biased, thereby detecting if a news source can be trusted or not.
Amazon’s Alexa and Google Home are examples of intelligent voice driven interfaces
that extensively use NLP to respond to vocal prompts and do everything like find a
particular shop, tell us the weather, suggest the best route to a place or controlling lights
at NLP is also being used in talent identification, recruitment and automated report
generation or minute of meetings. NLP is solely booming in the healthcare industry. This
technology is used in improvising care delivery, disease diagnosis and bringing down
costs. Answering as used by IBM Watson’s answering to a query.
32
5.1 Conclusion
This Report has introduced to Machine Learning & Natural Language
Processing. Now, I know that Machine Learning is a technique of training
machines to perform the activities a human brain can do, albeit bit faster
and better than an average human-being. Today we have seen that the
machines can beat human champions in games such as Chess, AlphaGO,
which are considered very complex. we have seen that machines can be
trained to perform human activities in several areas and can aid humans
in living better lives.
Machine Learning can be a Supervised or Unsupervised. If we have lesser
amount of data and clearly labelled data for training, opt for Supervised
Learning. Unsupervised Learning would generally give better performance
and results for large data sets. If we have a huge data set easily available,
go for deep learning techniques. I also have learned Reinforcement
Learning and Deep Reinforcement Learning. You now know what Neural
Networks are, their applications and limitations.
Finally, when it comes to the development of machine learning models of
our own, I looked at the choices of various development languages, IDEs
and Platforms. Next thing that you need to do is start learning and
practicing each machine learning technique. The subject is vast, it means
that there is width, but if you consider the depth, each topic can be learned
in a few hours. Each topic is independent of each other. I need to take into
consideration one topic at a time, learn it, practice it and implement the
algorithm/s in it using a language choice of yours. This is the best way to
start studying Machine Learning. Practicing one topic at a time, very soon
you would acquire the width that is eventually required of a Machine
Learning expert.
33
species better.We already spend so much time in front of screens that our mobiles have
become an extension of us- and through cognitive learning, it has literally become the
case. Your machine learns all about you, and then accordingly alters your results. No
two people’s Google search results are the same: why? Cognitive learning.
The Rise of Quantum Computing
“Quantum computing”— sounds like something straight out of a science fiction movie,
no? But it has become a genuine phenomenon. Satya Nadella, the chief executive of
Microsoft Corp., calls i7t one of the three technologies that will reshape our
world. Quantum algorithms have the potential to transform and innovate the field of
machine learning. It could process data at a much faster pace and accelerate the ability to
draw insights and synthesize information.
Heavy-duty computation will finally be done in a jiffy, saving so much of time and
resources. The increased performance of machines will open so many doorways that will
elevate and take evolution to the next level. Something as basic as two numbers- 0 and 1
changed the way of the world, imagine what could be achieved if we ventured into a
whole new realm of computers and physics?
Rise of Robots
With machine learning on the rise, it is only natural that the medium gets a face on it—
robots! The sophistication of machine learning is not a ‘small wonder’ if you know what
I mean.
Here are some job options that you can potentially think of opting –
1. Machine Learning Engineer – They are sophisticated programmers who develop the
systems and machines that learn and apply knowledge without having any specific lead
or direction.
34
2. Deep Learning Engineer – Similar to computer scientists, they specialise in using deep
learning platforms to develop tasks related to artificial intelligence. Their main goal is to
be able to mimic and emulate brain functions.
3. Data Scientist – Someone who extracts meaning from data and analyses and interprets
it. It requires both methods, statistics, and tools.
4. Computer Vision Engineer – They are software developers who
create vision algorithms for recognising patterns in images.
Machine learning already is and will change the course of the world in
the coming decade. Let’s eagerly prep and wait for what the future
awaits. Let’s hope that machines do not get the bright idea of taking
over the world, because not all of us are Arnold Schwarzenegger.
Fingers crossed!
Reference
www.edugrad.com
www.google.com
www.python.org
www.wikipedia.org
www.tutorialspoint.com
List of Figure
35
36
37
38