DSA2324 Lecture 01 Introduction To Data Science
DSA2324 Lecture 01 Introduction To Data Science
3 /94
Outline
1. Course introduction
4 /94
Course prerequisites
It is strongly suggested to have a good knowledge of the following topics:
• Linear algebra • Statistics
• Calculus 1 and Calculus 2 • Dynamical systems (minor)
Please fill out the following questionnaire to assess your knowledge on the
prerequisites:
https://forms.gle/LpvgCqZNM1oEYuwV8
5 /94
Course prerequisites
How to «refresh» the prerequisites?
• Linear algebra
UniBg course, course by Gilbert Strang @MIT on YouTube, Addendum provided
among the materials for this course
• Statistics
Brief review at the beginning of this course, UniBg course
• Dynamical systems
UniBg course
6 /94
Evaluation
+
• Data science project with discussion
• You will receive more information on the Up to 10 points
project during the course
7 /94
Educational objectives
At the end of the course, you will be able to:
8 /94
Teaching materials
Provided materials
• Lessons’ slides
All the course materials are available at the following Microsoft Team
https://teams.microsoft.com/l/team/19%3ATl9qqZhFm
x62dyo_Crb2ntZWlv9Uqdb-
0oSPCIN4GBM1%40thread.tacv2/conversations?groupId
=0c35adb4-0a2f-4c2f-aa25-1dd394f1f490&tenantId=
9 /94
Teaching materials
Provided materials
• Lessons’ slides
10 /94
Teaching materials
Suggested books
• Foster Provost, Tom Fawcett. • T. Hastie, R. Tibshirani, J.
Data Science for Business: Friedman. The elements of
What you need to know about statistical learning: data
data mining and data-analytic mining, inference, and
thinking, O'Reilly Media, Inc. prediction, 2° Edition,
(2013) Springer (2009)
11 /94
Teaching materials
Suggested books
• Cole Nussbaumer Knaflic. • I. Goodfellow, Y. Bengio, A.
Storytelling with data: a data Courville. Deep Learning, The
visualization guide for MIT Press (2016)
business professionals, Wiley
(2015)
12 /94
Teaching materials
Interactions and feedback
• During the course I will give you activities to do and tests to answer
• They are optional but they help you assess your level of understanding before the
exam
• In addition, they will give a bonus of (at most) +3 points to the final grade
13 /94
Syllabus
1. Introduction to data science 10.Decision trees
14 /94
Outline
1. Course introduction
15 /94
«Data is the new oil»
16 /94
«Data is the new oil» • Fuels
• Oils • Automobiles,
• … • Planes,
• Generators,
• Engines,
• ...
Barrels
of oil
Asphalt
• Infrastructures,
• Streets,
• …
17 /94
«Data is the new oil» • Fuels
• Oils • Automobiles,
• … • Planes,
• Generators,
• Engines,
• ...
Barrels DATA
of oil
Asphalt
• Infrastructures,
• Streets,
• …
18 /94
«Data is the new oil» PRODUCT
• Automobiles,
QUALITY • Planes,
• Generators,
• Engines,
• ...
Barrels DATA
of oil
GOODS
FORECAST
• Infrastructures,
• Streets,
• …
19 /94
«Data is the new oil» PRODUCT
• Machine
parameters
QUALITY
optimization
Barrels DATA
of oil
GOODS
FORECAST • Production and
purchasing
management
20 /94
«Data is the new oil» PRODUCT
• Machine
parameters
QUALITY
optimization
Barrels DATA
of oil
GOODS
FORECAST • Production and
purchasing
Data management
acquisition
21 /94
«Data is the new oil» PRODUCT
• Machine
parameters
QUALITY
optimization
Barrels DATA
of oil
GOODS
Descriptive FORECAST • Production and
purchasing
Data analytics management
acquisition and
reporting
Crude oil DATUM • Reduction of
extraction Refinement materials
process used
PROCESS
OPTIMIZATION
22 /94
«Data is the new oil» PRODUCT
• Machine
parameters
QUALITY
optimization
Barrels DATA
of oil
GOODS
Descriptive Modeling
FORECAST • Production and
23 /94
«Data is the new oil» PRODUCT
• Machine
parameters
QUALITY
optimization
Barrels DATA
of oil
GOODS
Descriptive Modeling
FORECAST • Production and
24 /94
Data is the new oil and data science is «sexy»
The data scientist role has been deemed the sexiest job of the 21st century [7]
25 /94
Job positions that involve data
Data analyst Data scientist Data engineer Machine learning
engineer
• Data retrieval • Use different machine • Design and maintain • Design and
(database queries) learning techniques to data management implementation of
• Spot trends and derive insights from systems machine learning
patterns in the data data to guide • Data collection and methods
• Visualize the data and business decisions management • Extend existing
produce reports to • Make predictions on • Make data accessible machine learning
present information to products, assets and to the other members frameworks and
third parties consumer behavior of the data science libraries
• … based on past data team • …
• … • …
Often, career opportunities require a good mix of all the aforementioned skills
26 /94
What is data science?
Data science is a set of fundamental principles, processes and techniques that guide the
extraction of knowledge from data with the goal of improving decision-making
Data mining is the extraction of knowledge from data, via technologies that incorporate
data science principles
27 /94
The data-driven company
Data-driven decision-making (DDD) refers to the
practice of basing decisions on the analysis of data, rather
than purely on intuition [1, 2]
• Some decisions can be made automatically (finance,
recommendations)
28 /94
Anti-hippo culture
29 /94
The road to becoming data-driven
1 2 3 4 5
Data Denial Data Data Aware Data Data-Driven
Indifference Informed
Data are not Data are Data play a
used and are There is no collected and central role in
viewed with interest to used for Data are the most
distrust acquire or monitoring, mainly used disparate
use data but no by managers decisions that
decisions are in decision- are made in the
made based making various
on them business
sectors
30 /94
Why become data-driven?
Data-driven
companies are 1$
invested in analytics
5% more pays back 13 $ [3]
productive [2]
31 /94
Why become data-driven?
Retail $0,8T
Travels $480B
Business value created by
Logistics $475B
Artificial Intelligence by Automotive & assembly $405B
Materials $300B
2030 [4] Advanced electronics & semiconductors $291B
Healthcare systems & services $267B
$13 High tech
Telecom
$267B
$174B
Trillions
Oil & gas $173B
Agriculture $164B
It is difficult to find an industrial sector that will not benefit from artificial intelligence in
the near future
32 /94
Outline
1. Course introduction
33 /94
What are data?
• …
Forecasted
34 /94
Types of data: structured vs unstructured
Structured data
𝐇𝐨𝐮𝐬𝐞 𝐚𝐫𝐞𝐚
Data that are organized following a # 𝐛𝐞𝐝𝐫𝐨𝐨𝐦𝐬 𝐏𝐫𝐢𝐜𝐞 [k$]
[feet 2 ]
predefined scheme and stored in 523 1 115
tabular formats (excel sheets, SQL 645 1 150
Unstructured data
Data that can have an internal structure Audio files Text files Video files Image files
35 /94
Types of data: quantitative vs qualitative
Ordinal qualitative data
Nominal qualitative data can be ordered. Other examples:
cannot be ordered low/high income, age ranges…
36 /94
Data are dirty
𝐇𝐨𝐮𝐬𝐞 𝐚𝐫𝐞𝐚 𝐂𝐨𝐦𝐩𝐥𝐞𝐭𝐢𝐨𝐧
# 𝐛𝐞𝐝𝐫𝐨𝐨𝐦𝐬 𝐏𝐫𝐢𝐜𝐞 [k$]
[feet 2 ] 𝐝𝐚𝐭𝐞
Common data problems:
523 1 23/06/1998 115
• Missing values 645 1 01/07/2000 0.001
708 unknown 19/01/1980 210
• Unlikely values (outliers)
1034 3 31-Jan-2001 unknown
• Inconsistent formats unknown 4 17/12/2005 355
2545 unknown 14/02/1999 440
• …
⋮ ⋮ ⋮ ⋮
37 /94
Outline
1. Course introduction
38 /94
What are we going to do with data?
In this course, we will use data for:
39 /94
Supervised vs unsupervised learning
Many data science tasks can be tackled either by supervised or unsupervised learning methods
• Supervised learning: predict the values of one or more dependent variables (output(s))
based on the values of one or more independent variables (input(s))
𝝋 𝒚
Inputs Outputs
(Features) (Targets)
Typically, we will focus on supervised learning problems with only one output
• Unsupervised learning: there are no outputs! The goal may be to discover groups of similar
entities within the data or to project the data from a high-dimensional space (#inputs > 3)
down to two or three dimensions for the purpose of visualization
40 /94
Data science tasks
• Regression*: predict the values assumed by the continuous output(s) from the input(s)
Example: ➢ Predict the prices of houses based on their area
➢ Predict the prices of houses based on their area and number of bedrooms
𝐇𝐨𝐮𝐬𝐞 𝐚𝐫𝐞𝐚
# 𝐛𝐞𝐝𝐫𝐨𝐨𝐦𝐬 𝐏𝐫𝐢𝐜𝐞 [k$]
[feet 2 ]
523 1 115
645 1 150
708 2 210
⋮ ⋮ ⋮
𝜑∈ℝ 𝑦∈ℝ
𝝋 ∈ ℝ2×1
*: covered in this course : supervised : unsupervised 41 /94
Data science tasks
• Classification*: predict the values assumed by the categorical output(s) from the input(s)
Example: ➢ Develop an application that recognizes cats in images
Cat 𝜑= ∈ ℕ𝑊×𝐻×𝐷
Cat
Output: the class label
Cats Dogs
Output: the class label
𝑦 ∈ cat, dog
(single output)
Weight kg
Technical note: regression and classification are based on correlation, causal modeling is based on
causality
Middle-aged people
with high budget
𝝋 ∈ ℝ2×1
(customer age and
amount spent)
Young people with
low budget
Output: none
Older people with
medium budget
Clustering looks at the similarity between entities based on their features, co-occurrence grouping
considers the similarity of entities based on their appearing together in transactions (e.g., “a keyboard is
not similar to a mouse, although they are typically bought together”)
➢ Profile the typical wait time of customers who call into a call center
Proportion of calls
Output: none
Erika Joseph
Add to friends
Remove
Inputs:
• Movie title
• Year of release
• User id
• User rating
• Rating date
Inputs:
• Song titles
• Song genres
• Audio signals
• ⋮
• User ratings
• ⋮
Clustering is used for exploratory data analysis (“can we partition the Output: none (in this
data into different groups of similar entities?”), similarity matching has example)
the specific goal of finding similar entities
In this course, we will study methods for solving different data science tasks
52 /94
Syllabus
1. Introduction to data science 9. Decision trees (regression and classification)
54 /94
Models in supervised learning
Most supervised learning methods rely on mathematical models that describe the
relationship between the inputs and the outputs
Data-generating system
𝝋 𝒚
𝒮
Inputs Outputs
Supervised learning methods
We want 𝒚 ≈ 𝒚
ෝ
estimate ℳ from data
𝝋 ෝ
𝒚
ℳ
Inputs Estimated outputs
55 /94
Models in supervised learning
We view both 𝒮 and ℳ as mathematical functions that map inputs (features) to outputs
(targets)
𝝋
Inputs
𝒮
𝒚
Outputs ≡ 𝒚=𝑓 𝝋
𝝋
Inputs
ℳ
ෝ
𝒚
Estimated outputs ≡ ෝ = 𝑓መ 𝝋
𝒚
56 /94
Models in supervised learning
We view both 𝒮 and ℳ as mathematical functions that map inputs (features) to outputs
(targets)
𝝋
Inputs
𝑓 ⋅
𝒚
Outputs ≡ 𝒚=𝑓 𝝋
𝝋
Inputs
𝑓መ ⋅
ෝ
𝒚
Estimated outputs ≡ ෝ = 𝑓መ 𝝋
𝒚
57 /94
Dataset notation
Before moving on, we introduce the following notation that we will use for any dataset
58 /94
Static systems (and models)
A system whose outputs can be determined directly 𝝋 𝑖 𝒚 𝑖
from the inputs is said to be a static system
Inputs
𝑓 ⋅ Outputs
(“memoryless” system)
𝐼 𝑡
𝑅 The output 𝐼 𝑡 at time 𝑡 only
𝑉 𝑡 depends on the input 𝑉 𝑡 at
𝐼 𝑡 =
𝑅 the same time instant
𝑉 𝑡
𝑓 𝜑 𝑖
𝜑 𝑖
We can view each voltage/current measurement by itself (i.e. as an observation
𝜑 𝑖 ,𝑦 𝑖 in its own right), we do not need to consider 𝑉 𝑡 and 𝐼 𝑡 as signals
“The time 𝑡 can be omitted”
59 /94
Static systems (and models)
Static systems need not describe only physics phenomena
𝐈𝐦𝐚𝐠𝐞 𝐋𝐚𝐛𝐞𝐥
𝐇𝐨𝐮𝐬𝐞
#
𝐚𝐫𝐞𝐚 𝐏𝐫𝐢𝐜𝐞 [k$]
𝐛𝐞𝐝𝐫𝐨𝐨𝐦𝐬 Cat
[feet 2 ]
523 1 115
Not cat
645 1 150
708 2 210
⋮ ⋮ ⋮ Cat
60 /94
Learning static systems
In the regression setting, the simplest model that can be used to describe static systems
(but also dynamical systems!) is the linear model
𝑑−1
𝑦 𝑖 = 𝜃0 + 𝜃1 𝜑1 𝑖 + ⋯ + 𝜃𝑑−1 𝜑𝑑−1 𝑖 + 𝜖 𝑖 = 𝜃𝑗 𝜑𝑗 𝑖 + 𝜖 𝑖
1×1
𝑗=0
𝑖 −th observation
= 𝝋(𝑖)⊤ 𝜽 + 𝜖(𝑖) • 𝜑0 = 1
1×𝑑 𝑑×1 1×1 • 𝝋 𝑖 = 𝜑0 𝜑1 𝑖 ⋯ 𝜑𝑑−1 𝑖 ⊤ ∈ ℝ𝑑×1
⋯ • 𝜽 = 𝜃0 𝜃1 ⋯ 𝜃𝑑−1 ⊤ ∈ ℝ𝑑×1
⋮
• 𝑦 𝑖 ∈ℝ
61 /94
Learning static systems
To “learn” means to estimate the values of the parameters in 𝜽 = 𝜃0 𝜃1 ⋯ 𝜃𝑑−1 ⊤
Key idea: find the values of 𝜽 that minimize a “cost” (or “loss”), i.e. an “error” or
“something bad” → it is good to minimize something bad
• This is achieved through optimization
With this cost, we are minimizing the sum of the squared errors between the observed
outputs (i.e. those reported in our dataset) and the outputs estimated by the linear model
62 /94
Learning static systems
Scalar (single) parameter 𝜃 Multiple parameters 𝜽
Cost function
𝑁
1 2
𝐽 𝜽 = 𝜖 𝑖
𝑁
𝑖=1
Minimizer of the
cost function:
= arg min 𝐽 𝜽
𝜽
𝜽
𝑦ො 𝑖 = 𝑓መ 𝝋 𝑖 =𝝋 𝑖
⊤𝜽
63 /94
Dynamical systems (and models)
A system whose outputs (at a certain time instant) 𝒖 𝑡 𝒚 𝑡
cannot be determined directly from the inputs (at the Input
𝒮 Output
same time instant) is said to be a dynamical system SIGNALS SIGNALS
Dynamical models are mathematical models that describe the future evolution of the
variables involved as a function of their past trend
Dynamical systems usually involve the time: the outputs 𝒚 𝑡 at a certain time 𝑡 depend
on the outputs at previous times
This dependency on the past endows the model with a “memory” (i.e. the dynamics)
64 /94
Dynamical systems (and models)
This dependency on the past endows the model with a “memory” (i.e. the dynamics)
6
4
Voltage [V]
2
𝑉 𝑡 Electric 𝜔 𝑡
0
Voltage motor Angular −2
velocity −4
−6
0 2 4 6 8 10 12 14 16 18 20
65 /94
Dynamical systems (and models)
Dynamical systems can be defined in continuous-time or in discrete-time
𝑉 𝑡 𝑉𝐶 𝑡 𝑉 𝑡 = 𝑅 ⋅ 𝑖 𝑡 + 𝑉𝐶 𝑡
𝐶
1 1
𝑉ሶ𝐶 𝑡 + 𝑉𝐶 𝑡 = 𝑉 𝑡
𝑅𝐶 𝑅𝐶
66 /94
Dynamical systems (and models)
However, computers can only manage a finite amount of data. Thus, signals 𝑠 𝑡 should
be sampled at a sampling time 𝑇𝑠 so that we can store a finite amount of data
corresponding to the time instants 𝑘𝑇𝑠 , 𝑘 = 1, … , 𝑁, i.e.
𝑡 → continuous-time
𝑠 𝑡 𝑘 → discrete-time
𝑠 0 , 𝑠 𝑇𝑠 , 𝑠 2𝑇𝑠 , 𝑠 3𝑇𝑠 , …
𝑠0 𝑠𝑘
67 /94
Dynamical systems (and models)
Example: resistor-capacitor circuit (continuous-time → discrete-time)
𝑅
𝑖 𝑡
1 1
𝑉ሶ𝐶 𝑡 + 𝑉𝐶 𝑡 = 𝑉 𝑡
𝑅𝐶 𝑅𝐶
𝑉 𝑡 𝑉𝐶 𝑡
𝐶
Numerical differentiation
𝑉𝐶 𝑘 + 1 𝑇𝑠 − 𝑉𝐶 𝑘𝑇𝑠
𝑉ሶ𝐶 𝑘𝑇𝑠 ≈ 𝑡 = 𝑘𝑇𝑠
𝑇𝑠
𝑉𝐶 𝑘 + 1 − 𝑉𝐶 𝑘 1 1
𝑠 𝑘 = 𝑠 𝑘𝑇𝑠 + 𝑉𝐶 𝑘 = 𝑉𝑘
𝑇𝑠 𝑅𝐶 𝑅𝐶
Shift back by 1 step and 𝑇𝑠 𝑇𝑠
re-organize equation 𝑉𝐶 𝑘 = 1 − 𝑉𝐶 𝑘 − 1 + 𝑉 𝑘−1
𝑅𝐶 𝑅𝐶
68 /94
From signals to feature vectors
𝒖 𝑡 𝒚 𝑡 𝝋𝑘 𝒚𝑘
Input
𝒮 Output Inputs
𝑓 ⋅ Outputs
SIGNALS SIGNALS (features)
𝑅 𝑇𝑠
𝑖 𝑡 𝑉𝐶 𝑘 = 1 − 𝑉𝐶 𝑘 − 1
𝑅𝐶
𝑇𝑠
𝑉 𝑡 𝑉𝐶 𝑡 + 𝑉 𝑘−1
𝐶 𝑅𝐶
≡
1 1 𝑦 𝑘 =𝑓 𝝋𝑘 = 𝝋 𝑘 ⊤𝜽
𝑉ሶ𝐶 𝑡 + 𝑉𝐶 𝑡 = 𝑉 𝑡
𝑅𝐶 𝑅𝐶 • 𝝋 𝑘 = 𝑉𝐶 𝑘 − 1 𝑉 𝑘−1 ⊤
𝑇𝑠 𝑇𝑠 ⊤
• 𝜽= 1− 𝑅𝐶 𝑅𝐶
• 𝑦 𝑘 = 𝑉𝑐 𝑘
69 /94
Static vs dynamical systems
Static systems Dynamical systems
𝝋 𝑖 𝑦 𝑖 𝝋𝑘 𝑦𝑘
𝑓 ⋅ 𝑓 ⋅
Inputs Outputs Inputs Outputs
• For static systems, we will index the observations with the index 𝑖
• For dynamical systems, we will index the observations with the index 𝑘
𝑘 can be interpreted as the 𝑘-th sampling step
70 /94
Machine Learning (ML), Artificial Intelligence (AI),
Data Science and System Identification
Other tools
AI
Planning
ML
Experiment Search
design
Deep
learning
Subspace
methods Time series Reasoning
Frequency
Visualization
domain
methods
SYSTEM
IDENTIFICATION DATA SCIENCE
AND CONTROL
71 /94
Why do we need models?
All in all, we need a model to better understand the phenomena that are of our interest.
Models are useful for:
• Decision-making: suppose that we are testing a new vaccine. We have two groups of
people. We give the vaccine to the first group (test group) and a placebo to the second
one (control group). Then, we measure some variables from the patients. How can we
determine if the vaccine was effective or not?
72 /94
Why do we need models?
All in all, we need a model to better understand the phenomena that are of our interest.
Models are useful for:
• Prediction: forecast the values that the output variables will assume based on the
values assumed by the inputs variables and on which we have no data about
𝐇𝐨𝐮𝐬𝐞 𝐚𝐫𝐞𝐚
# 𝐛𝐞𝐝𝐫𝐨𝐨𝐦𝐬 𝐏𝐫𝐢𝐜𝐞 [k$]
[feet 2 ]
523 1 115
How much does a 600 feet 2 house with 2
645 1 150 bedrooms cost?
708 2 210
⋮ ⋮ ⋮
73 /94
Why do we need models?
All in all, we need a model to better understand the phenomena that are of our interest.
Models are useful for:
⋮ ⋮ ⋮
Prediction vs inference: prediction is not necessarily concerned with the structure of the
model 𝑓መ ⋅ and its complexity (𝑓መ ⋅ can be seen as a black-box) while inference uses the model to
understand the relationship between each input and each output
74 /94
Why do we need models?
All in all, we need a model to better understand the phenomena that are of our interest.
Models are useful for:
• Simulation: we can simulate, with a computer, the response (outputs) of a model due
to certain inputs. By looking at the model’s response, we can get a better grasp of the
modeled system
75 /94
Why do we need models?
All in all, we need a model to better understand the phenomena that are of our interest.
Models are useful for:
𝒔 𝑡 + 𝒖 𝑡 𝒚 𝑡
Controller 𝒮
−
76 /94
Why do we need models?
All in all, we need a model to better understand the phenomena that are of our interest.
Models are useful for:
• Fault diagnosis: we can check the presence of faults by comparing signals that come
from the real system with those simulated by the estimated model
Faults
𝒖(𝑡) 𝒚(𝑡)
𝒮
Model-based fault diagnosis system
77 /94
Outline
1. Course introduction
78 /94
Business problems as data science tasks
Each data-driven project is unique. First and foremost, decompose the business problem
into data science subtasks that can be solved by existing methods
Machine learning engineers
Data science focus on these aspects
(sub)task(s) Algorithm(s)/
Business • Regression
method(s)
• Classification solve
problem decompose • Causal modeling
• Clustering analyze
• Co-occurrence grouping
• Profiling
• Link prediction Analyze the results
• Dimensionality reduction (to derive insights and drive
Data scientists focus • Similarity matching
on these aspects business-related decisions)
79 /94
Business problems as data science tasks
80 /94
Selecting data-driven projects
Focus on data science and machine learning
projects that are valuable and feasible x
x x Valuable
What data- xx
driven x for your
x x x
Think about automating tasks rather than business
methods x xx x
can do x x
automating jobs x
81 /94
Selecting data-driven projects
MANUFACTURING LINE MANAGER
Data science Machine learning
Mix clay Shape
Mix clay Shape mug
mug Add
Add glaze
glaze
Fire kiln
Fire kiln Final inspection NO DEFECT NO DEFECT DEFECT
82 /94
Selecting data-driven projects
RECRUITING
Data science Machine learning
Mario Rossi
Email Phone Personal info
YES
Education
outreach screen
Employement
Mario Rossi
Onsite
Offer Personal info
interview NO
Education
Employement
83 /94
Selecting data-driven projects
MARKETING
Data science Machine learning
Version A Version B
84 /94
Outline
1. Course introduction
85 /94
CRISP-DM process
Picture taken from [1]
Cross Industry Standard Process for Data
Mining (CRISP-DM)
86 /94
CRISP-DM: Business understanding
Cast the business problem into one or more data science problems
• Regression
• Classification Think carefully about the use scenario:
• Causal modeling
• Clustering • What exactly do we want to do?
• Co-occurrence grouping
• Profiling • How exactly would we do it?
• Link prediction
• Dimensionality reduction • What parts of this use scenario constitute possible data
• Similarity matching mining models?
87 /94
CRISP-DM: Data understanding
Identify the available and needed data
88 /94
CRISP-DM: Data preparation
Clean and prepare the data for usage
Pay attention to not use historical data that will not be available when decisions need
to be made
89 /94
CRISP-DM: Modeling
Estimate a mathematical model to extract patterns from data
90 /94
CRISP-DM: Evaluation
Assess the validity of the results
We could find patterns that exist only in the particular dataset that
we have at our disposal (overfitting)
The devised solution and the model’s decisions should be comprehensible by the
stakeholders
Usually, evaluation is performed before deploying. In this case, build environments that
closely mimic the real use scenario
91 /94
CRISP-DM: Deployment
Put the model (or the data mining steps) into production
This step can require a notable investment in time. Usually, the data science team
builds a prototype that is then passed on to the development team
For this reason, it is suggested to include a member of the development team in the
early phases of the data science project
Deployment can involve not only the final model, but also previous phases (data
collection, model building, evaluation)
92 /94
Workflow of a machine learning project
Build a home assistant device
Amazon Google Apple Baidu
Echo Home Siri DuerOS
1. Collect data
«Alexa» «Hello»
2. Train model
• Iterate many times
until good enough A B
Audio #1 «Alexa»
Audio #2 «Hello»
3. Deploy model
«Alexa»
• Get data back
• Maintain/update model
93 /94
Workflow of a data science project
Optimize a
manufacturing line
Mix clay Shape mug Add glaze Fire kiln Final inspection
1. Collect data
𝐌𝐢𝐱𝐢𝐧𝐠 𝐭𝐢𝐦𝐞
𝐂𝐥𝐚𝐲 𝐛𝐚𝐭𝐜𝐡 # 𝐒𝐮𝐩𝐩𝐥𝐢𝐞𝐫
[minutes]
2. Analyze data 001 Supplier 1 35
• Iterate many times to 034 Supplier 1 22
get good insights 109 Supplier 2 28
94 /94
References
1. Provost, Foster, and Tom Fawcett. “Data Science for Business: What you need to know about data
mining and data-analytic thinking”. O'Reilly Media, Inc., 2013. Chapters 1-2.
2. Brynjolfsson, E., Hitt, L. M., and Kim, H. H. “Strength in numbers: How does data driven decision making
affect firm performance?”. Tech. rep., available at SSRN: http://ssrn.com/abstract=1819486, 2011
3. Nucleus Research, 2014. http://bit.ly/XQFDbv.
4. Notes from the AI frontier: Modeling the impact of AI on the world economy, 2018.
5. Pyle, D. “Data Preparation for Data Mining”. Morgan Kaufmann, 1999. Chapter 1.
6. G. James, D. Witten, T. Hastie, R. Tibshirani. “An Introduction to Statistical Learning”. 2° Edition,
Springer, 2021. Chapters 1-2.
7. Data scientist: The Sexiest Job the 21st Century, 2012.
8. Volume of data/information created, captured, copied, and consumed worldwide from 2010 to 2020,
with forecasts from 2021 to 2025, 2022.
9. Correlation does not imply causation: 5 real-world examples, 2021.
96 /94