0% found this document useful (0 votes)

125 views33 pages

1-Introduction To Data Science

The document discusses various data science topics including: 1. An introduction to data science including exploratory data analysis, machine learning, and dealing with large scales of data. 2. The skill sets needed for data science including an appreciation for data and asking insightful questions about data. 3. Examples of data sets like baseball statistics, movie data, Google Ngrams, and NYC taxi data and questions that could be answered by analyzing these data sets.

Uploaded by

Menna Saed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

125 views33 pages

1-Introduction To Data Science

Uploaded by

Menna Saed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 33

Data Science

Lecture 1: Introduction to Data Science

What is Data Science?
Like any emerging field, it isn’t yet well defined,
but incorporates elements of:
● Exploratory Data Analysis and Visualization
● Machine Learning and Statistics
● High-Performance Computing technologies for
dealing with scale.
Skill Sets for Data Science
Why data science, and why now? I see three reasons for this sudden burst of activity:

• New technology makes it possible to capture, annotate, and store vast amounts of social media, logging, and sensor
data. After you have amassed all this data, you begin to wonder what you can do with it.

• Computing advances make it possible to analyze data in novel ways and at ever increasing scales. Cloud computing

architectures give even the little guy access to vast power when they need it. New approaches to machine learning

have lead to amazing advances in longstanding problems, like computer vision and natural language processing.

• Prominent technology companies (like Google and Facebook) and quantitative hedge funds (like Renaissance
Technologies and TwoSigma) have proven the power of modern data analytics.
First, we will try to explain how good data scientists think, and how this differs from
the mindset of traditional programmers and software developers. Second, we will look
at data sets in terms of the potential for what they can be used for, and learn to ask the
broader questions they are capable of answering.
Appreciating Data
Computer Scientists do not naturally appreciate
data: it’s just stuff to run through a program.
The usual way to test algorithm performance is to
run the implementation on “random data”.
But interesting data sets are a scarce resource,
which requires hard work and imagination to
obtain.
Computer vs. Real Scientists (1)
● Scientists strive to understand the complicated
and messy natural world, while computer
scientists build their own clean and organized
virtual worlds. Thus:
● Nothing is ever completely true or false in
science, while everything is either true or
false in Computer Science / Mathematics.
Computer vs. Real Scientists (2)
● Scientists are data-driven, while computer
scientists are algorithm-driven.
● Scientists obsess about discovering things,
which computer scientists invent rather than
discover.
● Scientists are comfortable with the idea that
data has errors; computer scientists are not.
Examples of the cultural differences between computer science and real science include:

• Data vs. method centrism: Scientists are data driven, while computer scientists are algorithm
driven. Real scientists spend enormous amounts of effort collecting data to answer their
question of interest. They invent fancy measuring devices, stay up all night tending to
experiments, and devote most of their thinking to how to get the data they need.
By contrast, computer scientists obsess about methods: which algorithm is better than which
other algorithm, which programming language is best for a job, which program is better than
which other program. The details of the data set they are working on seem comparably
unexciting.
• Concern about results: Real scientists care about answers. They analyze data to
discover something about how the world works. Good scientists care about
whether the results make sense, because they care about what the answers mean.
By contrast, bad computer scientists worry about producing plausible- looking
numbers. As soon as the numbers stop looking grossly wrong, they are presumed
to be right. This is because they are personally less invested in what can be
learned from a computation, as opposed to getting it done quickly and
efficiently.
•Robustness: Real scientists are comfortable with the idea that data has errors. In general, computer scientists are not. Scientists
think a lot about possible sources of bias or error in their data, and how these possible problems can effect the conclusions derived
from them. Good programmers use strong data-typing and parsing methodologies to guard against formatting errors, but the
concerns here are different.

Becoming aware that data can have errors is empowering. Computer scientists chant “garbage in, garbage out” as a defensive
mantra to ward off criticism, a way to say that’s not my job. Real scientists get close enough to their data to smell it, giving it the
sniff test to decide whether it is likely to be garbage.

•Precision: Nothing is ever completely true or false in science, while everything is either true or false in computer science or
mathematics.

Generally speaking, computer scientists are happy printing floating point numbers to as many digits as possible: 8/13 =
0.61538461538. Real scientists will use only two significant digits: 8/13 ≈ 0.62. Computer scientists care what a number is, while
real scientists care what it means.
Genius vs. Wisdom
Software developers are hired to produce code.
Data Scientists are hired to produce insights.
Genius shows in finding the right answer!!!
Wisdom shows in avoiding the wrong answers.
Data science (like most things) benefits more
from wisdom than from genius.
Developing Wisdom
● Wisdom comes from experience.
● Wisdom comes from general knowledge.
● Wisdom comes from listening to others.
● Wisdom comes from humility, observing how
often you have been wrong and why/how.
I seek pass on wisdom, through my experience on
the difficulty of making good predictions.
Developing Curiosity
● The good data scientist develops a curiosity
about the domain/application they are working
in.
● They talk shop with the people whose data
they are working on.
● They read the newspaper every day, to get a
broader perspective on the world.
Asking Good Questions
Software developers are not encouraged to ask
questions, but data scientists are:
● What exciting things might you be able to
learn from a given data set?
● What things do you/your people really want to
know?
● What data sets might get you there?
Let’s Practice Asking Questions!
Who, What, Where, When, and Why on the
following datasets:
● Baseball-reference.com
● International Movie Database (IMBb)
● Google ngrams
● NYC taxi cab records
Baseball-Reference.com: biosketch
Statistical Record of Play
Summary
statistics of each
years batting,
pitching, and
fielding record,
with teams and
awards.
Baseball Questions
● How to best measure individual player’s skill,
value or performance?
● How fair do trades between teams work out?
● What is the trajectory of player’s
performances as they mature and age?
● To what extent does batting performance
correlate with the position played?
Demographic Questions
● Do left-handed people have shorter lifespans
than right-handers?
● How often do people return to where they
were born?
● Do player salaries reflect past, present, or
future performance?
● Are heights and weights increasing in the
population?
. There are two particular themes to be aware of here. First, the identifiers and reference
tags (i.e. the metadata) often prove more interesting in a data set than the stuff we are
supposed to care about, here the statistical record of play.
Second is the idea of a statistical proxy, where you use the data set you have to substitute
for the one you really want. The data set of your dreams likely does not exist, or may be
locked away behind a corporate wall even if it does. A good data scientist is a pragmatist,
seeing what they can do with what they have instead of bemoaning what they cannot get
their hands on.
IMDb: Movie Data
IMDb: Actor Data
Movie Questions
● Can we predict how well people will like a
movie? What about its gross?
● What does the social network of actors look
like? (Six degrees of Kevin Bacon)
● What is the age distribution of actors and
actresses in film?
● Do stars live longer or shorter lives than the bit
players or public?
• Which actors appeared in the most films? Earned the most money? Ap- peared in the lowest rated

films? Had the longest career or the shortest lifespan?

• What was the highest rated film each year, or the best in each genre? Which movies lost the most

money, had the highest-powered casts, or got the least favorable reviews.

• How well does movie gross correlate with viewer ratings or awards? Do customers instinctively

flock to trash, or is virtue on the part of the cre- ative team properly rewarded?

• How do Hollywood movies compare to Bollywood movies, in terms of ratings, budget, and gross?
Are American movies better received than foreign films, and how does this differ between U.S. and
non-U.S. reviewers?
Google Ngrams
● Presents an annual time series of the frequency
of every “popular” word/phrase with 1 to 5
words occurs in scanned books.
● `Popular’ means appears >40 times in total.
● Google has scanned about 15% of all books
ever published, making this resource quite
comprehensive.
Google Ngram Viewer
Ngram Questions
● How has the amount of cursing changed
over time?
● What is the lifespan of fame and
technologies? Is it increasing/decreasing?
● How often do new words emerge? Do
they stay in common usage?
● What words are associated with other words,
i.e. can you build a language model?
NYC Taxi Cab Data
● Gives driver/owner, pickup/dropoff location,
and fare data for every taxi trip taken.
● Data obtained from NYC via Freedom of
Information Act Request (FOA)
Taxicab Questions
● How much do drivers make each night?
● How far do they travel?
● How much slower is traffic during rush hour?
● Where are people traveling to/from at
different times of the day?
● Do faster drivers get tipped better?
● Where should drivers go to pick up their next
fare?
Properties of Data
1 Structured vs. Unstructured Data

Certain data sets are nicely structured, like the tables in a database or spread- sheet program. Others record information about
the state of the world, but in a more heterogeneous way. Perhaps it is a large text corpus with images and links like
Wikipedia, or the complicated mix of notes and test results appearing in personal medical records.

Data is often represented by a matrix, where the rows of the matrix represent distinct items or records, and the columns
represent distinct properties of these items. For example, a data set about U.S. cities might contain one row for each city, with
columns representing features like state, population, and area.

When confronted with an unstructured data source, such as a collection of tweets from Twitter, our first step is generally to
build a matrix to structure it. A bag of words model will construct a matrix with a row for each tweet, and a column for each
frequently used vocabulary word. Matrix entry M[i,j] then denotes the number of times tweet i contains word j.
2 Quantitative vs. Categorical Data

Quantitative data consists of numerical values, like height and weight. Such data can be incorporated directly into algebraic formulas and

mathematical models, or displayed in conventional graphs and charts.

By contrast, categorical data consists of labels describing the properties of the objects under investigation, like gender, hair color, and occupation.

This descriptive information can be every bit as precise and meaningful as numerical data, but it cannot be worked with using the same techniques.

Categorical data can usually be coded numerically. For example, gender might be represented as male = 0 or female = 1. But things get more com-

plicated when there are more than two characters per feature, especially when there is not an implicit order between them. We may be able to encode

hair colors as numbers by assigning each shade a distinct value like gray hair = 0, red hair = 1, and blond hair = 2. However, we cannot really treat

these values as numbers, for anything other than simple identity testing. Does it make any sense to talk about the maximum or minimum hair color?

What is the interpretation of my hair color minus your hair color?

3.Big Data vs. Little Data

Data science has become conflated in the public eye with big data, the analysis of massive data sets resulting from computer logs and sensor devices. In

principle, having more data is always better than having less, because you can always throw some of it away by sampling to get a smaller set if necessary.

The challenges of big data include:

•The analysis cycle time slows as data size grows: Computational opera- tions on data sets take longer as their volume increases. Small spreadsheets

provide instantaneous response, allowing you to experiment and play what if? But large spreadsheets can be slow and clumsy to work with, and

massive- enough data sets might take hours or days to get answers from. Clever algorithms can permit amazing things to be done with big data, but

staying small generally leads to faster analysis and exploration.

•Large data sets are complex to visualize: Plots with millions of points on them are impossible to display on computer screens or printed images, let alone

conceptually understand. How can we ever hope to really understand something we cannot see?

•Simple models do not require massive data to fit or evaluate: A typical data science task might be to make a decision (say, whether I should offer this

fellow life insurance?) on the basis of a small number of variables: say age, gender, height, weight, and the presence or absence of existing medical

conditions.

A Hands-On Introduction To Data Science
No ratings yet
A Hands-On Introduction To Data Science
2 pages
ST2195 Programming For Data Science
No ratings yet
ST2195 Programming For Data Science
11 pages
Business Intelligence
No ratings yet
Business Intelligence
23 pages
AI Sheet NN
No ratings yet
AI Sheet NN
2 pages
What Is Data Science (Slides)
100% (2)
What Is Data Science (Slides)
35 pages
Building Recommendation System Using Movielens Data
No ratings yet
Building Recommendation System Using Movielens Data
6 pages
Datascience With Answers
100% (1)
Datascience With Answers
36 pages
UNIT 2:fundamentals of Business Analytics
No ratings yet
UNIT 2:fundamentals of Business Analytics
30 pages
Social Network Analysis (SNA) - 1
100% (1)
Social Network Analysis (SNA) - 1
81 pages
Minerals Engineering: T.J. Napier-Munn
No ratings yet
Minerals Engineering: T.J. Napier-Munn
8 pages
Data Science A Beginner S Guide 1668243666
100% (1)
Data Science A Beginner S Guide 1668243666
26 pages
Lecture 1
100% (1)
Lecture 1
21 pages
Data Science
100% (2)
Data Science
33 pages
Data Science: Chapter 1: Introduction To Big Data
100% (2)
Data Science: Chapter 1: Introduction To Big Data
77 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
200 pages
R Programming UNIT-1
No ratings yet
R Programming UNIT-1
48 pages
01 Instructors Guide
No ratings yet
01 Instructors Guide
13 pages
Data Science Lecture 1 Introduction
No ratings yet
Data Science Lecture 1 Introduction
27 pages
OOSE Lab Report
No ratings yet
OOSE Lab Report
30 pages
Data Science
No ratings yet
Data Science
85 pages
Introduction To Data Science 5-13
No ratings yet
Introduction To Data Science 5-13
19 pages
Data Science and Ethical Issues
No ratings yet
Data Science and Ethical Issues
42 pages
Exploratory Data Analysis - Satyajit
No ratings yet
Exploratory Data Analysis - Satyajit
35 pages
1-Big Data Analytics
No ratings yet
1-Big Data Analytics
37 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
6 pages
Im Unit 3 DBMS Updated
No ratings yet
Im Unit 3 DBMS Updated
93 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
14 pages
Data Science Course
100% (1)
Data Science Course
51 pages
Career Plans For Next 2 Years
No ratings yet
Career Plans For Next 2 Years
11 pages
Unit-1 Data Visualization Notes
No ratings yet
Unit-1 Data Visualization Notes
15 pages
Data Analytics
100% (1)
Data Analytics
148 pages
Introduction To Machine Learning
100% (1)
Introduction To Machine Learning
119 pages
BUSINESS ANALYTICS WITH POWER BI
No ratings yet
BUSINESS ANALYTICS WITH POWER BI
35 pages
Module - 1 IDS
100% (1)
Module - 1 IDS
19 pages
Topic 1 Etw3482
100% (2)
Topic 1 Etw3482
69 pages
Email Classification: Roll No-41463 (LP-3)
No ratings yet
Email Classification: Roll No-41463 (LP-3)
5 pages
Full Data Science
No ratings yet
Full Data Science
161 pages
Machine Learning
No ratings yet
Machine Learning
17 pages
What Is Data Science GDI
0% (1)
What Is Data Science GDI
24 pages
Big Data Analysis
100% (1)
Big Data Analysis
30 pages
Data Analytics Course
No ratings yet
Data Analytics Course
19 pages
Introduction To Data Mining
75% (4)
Introduction To Data Mining
45 pages
AI-UNIT-1 PPT
No ratings yet
AI-UNIT-1 PPT
149 pages
UNIT - 2 .DataScience 04.09.18
No ratings yet
UNIT - 2 .DataScience 04.09.18
53 pages
BA ZG523 Introduction To Data Science
50% (2)
BA ZG523 Introduction To Data Science
12 pages
Data Science Answers
No ratings yet
Data Science Answers
2 pages
Data Visualization PDF
No ratings yet
Data Visualization PDF
3 pages
Intro of Data Mining
No ratings yet
Intro of Data Mining
27 pages
Orange: Data Mining
No ratings yet
Orange: Data Mining
10 pages
Introduction To Data Mining
100% (1)
Introduction To Data Mining
643 pages
BDM Unit I Slides Part 1
No ratings yet
BDM Unit I Slides Part 1
27 pages
Chapter 2. Introduction To Data Science
100% (2)
Chapter 2. Introduction To Data Science
45 pages
What Is Data Visualization UNIT-V
No ratings yet
What Is Data Visualization UNIT-V
24 pages
MIS Module 2-1
No ratings yet
MIS Module 2-1
61 pages
M.tech - Artificial Intelligence and Data Science
No ratings yet
M.tech - Artificial Intelligence and Data Science
57 pages
Week 2 - Data Analytics Life Cycle
No ratings yet
Week 2 - Data Analytics Life Cycle
41 pages
Excel and Advance Excel: Sr. No 1
No ratings yet
Excel and Advance Excel: Sr. No 1
1 page
Business Analytics
100% (1)
Business Analytics
10 pages
Data Science
No ratings yet
Data Science
16 pages
Module 4.1 - Data Science_c19a56558691ed09690242a995a65dbe
No ratings yet
Module 4.1 - Data Science_c19a56558691ed09690242a995a65dbe
56 pages
Lecture 1_ Introduction to Data Science
No ratings yet
Lecture 1_ Introduction to Data Science
38 pages
Intro to DS
No ratings yet
Intro to DS
37 pages
Ai Lect7 Expert System
No ratings yet
Ai Lect7 Expert System
183 pages
Brief Introduction to LLM
No ratings yet
Brief Introduction to LLM
69 pages
AI Sheet Search
No ratings yet
AI Sheet Search
4 pages
AI Lect8 Neural
No ratings yet
AI Lect8 Neural
84 pages
Ai Lect3 Search2
No ratings yet
Ai Lect3 Search2
135 pages
AI Sheet Logic
No ratings yet
AI Sheet Logic
3 pages
AI Sheet Fuzzy
No ratings yet
AI Sheet Fuzzy
1 page
Lec6 Web Programming
No ratings yet
Lec6 Web Programming
28 pages
Ai Lect6 Genetic
No ratings yet
Ai Lect6 Genetic
94 pages
4-Data Preprocessing (Cleaning) and Exploration
No ratings yet
4-Data Preprocessing (Cleaning) and Exploration
54 pages
Lec1 Web Programming
No ratings yet
Lec1 Web Programming
36 pages
Lec4 Web Programming
No ratings yet
Lec4 Web Programming
37 pages
Lec0 Web Programming
No ratings yet
Lec0 Web Programming
11 pages
Ai Lect5 CSP
No ratings yet
Ai Lect5 CSP
36 pages
Ai Lect2 Search
No ratings yet
Ai Lect2 Search
81 pages
31747-Texto Do Trabalho (Obrigatório) - 136531-1-10-20230630
No ratings yet
31747-Texto Do Trabalho (Obrigatório) - 136531-1-10-20230630
9 pages
Data Science: Demystifying
No ratings yet
Data Science: Demystifying
73 pages
Mis - Unit3
No ratings yet
Mis - Unit3
17 pages
Probability Distribution
No ratings yet
Probability Distribution
46 pages
Recurrent Neural Networks and Robust Time Series Prediction
No ratings yet
Recurrent Neural Networks and Robust Time Series Prediction
15 pages
Statistical Inference-BSA&F-III-Morning
No ratings yet
Statistical Inference-BSA&F-III-Morning
3 pages
Detailed Guide 7 Loss Functions Machine Learning Python Code
No ratings yet
Detailed Guide 7 Loss Functions Machine Learning Python Code
16 pages
Psych Stats Intro To Psych Stats
No ratings yet
Psych Stats Intro To Psych Stats
7 pages
Analisis Pengaruh Produk, Merek, Harga, Dan Promosi Terhadap Keputusan Pembelian Sepeda Motor Honda Beat
No ratings yet
Analisis Pengaruh Produk, Merek, Harga, Dan Promosi Terhadap Keputusan Pembelian Sepeda Motor Honda Beat
26 pages
Face Recognition On Small-Scale Datasets
No ratings yet
Face Recognition On Small-Scale Datasets
6 pages
B4 Très Imprtt Méthodes Biostatistiques Et Épidémiologiques Employées Pour La Recherche Biomédicale PDF
No ratings yet
B4 Très Imprtt Méthodes Biostatistiques Et Épidémiologiques Employées Pour La Recherche Biomédicale PDF
8 pages
Hypo-Test 2 Sample ss1
No ratings yet
Hypo-Test 2 Sample ss1
13 pages
Level of Difficulties Encountered by Grade 11 Senior High School Students in General Mathematics S.Y 2020-2021
No ratings yet
Level of Difficulties Encountered by Grade 11 Senior High School Students in General Mathematics S.Y 2020-2021
17 pages
Children Attention Scale
No ratings yet
Children Attention Scale
12 pages
THESIS-FINAL-1
No ratings yet
THESIS-FINAL-1
23 pages
Class 03 04 Confidence Interval, Hypothesis Testing
No ratings yet
Class 03 04 Confidence Interval, Hypothesis Testing
87 pages
Stat 122
No ratings yet
Stat 122
22 pages
Human Culture and Science Fiction: A Review of The Literature, 1980-2016
No ratings yet
Human Culture and Science Fiction: A Review of The Literature, 1980-2016
15 pages
AML Winter 2021 Solution
No ratings yet
AML Winter 2021 Solution
6 pages
Parameter and Statistic
67% (3)
Parameter and Statistic
18 pages
Sample APA Lab Write Up
No ratings yet
Sample APA Lab Write Up
24 pages
Week1 Lecture2
No ratings yet
Week1 Lecture2
57 pages
(eBook PDF) Research Methods in Kinesiology by Kent C. Kowalski - Download the ebook now for an unlimited reading experience
No ratings yet
(eBook PDF) Research Methods in Kinesiology by Kent C. Kowalski - Download the ebook now for an unlimited reading experience
53 pages
q4-week-1-stat
No ratings yet
q4-week-1-stat
32 pages
2802ICT Programming Assignment 2
No ratings yet
2802ICT Programming Assignment 2
6 pages
Tolerance Stackup Template
No ratings yet
Tolerance Stackup Template
10 pages
ChoiceModelR Manual
No ratings yet
ChoiceModelR Manual
17 pages
Dilla University: Page 1 of 6
100% (2)
Dilla University: Page 1 of 6
6 pages
Data Mining Using Learning Techniques For Fraud Detection
No ratings yet
Data Mining Using Learning Techniques For Fraud Detection
3 pages

Uploaded by

Uploaded by

Data Science

Lecture 1: Introduction to Data Science

films? Had the longest career or the shortest lifespan?

mathematical models, or displayed in conventional graphs and charts.

What is the interpretation of my hair color minus your hair color?

The challenges of big data include:

staying small generally leads to faster analysis and exploration.

You might also like