0% found this document useful (0 votes)
125 views33 pages

1-Introduction To Data Science

The document discusses various data science topics including: 1. An introduction to data science including exploratory data analysis, machine learning, and dealing with large scales of data. 2. The skill sets needed for data science including an appreciation for data and asking insightful questions about data. 3. Examples of data sets like baseball statistics, movie data, Google Ngrams, and NYC taxi data and questions that could be answered by analyzing these data sets.

Uploaded by

Menna Saed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
125 views33 pages

1-Introduction To Data Science

The document discusses various data science topics including: 1. An introduction to data science including exploratory data analysis, machine learning, and dealing with large scales of data. 2. The skill sets needed for data science including an appreciation for data and asking insightful questions about data. 3. Examples of data sets like baseball statistics, movie data, Google Ngrams, and NYC taxi data and questions that could be answered by analyzing these data sets.

Uploaded by

Menna Saed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 33

Data Science

Lecture 1: Introduction to Data Science


What is Data Science?
Like any emerging field, it isn’t yet well defined,
but incorporates elements of:
● Exploratory Data Analysis and Visualization
● Machine Learning and Statistics
● High-Performance Computing technologies for
dealing with scale.
Skill Sets for Data Science
Why data science, and why now? I see three reasons for this sudden burst of activity:

• New technology makes it possible to capture, annotate, and store vast amounts of social media, logging, and sensor
data. After you have amassed all this data, you begin to wonder what you can do with it.

• Computing advances make it possible to analyze data in novel ways and at ever increasing scales. Cloud computing

architectures give even the little guy access to vast power when they need it. New approaches to machine learning

have lead to amazing advances in longstanding problems, like computer vision and natural language processing.

• Prominent technology companies (like Google and Facebook) and quantitative hedge funds (like Renaissance
Technologies and TwoSigma) have proven the power of modern data analytics.
First, we will try to explain how good data scientists think, and how this differs from
the mindset of traditional programmers and software developers. Second, we will look
at data sets in terms of the potential for what they can be used for, and learn to ask the
broader questions they are capable of answering.
Appreciating Data
Computer Scientists do not naturally appreciate
data: it’s just stuff to run through a program.
The usual way to test algorithm performance is to
run the implementation on “random data”.
But interesting data sets are a scarce resource,
which requires hard work and imagination to
obtain.
Computer vs. Real Scientists (1)
● Scientists strive to understand the complicated
and messy natural world, while computer
scientists build their own clean and organized
virtual worlds. Thus:
● Nothing is ever completely true or false in
science, while everything is either true or
false in Computer Science / Mathematics.
Computer vs. Real Scientists (2)
● Scientists are data-driven, while computer
scientists are algorithm-driven.
● Scientists obsess about discovering things,
which computer scientists invent rather than
discover.
● Scientists are comfortable with the idea that
data has errors; computer scientists are not.
Examples of the cultural differences between computer science and real science include:

• Data vs. method centrism: Scientists are data driven, while computer scientists are algorithm
driven. Real scientists spend enormous amounts of effort collecting data to answer their
question of interest. They invent fancy measuring devices, stay up all night tending to
experiments, and devote most of their thinking to how to get the data they need.
By contrast, computer scientists obsess about methods: which algorithm is better than which
other algorithm, which programming language is best for a job, which program is better than
which other program. The details of the data set they are working on seem comparably
unexciting.
• Concern about results: Real scientists care about answers. They analyze data to
discover something about how the world works. Good scientists care about
whether the results make sense, because they care about what the answers mean.
By contrast, bad computer scientists worry about producing plausible- looking
numbers. As soon as the numbers stop looking grossly wrong, they are presumed
to be right. This is because they are personally less invested in what can be
learned from a computation, as opposed to getting it done quickly and
efficiently.
•Robustness: Real scientists are comfortable with the idea that data has errors. In general, computer scientists are not. Scientists
think a lot about possible sources of bias or error in their data, and how these possible problems can effect the conclusions derived
from them. Good programmers use strong data-typing and parsing methodologies to guard against formatting errors, but the
concerns here are different.

Becoming aware that data can have errors is empowering. Computer scientists chant “garbage in, garbage out” as a defensive
mantra to ward off criticism, a way to say that’s not my job. Real scientists get close enough to their data to smell it, giving it the
sniff test to decide whether it is likely to be garbage.

•Precision: Nothing is ever completely true or false in science, while every- thing is either true or false in computer science or
mathematics.

Generally speaking, computer scientists are happy printing floating point numbers to as many digits as possible: 8/13 =
0.61538461538. Real scientists will use only two significant digits: 8/13 ≈ 0.62. Computer scientists care what a number is, while
real scientists care what it means.
Genius vs. Wisdom
Software developers are hired to produce code.
Data Scientists are hired to produce insights.
Genius shows in finding the right answer!!!
Wisdom shows in avoiding the wrong answers.
Data science (like most things) benefits more
from wisdom than from genius.
Developing Wisdom
● Wisdom comes from experience.
● Wisdom comes from general knowledge.
● Wisdom comes from listening to others.
● Wisdom comes from humility, observing how
often you have been wrong and why/how.
I seek pass on wisdom, through my experience on
the difficulty of making good predictions.
Developing Curiosity
● The good data scientist develops a curiosity
about the domain/application they are working
in.
● They talk shop with the people whose data
they are working on.
● They read the newspaper every day, to get a
broader perspective on the world.
Asking Good Questions
Software developers are not encouraged to ask
questions, but data scientists are:
● What exciting things might you be able to
learn from a given data set?
● What things do you/your people really want to
know?
● What data sets might get you there?
Let’s Practice Asking Questions!
Who, What, Where, When, and Why on the
following datasets:
● Baseball-reference.com
● International Movie Database (IMBb)
● Google ngrams
● NYC taxi cab records
Baseball-Reference.com: biosketch
Statistical Record of Play
Summary
statistics of each
years batting,
pitching, and
fielding record,
with teams and
awards.
Baseball Questions
● How to best measure individual player’s skill,
value or performance?
● How fair do trades between teams work out?
● What is the trajectory of player’s
performances as they mature and age?
● To what extent does batting performance
correlate with the position played?
Demographic Questions
● Do left-handed people have shorter lifespans
than right-handers?
● How often do people return to where they
were born?
● Do player salaries reflect past, present, or
future performance?
● Are heights and weights increasing in the
population?
. There are two particular themes to be aware of here. First, the identifiers and reference
tags (i.e. the metadata) often prove more interesting in a data set than the stuff we are
supposed to care about, here the statistical record of play.
Second is the idea of a statistical proxy, where you use the data set you have to substitute
for the one you really want. The data set of your dreams likely does not exist, or may be
locked away behind a corporate wall even if it does. A good data scientist is a pragmatist,
seeing what they can do with what they have instead of bemoaning what they cannot get
their hands on.
IMDb: Movie Data
IMDb: Actor Data
Movie Questions
● Can we predict how well people will like a
movie? What about its gross?
● What does the social network of actors look
like? (Six degrees of Kevin Bacon)
● What is the age distribution of actors and
actresses in film?
● Do stars live longer or shorter lives than the bit
players or public?
• Which actors appeared in the most films? Earned the most money? Ap- peared in the lowest rated

films? Had the longest career or the shortest lifespan?

• What was the highest rated film each year, or the best in each genre? Which movies lost the most

money, had the highest-powered casts, or got the least favorable reviews.

• How well does movie gross correlate with viewer ratings or awards? Do customers instinctively

flock to trash, or is virtue on the part of the cre- ative team properly rewarded?

• How do Hollywood movies compare to Bollywood movies, in terms of rat- ings, budget, and gross?
Are American movies better received than foreign films, and how does this differ between U.S. and
non-U.S. reviewers?
Google Ngrams
● Presents an annual time series of the frequency
of every “popular” word/phrase with 1 to 5
words occurs in scanned books.
● `Popular’ means appears >40 times in total.
● Google has scanned about 15% of all books
ever published, making this resource quite
comprehensive.
Google Ngram Viewer
Ngram Questions
● How has the amount of cursing changed
over time?
● What is the lifespan of fame and
technologies? Is it increasing/decreasing?
● How often do new words emerge? Do
they stay in common usage?
● What words are associated with other words,
i.e. can you build a language model?
NYC Taxi Cab Data
● Gives driver/owner, pickup/dropoff location,
and fare data for every taxi trip taken.
● Data obtained from NYC via Freedom of
Information Act Request (FOA)
Taxicab Questions
● How much do drivers make each night?
● How far do they travel?
● How much slower is traffic during rush hour?
● Where are people traveling to/from at
different times of the day?
● Do faster drivers get tipped better?
● Where should drivers go to pick up their next
fare?
Properties of Data
1 Structured vs. Unstructured Data

Certain data sets are nicely structured, like the tables in a database or spread- sheet program. Others record information about
the state of the world, but in a more heterogeneous way. Perhaps it is a large text corpus with images and links like
Wikipedia, or the complicated mix of notes and test results appearing in personal medical records.

Data is often represented by a matrix, where the rows of the matrix represent distinct items or records, and the columns
represent distinct properties of these items. For example, a data set about U.S. cities might contain one row for each city, with
columns representing features like state, population, and area.

When confronted with an unstructured data source, such as a collection of tweets from Twitter, our first step is generally to
build a matrix to structure it. A bag of words model will construct a matrix with a row for each tweet, and a column for each
frequently used vocabulary word. Matrix entry M[i,j] then denotes the number of times tweet i contains word j.
2 Quantitative vs. Categorical Data

Quantitative data consists of numerical values, like height and weight. Such data can be incorporated directly into algebraic formulas and

mathematical models, or displayed in conventional graphs and charts.

By contrast, categorical data consists of labels describing the properties of the objects under investigation, like gender, hair color, and occupation.

This descriptive information can be every bit as precise and meaningful as numerical data, but it cannot be worked with using the same techniques.

Categorical data can usually be coded numerically. For example, gender might be represented as male = 0 or female = 1. But things get more com-

plicated when there are more than two characters per feature, especially when there is not an implicit order between them. We may be able to encode

hair colors as numbers by assigning each shade a distinct value like gray hair = 0, red hair = 1, and blond hair = 2. However, we cannot really treat

these val- ues as numbers, for anything other than simple identity testing. Does it make any sense to talk about the maximum or minimum hair color?

What is the interpretation of my hair color minus your hair color?


3.Big Data vs. Little Data

Data science has become conflated in the public eye with big data, the analysis of massive data sets resulting from computer logs and sensor devices. In

principle, having more data is always better than having less, because you can always throw some of it away by sampling to get a smaller set if necessary.

The challenges of big data include:

•The analysis cycle time slows as data size grows: Computational opera- tions on data sets take longer as their volume increases. Small spreadsheets

provide instantaneous response, allowing you to experiment and play what if? But large spreadsheets can be slow and clumsy to work with, and

massive- enough data sets might take hours or days to get answers from. Clever algorithms can permit amazing things to be done with big data, but

staying small generally leads to faster analysis and exploration.

•Large data sets are complex to visualize: Plots with millions of points on them are impossible to display on computer screens or printed images, let alone

conceptually understand. How can we ever hope to really understand something we cannot see?

•Simple models do not require massive data to fit or evaluate: A typical data science task might be to make a decision (say, whether I should offer this

fellow life insurance?) on the basis of a small number of variables: say age, gender, height, weight, and the presence or absence of existing medical

conditions.

You might also like