0% found this document useful (0 votes)
24 views

Module 3_(Prepare Data for Exploration)

Uploaded by

lostbilla66
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views

Module 3_(Prepare Data for Exploration)

Uploaded by

lostbilla66
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Google data analytics professional course

Week - 1
Collecting Data

Data collection considerations


● How the data will be collected
● Data sources
● Decide what data to use
● How much data to collect
● Select the right data type
● Determine the time frame for data collection
First-party data, This is data collected by an individual or group using their
own resources. Collecting first-party data is typically the preferred method
because you know exactly where it came from.
Second-party data, which is data collected by a group directly from its
audience and then sold. In our example, if you aren't able to collect your own
data, you might buy it from an organization that's led traffic pattern
studies in your city.
Third-party data, or data collected from outside sources who did not
collect it directly. This data might have come from a number of different
sources before you investigated it.

Differentiate between data formats and data


structures

Discover data formats


Two types of data
Quantitative
Qualitative
We can go even deeper into quantitative data and break it down into discrete
or continuous data.

Quantitative
● Discrete data:
This is data that's counted and has a limited number of values.
● Continuous data:

It can be measured using a timer, and its value can be shown as a decimal
with several places.Example You could express that movie's run time as
110.0356 minutes. You could even add fractional data after the decimal point
if you needed to.

Qualitative

● Nominal Eg: yes or no type data


● Ordinal data Eg: Rating between some values

Another types of data:


Internal data,
which is data that lives within a company's own systems.
External data, is, you guessed it, data that lives and is generated outside of
an organization.
Structured data is data that's organized in a certain format, such as rows
and columns.
This also helps make data visualization pretty easy because structured data
can be applied directly to charts, graphs, heat maps, dashboards and most
other visual representations of data.
Unstructured data,
This is data that is not organized in any easily identifiable manner. Audio and
video files are examples of unstructured data because there's no clear way
to identify or organize their content.
Understanding structured data

Data model
A model that is used for organizing data elements and how they relate to
one another.

Data elements
They're pieces of information, such as people's names, account numbers, and
addresses.

The structure of data


Data modeling levels and techniques

Types of data modeling

Data-modeling techniques
There are a lot of approaches when it comes to developing data models, but
two common methods are the Entity Relationship Diagram (ERD) and the
Unified Modeling Language (UML) diagram.

Explore data types, fields and values

Know the type of data you're working with

Data type
A data type is a specific kind of data attribute that tells what kind of value
the data is.
Basic data types:
● String
● Number
● Boolean

Understanding Boolean logic


● AND
● OR
● NOT
Additional Reading/Resources
● https://www.maa.org/press/periodicals/convergence/origins-of-boolean-algebra-in-
the-logic-of-classes-george-boole-john-venn-and-c-s-peirce
● https://libguides.mit.edu/c.php?g=175963&p=1158594

Meet wide and long data

Transforming data
Data transformation is the process of changing the data’s format,
structure, or values. As a data analyst, there is a good chance you will need
to transform data at some point to make it easier for you to analyze it.

Data transformation usually involves:

● Adding, copying, or replicating data


● Deleting fields or records
● Standardizing the names of variables
● Renaming, moving, or combining columns in a database
● Joining one set of data with another
● Saving a file in a different format. For example, saving a spreadsheet
as a comma separated values (CSV) file.
Goals for data transformation might be:

● Data organization: better organized data is easier to use


● Data compatibility: different applications or systems can then use the
same data
● Data migration: data with matching formats can be moved from one
system to another
● Data merging: data with the same organization can be merged together
● Data enhancement: data can be displayed with more detailed fields
● Data comparison: apples-to-apples comparisons of the data can then be
made

Week - 2
Unbiased and objective data

Bias: From questions to conclusions


Bias has evolved to become a preference in favor of or against a person,
group of people, or things.
Data bias is a type of error that systematically skews results in a certain
direction.
Biased and unbiased data
Sampling bias is when a sample isn't representative of the population
as a whole. You can avoid this by making sure the sample is chosen at
random, so that all parts of the population.
Unbiased sampling results in a sample that's representative of the
population being measured.

Understanding bias in data


Three more types of data bias, [ Sampling bias + ]
● observer bias (or Experimenter bias or research bias)
● interpretation bias
● confirmation bias
Observer bias
The tendency for different people to observe things differently.
Interpretation bias
The tendency to always interpret ambiguous situations in a positive, or
negative way. Eg: Communicating with friend and manager same content but
different understanding.
Confirmation bias
It is the tendency to search for, or interpret information in a way that
confirms preexisting beliefs.
Explore data credibility

Identifying good data sources

ROCCC

● R- Reliable
● O- Original
● C- Comprehensive
● C- Current
● C- Cited

For good data, stick with vetted public data sets,academic papers, financial
data and governmental agency data.

What is "bad" data?


Not ROCCC

Data ethics and privacy

Introduction to data ethics

Ethics refers to well-founded standards of right and wrong that prescribe


what humans ought to do, usually in terms of rights, obligations, benefits to
society, fairness or specific virtues.
Data ethics refers to well- founded standards of right and wrong that
dictate how data is collected, shared, and used.

GDPR - General Data Protection Regulation of the European Union


There are lots of different aspects of data ethics but we'll cover six:
● ownership,
● transaction transparency,
● consent,
● currency,
● privacy, and
● openness.

Ownership
Individuals who own the raw data they provide, and they have primary
control over its usage, how it's processed and how it's shared.

Transaction transparency
Which is the idea that all data processing activities and algorithms should
be completely explainable and understood by the individual who provides
their data.

Consent
This is an individual's right to know explicit details about how and why
their data will be used before agreeing to provide it. They should know
answers to questions like why is the data being collected? How will it be
used? How long will it be stored? The best way to give consent is probably a
conversation between the person providing the data and the person
requesting it.
Currency
Individuals should be aware of financial transactions resulting from the
use of their personal data and the scale of these transactions.

Privacy means preserving a data subject's information and activity any


time a data transaction occurs.
● Protection from unauthorized access to our private data,
● Freedom from inappropriate use of our data,
● The right to inspect, update, or correct our data,
● Ability to give consent to use our data,and
● Legal right to access our data.

Data anonymization
Personally identifiable information - PII

Data anonymization
Data anonymization is the process of protecting people's private or
sensitive data by eliminating that kind of information. Typically, data
anonymization involves blanking, hashing, or masking personal information,
often by using fixed-length codes to represent data columns, or hiding data
with altered values.

What types of data should be anonymized?


Healthcare and financial data are two of the most sensitive types of data.
These two industries usually goes through de-identification, which is a
process used to wipe data clean of all personally identifying information.
Data anonymization is used in just about every industry.
● Telephone numbers
● Names
● License plates and license numbers
● Social security numbers
● IP addresses
● Medical records
● Email addresses
● Photographs
● Account numbers

Understanding open data


Openness refers to free access, usage and sharing of data.preferably by
downloading over the Internet in a convenient and modifiable form.
Eg:data.gov
● Be available and accessible to the public as a complete dataset
● Be provided under terms that allow it to be reused and redistributed
● Allow universal participation so that anyone can use, reuse, and
redistribute the data

Data Interoperability is the ability of data systems and services to openly


connect and share data.
Sites and resources for open data
● https://www.data.gov/
● https://www.census.gov/data.html
● https://www.opendatanetwork.com/
● https://cloud.google.com/solutions/datasets
● https://datasetsearch.research.google.com/

Week-3
Working with databases

All about databases


Database
A database is a collection of data stored in a computer system.

Metadata
Metadata is data about data.

Database features
Relational database
A relational database is a database that contains a series of related tables
that can be connected via their relationships.

Primary key
A primary key is an identifier that references a column in which each value
is unique.
A table can only have one primary key.
Foreign key
A foreign key is a field within a table that's a primary key in another table.
A table in a relational database is allowed to have multiple foreign keys.

Managing data with metadata

Exploring metadata
Metadata summarizes basic information about data.
There are three common types of metadata:
● descriptive,
● structural, and
● administrative
Descriptive metadata:
It is metadata that describes a piece of data and can be used to
identify it at a later point in time.
Eg: The descriptive metadata of a book in a library would include the code
you see on its spine, known as a unique International Standard Book Number,
also called the ISBN.

Structural metadata
Which is metadata that indicates how a piece of data is organized and
whether it's part of one or more than one data collection.
Eg: Index of a book

Administrative metadata
It is metadata that indicates the technical source of a digital asset.
Eg: Details of the photo includes size, time, height, width etc..
Metadata is as important as the data itself
Metadata tells the who, what, when, where, which, how, and why of data.

Elements of metadata

● Title and description


● Tags and categories
● Who created it and when
● Who last modified it and when
● Who can access or update it

Using metadata as an analyst


A metadata repository is a database specifically created to store metadata.
Analysts use meta data because it tells what the data is.

Metadata management
Data governance is a process to ensure the formal management of a
company’s data assets.

Accessing different data sources

Working with more data sources


● Internal
● External

Internal data from internal source


Data from spreadsheet to spreadsheet
Eg:
=IMPORTRANGE("https://docs.google.com/spreadsheets/d/1utuuy9wrDP0g6TbkBzZgKIU6qVzPV7q2dL
Km5urU_x4/edit#gid=0", "A1:O12")

From external source to a spreadsheet

Import from files like csv, xlsx etc..


or
Spreadsheet - IMPORTHTML(“ url ”, ” table ”, 1)

GUIDE
● https://www.thedataschool.co.uk/anna-prosvetova/web-scraping-made
-easy-import-html-tables-or-lists-using-google-sheets-and-excel/

TABLE IN WEBSITE
● https://en.wikipedia.org/wiki/Demographics_of_India

Eg:
=IMPORTHTML("http://en.wikipedia.org/wiki/Demographics_of_India","tabl
e",1)
● We can draw data only from table or list.
● The number is the index that refers to the order of the tables on a
web page.

Microsoft Excel
Y​ou can import data from web pages using the From Web option:

Step 1: Open a new or existing spreadsheet.

S​tep 2: Click Data in the main menu and select the From Web option.

S​tep 3: Enter the URL and click OK.

Step 4: In the Navigator, select which table to import.

Step 5: Click Load to load the data from the table into your spreadsheet.

Importing data from spreadsheets and databases


Data from the websites
https://www.who.int/data/gho/

Exploring public datasets


● https://cloud.google.com/solutions/datasets
● https://datasetsearch.research.google.com/
● https://www.kaggle.com/datasets
● https://cloud.google.com/bigquery/public-data

Public health datasets


● https://www.who.int/data/collections
● https://cloud.google.com/healthcare/docs/resources/public-datasets/
tcia
● https://cloud.google.com/life-sciences/docs/resources/public-dataset
s/1000-genomes

Public climate datasets


● https://www.climate.gov/maps-data/all?listingMain=datasetgallery
● https://www.ncei.noaa.gov/weather-climate-links

Public social-political datasets

● https://data.unicef.org/resources/dataset/sowc-2019-statistical-t
ables/
● https://www.bls.gov/cps/tables.htm
● https://openpolicing.stanford.edu/

Sorting and filtering

Sort and filter in spreadsheet

Working with large datasets in SQL


Bigquery commands
SELECT
FROM
WHERE

SELECT
count(*) as num_of_bikestrips or count(duration) as num_of_bikestrips
FROM
`bigquery-public-data.london_bicycles.cycle_hire`
WHERE
duration >= 1200;
SELECT
name,
count
FROM
`babynames.names_2014`
WHERE
gender = 'M'
ORDER BY
count DESC //count in descending order
LIMIT
5

IN-Depth Bigquery

Dialects
Vendors of SQL databases may use slightly different variations of SQL.
These variations are called SQL dialects.
● MySQL, PostgreSQL, and SQL Server, aren’t case sensitive. This
means if you searched for country_code = ‘us’, it will return all entries
that have 'us', 'uS', 'Us', and 'US'.
● BigQuery is case sensitive, so that same search would only return
entries where the country_code is exactly 'us'.
● We can use single or double quotations ‘ ‘ or “ “ .
-- command line

Write neatly

Multi line command


Week-4
Effectively organize data
Benefits of organize data
● It makes it easier to find and use,
● Helps you avoid making mistakes during your analysis and
● Helps to protect it.

Best practices you can use when organizing data, including


● Naming conventions,
● Foldering, and
● Archiving older files.
● Align your naming and storage practices with your team to avoid any
confusion
● Develop metadata practices

All about file naming


Naming conventions
That describes the content, date, or version of a file and its name.

Securing data
Data security
Data Security means protecting data from unauthorized access or
corruption by adopting safety measures.
Balancing security and analytics

Encrypt data after analysis

Week-5
Create or enhance your online presence

Why an online presence is important


A professional online presence
● It can help potential employers find you.
● lets you make connections with other data analysts in your field,
● learn and share data findings, and
● maybe even participate in community events.
Network building
● LinkedIn
● Instagram
● Github
● Kaggle
● Dataelixir https://dataelixir.com/newsletter-archives/
● Meetup https://www.meetup.com/topics/data-analytics/
● Tableau https://www.tableau.com/learn/series/how-we-do-data
● Kdnuggets https://www.kdnuggets.com/meetings/index.html
● Conference https://www.digitalanalyticsassociation.org/
● Data Science Assn https://www.datascienceassn.org/
Quick Review
Week-1
★ Collect data

Week-2
★ Checking data (good or bad, biased or unbiased)

Week-3
★ Database

Week-4
★ Organize the data

Week-5
★ Network building

Dhamodharan
09/10/2021

You might also like