Module 3_(Prepare Data for Exploration)
Module 3_(Prepare Data for Exploration)
Week - 1
Collecting Data
Quantitative
● Discrete data:
This is data that's counted and has a limited number of values.
● Continuous data:
It can be measured using a timer, and its value can be shown as a decimal
with several places.Example You could express that movie's run time as
110.0356 minutes. You could even add fractional data after the decimal point
if you needed to.
Qualitative
Data model
A model that is used for organizing data elements and how they relate to
one another.
Data elements
They're pieces of information, such as people's names, account numbers, and
addresses.
Data-modeling techniques
There are a lot of approaches when it comes to developing data models, but
two common methods are the Entity Relationship Diagram (ERD) and the
Unified Modeling Language (UML) diagram.
Data type
A data type is a specific kind of data attribute that tells what kind of value
the data is.
Basic data types:
● String
● Number
● Boolean
Transforming data
Data transformation is the process of changing the data’s format,
structure, or values. As a data analyst, there is a good chance you will need
to transform data at some point to make it easier for you to analyze it.
Week - 2
Unbiased and objective data
ROCCC
● R- Reliable
● O- Original
● C- Comprehensive
● C- Current
● C- Cited
For good data, stick with vetted public data sets,academic papers, financial
data and governmental agency data.
Ownership
Individuals who own the raw data they provide, and they have primary
control over its usage, how it's processed and how it's shared.
Transaction transparency
Which is the idea that all data processing activities and algorithms should
be completely explainable and understood by the individual who provides
their data.
Consent
This is an individual's right to know explicit details about how and why
their data will be used before agreeing to provide it. They should know
answers to questions like why is the data being collected? How will it be
used? How long will it be stored? The best way to give consent is probably a
conversation between the person providing the data and the person
requesting it.
Currency
Individuals should be aware of financial transactions resulting from the
use of their personal data and the scale of these transactions.
Data anonymization
Personally identifiable information - PII
Data anonymization
Data anonymization is the process of protecting people's private or
sensitive data by eliminating that kind of information. Typically, data
anonymization involves blanking, hashing, or masking personal information,
often by using fixed-length codes to represent data columns, or hiding data
with altered values.
Week-3
Working with databases
Metadata
Metadata is data about data.
Database features
Relational database
A relational database is a database that contains a series of related tables
that can be connected via their relationships.
Primary key
A primary key is an identifier that references a column in which each value
is unique.
A table can only have one primary key.
Foreign key
A foreign key is a field within a table that's a primary key in another table.
A table in a relational database is allowed to have multiple foreign keys.
Exploring metadata
Metadata summarizes basic information about data.
There are three common types of metadata:
● descriptive,
● structural, and
● administrative
Descriptive metadata:
It is metadata that describes a piece of data and can be used to
identify it at a later point in time.
Eg: The descriptive metadata of a book in a library would include the code
you see on its spine, known as a unique International Standard Book Number,
also called the ISBN.
Structural metadata
Which is metadata that indicates how a piece of data is organized and
whether it's part of one or more than one data collection.
Eg: Index of a book
Administrative metadata
It is metadata that indicates the technical source of a digital asset.
Eg: Details of the photo includes size, time, height, width etc..
Metadata is as important as the data itself
Metadata tells the who, what, when, where, which, how, and why of data.
Elements of metadata
Metadata management
Data governance is a process to ensure the formal management of a
company’s data assets.
GUIDE
● https://www.thedataschool.co.uk/anna-prosvetova/web-scraping-made
-easy-import-html-tables-or-lists-using-google-sheets-and-excel/
TABLE IN WEBSITE
● https://en.wikipedia.org/wiki/Demographics_of_India
Eg:
=IMPORTHTML("http://en.wikipedia.org/wiki/Demographics_of_India","tabl
e",1)
● We can draw data only from table or list.
● The number is the index that refers to the order of the tables on a
web page.
Microsoft Excel
You can import data from web pages using the From Web option:
Step 2: Click Data in the main menu and select the From Web option.
Step 5: Click Load to load the data from the table into your spreadsheet.
● https://data.unicef.org/resources/dataset/sowc-2019-statistical-t
ables/
● https://www.bls.gov/cps/tables.htm
● https://openpolicing.stanford.edu/
SELECT
count(*) as num_of_bikestrips or count(duration) as num_of_bikestrips
FROM
`bigquery-public-data.london_bicycles.cycle_hire`
WHERE
duration >= 1200;
SELECT
name,
count
FROM
`babynames.names_2014`
WHERE
gender = 'M'
ORDER BY
count DESC //count in descending order
LIMIT
5
IN-Depth Bigquery
Dialects
Vendors of SQL databases may use slightly different variations of SQL.
These variations are called SQL dialects.
● MySQL, PostgreSQL, and SQL Server, aren’t case sensitive. This
means if you searched for country_code = ‘us’, it will return all entries
that have 'us', 'uS', 'Us', and 'US'.
● BigQuery is case sensitive, so that same search would only return
entries where the country_code is exactly 'us'.
● We can use single or double quotations ‘ ‘ or “ “ .
-- command line
Write neatly
Securing data
Data security
Data Security means protecting data from unauthorized access or
corruption by adopting safety measures.
Balancing security and analytics
Week-5
Create or enhance your online presence
Week-2
★ Checking data (good or bad, biased or unbiased)
Week-3
★ Database
Week-4
★ Organize the data
Week-5
★ Network building
Dhamodharan
09/10/2021