BigDataAnalytics _ Unit1
BigDataAnalytics _ Unit1
For Example: A flight booking service may record data like the number of tickets booked each day.
Descriptive analysis will reveal booking spikes, booking slumps, and high-performing months for this service.
2. Diagnostic analysis -
● Detailed data examination to understand why something happened.
● It is characterized by techniques such as drill-down, data discovery, data mining, and
correlations.
● Multiple data operations and transformations may be performed on a given data set to discover
unique patterns in each of these techniques.
For example: the flight service might drill down on a particularly high-performing month to better
understand the booking spike. This may lead to the discovery that many customers visit a particular city to
attend a monthly sporting event.
3. Predictive analysis -
● Historical data to make accurate forecasts about data patterns that may occur in the future.
● It is characterized by techniques such as machine learning, forecasting, pattern matching, and
predictive modeling.
● In each of these techniques, computers are trained to reverse engineer causality connections in the data.
For example: the flight service team might use data science to predict flight booking patterns for the coming year at the
start of each year. The computer program or algorithm may look at past data and predict booking spikes for certain
destinations in May. Having anticipated their customer’s future travel requirements, the company could start targeted
advertising for those cities from February.
4. Prescriptive analysis -
● It not only predicts what is likely to happen but also suggests an optimum response to that outcome.
● It can analyze the potential implications of different choices and recommend the best course of action.
● It uses graph analysis, simulation, complex event processing, neural networks, and
recommendation engines from machine learning.
For Example: Prescriptive analysis could look at historical marketing campaigns to maximize the advantage of the
upcoming booking spike. A data scientist could project booking outcomes for different levels of marketing spend on
various marketing channels. These data forecasts would give the flight booking company greater confidence in their
marketing decisions.
Data Collection and Management
● Data Collection is the process of collecting information from relevant sources to find a solution
to the given statistical inquiry. Collection of Data is the first and foremost step in a statistical
investigation, research and business intelligence.
Primary Data
● Primary data refers to information collected directly from first-hand sources specifically for a
particular research purpose.
● This type of data is gathered through various methods, including surveys, interviews,
experiments, observations, and focus groups.
● One of the main advantages of primary data is that it provides current, relevant, and specific
information tailored to the researcher’s needs, offering a high level of accuracy and control over
data quality.
Secondary Data
● Secondary data refers to information that has already been collected, processed, and
published by others.
● This type of data can be sourced from existing research papers, government reports, books,
statistical databases, and company records.
● The advantage of secondary data is that it is readily available and often free or less expensive to
obtain compared to primary data.
Primary Data are collected from:
● Surveys and Questionnaires: In a survey, a sample population is questioned about a set of predetermined
questions to gather data. This approach helps acquire demographic data as well as subjective
preferences and opinions. Online questionnaires, telephone interviews, and in-person interviews are all
options for conducting surveys.
● Observational Studies: In observational studies, information is gathered by directly observing
and documenting occurrences, actions, or events. In disciplines like anthropology, psychology, and
the social sciences, this approach is frequently employed. Field observations, videotapings, or already-existing
records and documentation are all methods for gathering observational data.
● Experiments: Experiments entail changing factors to examine how they affect a desired
outcome. By contrasting a control group with one or more experimental groups, data are gathered.
Researchers can establish cause-and-effect links using this technique. Experiment data might be gathered in
controlled lab environments or real-world situations.
● Interviews: Individuals are interviewed one-on-one or in groups to collect data. Interviews can be
pre-planned with a series of questions or left unplanned to allow for free-flowing discussions. This approach
works well for obtaining in-depth knowledge, insights, & qualitative data.
● Web scraping: This technique automatically extracts data from websites. Large amounts of organized or
unstructured data can be gathered using a variety of web sources. Programming expertise and commitment to
moral and legal standards are required for web scraping.
● Sensor data collection: Data from real-world objects or settings is gathered using sensors.
Examples include heart rate monitors, accelerometers, temperature sensors, and GPS trackers. In industries
including the Internet of Things, healthcare, and environmental monitoring, sensor data collection is common.
● Social Media Monitoring: As social media platforms have grown in popularity, researchers are
gathering information from sites like Twitter, Facebook, and Instagram to analyze trends,
attitudes, and general public opinion. This approach aids in the comprehension of user behavior and social
dynamics.
● Existing Databases & Records: Information may be gathered from historical archives, databases, or
records that are already in existence. This technique is time- and money-efficient, especially when working
with huge datasets. Government information, client databases, & medical records are a few examples.
● Data management is the practice of collecting, organizing, protecting, and storing an
organization’s data so it can be analyzed for business decisions.
Types of Data Management
● Data management techniques include the following:
a. Data preparation is used to clean and transform raw data into the right shape and format
for analysis, including making corrections and combining data sets.
b. Data pipelines enable the automated transfer of data from one system to another.
c. ETLs (Extract, Transform, Load) are built to take the data from one system, transform it,
and load it into the organization’s data warehouse.
d. Data catalogs help manage metadata to create a complete picture of the data, providing a
summary of its changes, locations, and quality while also making the data easy to find.
e. Data warehouses are places to consolidate various data sources, contend with the many
data types businesses store, and provide a clear route for data analysis.
f. Data governance defines standards, processes, and policies to maintain data security and
integrity.
g. Data architecture provides a formal approach for creating and managing data flow.
h. Data security protects data from unauthorized access and corruption.
i. Data modeling documents the flow of data through an application or organization.
Sources of Data
● A data source is a place or origin from which data is received in the context of
data management and collecting in data science.
● A database, website, API, sensor, or any other platform or system that produces
or stores data can be a data source. To obtain the information required for
analysis and decision-making processes, data scientists locate and access
pertinent data sources.
● Internal data: This refers to information gathered within a company, such as financial, customer, or sales
data.
● External data: This refers to information gathered from sources outside of an organization, such as the
government, social media, or the weather.
● Sensor data: This refers to information gathered through sensors, such as GPS, temperature, or heart rate
readings.
● Text data: This is information gathered from written materials like news stories, social media posts, and
product reviews.
● Image data: This is information gathered from visual sources like pictures, x-rays, or satellite images.
● Audio data: This is information that has been gathered from audio sources like voice, music, or noises in the
environment.
Using Multiple Data Source
Importance of Multiple Data Sources
● Comprehensive Insights: Accessing varied perspectives allows for deeper understanding and better
decision-making.
● Enhanced Accuracy: Cross-verification of data ensures reliable outcomes.
● Rich Context: Adding external data enriches internal datasets, providing broader context.
● Improved Predictions: Aggregating diverse data improves the performance of predictive and prescriptive models.
1. Data Integration
○ Handling different formats (structured, semi-structured, unstructured).
○ Combining siloed data from disparate sources.
2. Data Quality
○ Cleaning inconsistent or missing values.
○ Removing duplicates and irrelevant data.
3. Data Governance
○ Ensuring privacy, security, and compliance (e.g., GDPR, HIPAA).
4. Scalability
○ Managing the increasing volume, velocity, and variety of data.
5. Real-Time Processing
○ Managing latency for streaming data sources.
Steps for Using Multiple Data Sources
● Merge datasets using unique identifiers (e.g., user ID, location ID).
● Add external data for context (e.g., weather, market trends).
1. Data Collection: Data exploration commences with collecting data from diverse sources such as databases, APIs,
or through web scraping techniques. This phase emphasizes recognizing data formats, structures, and
interrelationships. Comprehensive data profiling is conducted to grasp fundamental statistics, distributions, and
ranges of the acquired data.
2. Data Cleaning: Integral to this process is the rectification of outliers, inconsistent data points, and addressing
missing values, all of which are vital for ensuring the reliability of subsequent analyses. This step involves employing
methodologies like standardizing data formats, identifying outliers, and imputing missing values. Data organization
and transformation further streamline data for analysis and interpretation.
3. Exploratory Data Analysis (EDA): This EDA phase involves the application of various statistical tools such as
box plots, scatter plots, histograms, and distribution plots. Additionally, correlation matrices and descriptive
statistics are utilized to uncover links, patterns, and trends within the data.
4. Feature Engineering: Feature engineering focuses on enhancing prediction models by introducing or modifying
features. Techniques like data normalization, scaling, encoding, and creating new variables are applied. This step
ensures that features are relevant and consistent, ultimately improving model performance.
5. Model Building and Validation: During this stage, preliminary models are developed to test hypotheses or
predictions. Regression, classification, or clustering techniques are employed based on the problem at hand.
Cross-validation methods are used to assess model performance and generalizability.
Fixing Data
● Data fixing is a critical step in data preprocessing where errors or inaccuracies in the dataset are identified
and corrected to ensure data reliability and usability.
1. Incorrect Values:
○ Negative values where only positive ones are valid (e.g., negative age or income).
○ Out-of-range values (e.g., temperatures exceeding physical limits).
2. Typographical Errors:
○ Misspelled names, places, or labels.
○ Inconsistent entries (e.g., "NY" and "New York" for the same entity).
3. Data Mismatches:
○ Data discrepancies between sources (e.g., mismatched customer IDs across tables).
4. Logical Errors:
○ Start dates occurring after end dates.
○ Invalid combinations of categorical data (e.g., "Male" listed as "Pregnant").
5. Incomplete Data:
○ Missing critical fields that require imputation or manual entry.
6. Duplicate Entries:
○ Redundant rows that inflate results or create biases.
Methods for Fixing Data
1. Handling Missing Values:
○ Fill missing data with:
■ Mean/Median/Mode (for numerical data).
■ Interpolation (for time-series data).
■ Domain-Specific Defaults.
○ Drop rows or columns if the missing values are not critical.
2. Correcting Invalid Values:
○ Replace with nearest valid values (e.g., cap outliers at upper/lower bounds).
○ Use regex or string matching for correcting typos (e.g., "Nwe York" → "New York").
3. Resolving Duplicates:
○ Drop duplicate rows using drop_duplicates() in Python or similar tools.
4. Standardizing Data:
○ Convert inconsistent formats (e.g., "MM-DD-YYYY" → "YYYY-MM-DD").
○ Standardize text case, remove whitespace, and normalize units.
5. Cross-Referencing Data:
○ Match and validate entries against reference datasets or lookup tables.
6. Handling Outliers:
○ Apply statistical methods to detect and cap outliers (e.g., z-scores, IQR).
○ Decide whether to keep or remove outliers based on context.
7. Fixing Logical Errors:
○ Write conditional rules to correct invalid logic (e.g., swap dates if end < start).
Data Storage and Management
● Data Storage is a key segment of computerized gadgets, as buyers and organizations have come to depend
on it to save data going from individual information to business-basic data.
● It is used to capture and retain digital data on storage devices.
● The network-attached storage device permits the storage and recovery of data from a centralized location
by approved network users.
● These devices are adaptable and versatile.
● NAS associates with a wireless router, making it simple for disseminated workplaces to access documents
from any device associated with the network.
Cloud storage:
● Cloud storage is a storage option that utilizes remote servers and is accessible from any computer with
Internet access.
● It is kept up, worked and overseen by a cloud storage service provider on storage servers that are based
on virtualization strategies. Examples of cloud storage providers are Google Drive, iCloud, Citrix
ShareFile, ownCloud, Dropbox, Amazon Cloud Drive, MediaFire, etc.
Direct Attached Storage:
●Direct-attached storage is storage associated with a PC.
●It is associated with one computer and not accessible to other computers.
● DAS can furnish clients with preferable execution over networked storage in light of the fact
that the server does not need to cross a system to peruse and compose information.
● Hard drive or USB flash drive is an example of direct-attached storage.
Storage Area Network:
○ The storage area network is a network-based storage system.
○ SAN systems connect to the network utilizing high-speed interfaces enabling improved
execution and the capacity to interface numerous servers to a centralized disk storage store.
○ Storage area networks are highly scalable because capacity can be added as needed.
Object storage:
● Object storage is a technique for organizing data into distinct components called objects that
are kept with specific identifiers and metadata.
● Each object is given a distinct address in this type, and data may be retrieved by using that
address as a reference.
● It is made for large-scale, unstructured data storage, including multimedia files, backups, and
archives. Distributed storage systems and cloud storage platforms is an example of Object
storage