Unit 2 Notes Data Analytics
Unit 2 Notes Data Analytics
Big data refers to large and complex datasets that cannot be processed or analyzed using
traditional data processing techniques. It includes structured, semi-structured, and unstructured
data that is generated from a variety of sources, such as social media, mobile devices, sensors,
and machine logs.
Insights and decision making: Big data can help organizations gain insights into
customer behavior, market trends, and operational efficiencies. This information can be
used to make data-driven decisions that can improve business outcomes.
Innovation: Big data can fuel innovation by providing new opportunities to create new
products and services or improve existing ones.
Cost savings: Big data technologies can help organizations reduce costs by optimizing
processes, identifying inefficiencies, and improving resource utilization.
Competitive advantage: Organizations that leverage big data effectively can gain a
competitive advantage by making better decisions faster, identifying new revenue
streams, and improving customer experiences.
Healthcare: Big data is important in healthcare as it can help in improving patient
outcomes, identifying disease patterns, and optimizing resource allocation.
Social good: Big data can be used for social good by providing insights into social issues
such as poverty, hunger, and disease outbreaks, which can help policymakers make
informed decisions and allocate resources effectively.
In conclusion, big data has become an important aspect of modern business and society. Its
importance lies in its ability to provide insights, fuel innovation, save costs, gain a competitive
advantage, improve healthcare, and address social issues.
5 V’s of BigData:
Volume: This refers to the vast amount of data that is generated from various sources such as
social media, sensors, and logs. Big Data is characterized by its sheer volume and the need for
tools and technologies that can handle and process large datasets.
Page 1
Velocity: This refers to the speed at which data is generated, collected, and processed. With the
advent of IoT (Internet of Things) and other real-time data sources, data is being generated at a
faster rate than ever before. Big Data requires fast and real-time processing of data to extract
valuable insights.
Variety: This refers to the different types and formats of data that are generated. Big Data can
include structured, unstructured, and semi-structured data, such as text, audio, video, and social
media data. Analyzing these various data types requires advanced tools and technologies.
Veracity: This refers to the accuracy and reliability of the data. Big Data often includes
incomplete or inaccurate data that can lead to unreliable insights. It is essential to ensure that the
data being analyzed is trustworthy and of high quality.
Value: This refers to the importance and usefulness of the insights derived from the data. The
ultimate goal of analyzing Big Data is to extract valuable insights that can drive business
decisions, improve processes, and enhance customer experiences. The insights derived from Big
Data should be actionable and valuable to the organization.
Big data analytics refers to the process of analyzing and interpreting large and complex datasets
to uncover valuable insights that can drive business decisions, optimize processes, and improve
customer experiences. It involves using advanced analytical techniques, such as data mining,
machine learning, and predictive modeling, to extract insights from vast amounts of data.
The growth of Big Data has been fueled by the proliferation of new data sources, such as social
media, mobile devices, and the Internet of Things (IoT). These sources generate vast amounts of
data, which can be used to gain insights into customer behavior, market trends, and operational
efficiencies.
Predictive modeling: Big Data analytics can be used to build predictive models that can
forecast customer behavior, market trends, and operational outcomes.
Customer insights: Big Data analytics can help organizations gain insights into
customer behavior and preferences, which can be used to improve customer experiences
and drive revenue growth.
Operational efficiency: Big Data analytics can help organizations identify inefficiencies
in their processes and operations, which can lead to cost savings and improved resource
utilization.
Risk management: Big Data analytics can be used to identify potential risks and
vulnerabilities in an organization's operations, supply chain, and customer interactions.
Page 2
Product development: Big Data analytics can provide insights into customer needs and
preferences, which can inform the development of new products and services.
To effectively leverage Big Data analytics, organizations need to invest in the right tools and
technologies, as well as skilled data analysts and data scientists. It is essential to have a clear
understanding of the organization's goals and objectives and to develop a robust data strategy to
guide the analytics process.
Big data analytics has a wide range of applications across various industries. Here are some
examples of how big data analytics is being used in different domains:
Healthcare: Big data analytics is being used to improve patient outcomes by identifying
disease patterns, predicting patient risks, and optimizing resource allocation. For
example, analytics can be used to predict which patients are at high risk of developing a
specific disease and intervene early to prevent the disease from progressing.
Finance: Big data analytics is being used to detect fraudulent transactions, assess credit
risk, and identify investment opportunities. For example, analytics can be used to identify
patterns of fraudulent activity in credit card transactions and alert the bank to take
immediate action.
Retail: Big data analytics is being used to optimize supply chain management,
personalize customer experiences, and forecast demand. For example, analytics can be
used to analyze customer data and provide personalized recommendations for products
and services.
Manufacturing: Big data analytics is being used to improve quality control, optimize
production processes, and reduce downtime. For example, analytics can be used to
monitor machine performance in real-time and predict potential equipment failures before
they occur.
Transportation: Big data analytics is being used to optimize logistics and improve
safety. For example, analytics can be used to optimize delivery routes to reduce fuel
consumption and improve on-time delivery.
Education: Big data analytics is being used to personalize learning experiences, identify
at-risk students, and measure student performance. For example, analytics can be used to
track student progress and provide personalized recommendations for coursework.
Sports: Big data analytics is being used to analyze player performance, optimize team
strategies, and enhance the fan experience. For example, analytics can be used to track
player movements on the field and provide insights into their performance and behavior.
Page 3
In conclusion, big data analytics has a wide range of applications across different industries and
domains. By leveraging advanced analytics techniques, organizations can gain valuable insights
into customer behavior, market trends, and operational efficiencies, which can drive business
decisions and improve outcomes.
Hadoop is one of the most popular big data technologies used for processing and managing large
and complex datasets. It is an open-source framework that provides a distributed file system
(HDFS) and a framework for processing large-scale data (MapReduce). The Hadoop ecosystem
includes several other tools and technologies that work together to provide a complete big data
solution.
One of the key features of Hadoop is its ability to process data in a parallel and distributed
manner, which enables it to handle large datasets with ease. Hadoop uses a distributed computing
model, where data is divided into smaller chunks and processed in parallel across multiple nodes
in a cluster.
Some of the other technologies that work in parallel with Hadoop in the big data world include:
Spark: Apache Spark is an open-source data processing framework that is designed to work with
Hadoop. It provides an alternative to MapReduce and is known for its speed and ease of use.
Spark provides an in-memory processing model, which allows it to process data much faster than
traditional batch processing systems.
NoSQL databases: NoSQL databases, such as MongoDB and Cassandra, are designed to handle
large volumes of unstructured data. These databases are highly scalable and provide high
performance for read and write operations. They are often used in conjunction with Hadoop to
store and process large amounts of data.
Hive: Apache Hive is a data warehousing and SQL-like query language that is used to process
data stored in Hadoop. Hive provides a familiar interface for data analysts and allows them to
perform complex queries on large datasets.
Pig: Apache Pig is a platform for creating complex data processing pipelines. Pig provides a
high-level language (Pig Latin) for expressing data transformations and can be used to perform
complex data manipulations on large datasets.
In conclusion, Hadoop is a key technology in the big data world and provides a framework for
processing and managing large datasets in a parallel and distributed manner. Other technologies,
such as Spark, NoSQL databases, Hive, and Pig, work in parallel with Hadoop to provide a
complete big data solution. Together, these technologies enable organizations to extract valuable
Page 4
insights from large and complex datasets, which can drive business decisions and improve
outcomes.
There are several open source technologies available for big data analytics. Here are some of the
most popular ones:
These are just a few examples of the many open-source technologies available for big data
analytics. The choice of technology depends on the specific requirements and use case of the
organization.
Cloud computing and big data are closely related, as the cloud provides a scalable and cost-
effective platform for storing and processing large data sets.
Here are some of the ways in which cloud computing can be used for big data:
Page 5
Storage: Cloud storage services such as Amazon S3, Google Cloud Storage, and
Microsoft Azure Blob Storage provide a cost-effective way to store large amounts of
data. These services are highly scalable and can be used to store both structured and
unstructured data.
Processing: Cloud computing platforms such as Amazon Web Services (AWS), Google
Cloud Platform (GCP), and Microsoft Azure provide tools and services for processing
big data. These include services such as Amazon EMR, Google Cloud Dataproc, and
Microsoft Azure HDInsight, which provide managed Hadoop and Spark clusters for
processing large data sets.
Analytics: Cloud-based analytics platforms such as Google BigQuery, AWS Redshift,
and Microsoft Azure Synapse Analytics provide a powerful and scalable way to perform
data analysis on large data sets. These services can be used for a wide range of analytics
tasks, including data warehousing, business intelligence, and machine learning.
Real-time processing: Cloud-based stream processing services such as AWS Kinesis,
Google Cloud Dataflow, and Azure Stream Analytics provide a scalable and cost-
effective way to process real-time data streams.
Overall, cloud computing provides a powerful and flexible platform for big data processing and
analysis, allowing organizations to store and analyze large data sets without the need for costly
on-premises infrastructure.
Predictive Analytics
Predictive analytics is the use of statistical algorithms and machine learning techniques to
analyze historical data and make predictions about future events or trends. It involves the use of
advanced analytics techniques to identify patterns and relationships in data, which can then be
used to make predictions about future outcomes.
Predictive analytics can be applied in a wide range of industries and use cases, including:
Sales and marketing: Predictive analytics can be used to identify potential customers,
forecast sales volumes, and optimize marketing campaigns.
Financial services: Predictive analytics can be used to detect fraud, manage risk, and
optimize investment portfolios.
Healthcare: Predictive analytics can be used to identify high-risk patients, forecast
disease outbreaks, and optimize healthcare delivery.
Manufacturing: Predictive analytics can be used to forecast demand, optimize
production schedules, and prevent equipment failures.
Transportation: Predictive analytics can be used to optimize routes, predict maintenance
needs, and improve safety.
Page 6
To perform predictive analytics, organizations need to gather and analyze large amounts of data,
including historical data and real-time data. This data is then used to build predictive models,
which can be used to make predictions about future events or outcomes. The accuracy of
predictive analytics depends on the quality of the data and the complexity of the predictive
models used.
Hadoop is an open-source framework for distributed storage and processing of large data sets. It
provides a scalable and fault-tolerant platform for storing and processing big data.
Predictive analytics can be performed on data stored in Hadoop using a variety of open-source
tools and frameworks, such as Apache Mahout, Apache Spark MLlib, and H2O.ai. These tools
provide a wide range of machine learning algorithms for building predictive models, including
regression, classification, clustering, and recommendation.
One of the key advantages of using Hadoop for predictive analytics is its ability to process large
data sets in parallel. Hadoop uses a distributed processing model, where data is stored across
multiple nodes in a cluster and processed in parallel. This allows organizations to analyze large
amounts of data quickly and efficiently, which is essential for predictive analytics.
Another advantage of using Hadoop for predictive analytics is its ability to handle a variety of
data types, including structured and unstructured data. This is particularly useful for applications
such as text mining and sentiment analysis, where unstructured data such as social media posts
and customer reviews are analyzed to make predictions about customer behavior.
Overall, Hadoop provides a powerful platform for performing predictive analytics on big data.
By using open-source tools and frameworks, organizations can build predictive models that are
accurate and scalable, allowing them to make better decisions and gain a competitive advantage.
Mobile Business Intelligence (BI) refers to the delivery of business intelligence solutions to
mobile devices such as smartphones and tablets. It allows business users to access and analyze
data on-the-go, providing them with real-time insights into key business metrics.
Big Data plays a critical role in Mobile BI, as it provides the large and complex data sets that
businesses need to analyze to make informed decisions. Big Data technologies such as Hadoop
and Spark provide a scalable and cost-effective platform for storing and processing the vast
amounts of data required for Mobile BI.
Page 7
Some of the benefits of Mobile BI and Big Data include:
Real-time insights: Mobile BI allows business users to access and analyze data in real-
time, enabling them to make informed decisions on-the-go.
Increased productivity: Mobile BI eliminates the need for users to be tied to their
desktops or laptops, allowing them to access data and insights from anywhere, at any
time.
Better decision-making: Big Data provides the large and complex data sets required for
accurate analysis, enabling businesses to make better-informed decisions.
Improved customer experience: Mobile BI allows businesses to analyze customer data
in real-time, enabling them to provide a more personalized and tailored customer
experience.
Cost savings: Big Data technologies provide a cost-effective platform for storing and
processing large data sets, reducing the need for expensive on-premises infrastructure.
Overall, Mobile BI and Big Data provide a powerful combination for businesses looking to gain
insights into their data and make better-informed decisions. By leveraging the scalability and
flexibility of Big Data technologies, businesses can analyze vast amounts of data in real-time,
providing them with a competitive advantage in today's fast-paced business environment.
Crowdsourcing analytics is a process of using a group of people to collect and analyze data,
typically through an online platform. This approach allows organizations to tap into the
collective knowledge and expertise of a diverse group of individuals to solve complex problems,
analyze large datasets, or make predictions.
Crowdsourcing analytics can be used in a variety of industries and use cases, including:
Page 8
One of the key advantages of crowdsourcing analytics is its ability to collect and analyze large
amounts of data quickly and efficiently. By leveraging the collective knowledge and expertise of
a large group of individuals, organizations can analyze data more accurately and efficiently than
if they were relying solely on internal resources.
Overall, crowdsourcing analytics provides a powerful tool for organizations looking to collect
and analyze large amounts of data. By tapping into the collective knowledge and expertise of a
diverse group of individuals, organizations can make better-informed decisions and gain a
competitive advantage.
Inter and trans firewall analytics refer to the analysis of network traffic that passes through or
between firewalls. Firewalls are network security devices that monitor and control incoming and
outgoing network traffic based on pre-defined security rules.
Inter firewall analytics involves analyzing traffic that passes through a single firewall. This
includes analyzing incoming and outgoing traffic to identify potential threats such as malware,
viruses, or unauthorized access attempts. This analysis can be used to detect and prevent network
attacks, protect sensitive data, and ensure compliance with security policies.
Trans firewall analytics, on the other hand, involves analyzing traffic that passes through
multiple firewalls. This is typically done in large and complex networks that have multiple layers
of security. Trans firewall analytics can help organizations identify potential vulnerabilities in
their network architecture, optimize network performance, and ensure compliance with
regulatory requirements.
Both inter and trans firewall analytics require the use of advanced analytics techniques such as
machine learning, anomaly detection, and behavioral analysis. These techniques allow
organizations to identify patterns and anomalies in network traffic, detect potential threats, and
respond quickly to security incidents.
Improved threat detection: Analytics can help identify and mitigate potential security
threats before they can cause harm to the network.
Enhanced compliance: Analytics can help ensure compliance with regulatory
requirements, such as HIPAA or PCI-DSS.
Improved network performance: Analytics can help optimize network traffic, reducing
latency and improving overall network performance.
Better visibility: Analytics can provide organizations with a better understanding of their
network traffic, including where it is coming from and where it is going.
Page 9
In summary, inter and trans firewall analytics are important tools for organizations looking to
improve their network security and performance. By using advanced analytics techniques to
analyze network traffic, organizations can detect potential threats, optimize network
performance, and ensure compliance with regulatory requirements.
Information Management
Data governance: This involves the establishment of policies, procedures, and standards
to ensure the quality, accuracy, and security of data.
Data architecture: This involves the design and implementation of a data infrastructure
that supports the organization's information needs, including data storage, retrieval, and
analysis.
Data modeling: This involves the creation of data models that represent the organization's
data in a standardized and structured format, allowing for easier analysis and reporting.
Data integration: This involves the process of combining data from different sources and
formats to create a unified view of the organization's information assets.
Data security: This involves the protection of sensitive data from unauthorized access,
use, disclosure, or destruction.
Data analytics: This involves the use of data to gain insights and make informed
decisions, including the use of tools such as business intelligence, data mining, and
predictive analytics.
Page 10
Overall, information management plays a critical role in enabling organizations to effectively
capture, store, retrieve, and analyze their data and information assets. By establishing effective
information management processes, organizations can gain insights that drive business growth
and competitive advantage.
Page 11