0% found this document useful (0 votes)
40 views

Unit 1 Data Science and Big Data

Big Data and Data Science refers to extremely large and diverse collections of structured, unstructured, and semi-structured data that continues to grow exponentially over time and cannot be processed and stored on a single machine. Big data tools are emerging to help companies collect, process, and analyze data at the speed needed to gain value from it. Organizations can use big data analytics systems and software to make data-driven decisions that can improve business outcomes through more effective marketing, new revenue opportunities, and improved operational efficiency. Popular big data tools include Apache Hadoop, Tableau, R language, and cloud computing providers that allow customers to easily process data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views

Unit 1 Data Science and Big Data

Big Data and Data Science refers to extremely large and diverse collections of structured, unstructured, and semi-structured data that continues to grow exponentially over time and cannot be processed and stored on a single machine. Big data tools are emerging to help companies collect, process, and analyze data at the speed needed to gain value from it. Organizations can use big data analytics systems and software to make data-driven decisions that can improve business outcomes through more effective marketing, new revenue opportunities, and improved operational efficiency. Popular big data tools include Apache Hadoop, Tableau, R language, and cloud computing providers that allow customers to easily process data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 23

Big Data and Data Science

Introduction

Data that is always increasing and cannot be processed and stored on a single machine is
termed as Big Data.
Big Data

• Big data refers to extremely large and diverse collections of structured,


unstructured, and semi-structured data that continues to grow exponentially over
time.
• These datasets are so huge and complex in volume, velocity, and variety, that
traditional data management systems cannot store, process, and analyze them.
• The amount and availability of data is growing rapidly, spurred on by digital
technology advancements, such as connectivity, mobility, the Internet of Things
(IoT), and artificial intelligence (AI). As data continues to expand and proliferate,
new big data tools are emerging to help companies collect, process, and analyze
data at the speed needed to gain the most value from it.
• Big data describes large and diverse datasets that are huge in volume and also
rapidly grow in size over time. Big data is used in machine learning, predictive
modeling, and other advanced analytics to solve business problems and make
informed decisions.
Motivation

Big Data has given the organization a new way to analyze and visualize their data effectively.

For example:

Business: Customer Feedback, trends etc.

Health: Health care organizations are leveraging big data technology to capture all the information
about a patient to get more complete view for insight into care coordination, health management &
outcome.
How Big Data Work
Making big data work requires three main actions:

•Integration: Big data collects terabytes, and sometimes even petabytes, of raw data from
many sources that must be received, processed, and transformed into the format that
business users and analysts need to start analyzing it.
•Management: Big data needs big storage, whether in the cloud, on-premises, or both. Data
must also be stored in whatever form required. It also needs to be processed and made
available in real time. Increasingly, companies are turning to cloud solutions to take
advantage of the unlimited compute and scalability.
•Analysis: The final step is analyzing and acting on big data—otherwise, the investment
won’t be worth it. Beyond exploring the data itself, it’s also critical to communicate and share
insights across the business in a way that everyone can understand. This includes using
tools to create data visualizations like charts, graphs, and dashboards.
Applications
Tracking Customer Spending Habit, Shopping Behavior: In big retails store (like Amazon,
Walmart, Big Bazar etc.) management team has to keep data of customer’s spending habit shopping
behavior, customer’s most liked product (so that they can keep those products in the store).

Recommendation: By tracking customer spending habit, shopping behavior, Big retails store provide
a recommendation to the customer. E-commerce site like Amazon, Walmart, Flipkart does product
recommendation.

Education Sector: Online educational course conducting organization utilize big data to search
candidate, interested in that course. If someone searches for YouTube tutorial video on a subject, then
online or offline course provider organization on that subject send ad online to that person about their
course.
Examples
• Tracking consumer behavior and shopping habits to deliver
hyper-personalized retail product recommendations tailored to individual customers
• Monitoring payment patterns and analyzing them against historical customer activity
to detect fraud in real time
• Combining data and information from every stage of an order’s shipment journey with
hyperlocal traffic insights to help fleet operators optimize last-mile delivery
• Using AI-powered technologies like
natural language processing to analyze unstructured medical data (such as research
reports, clinical notes, and lab results) to gain new insights for improved treatment
development and enhanced patient car
• Using image data from cameras and sensors, as well as GPS data, to detect potholes
and improve road maintenance in cities
• Analyzing public datasets of satellite imagery and geospatial datasets to visualize,
monitor, measure, and predict
the social and environmental impacts of supply chain operations
Challenges

Lack of knowledge Professionals

Lack of proper understanding of Massive Data

Data Growth Issues

Integrating Data from a Spread of Sources

Securing Data
Tools and techniques
Data Science
▪ Data science is a field that deals with unstructured, structured data, and semi-structured
data. It involves practices like data cleansing, data preparation, data analysis, and much
more.
▪ Data science is the combination of: statistics, mathematics, programming, and problem-
solving;, capturing data in ingenious ways; the ability to look at things differently; and the
activity of cleansing, preparing, and aligning data. This umbrella term includes various
techniques that are used when extracting insights and information from data.
▪ Unlock the potential of analytics with Simplilearn's top-rated analytics courses. Gain a
competitive edge in the job market and propel your career forward.
Need of Data Science
▪ From business to the health industry, science to our everyday lives, marketing to research,
in fact, for everything in a fraternity, data is required to thrust the movement forward.
Computer science and information technology have taken over our lives, and it is advancing
with each passing day with such velocity and variety that the operational techniques used a
few years back have now become obsolete.
▪ The same is the case with challenges and problems. The problems and concerns of the past
for a specific theme, illness, or shortfall may not be the same today as they have advanced
in terms of complexity.
▪ Every field of science and study or organization, therefore, needs an updated set of
operational systems and technology to keep up with the challenges of today and tomorrow
as well as to derive solutions for unanswered questions
Employ and utilize
Organizations can use big data analytics systems and software to make data-driven decisions that
can improve business-related outcomes.

The benefits may include more effective marketing, new revenue opportunities, customer
personalization and improved operational efficiency.

With an effective strategy, these benefits can provide competitive advantages over rivals.
Current trends and opportunities
Apache Hadoop
• Apache Hadoop is an open source, Java-based software platform that manages data
processing and storage for big data applications.
• The platform works by distributing Hadoop big data and analytics jobs across nodes in
a computing cluster, breaking them down into smaller workloads that can be run in
parallel. Some key benefits of Hadoop are scalability, resilience and flexibility.
• The Hadoop Distributed File System (HDFS) provides reliability and resiliency by
replicating any node of the cluster to the other nodes of the cluster to protect against
hardware or software failures.
• Hadoop flexibility allows the storage of any data format including structured and
unstructured data.
Tableau
• Tableau is a powerful tool used for data analysis and visualization. It allows the
creation of amazing and interactive visualization and that too without coding.
• Tableau is very famous as it can take in data and produce the required data
visualization output in a very short time.
• Basically, it can elevate your data into insights that can be used to drive your action in
the future..
Tableau Features

• Tableau supports powerful data discovery and exploration that


enables users to answer important questions in seconds
• No prior programming knowledge is needed; users without relevant
experience can start immediately with creating visualizations using
Tableau
• It can connect to several data sources that other BI tools do not
support. Tableau enables users to create reports by joining and
blending different datasets
• Tableau Server supports a centralized location to manage all
published data sources within an organization
R language
• R is a language and environment for statistical computing and graphics. It is a GNU
project which is similar to the S language and environment which was developed at Bell
Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. R
can be considered as a different implementation of S.
• There are some important differences, but much code written for S runs unaltered under R.
• R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests,
time-series analysis, classification, clustering, …) and graphical techniques, and is highly
extensible. The S language is often the vehicle of choice for research in statistical methodology,
and R provides an Open Source route to participation in that activity.
• One of R’s strengths is the ease with which well-designed publication-quality plots can be
produced, including mathematical symbols and formulae where needed. Great care has been
taken over the defaults for the minor design choices in graphics, but the user retains full control.
• R is available as Free Software under the terms of the Free Software Foundation’s GNU General
Public License in source code form. It compiles and runs on a wide variety of UNIX platforms
and similar systems (including FreeBSD and Linux), Windows and MacOS.
Big Data and Cloud
• Cloud Computing providers often utilize a “software as a service” model to allow customers to
easily process data. Typically, a console that can take in specialized commands and parameters
is available, but everything can also be done from the site’s user interface. Some products that
are usually part of this package include database management systems, cloud-based virtual
machines and containers, identity management systems, machine learning capabilities, and
more.
• In turn, Big Data is often generated by large, network-based systems. It can be in either a
standard or non-standard format. If the data is in a non-standard format, artificial intelligence from
the Cloud Computing provider may be used in addition to machine learning to standardize the
data.
• From there, the data can be harnessed through the Cloud Computing platform and utilized in a
variety of ways. For example, it can be searched, edited, and used for future insights.
• This cloud infrastructure allows for real-time processing of Big Data. It can take huge “blasts” of
data from intensive systems and interpret it in real-time.
• Another common relationship between Big Data and Cloud Computing is that the power of the
cloud allows Big Data analytics to occur in a fraction of the time it used to.
Job roles
Business Analyst
Data Analyst
Data Scientist
Data Engineer/Data Architect
Machine Learning Engineer
Big Data Engineer
Skill Set
Analytical Skills
Data Visualization Skills
Problem Solving Skills
SQL – Structured Query Language
Skills of Programming - Python, Java, R

You might also like