Unit-2 R Programing
Unit-2 R Programing
What is Data?
Data can be defined as a representation of facts, concepts, or instructions in a formalized
manner.
Table 1.1 Characteristics of Data
Is the information correct in
Accuracy
every detail?
How comprehensive is the
Completeness
information?
Does the information
Reliability contradict other trusted
resources?
Do you really need this
Relevance
information?
How up- to-date is
information? Can it be used
Timeliness
for real-time
reporting?
Differences between Small Data, Medium Data and Big Data
Data can be small, medium or big.
Small data is data in a volume and format that makes it accessible, informative and
actionable.
Medium data refers to data sets that are too large to fit on a single machine but don’t require
enormous clusters of thousands.
Big data is extremely large data sets that may be analysed computationally to reveal patterns,
trends, and associations, especially relating to human behaviour and interactions.
Table 1.2 Small Data and Big Data Comparison Table
Basis of
Small Data Big Data
Comparison
Data that is ‘small’ enough Data sets that are so large or
for human complex
.In a volume and format that that traditional data
Definition
makes it processing
accessible, informative and applications cannot deal
actionable with them
● Data from traditional
● Purchase data from point-
Data Source enterprise
of-sale
systems like
Page 3 of 28
○ Enterprise resource
planning
● Clickstream data from
websites
○ Customer relationship ● GPS stream data –
management(CRM) Mobility data
sent to a server
● Social media – Facebook,
Twitter
Most cases in a range of tens
or
More than a few Terabytes
Volume hundreds of GB.Some case
(TB)
few TBs ( 1
TB=1000 GB)
Velocity (Rate ● Data can arrive at very
● Controlled and steady data
at which data fast
flow
appears) speeds.
● Enormous data can
accumulate
● Data accumulation is slow
within very short periods of
time
High variety data sets which
Structured data in tabular include
format with Tabular data,Text files,
Variety fixed schema and semi- Images,
structured data Video, Audio,
in JSON or XML format XML,JSON,Logs,Sensor
data etc.
Usually, the quality of data
not
Veracity Contains less noise as data
guaranteed. Rigorous data
(Quality of collected in
validation
data ) a controlled manner.
is required before
processing.
Complex data mining for
Business Intelligence, Analysis,
prediction,
Value and
recommendation, pattern
Reporting
finding, etc.
Historical data equally valid
In some cases, data gets
Time as data
older soon(Eg
Variance represent solid business
fraud detection).
interactions
Mostly in distributed
Databases within an
storages on
Data Location enterprise, Local
Cloud or in external file
servers, etc.
systems.
More agile infrastructure
with a
Predictable resource
horizontally scalable
Infrastructure allocation.Mostly
architecture.
vertically scalable hardware
Load on the system varies a
lot.
Introduction to Big Data
Big data is data that exceeds the processing capacity of conventional database systems. The
data is too big, moves too fast, or doesn’t fit the strictures of your database architectures. To
gain value from this data, you must choose an alternative way to process it. Big Data has to
deal with large and complex datasets that can be structured, Semi-structured, or unstructured
and will typically not fit into memory to be Processed.
Big data is a field that treats ways to analyze, systematically extract information from, or
otherwise deal with data sets that are too large or complex to be dealt with by traditional
dataprocessing application software.
Classification of Types of Big Data
The following classification was developed by the Task Team on Big Data, in June 2013.
Page 12 of 28
Fig. 1.5 Sources of Big Data
1. Social Networks (human-sourced information): this information is the record of human
experiences, previously recorded in books and works of art, and later in photographs, audio
and video. Human-sourced information is now almost entirely digitized and stored
everywhere from personal computers to social networks. Data are loosely structured and
often ungoverned.
1100. Social Networks: Facebook, Twitter, Tumblr etc.
1200. Blogs and comments
1300. Personal documents
1400. Pictures: Instagram, Flickr, Picasa etc.
1500. Videos: Youtube etc.
1600. Internet searches
1700. Mobile data content: text messages
1800. User-generated maps
1900. E-Mail
Page 13 of 28
2. Traditional Business systems (process-mediated data): these processes record and
monitor business events of interest, such as registering a customer, manufacturing a product,
taking an order, etc. The process-mediated data thus collected is highly structured and
includes transactions,reference tables and relationships, as well as the metadata that sets its
context. Traditional business data is the vast majority of what IT managed and processed, in
both operational and BI systems. Usually structured and stored in relational database systems.
(Some sources belonging to this class may fall into the category of "Administrative data").
21. Data produced by Public Agencies
2110. Medical records
22. Data produced by businesses
2210. Commercial transactions
2220. Banking/stock records
2230. E-commerce
2240. Credit cards
3. Internet of Things (machine-generated data): derived from the phenomenal growth in
the number of sensors and machines used to measure and record the events and situations in
the physical world. The output of these sensors is machine-generated data, and from simple
sensor records to complex computer logs, it is well structured. As sensors proliferate and data
volumes grow, it is becoming an increasingly important component of the information stored
and processed by many businesses. Its well-structured nature is suitable for computer
processing, but its size and speed is beyond traditional approaches.
31. Data from sensors
311. Fixed sensors
3111. Home automation
3112. Weather/pollution sensors
3113. Traffic sensors/webcam
3114. Scientific sensors
3115. Security/surveillance videos/images
312. Mobile sensors (tracking)
3121. Mobile phone location
3122. Cars
3123. Satellite images
32. Data from computer systems
3210. Logs
3220. Web logs
Analytics not only helps in understanding data more accurately, it is also helping to
generate insights from large amounts of data through visualization. Thus, it is no
wonder that Big Data has made its way into the boardroom, being an effective tool to
help companies strategize their decision making capabilities.
Big Data is one of THE biggest buzzwords around at the moment and
I believe big data will change the world. Some say it will be even
bigger than the Internet. What’s certain, big data will impact
everyone’s life. Having said that, I also think that the term ‘big data’ is
not very well defined and is, in fact, not well chosen. Let me use this
article to explain what’s behind the massive ‘big data’ buzz and
demystify some of the hype.
Basically, big data refers to our ability to collect and analyse the vast
amounts of data we are now generating in the world. The ability to
harness the ever-expanding volumes of data is completely
transforming our ability to understand the world and everything within
it. The advances in analysing big data allow us to e.g. decode human
DNA in minutes, find cures for cancer, accurately predict human
behaviour, foil terrorist attacks, pinpoint marketing efforts and prevent
diseases.
Take this business example: Wal-Mart is able to take data from your
past buying patterns, their internal stock information, your mobile
phone location data, social media as well as external weather
information and analyse all of this in seconds so it can send you a
voucher for a BBQ cleaner to your phone – but only if you own a
barbeque, the weather is nice and you currently are within a 3 miles
radius of a Wal-Mart store that has the BBQ cleaner in stock. That’s
scary stuff, but one step at a time, let’s first look at why we have so
much more data than ever before.
In my talks and training sessions on big data I talk about the
‘datafication of the world’. This datafication is caused by a number of
things including the adoption of social media, the digitalisation of
books, music and videos, the increasing use of internet-connected
devices as well as cheaper and better sensors that allow us to
measure and track everything. Just think about it for a minute:
When you were reading a book in the past, no external data was
generated. If you now use a Kindle or Nook device, they track
what you are reading, when you are reading it, how often you
read it, how quickly you read it, and so on.
When you were listening to CDs in the past no data was
generated. Now we listen to Music on your iPhone or digital
music player and these devices are recording data on what we
are listening to, when and how often, in what order etc.
Today, most of us carry smart phones and they are constantly
collecting and generating data by logging our location, tracking
our speed, monitoring what apps we are using as well as who we
are ringing or texting.
Sensors are increasingly used to monitor and capture everything
from temperature to power consumption, from ocean movements
to traffic flows, from dust bin collections to your heart rate. Your
car is full of sensors and so are smart TVs, smart watches, smart
fridges, etc. Take my scales (which I – as a gadget freak – love!),
they measure (and keep a record of) my weight, my % body fat,
my heart rate and even the air quality in our house.
Finally, combine all this now with the billions of internet searches
performed daily, the billions of status updates, wall posts,
comments and likes generated on Facebook each day, the 400+
million tweets sent on Twitter per day and the 72 hours of video
uploaded to YouTube every minute.
I am sure you are getting the point. The volume of data is growing at a
frightening rate. Google’s executive chairman Eric Schmidt brings it to
a point: “From the dawn of civilisation until 2003, humankind
generated five exabytes of data. Now we produce five exabytes every
two days…and the pace is accelerating.”
Not only do we have a lot of data, we also have a lot of different and
new types of data: text, video, web search logs, sensor data, financial
transactions and credit card payments etc. In the world of ‘Big Data’
we talk about the 4 Vs that characterize big data:
So, we have a lot of data, in different formats, that is often fast moving
and of varying quality – why would that change the world? The reason
the world will change is that we now have the technology to bring all of
this data together and analyse it.
Classification of analytics:
Big Data is a massive amount of data sets that cannot be stored, processed, or analyzed using
traditional tools.
Today, there are millions of data sources that generate data at a very rapid rate. These data
sources are present across the world. Some of the largest sources of data are social media
platforms and networks. Let’s use Facebook as an example—it generates more than 500
terabytes of data every day. This data includes pictures, videos, messages, and more.
Data also exists in different formats, like structured data, semi-structured data, and unstructured
data. For example, in a regular Excel sheet, data is classified as structured data—with a definite
format. In contrast, emails fall under semi-structured, and your pictures and videos fall under
unstructured data. All this data combined makes up Big Data.
There are many different ways that Big Data analytics can be used in order to improve
businesses and organizations. Here are some examples:
Using analytics to understand customer behavior in order to optimize the customer experience
Increasing operational efficiency by understanding where bottlenecks are and how to fix them
Today, Big Data is the hottest buzzword around. With the amount of data being generated every
minute by consumers and businesses worldwide, there is significant value to be found in Big
Data analytics.
Big Data analytics is a process used to extract meaningful insights, such as hidden patterns,
unknown correlations, market trends, and customer preferences. Big Data analytics provides
various advantages—it can be used for better decision making, preventing fraudulent activities,
among other things.
In today’s world, Big Data analytics is fueling everything we do online—in every industry.
Take the music streaming platform Spotify for example. The company has nearly 96 million
users that generate a tremendous amount of data every day. Through this information, the cloud-
based platform automatically generates suggested songs—through a smart recommendation
engine—based on likes, shares, search history, and more. What enables this is the techniques,
tools, and frameworks that are a result of Big Data analytics.
If you are a Spotify user, then you must have come across the top recommendation section,
which is based on your likes, past history, and other things. Utilizing a recommendation engine
that leverages data filtering tools that collect data and then filter it using algorithms works. This
is what Spotify does.
Big Data is a massive amount of data sets that cannot be stored, processed, or analyzed using
traditional tools.
Today, there are millions of data sources that generate data at a very rapid rate. These data
sources are present across the world. Some of the largest sources of data are social media
platforms and networks. Let’s use Facebook as an example—it generates more than 500
terabytes of data every day. This data includes pictures, videos, messages, and more.
Data also exists in different formats, like structured data, semi-structured data, and unstructured
data. For example, in a regular Excel sheet, data is classified as structured data—with a definite
format. In contrast, emails fall under semi-structured, and your pictures and videos fall under
unstructured data. All this data combined makes up Big Data.
Also Read: Data Science vs. Big Data vs. Data Analytics
Looking to master analytics? Simplilearn offers industry-leading analytics courses that provide
in-depth knowledge and practical skills for your professional growth.
Uses and Examples of Big Data Analytics
There are many different ways that Big Data analytics can be used in order to improve
businesses and organizations. Here are some examples:
Using analytics to understand customer behavior in order to optimize the customer experience
Increasing operational efficiency by understanding where bottlenecks are and how to fix them
These are just a few examples — the possibilities are really endless when it comes to Big Data
analytics. It all depends on how you want to use it in order to improve your business.
The history of Big Data analytics can be traced back to the early days of computing, when
organizations first began using computers to store and analyze large amounts of data. However,
it was not until the late 1990s and early 2000s that Big Data analytics really began to take off, as
organizations increasingly turned to computers to help them make sense of the rapidly growing
volumes of data being generated by their businesses.
Today, Big Data analytics has become an essential tool for organizations of all sizes across a
wide range of industries. By harnessing the power of Big Data, organizations are able to gain
insights into their customers, their businesses, and the world around them that were simply not
possible before.
As the field of Big Data analytics continues to evolve, we can expect to see even more amazing
and transformative applications of this technology in the years to come.
Read More: Fascinated by Data Science, software alum Aditya Shivam wanted to look for new
possibilities of learning and then gradually transitioning in to the data field. Read about Shivam’s
journey with our Big Data Engineer Master’s Program, in his Simplilearn Big Data Engineer
Review.
1. Risk Management
Use Case: Banco de Oro, a Phillippine banking company, uses Big Data analytics to identify
fraudulent activities and discrepancies. The organization leverages it to narrow down a list of
suspects or root causes of problems.
Use Case: Rolls-Royce, one of the largest manufacturers of jet engines for airlines and armed
forces across the globe, uses Big Data analytics to analyze how efficient the engine designs are
and if there is any need for improvements.
Use Case: Starbucks uses Big Data analytics to make strategic decisions. For example, the
company leverages it to decide if a particular location would be suitable for a new outlet or not.
They will analyze several different factors, such as population, demographics, accessibility of the
location, and more.
Use Case: Delta Air Lines uses Big Data analysis to improve customer experiences. They
monitor tweets to find out their customers’ experience regarding their journeys, delays, and so
on. The airline identifies negative tweets and does what’s necessary to remedy the situation. By
publicly addressing these issues and offering solutions, it helps the airline build good customer
relations.
Different Types of Big Data Analytics
1. Descriptive Analytics
This summarizes past data into a form that people can easily read. This helps in creating reports,
like a company’s revenue, profit, sales, and so on. Also, it helps in the tabulation of social media
metrics.
Use Case: The Dow Chemical Company analyzed its past data to increase facility utilization
across its office and lab space. Using descriptive analytics, Dow was able to identify
underutilized space. This space consolidation helped the company save nearly US $4 million
annually.
2. Diagnostic Analytics
This is done to understand what caused a problem in the first place. Techniques like drill-
down, data mining, and data recovery are all examples. Organizations use diagnostic analytics
because they provide an in-depth insight into a particular problem.
Use Case: An e-commerce company’s report shows that their sales have gone down, although
customers are adding products to their carts. This can be due to various reasons like the form
didn’t load correctly, the shipping fee is too high, or there are not enough payment options
available. This is where you can use diagnostic analytics to find the reason.
3. Predictive Analytics
This type of analytics looks into the historical and present data to make predictions of the future.
Predictive analytics uses data mining, AI, and machine learning to analyze current data and make
predictions about the future. It works on predicting customer trends, market trends, and so on.
Use Case: PayPal determines what kind of precautions they have to take to protect their clients
against fraudulent transactions. Using predictive analytics, the company uses all the historical
payment data and user behavior data and builds an algorithm that predicts fraudulent activities.
4. Prescriptive Analytics
This type of analytics prescribes the solution to a particular problem. Perspective analytics works
with both descriptive and predictive analytics. Most of the time, it relies on AI and machine
learning.
Use Case: Prescriptive analytics can be used to maximize an airline’s profit. This type of
analytics is used to build an algorithm that will automatically adjust the flight fares based on
numerous factors, including customer demand, weather, destination, holiday seasons, and oil
prices.
Spark - used for real-time processing and analyzing large amounts of data
Companies choose modern techniques to handle these large data sets, like
compression, tiering, and deduplication. Compression is employed to reduce
the number of bits within the data, thus reducing its overall size.
Deduplication is the process of removing duplicate and unwanted data from a
knowledge set. Data tiering allows companies to store data in several storage
tiers. It ensures that the info resides within the most appropriate storage
space. Data tiers are often public cloud, private cloud, and flash storage,
counting on the info size and importance. Companies also are choosing its
tools, like Hadoop, NoSQL, and other technologies.
Companies often get confused while selecting the simplest tool for giant Data
analysis and storage. Is HBase or Cassandra the simplest technology for data
storage? Is Hadoop Map Reduce ok, or will Spark be a far better data
analytics and storage option? These questions bother companies, and
sometimes they cannot seek the answers. They find themselves making poor
decisions and selecting inappropriate technology. As a result, money, time,
effort, and work hours are wasted.
You'll either hire experienced professionals who know far more about these
tools. Differently is to travel for giant Data consulting. Here, consultants will
recommend the simplest tools supporting your company’s scenario.
Supporting their advice, you'll compute a technique and select the simplest
tool.
Data in a corporation comes from various sources, like social media pages,
ERP applications, customer logs, financial reports, e-mails, presentations, and
reports created by employees. Combining all this data to organize reports
may be a challenging task. This is a neighborhood often neglected by firms.
Data integration is crucial for analysis, reporting, and business intelligence,
so it's perfect.
Securing Data
Big data can solve most of the problems of rising costs by continuously
monitoring your infrastructure. Effective DevOps and DataOps practices help
you monitor and manage the data stack and resources you use to store and
manage data, identify savings opportunities, and balance the costs of scaling.
Consider cost early when building a data processing pipeline. Duplicate data
from multiple stores that double your costs? Can you optimize management
costs by tiering your data according to business value? Do you have a habit
of archiving and forgetting data? The answers to these questions can help you
devise a solid strategy and save you huge bucks.
Choose an affordable tool that fits your budget. Most cloud-based Data stacks
are offered on a pay-as-you-go basis. In other words, your cost is directly
related to the API and data calls, and processing power you use. New Big
Data Toolsis constantly expanding, allowing you to choose and combine
different tools to fit your budget and needs.
Real-Time Insights
The dataset is a treasure trove of insights. But knowledge is worthless
without real understanding derived from it. Now some will define real-time
as instantaneous, while others will think of it as time spent on data extraction
and analysis. However, the key idea is to establish a good understanding to
reap the benefits of activities such as
One of the challenges associated with big data is generating timely reports
and insights. To this end, companies are looking for opportunities to compete
with their competitors in the marketplace by investing in ETL tools and
analytics with real-time capabilities.
At this point in the evolution of big data, the challenges for most
companies are not related to technology. The biggest impediments to
adoption relate to cultural challenges: organizational alignment,
resistance or lack of understanding, and change management.
Here are some key technologies that enable Big Data for Businesses:
Ref — https://www.marutitech.com/big-data-analytics-will-play-important-role-businesses/
1) Predictive Analytics
2) NoSQL Databases
These databases are utilised for reliable and efficient data management
across a scalable number of storage nodes. NoSQL databases store data
as relational database tables, JSON docs or key-value pairings.
These are tools that allow businesses to mine big data (structured and
unstructured) which is stored on multiple sources. These sources can
be different file systems, APIs, DBMS or similar platforms. With search
and knowledge discovery tools, businesses can isolate and utilise the
information to their benefit.
4) Stream Analytics
6) Distributed Storage
7) Data Virtualization
9) Data Preprocessing
An important parameter for big data processing is the data quality. The
data quality software can conduct cleansing and enrichment of large
data sets by utilising parallel processing. These softwares are widely
used for getting consistent and reliable outputs from big data
processing.
There’s no doubt that Big Data will continue to play an important role
in many different industries around the world. It can definitely do
wonders for a business organization. In order to reap more benefits,
it’s important to train your employees about Big Data management.
With proper management of Big Data, your business will be more
productive and efficient.