u1 a clsrm
u1 a clsrm
• You can take data from any source and analyze it to find answers that
enable 1) cost reductions, 2) time reductions, 3) new product
development and optimized offerings, and 4) smart decision making.
• When you combine big data with high-powered analytics, you can
.accomplish business-related tasks
Big Data Applications
• Its application include:
• Healthcare
• Academia
• Banking
• Manufacturing
• IT
• Retail
• Transportation
• Media and Entertainment
• Today numerous companies are using big data anlytics.
Types of digital data
• Structured
• Unstructured
• Semi-structured
• Usually, data is in the unstructured format which makes extracting
information from it difficult.
• According to Merrill Lynch, 80–90% of business data is either
unstructured or semi-structured.
• Gartner also estimates that unstructured data constitutes 80% of the
whole enterprise data.
Formats of Digital Data
Here is a percent distribution of the three forms of data
• Structured
By structured data, we mean data that can be processed, stored, and retrieved in a fixed
format. It refers to highly organized information that can be readily and seamlessly
stored and accessed from a database by simple search engine algorithms. For instance,
the employee table in a company database will be structured as the employee details,
their job positions, their salaries, etc., will be present in an organized manner.
• Unstructured
This data refers to the data that lacks any specific form or structure whatsoever. This
makes it very difficult and time-consuming to process and analyze unstructured data.
Email is an example of unstructured data. Structured and unstructured are two
important types of big data.
• Semi-structured
This data pertains to the data containing both the formats mentioned above, that is,
structured and unstructured data. To be precise, it refers to the data that although has
not been classified under a particular repository (database), yet contains vital
information or tags that segregate individual elements within the data. Thus we come to
the end of types of data. Lets discuss the characteristics of data.
Example: data in an XML file
History of Big Data innovation
• 90% of the available data has been created in the last two years and the term Big Data
has been around 2005, when it was launched by O’Reilly Media in 2005.
• The evolution of Big Data includes a number of preliminary steps for its foundation.
• Such steps to the modern conception of Big Data involve the development of computers,
smart phones, the internet, and sensory (Internet of Things) equipment to provide data.
Credit cards also played a role, by providing increasingly large amounts of data, and
certainly social media changed the nature of data volumes in novel and still developing
ways.
The Foundations of Big Data:
•1937 - The first major data project is created in 1937 and was ordered by the Franklin D.
Roosevelt’s administration in the USA. After the Social Security Act became law in 1937,
the government had to keep track of contribution from 26 million Americans and more
than 3 million employers.
•1943 - The first data-processing machine appeared in 1943 and was developed by the
British to decipher Nazi codes during World War II. This device, named Colossus, searched
for patterns in intercepted messages at a rate of 5.000 characters per second
•1965 - The United Stated Government decided to build the first data center to store over
742 million tax returns and 175 million sets of fingerprints by transferring al those records
onto magnetic computer tape that had to be stored in a single location.
• 1989 - British computer scientist Tim Berners-Lee invented eventually the World
Wide Web. He wanted to facilitate the sharing of information via a ‘hypertext’
system.
• 1995 - The first super computer is built, which was able to do as much work in a
second than a calculator operated by a single person can do in 30.000 years.
• 2005 - The term Big Data was coined by Roger Mougalas back in 2005. However,
the application of big data and the quest to understand the available data is
something that has been in existence for a long time. 2005 is also the year
that Hadoop was created by Yahoo built on top of Google’s MapReduce. It’s goal
was to index the entire World Wide Web and nowadays the open-source Hadoop
is used by a lot organizations to crunch through huge amounts of data.
• 2010 - In 2010 Eric Schmidt speaks at the Techonomy conference in Lake Tahoe
in California and he states that "there were 5 exabytes of information created by
the entire world between the dawn of civilization and 2003. Now that same
amount is created every two days.”
• 2011 - The McKinsey report states that by 2018 the US will face a shortfall of
between 140,000 and 190,000 professional data scientists, and states that issues
including privacy, security and intellectual property will have to be resolved
before the full value of Big Data will be realized.
• 2014 - The rise of the mobile machines – as for the first time, more people are
using mobile devices to access digital data, than office or home computers. 88%
of business executives surveyed by GE working with Accenture report that big
data analytics is a top priority for their business.
Big Data Platforms
• Big data platform is a type of IT solution that combines the features and
capabilities of several big data application and utilities within a single solution.
3) Google Cloud:
Google Cloud offers lots of big data management tools, each with its own
speciality. BigQuery warehouses petabytes of data in an easily queried
format. Cloud Dataflow analyzes ongoing data streams and batches of historical
data side by side. With Google Data Studio, clients can turn varied data into custom
graphics.
4) Talend:
Talend’s trio of big data integration platforms includes a free basic platform and
two paid subscription platforms, all rooted in open-source tools like Apache Spark.
The paid platforms, though—one designed for existing data, the other for real-time
data streams—come with more power and tech support. Both can clean and parse
data, delete duplicate data and detect fraud automatically, among other functions.
5)Tableau:
The Tableau platform—available on-premises or in the Cloud—allows users to find
correlations, trends and unexpected interdependences between data sets.
The Data Management add-on further enhances the platform, allowing for more
granular data cataloging and the tracking of data lineage.
6) MAPR:
MapR’s platform, which they term "dataware," has attracted customers like
American Express and Samsung with its massive capacity (exabytes!) and robust
security measures. It is a dashboard for managing big data spread across various
platforms, clouds, servers and edge-computing devices.
9) Alibaba Cloud:
The leading public cloud provider in China, Alibaba operates in 19 regions
worldwide, including the U.S. Its popular cloud platform offers a variety of
database formats and big data tools, including data warehousing, analytics for
streaming data and speedy Elasticsearch, which can scan petabytes of data
scattered across hundreds of servers in real time.
Drivers for Big data
• Big Data is no longer just a buzzword; it is a proven phenomenon and not likely to
die away soon
• Two factors have combined to make Big Data especially appealing
• One is that so many potentially valuable data resources have come into
existence. These sources include the telemetry generated by today's smart
devices, the digital footprints left by people who are increasingly living their lives
online, and the rich sources of information commercially available from
specialized data vendors.
• The other factor contributing to Big Data's appeal is the emergence of powerful
technologies for effectively exploiting it. IT organizations can now take advantage
of tools such as Hadoop, NoSQL to rationalize, analyze and visualize Big Data in
ways that enable them to quickly separate the actionable insight from the
massive chaff of raw input
Big Data Architecture
A big data architecture is designed to handle the ingestion, processing, and analysis
of data that is too large or complex for traditional database systems.
Most big data architectures include some or all of the following components:
•Data sources: All big data solutions start with one or more data sources. Examples
include:
• Application data stores, such as relational databases.
• Static files produced by applications, such as web server log files.
• Real-time data sources, such as IoT devices.
•Data storage: Data for batch processing operations is typically stored in a distributed
file store that can hold high volumes of large files in various formats. This kind of store
is often called a data lake. Options for implementing this storage include Azure Data
Lake Store or blob containers in Azure Storage.
• Batch processing: Because the data sets are so large, often a big data solution
must process data files using long-running batch jobs to filter, aggregate, and
otherwise prepare the data for analysis. Usually these jobs involve reading source
files, processing them, and writing the output to new files. Options include
running U-SQL jobs in Azure Data Lake Analytics, using Hive, Pig, or custom
Map/Reduce jobs in an HDInsight Hadoop cluster, or using Java, Scala, or Python
programs in an HDInsight Spark cluster.
• Analytical data store: Many big data solutions prepare data for analysis and then
serve the processed data in a structured format that can be queried using
analytical tools. The data could be presented through a low-latency NoSQL
technology such as HBase, or an interactive Hive database that provides a
metadata abstraction over data files in the distributed data store. Azure Synapse
Analytics provides a managed service for large-scale, cloud-based data
warehousing. HDInsight supports Interactive Hive, HBase, and Spark SQL, which
can also be used to serve data for analysis.
• Analysis and reporting: To empower users to analyze the data, the architecture
may include a data modeling layer, such as a multidimensional OLAP cube or
tabular data model in Azure Analysis Services. It might also support self-service
BI, using the modeling and visualization technologies in Microsoft Power BI or
Microsoft Excel. Analysis and reporting can also take the form of interactive data
exploration by data scientists or data analysts. For these scenarios, many Azure
services support analytical notebooks, such as Jupyter, enabling these users to
leverage their existing skills with Python or R.