0% found this document useful (0 votes)
139 views28 pages

High Performance Computing BD4071 Unit 1 Notes

BD4071 High-performance computing, unit 1 introduction, notes

Uploaded by

Rinu Rince
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
139 views28 pages

High Performance Computing BD4071 Unit 1 Notes

BD4071 High-performance computing, unit 1 introduction, notes

Uploaded by

Rinu Rince
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

UNIT I INTRODUCTION

The Emerging IT Trends - IOT/IOE - Apache Hadoop for big data Analytics - Big data into big insights and actions -
Emergence of BDA discipline – strategic implications of big data – BDA Challenges – HPC paradigms – Cluster computing
– Grid Computing – Cloud computing – Heterogeneous computing – Mainframes for HPC - Supercomputing for BDA –
Appliances for BDA.
----------------------------------------------------------------------------------------------------------------------------- ------------------------------------
The first wave of IT belongs to the aura of hardware engineering. A variety of electronic modules (purpose specific
and generic) had been meticulously designed and aggregated in order to fulfil the various computing, networking, and
storage requirements. The miniaturization technologies have been playing a very indispensable role in shaping up the
hardware industry in bringing forth scores of micro- and Nano-level components. We are on the track towards the era of
ubiquitously deployed, disappearing, and disposable computers.
The second wave of IT heralded a shift from hardware to software. The software engineering started to improvise
decisively from there, and today software is pervasive and persuasive. Software brings in the much-needed adaptively,
modifiability, extensibility, and sustainability. Every tangible thing is being empowered to be smart through the embedding
and imbedding of software.
The third and current wave of IT began a couple years ago and is based on the utilization of data (big and fast) to
derive benefits from the advantages gained in hardware and software. The capture and investigations of data lead to
actionable and timely insights that can be sagaciously used for realizing smarter applications and appliances.

Thus, data analytics is the most endearing and enduring subject of study and research in order to come out with viable and
venerable methods for the envisaged smarter planet. Especially considering the faster growth of disparate and distributed
data sources, there is a special liking for data virtualization, processing, mining, analysis, and visualization technologies
that invariably fulfil the goals of knowledge discovery and dissemination. Data-driven insights enable taking right decisions
for people as well as systems in time with all clarity and confidence. You can find the most promising trends sweeping the
IT landscape in order to bring in the enhanced care, choice, convenience, and comfort for people at large.

The Emerging IT Trends

IT Consumerization The much discoursed and deliberated Gartner report details the diversity of mobile devices
(smartphones, tablets, wearables, etc.). Increasingly, the IT is coming closer to humans. People at any point of time, any
place, any device, any network, and any media could access and use any remotely held IT resources, business applications,
and data for their personal as well as professional purposes. The massive production of slim and sleek input/output devices
empowers end users to directly connect and benefit immensely out of all the telling improvisations of the IT discipline. The
IT consumer trend has been evolving for some time now and peaking these days. That is, IT is steadily becoming an
inescapable part of consumers directly and indirectly. And the need for robust and resilient mobile device management
software solutions with the powerful emergence of “bring your own device (BYOD)” is being felt and is being insisted across.
Another aspect is the emergence of next-generation mobile applications and services across a variety of business verticals.
There is a myriad of mobile applications, maps, and service development/delivery platforms, programming and mark-up
languages, architectures and frameworks, tools, containers, and operating systems in the fast-moving mobile space.
Precisely speaking, IT is moving from enterprise centricity to be consumer oriented.

IT Commoditization Commoditization is another cool trend sweeping the IT industry. With the huge acceptance and
adoption of cloud computing and big data analytics, the value of commodity IT is decisively on the rise. Typically, the
embedded intelligence is being consciously abstracted out of hardware boxes and appliances in order to make hardware
modules voluminously manufactured and easily and quickly used. The infrastructure affordability is another important need
in getting realized with this meticulous segregation, and the perennial problem of vendor lock-in is steadily eased out. Any
product can be replaced and substituted by another similar device from other manufacturers. With the consolidation,
centralization, and commercialization of IT infrastructures heat up, the demands of commoditized hardware go up furiously.
All kinds of IT infrastructures (server machines, storage appliances, network solutions such as routers, switches, load
balancers, firewall gateways, etc.) are being commoditized with the renewed focus on IT industrialization. The
commoditization through virtualization and containerization is pervasive and persuasive. Therefore, the next-generation IT
environment is specifically software defined to bring in programmable and policy-based hardware systems in plenty.

The Ensuing Era of Appliances The topic of hardware engineering is seeing a lot of hitherto unheard improvisations.
Undoubtedly, appliances are the recent hit in the IT market. All the leading vendors are investing their treasure, time, and
talent in unearthing next-generation smartly integrated systems (compute, storage, network, virtualization, and
management modules) in the form of instant-on appliances. IT appliances are fully customized and configured in the factory
itself so that they could just be turned on to get their verve at customer location in minutes and hours rather than days. In
order to incorporate as much automation as possible, there are strategic appliance-centric initiatives on producing pre-
integrated and pretested and tuned converged IT stacks. For example, Flex Pod and VCE are leading in the race in the
converged IT solution category. Similarly, there are expertly integrated systems such as IBM Pure Flex System, Pure
Application System, and Pure Data System. Further on, Oracle engineered systems, such as Oracle Exadata Database
Machine and Exalogic Elastic Cloud, are gaining in the competitive market.

Infrastructure Optimization and Elasticity The entire IT stack has been going for the makeover periodically. Especially
on the infrastructure front due to the closed, inflexible, and monolithic nature of conventional infrastructure, there are
concerted efforts being undertaken by many in order to untangle them into modular, open, extensible, converged, and
programmable infrastructures. Another worrying factor is the underutilization of expensive IT infrastructures (servers,
storages, and networking solutions). With IT becoming ubiquitous for automating most of the manual tasks in different
verticals, the problem of IT sprawl is to go up, and they are mostly underutilized and sometimes even unutilized for a long
time. Having understood these prickling issues pertaining to IT infrastructures, the concerned have plunged into unveiling
versatile and venerable measures for enhanced utilization and for infrastructure optimization. Infrastructure rationalization
and simplification are related activities. That is, next-generation IT infrastructures are being realized through consolidation,
centralization, federation, convergence, virtualization, automation, and sharing. To bring in more flexibility, software-defi
ned infrastructures are being prescribed these days.
With the faster spread of big data analytical platforms and applications, commodity hardware is being insisted to
accomplish data and process-intensive big data analytics quickly and cheaply. That is, we need low-priced infrastructures
with supercomputing capability and infinite storage. The answer is that all kinds of underutilized servers are collected and
clustered together to form a dynamic and huge pool of server machines to efficiently tackle the increasing and intermittent
needs of computation. Precisely speaking, clouds are the new-generation infrastructures that fully comply to these
expectations elegantly and economically. The cloud technology, though not a new one, represents a cool and compact
convergence of several proven technologies to create a spellbound impact on both business and IT in realizing the dream
of virtual IT that in turn blurs the distinction between the cyber and the physical worlds. This is the reason for the exponential
growth being attained by the cloud paradigm. That is, the tried and tested technique of “divide and conquer” in software
engineering is steadily percolating to hardware engineering. Decomposition of physical machines into a collection of sizable
and manageable virtual machines and composition of these virtual machines based on the computing requirement are the
essence of cloud computing.
Finally, software-defined cloud centres will see the light soon with the faster maturity and stability of competent
technologies towards that goal. There is still some critical inflexibility, incompatibility, and tighter dependency issues among
various components in cloud-enabled data centres; thus, full-fledged optimization and automation are not yet possible within
the current setup. To attain the originally envisaged goals, researchers are proposing to incorporate software wherever
needed in order to bring in the desired separations so that a significantly higher utilization is possible. When the utilization
goes up, the cost is bound to come down. In short, the target of infrastructure programmability can be met with the
embedding of resilient software so that the infrastructure manageability, serviceability, and sustainability tasks become
easier, economical, and quicker.

The Growing Device Ecosystem The device ecosystem is expanding incredibly fast; thereby, there is a growing array of
fixed, portable, wireless, wearable, nomadic, implantable, and mobile devices (medical instruments, manufacturing and
controlling machines, consumer electronics, media players, kitchen wares, household utensils, equipment, appliances,
personal digital assistants, smartphones, tablets, etc.). Trendy and handy, slim and sleek personal gadgets and gizmos are
really and readily appealing and eye-catching to people today. With the shine of miniaturization technologies such as
MEMS, Nanotechnology, SoC, etc., the power and smartness of devices are on the climb. IBM has stunningly characterized
the device world with three buzzwords (instrumented, interconnected, and intelligent). Connected devices are innately
proving to be highly beneficial for their owners. Machine-to-machine (M2M) communication enables machines, instruments,
equipment, and any other devices to be self-aware and situation and surroundings aware. The cloud-enabled devices are
extremely versatile in their operations, outputs, outlooks, and offerings. For example, cloud -enabled microwave device
downloads the appropriate recipe for a particular dish from the Web and accordingly does the required in an automated
fashion. Similarly, multifaceted sensors and actuators are attached with any ordinary devices to make them extraordinary
in their decisions and deeds.
The implications of the exorbitant rise in newer devices for different environments (smart homes, hospitals, hotels,
etc.) are clearly visible. On the data generation front, the volume of machine-generated data is higher and heavier than
man-generated data. This clearly vouches the point that there is a direct proportion between data growth and new devices.
The explosion of connected devices is unprecedented, and these special entities are slated to be in billions in the years
ahead. However, the number of digitized elements will be easily in trillions. That is, all kinds of casual items in our midst
are bound to be smart enough through the methodical execution of tool-supported digitization processes. Another vital trend
is that everything is service enabled with the surging popularity of the RESTful service paradigm. Every tangible element is
being service enabled in order to share their unique capabilities and to leverage others’ capabilities as well
programmatically. Thus, connectivity and service enablement go a long way in facilitating the production of networked,
resource constraint, and embedded systems in greater volumes.
When common things are digitized, standalone objects are interlinked with one another via networks, and every
concrete thing gets service enabled, there would be big interactions, transactions, and collaborations resulting in big data
that in turn lead to big discovery. All these clearly portray one thing. That is, the data volume is big and transmitted to be
analysed at an excruciating speed to come out with actionable insights. Having a large number of smart and sentient objects
and enabling them to be interconnected to be intelligent in their requests and responses are being pursued with vengeance
these days. Further on, all kinds of physical devices getting emboldened with remotely hosted software applications and
data are bound to go a long way in preparing the devices in our everyday environments to be active, assistive, and articulate.
That is, a cornucopia of next-generation people-centric services can be produced to provide enhanced care, comfort,
choice, and convenience to human beings.
People can track the progress of their fitness routines. Taking decisions becomes an easy and timely affair with the
prevalence of connected solutions that benefit knowledge workers immensely. All the secondary and peripheral needs will
be accomplished in an unobtrusive manner people to nonchalantly focus on their primary activities. However, there are
some areas of digitization that need attention, one being energy efficient. Green solutions and practices are being insisted
upon everywhere these days, and IT is one of the principal culprits in wasting a lot of energy due to the pervasiveness of
IT servers and connected devices. Data centres consume a lot of electricity, so green IT is a hot subject for study and
research across the globe. Another area of interest is remote monitoring, management, and enhancement of the
empowered devices. With the number of devices in our everyday environments growing at an unprecedented scale, their
real-time administration, configuration, activation, monitoring, management, and repair (if any problem arises) can be eased
considerably with effective remote correction competency.

Extreme Connectivity The connectivity capability has risen dramatically and become deeper and extreme. The kinds of
network topologies are consistently expanding and empowering their participants and constituents to be highly productive.
There are unified, ambient, and autonomic communication technologies from research organizations and labs drawing the
attention of executives and decision makers. All kinds of systems, sensors, actuators, and other devices are empowered
to form ad hoc networks for accomplishing specialized tasks in a simpler manner. There are a variety of network, and
connectivity solutions in the form of load balancers, switches, routers, gateways, proxies, firewalls, etc. for providing higher
performance, network solutions are being embedded in appliances (software as well as hardware) mode.
Device middleware or Device Service Bus (DSB) is the latest buzzword enabling a seamless and spontaneous
connectivity and integration between disparate and distributed devices. That is, device-to-device (in other words, machine-
to-machine (M2M)) communication is the talk of the town. The interconnectivity-facilitated interactions among diverse
categories of devices precisely portend a litany of supple, smart, and sophisticated applications for people. Software-defi
ned networking (SDN) is the latest technological trend captivating professionals to have a renewed focus on this emerging
yet compelling concept. With clouds being strengthened as the core, converged, and central IT infrastructure, device-to-
cloud connections are fast materializing. This local as well as remote connectivity empowers ordinary articles to become
extraordinary objects by distinctively communicative, collaborative, and cognitive.

The Trait of Service Enablement Every technology pushes for its adoption invariably. The Internet computing has forced
for Web enablement, which is the essence behind the proliferation of Web-based applications. Now, with the pervasiveness
of sleek, handy, and multifaceted mobiles, every enterprise and Web applications are being mobile enabled. That is, any
kind of local and remote applications is being accessed through mobiles on the move, thus fulfilling real-time interactions
and decision-making economically. With the overwhelming approval of the service idea, every application is service
enabled. That is, we often read, hear, and feel service-oriented systems. The majority of next-generation enterprise-scale,
mission critical, process-centric, and multipurpose applications are being assembled out of multiple discrete and complex
services.
Not only applications, physical devices at the ground level are being seriously service enabled in order to
uninhibitedly join in the mainstream computing tasks and contribute for the intended success. That is, devices, individually
and collectively, could become service providers or publishers, brokers and boosters, and consumers. The prevailing and
pulsating idea is that any service-enabled device in a physical environment could interoperate with others in the vicinity as
well as with remote devices and applications. Services could abstract and expose only specific capabilities of devices
through service interfaces, while service implementations are hidden from user agents. Such kinds of smart separations
enable any requesting device to see only the capabilities of target devices and then connect, access, and leverage those
capabilities to achieve business or people services. The service enablement completely eliminates all dependencies and
deficiencies so that devices could interact with one another flawlessly and flexibly.
IOT/IOE : The Internet of Things (IoT)/Internet of Everything (IoE)

Originally, the Internet was the network of networked computers. Then, with the heightened ubiquity and utility of wireless
and wired devices, the scope, size, and structure of the Internet have changed to what it is now, making the Internet of
Device (IoD) concept a mainstream reality. With the service paradigm being positioned as the most optimal, rational, and
practical way of building enterprise-class applications, a gamut of services (business and IT) are being built by many,
deployed in World Wide Web and application servers, and delivered to everyone via an increasing array of input/output
devices over networks. The increased accessibility and auditability of services have propelled interested software architects,
engineers, and application developers to realize modular, scalable, and secure software applications by choosing and
composing appropriate services from those service repositories quickly. Thus, the Internet of Service (IoS) idea is fast
growing. Another interesting phenomenon getting the attention of press these days is the Internet of Energy. That is, our
personal as well as professional devices get their energy through their interconnectivity. Figure 1.1 clearly illustrates how
different things are linked with one another in order to conceive, concretize, and deliver futuristic services with one another

Fig. 1.1 The evolution of the data analytic world


in order to conceive, concretize, and deliver futuristic services for the mankind (Distributed Data Mining and Big Data, a
Vision paper by Intel, 2012). As digitization gains more accolades and success, all sorts of everyday objects are being
connected with one another as well as with scores of remote applications in cloud environments. That is, everything is
becoming a data supplier for the next generation applications, thereby becoming an indispensable ingredient individually
as well as collectively in consciously conceptualizing and concretizing smarter applications. There are several promising
implementation technologies, standards, platforms, and tools enabling the realization of the IoT vision. The probable outputs
of the IoT field are a cornucopia of smarter environments such as smarter offices, homes, hospitals, retail, energy,
government, cities, etc. Cyber-physical systems (CPS), ambient intelligence (AmI), and ubiquitous computing (UC) are
some of the related concepts encompassing the ideals of IoT. The other related terms are the industrial Internet of things,
the Internet of important things, etc.
In the upcoming era, unobtrusive computers, communicators, and sensors will be facilitating decision-making in a
smart way. Computers in different sizes, look, capabilities, and interfaces will be fitted, glued, implanted, and inserted
everywhere to be coordinative, calculative, and coherent. The interpretation and involvement of humans in operationalizing
these smart and sentient objects are almost nil. With autonomic IT infrastructures, more intensive automation is bound to
happen. The devices will also be handling all kinds of everyday needs, with humanized robots extensively used in order to
fulfil our daily physical chores. With the emergence of specific devices for different environments, there will similarly be
hordes of services and applications coming available for making the devices smarter that will in turn make our lives more
productive.
During the early days, many people were using one mainframe system for their everyday compute needs. Today
everyone has his/her own compute system to assist in his/her information needs and knowledge works. Increasingly
multiple devices in and around us assist us in fulfilling our compute, communication, content, and cognition needs. Not only
that, the future IT is destined to provide us with a variety of context- aware, insights-driven, real-time, and people-centric
physical services. IBM in its visionary and seminal article has articulated that in the future, every system is instrumented,
interconnected, and intelligent in their obligations and operations. The budding technology landscape is tended towards
making every common thing smart, every device smarter, and every human being the smartest.
On another note, the concept of service orientation is all set to be conspicuous. That is, every tangible thing is going
to be service centric in the form of service provider, broker, booster, consumer, auditor, etc. Services are going to be
ambient, articulate, and adaptive. With micro-services are all encompassing and with containers emerging as the best-in-
class runtime environment for micro-services, the activities of service crafting, shipping, deploying, delivery, management,
orchestration, and enhancement are going to be greatly simplified. Further on, every system is to be ingrained with right
and relevant smartness so that our environments (personal as well as professional), social engagements, learning, item
purchasing, decision- making, scientific experiments, projects execution, and commercial activities are bound to exhibit
hitherto unheard smartness during their accomplishments and delivery. The easily embeddable smartness through a surfeit
of pioneering technologies is to result in smarter homes, hotels, hospitals, governments, retail, energy, healthcare, etc.
Lastly, as energy efficiency is being insisted everywhere as one of the prime requirements for the sustainability
target, the worldwide luminaries, visionaries, and stalwarts are focusing on unearthing green, lean, and clean technologies
and applying them in a methodical manner to make our digital assistants and artifacts power aware. In short, everything is
self-aware and situation and surroundings aware in order to be cognizant and cognitive in all things for which they are
artistically ordained. Thus, the service-orientated nature, the ingraining of smartness, and the guarantee of sustainability
are the part and parcel of everything in our increasingly digital living.

Apache Hadoop for big data analytics

In simple terms, the data growth has been phenomenal due to a series of innovations happening together. An astronomic
amount of data is getting collected and stored in large-scale data storage appliances and networks. A few sample tips and
titbits on the worldwide data growth prospects. Very precisely, EMC and IDC have been tracking the size of the “Digital
Universe (DU).” In the year 2012, both EMC and IDC estimated that the DU would double every two years to reach 44
zettabytes (ZB) by the year 2020. The year 2013 actually generated 4.4ZB DU, which is broken down into 2.9ZB generated
by consumers and 1.5ZB generated by enterprises. EMC and IDC forecast that the IoT will grow to 32 billion connected
devices by 2020, which will contribute 10 % of the DU. Cisco’s large-scale data tracking project focuses on data center and
cloud -based IP traffic estimating the 2013 amount at 3.1ZB/year (1.5ZB in “traditional” data centers, 1.6ZB in cloud data
centers). By 2018, this is predicted to rise to 8.6ZB with the majority of growth happening in cloud data centers. Precisely
speaking, the deeper connectivity and service enablement of all kinds of tangible things in our everyday locations open up
the path breaking possibility of data-driven insights and insight-driven decisions.
With the Web-scale interactions being facilitated by Yahoo, Google, Facebook, and other companies, the amount
of data being collected routinely would have easily overwhelmed the capabilities of traditional IT architectures of these
companies. The need for fresh, elastic, and dynamic architectures has been felt, and hence, the emergence of Hadoop is
being felicitated and celebrated widely.
Apache Hadoop is an open-source distributed software platform for efficiently storing and processing data. Hadoop software
runs on a cluster of industry-standard and commodity servers configured with direct-attached storage (DAS). On the storage
front, Hadoop could store petabytes of data reliably on tens of thousands of servers and enables horizontal scalability of
inexpensive server nodes dynamically and cost-effectively in order to guarantee the elasticity mandated for big and fast
data analytics.
Map Reduce is the core and central module to simplify the scalability aspect of Apache Hadoop. Map Reduce helps
programmers a lot in subdividing data (static as well as streaming) into smaller and manageable parts that are processed
independently. The parallel and high- performance computing complexities are being substantially lessened through the
leverage of the hugely popular Map Reduce framework. It takes care of intra-cluster communication, task monitoring and
scheduling, load balancing, fault and failure handling, etc. Map Reduce has been modernized in the latest versions of
Hadoop, and YARN is the new incarnation blessed with additional modules to bring in more automation such as cluster
management and to avoid any kind of fallibility.

Fig. 1.2 The HDFS reference architecture


The other major module of the Apache Hadoop platform is the Hadoop distributed file system (HDFS), which is
designed from the ground up for ensuring scalability and fault tolerance. HDFS stores large fi les by dividing them into
blocks (usually 64 or 128 MB) and replicating the blocks on three or more servers for ensuring high data availability. HDFS
(as illustrated in the Fig. 1.2) provides well written APIs for Map Reduce applications to read and write data in parallel. Data
Nodes can be incorporated at run time to maintain the performance mandate. A separate node is allocated for managing
data placement and monitoring server availability. As illustrated above through a pictorial representation (the source is from
a white paper published by Fujitsu), HDFS clusters easily and reliably holds on to petabytes of data on thousands of nodes.
In addition to Map Reduce and HDFS, Apache Hadoop includes many other vital components as explained in the box
below.

Ambari A Web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters which includes support for
Hadoop HDFS, Hadoop MapReduce , Hive, HCatalog, HBase , ZooKeeper, Oozie, Pig, and Sqoop. Ambari also provides
a dashboard for viewing cluster health such as heatmaps and ability to view MapReduce, Pig, and Hive applications visually
along with features to diagnose their performance characteristics in a user-friendly manner.

Avro It serves for serializing structured data. Structured data is converted into bit strings and efficiently deposited in HDFS
in a compact format. The serialized data contains information of the original data schema. By means of the NoSQL
databases such as HBase and Cassandra, large tables can be stored and accessed efficiently.

Chukwa A data collection system for managing large distributed systems. Chukwa monitors large Hadoop environments.
Logging data is collected, processed, and visualized.

HBase A scalable and distributed database that supports structured data storage for large tables.

Mahout A Scalable machine learning and data mining library.


Pig A high-level data-flow language and execution framework for parallel computation. It includes a language, Pig Latin,
for expressing these data flows. Pig Latin includes operators for many of the traditional data operations
(join, sort, filter, etc.) as well as the ability for users to develop their own functions for reading, processing, and writing data.
Pig runs on Hadoop and makes use of both HDFS and MapReduce.

Apache Flume It is a distributed system for collecting, aggregating, and moving large amounts of data from multiple sources
into HDFS. With the data sources that are multiplying and diversifying, the role and responsibility of Flume software grow
up. Flume is suited in particular for importing data streams, such as Web logs or other log data into HDFS.

Apache Sqoop It is a tool for transferring data between Hadoop and the traditional SQL databases. You can use Sqoop to
import data from a MySQL or Oracle database into HDFS, run MapReduce on the data, and then export
the data back into an RDBMS.

Apache Hive It is a simpler programming language for writing MapReduce applications. HiveQL is a dialect of SQL and
supports a subset of the SQL syntax. Hive is being actively enhanced to enable low-latency queries on
Apache HBase and HDFS.

ODBC/JDBC Connectors ODBC/JDBC Connectors for HBase and Hive are the components included in Hadoop
distributions. They provide connectivity with SQL applications by translating standard SQL queries into HiveQL commands
that can be executed upon the data in HDFS or HBase.

Spark Spark is a framework for writing fast, distributed programs. Spark solves similar problems as Hadoop MapReduce
does but with a fast in-memory approach and a clean functional style API. With its ability to integrate with Hadoop and
inbuilt tools for interactive query analysis (Shark), large-scale graph processing and analysis (Bagel), and real-time analysis
(Spark Streaming), it can be interactively used to quickly process and query big data sets.
To make programming faster, Spark provides clean, concise APIs in Scala, Java, and Python. You can also use Spark
interactively from the Scala and Python shells to rapidly query big data sets. Spark is also the engine behind
Shark, a fully Apache Hive-compatible data warehousing system that can run 100× faster than Hive.

JAQL It is a functional, declarative programming language designed especially for working with large volumes of structured,
semi-structured, and unstructured data. The primary use of JAQL is to handle data stored as JSON documents, but JAQL
can work on various types of data. For example, it can support XML, comma-separated values (CSV) data, and flat files.
An “SQL within JAQL” capability lets programmers work with structured SQL data while employing a JSON data model
that’s less restrictive than its SQL counterparts. Specifically, JAQL allows you to select, join, group, and filter data that is
stored in HDFS, much like a blend of Pig and Hive. JAQL’s query language was inspired by many programming and query
languages, including Lisp, SQL, XQuery, and Pig.
Tez A generalized data-flow programming framework, built on Hadoop YARN , which provides a powerful and flexible
engine to execute an arbitrary DAG of tasks to process data for both batch and interactive use cases. Tez is being adopted
by Hive, Pig, and other frameworks in the Hadoop ecosystem, and also by other commercial software (e.g., ETL tools), to
replace Hadoop MapReduce as the underlying execution engine.
ZooKeeper A high- performance coordination service for distributed applications.

Big Data into Big Insights and Actions

We have talked about the data explosion and the hard realization that if data is crunched deftly, then the free flow of
actionable insights empowering decision makers to take precise and perfect decisions well ahead of time is being helped
with the availability of competent analytical platforms and optimized infrastructures. In this section, let us discuss why
organizations are hell bent on adopting the technology advancements being achieved in the analytical space to be ahead.
With big and fast data analytics becoming common and casual, the IT systems are more reliable and resilient. With IT
resuscitation and versatility, the business efficiency and adaptivity are set to rise significantly.

Customer Centricity IT has been doing a good job of bringing in highly visible and venerable business automation,
acceleration, and augmentation. However, in the recent past, technologies are becoming more people centric. Business
houses and behemoths across the globe, having captured the essence of IT well in their journey previously, are renewing
their focus and re-strategizing to have greater impacts on the elusive goal of customer delight through a variety of captivating
premium offerings with all the quality of service (QoS) attributes implicitly embedded. Traversing extra miles for maintaining
customer relationships intact, precisely capturing their preferences, and bringing forth fresh products and services to retain
their loyalty and to attract newer customers are some of the prickling challenges before enterprises to retain their businesses
ahead of immediate competitors. Personalized services, multichannel interactions, transparency, timeliness, resilience, and
accountability are some of the differentiating features for corporates across the globe to be ensured for their subscribers,
employees, partners, and end users. People data is the most sought after for understanding and fulfilling customers’ wishes
comprehensively.
There are social sites and digital communities to enable different sets of people to express their views and
feedbacks on various social concerns, complaints, product features, thoughts, musings, knowledge share, etc. The recent
technologies have the wherewithal to enable various customer analytic needs. The first and foremost aspect of big data
analytics is the social data and its offshoots such as 360° of customers, social media and networking analytics, sentiment
analytics, etc. With businesses and customers that are synchronizing more closely, the varying expectations
of customers are resolutely getting fulfilled.

Operational Excellence The second aspect of big data analytics is the machine data. Machines produce a huge amount
of data to be systematically captured and subjected to a series of deeper and decisive investigations to squeeze out tactical
as well as strategic insights to enable every kind of machines to be contributing to their fullest capacity and capability.
Business agility, adaptivity, and affordability through the promising IT innovations are gaining momentum. IT
systems are pruned further and prepared to be extra lean and agile, extensible, and adaptive in their structural as well as
behavioural aspects. There is a continued emphasis in closing down the gaps between business and IT. Operational data
are being captured meticulously in order to extract all kinds of optimization tips to keep IT systems alive. Any kind of
slowdown and breakdown is being proactively nipped in the bud in order to continuously fulfil customers’ needs.
Thus, highly optimized and organized IT in synchronization with customer orientation does a lot for any kind of
enterprising initiative to achieve its originally envisaged success. The new-generation analytical methods come handy in
visualizing opportunities and completing them with ease. Fig. 1.3 The big picture

Data center executives face big data -induced several major problems and challenges. The current compute, storage, and
network infrastructure s are bound to face capacity and capability problems with the more generic and specific usage of big
and fast data for knowledge discovery. Capacity planning is a key issue. Scalability and availability of IT resources are very
important in the big data world. Therefore, the cloud idea captured the imagination of people widely in order to have highly
efficient, consolidated, centralized and federated, converged, mixed (virtualized and bare metal) servers; automated,
shared, and orchestrated IT centers; and server farms. And in the recent past, software-defined data centers are seeing
the reality with the availability of software-defined compute, storage, networking, and security
solutions.
A white paper titled “Big Data Meets Big Data Analytics” by SAS indicates that there are three key technologies
that can help you get a handle on big data and even more importantly extract meaningful business value from it:
• Information Management for Big Data – Capturing, storing, and managing data as a strategic asset for business
empowerment
• High-Performance Analytics for Big Data – Leveraging data using high-performance IT for extracting real-time and real-
world insights to solve a variety of professional, social, as well as personal problems
• Flexible Deployment Options for Big Data – Choosing better options (on premise (private cloud), off premise (public cloud),
or even hybrid cloud) for big data and analytics

The Big Picture With the cloud space growing fast as the optimized, automated, policy-based, software-defined, and
shared environment comprising highly sophisticated and synchronized platforms and infrastructure for application
development, deployment, integration, management, and delivery, the integration requirement too has grown deeper and
broader as pictorially illustrated in the Fig. 1.3. Thus, there will be extreme integration among all kinds of entities and
elements in the physical world and software services in the cyber world in order to have a growing array of versatile and
vivacious applications for the total humanity.
All kinds of physical entities at the ground level will have purpose-specific interactions with services and data hosted
on the enterprise as well as cloud servers and storage appliances to enable scores of real-time and real-world applications
for the society. This extended and enhanced integration would lead to data deluges that have to be accurately and
appropriately subjected to a variety of checks to promptly derive actionable insights that in turn enable institutions,
innovators, and individuals to be smarter and speedier in their obligations and offerings.

The Emergence of Big Data Analytic s (BDA) Discipline

Lately, there are several data and process-intensive workloads emerging and evolving fast. This natural phenomenon puts
professionals and professors to ponder about the means of efficiently running them to extract the expected results in time.
The HPC paradigm is being mandated in order to tackle the challenges being posed by this special category of applications.
With the heightened complexity, the leverage of special infrastructures and platforms in association with adaptive processes
is being insisted for cost-effective BDA.
High- performance systems are being mandated here in order to efficiently crunch a large amount of data emanating from
different and distributed sources. Therefore, HPC acquires special significance with the maturity and stability of big data
computing. For the forthcoming era of big data-inspired insights, HPC is bound to play a very prominent and productive role
in fulfilling the hard requirements of capturing, indexing, storing, processing, analysing, and mining big data in an efficient
manner.

The Intricacies of the “Big Data Paradigm”


• The data volume gets bigger (in the range of tera-, peta-, and exabytes).
• The data generation, capture, and crunching frequency have gone up significantly (the velocity varies from batch
processing to real time).
• The data structure too has got diversified (poly-structured data).

That is, data structure, size, scope, and speed are on the rise. Big data and large-scale analytics are increasingly important
for optimizing business operations, enhancing outputs and rolling out newer offerings, improving customer relations, and
uncovering incredible opportunities. BDA therefore has greater benefits for businesses to explore fresh avenues for
additional revenues. In addition, BDA has long-lasting implications on IT infrastructures and platforms as enlisted below.

• The data virtualization, transformation, storage, visualization, and management


• Pre-processing and analysing big data for actionable insights
• Building insight-driven applications

The Key Drivers for Big Data There are a number of new technologies and tools constantly coming up in order to fulfill
the varying expectations of businesses as well as people. The prominent ones among them are:
• Digitization through edge technologies
• Distribution and federation
• Consumerization (mobiles and wearables)
• Centralization, commoditization, and industrialization (cloud computing)
• Ambient, autonomic, and unified communication and ad hoc networking
• Service paradigm (service enablement (RESTful APIs) of everything
• Social computing and ubiquitous sensing, vision, and perception
• Knowledge engineering
• The Internet of Things (IoT)
In short, enabling the common things to be smart and empowering, all kinds of physical, mechanical, and electrical systems
and electronics to be smarter through instrumentation and interconnectivity are the key differentiators for the ensuing
smarter world. The resulting sentient objects (trillions of digital objects) and smarter devices (billions of connected devices)
in sync up with the core and central IT infrastructure (Cloud) are to produce a wider variety and a large amount of data that
in turn has the wherewithal to formulate and facilitate the implementation and delivery of situation-aware and sophisticated
services for humanity ultimately. The promising and potential scenarios are as follows: Sensors and actuators are to be
found everywhere, and machine-to-machine (M2M) communication is to pour out voluminous data which is to be carefully
captured and cognitively subjected to a number of decisive investigations to emit out useful knowledge, the raging concepts
of the Internet of Thing s (IoT), cyber physical systems (CPS), smart environments, etc.
Similarly, the cloud enablement of every tangible thing in our midst is turning out to be a real game changer for every
individual to be the smartest in his daily deeds, decision-making capabilities, and deals. Thus, with data being categorized
as the strategic asset for any organization and individual across the globe, the purpose specific HPC resources are being
given prime importance to turn data into information and to generate pragmatic knowledge smoothly.

The Strategic Implications of Big Data

1. The Unequivocal Result: The Data-Driven World

• Business transactions, interactions, operations, master, and analytical data


• Log fi les of system and application infrastructures (compute, storage, network, application, Web and database servers,
etc.)
• Social and people data and weblogs
• Customer, product, sales, and other business-critical data
• Sensor and actuator data
• Scientific experimentation and observation data (genetics, particle physics, financial and climate modelling, drug
discovery, etc.)

2. Newer Kinds of Analytical Approaches

There are several fresh analytical methods emerging based on big data as enlisted below. Big data analytics are the new
tipping point for generating and using real-time and actionable insights to act upon with all confidence and clarity. For
example, there will be substantial renaissance on predictive and prescriptive analytics for any enterprise to correctly and
cogently visualize the future and accordingly to set everything in place to reap all the visible and invisible benefits (business
and technical).
Generic (horizontal) Specific (vertical)
Real-time analytics Social media analytics
Predictive analytics Operational analytics
Prescriptive analytics Machine analytics
High- performance analytics Retail and security analytics
Diagnostic analytics Sentiment analytics
Declarative analytics Financial analytics
3. Next-Generation Insight-Driven Applications

With a number of proven methods and tools for precisely transitioning data into information and into workable knowledge
on the horizon, the aspect of knowledge engineering is fast picking up. Next, timely and technically captured insights of big
data by meticulous leveraging of standards-compliant analytical platforms are to be disseminated to software applications
and services to empower them to behave distinctively. The whole idea is to build, host, and deliver people-centric services
through any handheld, portable, implantable, mobile, and wearable devices. Big data is going to be the principal cause if
leveraged appropriately for the proclaimed smarter world.
The Process Steps: Big Data → Big Infrastructure → Big Insights The big data challenges are common for nearly all
businesses, irrespective of the industry. Ingest all the data, analyse it as fast as possible, make good sense of it, and
ultimately drive smart decisions to positively affect the business all as fast as possible. There are different methodologies
being disseminated for timely extraction of all kinds of usable and useful knowledge from data heap. The most common
activities involved in bringing forth actionable insights out of big data are:
• Aggregate and ingest all kinds of distributed, different, and decentralized data.
• Analyse the cleansed data.
• Articulate the extracted actionable intelligence.
• Act based on the insights delivered and raise the bar for futuristic analytics (real-time, predictive, prescriptive, and personal
analytics) and to accentuate business performance and productivity.

The Big Data Analytics (BDA) Challenges


Definitely, there will be humungous IT challenges because of the enormity, variability, viscosity, and veracity of big data.
Therefore, big data mandates high-quality IT infrastructures, platforms, DBs, middleware solutions, fi le systems, tools, etc.

The Infrastructural Challenges

• Compute, storage, and networking elements for data capture, transmission, ingestion, cleansing, storage, pre-processing,
management, and knowledge dissemination
• Clusters, grids, clouds, mainframes, appliances, parallel and supercomputers, etc.
• Expertly integrated and specifically engineered systems for powering up big data requirements efficiently

The Platform Challenges:

We need end-to-end, easy-to-use, and fully integrated platforms for making sense out of big data. Today, there are pieces
for data virtualization, ingestion, analytics, and visualization. However, complete and comprehensive platforms are the need
of the hour in order to accelerate the process of extracting knowledge from big data.
– Analytical, distributed, scalable, and parallel databases
– Enterprise data warehouses (EDWs)
– In-memory systems and grids (SAP HANA, VoltDB, etc.)
– In-database systems (SAS, IBM Netezza etc.)
– High- performance Hadoop implementations (Cloudera, MapR, Hortonworks,etc.)

The File System s and Database Challenges:

Having realized the shortcomings of traditional databases for big data, product vendors have come out with a few analytical,
scalable, and parallel SQL databases to take care of the heightened complexity of big data. In addition to that, there are
NoSQL and NewSQL DBs that are more compatible and capable for handling big data. There are new types of parallel and
distributed fi le system solutions such as Lustre from NetApp with many newly added capabilities for the big data world.
Figure 2.1 clearly illustrates the challenges associated with big data analytics Fig. 2.1 The challenging big data world
The High-Performance Computing (HPC) Paradigms

The challenges of knowledge engineering are growing steadily due to several game changing developments in the data
front. Large data sets are being produced in different speeds and structures; there is a heightened complexity of data
relationships, there is a sharp rise in routine and repetitive data, the aspects of data virtualization and visualization
requirements are becoming crucial, etc. HPC is being prescribed as the best way forward as HPC facilitates businesses to
see exponential performance gains, an increase in productivity and profitability, and the ability to streamline their analytic
processes.

Clearing the Mystery of HPC The performance of software applications varies widely due to several reasons. The
underlying infrastructure is one among them, and hence application designers and developers are always forced to come
out with a code that is critically optimized for the execution container and the deployment environment. Now with clouds
emerging as the core, central, and converged environment for applications, software engineers are tasked with bringing in
the necessary modifications on the legacy software to be seamlessly migrated, configured, and delivered via clouds. This
process is touted as the cloud enablement. Now applications are being built directly in cloud environments and deployed
there too in order to eliminate the standard frictions between development and operational environments. The applications
prepared in this way are cloud native.
The point is that the application performance varies between non- cloud and cloud environments. Applications are
mandated to work to their full potentials in whatever environments they are made to run. Several techniques in the form of
automated capacity planning, proven performance engineering and enhancement mechanisms, auto-scaling, performance-
incrementing architectural patterns, dynamic load balancing and storage caching, and CPU bursting are being considered
and inscribed to substantially enhance application performance. It is a well-known and widely recognized truth that
virtualization brings down performance. In order to decimate the performance degradation being sponsored by the
virtualization concept, the aspect of containerization is being recommended with the faster maturity and stability of the open-
source Docker technology.
Clarifying the HPC -Associated Terms The terms performance, scalability, and throughput are used in a variety of
ways in the computing field, and hence it is critical to have an unambiguous view and understanding on each of them. When
we say performance, it gets associated with IT systems as well as applications. Multiprocessor and multicore architectures
and systems are becoming common.
An application performance can be calculated based on a few characteristics. The user load and processing power
are the deciding factors. That is, how many users an application can handle per second or the number of business
transactions being performed by it per second. A system performance is being decided based on the number of floating-
point operations per second it could achieve. A high-performance system is one that functions above a teraflop or 10 12
floating-point operations per second.
The performance of an application depends on the application architecture, the application infrastructure, and the
system infrastructure. As we all know, every infrastructure component has a theoretical performance limit. But due to several
internal and external reasons, the performance practically achieved is much below the theoretically quoted one. A network
with the data transfer speed of 1 Gbps does not provide the fullest speed at any point of time due to multiple dependencies
and deficiencies surrounding it. The same story lies for other hardware modules too. Ultimately, the performance of any
application that runs on them suffers ignominiously. The Joyent white paper says that limited bandwidth, disk space,
memory, CPU cycles, and network connections can collectively lead to poor performance. Sometimes poor performance of
an application is the result of its architecture that does not properly distribute its processes across available system
resources. The prickling challenge is how to empower the system modules technologically to achieve its stated theoretical
performance so that applications can be high performing.
Throughput is the performance number achieved by systems and applications. The effective rate at which data is
transferred from point A to point B is the throughput value. That is, throughput is a measurement of the raw speed. While
speed of moving or processing data can certainly improve system performance, the system is only as fast as its slowest
element. A system that deploys ten gigabit Ethernet yet its server storage can access data at only one gigabit effectively
has a one gigabit system.
Scalability is closely synchronized with performance. That is, even with the number of users going up suddenly and
significantly, the same performance metric has to be maintained. That is, with the unplanned appreciation in user load, the
processing performance has to be adequately enhanced to maintain the earlier response time to all the users. The only
way to restore higher effective throughput and performance is through the addition of compatible resources. Auto-scaling
at the system level as well as at the application level is being made mandatory in cloud environments. Leading cloud service
providers (CSPs) ensure auto-scaling, whereas cloud management platforms such as OpenStack ensure auto-scaling of
resources. For big data analytics, the project of Savanna/Sahara is to ensure automated scaling out as well as scaling
down.
There are several popular computing types emerging exclusively for realizing the goals of high- performance
computing. There are primarily centralized and distributed computing approaches to bring in high-performance computation.
Without an iota of doubt, parallel computing is the most widely used for ensuring high performance and throughput. There
are symmetric multiprocessing (SMP) and massively multiprocessing (MPP) solutions in plenty. As the number of domains
yearning for high performance is consistently on the rise, the focus is more intense on HPC technologies these days.
Especially for the forthcoming world of knowledge, the domain of data analytics for extraction of knowledge is bound to play
a very stellar role in shaping up the data-driven world. With data becoming big data, the need for HPC significantly goes
up. In the ensuing sections, we would like to throw more light on different HPC models that are being perceived as more
pragmatic for the big data analytics. The Hadoop standard has been widely embraced for its ability to economically store
and analyse large data sets. By exploiting the parallel computing techniques ingrained in the “ MapReduce ,” the data-
processing framework of Hadoop, it is possible to reduce long-computation times to minutes. This works well for mining
large volumes of historical data stored on disk, but it is not suitable for gaining real time insights from live and operational
data. Thus, there are concerted efforts to formulate realistic high-performance methods towards real-time and high-
performance big data analytics.

Cluster Computing

Clusters are more visible these days as they are ensuring scalability, availability, and sustainability in addition to
the original goal of higher performance. Besides providing the cost-effectiveness, the simplicity of bringing up clusters is
the main noteworthy point. Technologies for cluster formation, monitoring, measurement, management, and maintenance
are already matured and stabilized. The massive acceptance of x86-based clusters is clearly facilitating the uptake of HPC
in big way. The unprecedented success of clusters is due to the simplified architecture, which originates from traditional
commoditized servers linked together by powerful interconnects such as Gigabit Ethernet and InfiniBand.
Clusters typically follow the MPP model. The main drawback is that every cluster node has its own memory space,
which needs to be accessed by other nodes via the system interconnect. The unmistakable result is the added complexity
in software development, namely, slicing up an application’s data in such a way as to distribute it across the cluster and
then coordinating computation via internode message passing. Such an architectural approach also necessitates
maintaining an operating system (OS) and associated software stack on each node of the cluster. Clusters are often
complicated to manage at both the system and the application levels.
On the other side, SMP platforms implement a shared-memory model, where the entire address space is uniformly
accessible to all the processors in the system. This is implemented by aggregating multiple processors and associated
memory into a single node using NUMA-aware communication fabric. This means there is no need to carve up application
data and to coordinate distributed computation. Such an arrangement offers a much more natural programming model for
the development community and can implicitly take advantage of existing software packages. Such systems also need only
a single OS instance and software environment, thus simplifying system upgrades and patches. Unfortunately, the SMP
model has lost out to the cluster model due to several reasons. The custom NUMA fabric requires additional hardware
components, and the cost is running high for the SMP model. Gartner defines a big data cluster as a set of loosely coupled
compute, storage, and network systems, with a few specialized nodes suitable for administration, combined to run a big
data framework. The advantages of big data clusters include:
• Lower acquisition costs and clusters can reuse existing hardware.
• Ease of scaling the cluster and ability to mix high-end hardware for master nodes and commodity hardware for worker
nodes.
• Ability to easily modify hardware configuration to suit the quirks of workloads.
The negative factors are the Hadoop stack component availability, and the version releases vary widely among
distributions. Further on, depending on the vendor chosen, the support capabilities can change sharply.

The Virtual SMP Alternative (www.scalemp.com) Clusters are currently very popular, and due to the abovementioned
reasons, SMPs could not do well in the market as expected. However, there is a renewed interest in consolidating their
advantages together to enable businesses substantially. The distinct capabilities of SMPs are being embedded in clusters,
whereas the reverse is also being attended seriously. If clusters are made to function like SMPs, then business IT can
easily realize the advantages of relatively cheap and scalable hardware with the highly decreased management complexity.
On the other hand, if SMPs can also run MPI applications built for distributed memory architectures without the management
overhead of a traditional cluster, there will be strong pick up for SMPs. Having realized the need for having the distinct
advantages of clusters in SMPs, ScaleMP has come out with a converged product, that is, the value proposition offered by
ScaleMP, a company whose vSMP Foundation (Versatile SMP) product turns a conventional x86 cluster into a shared-
memory platform. It does so in software, employing a virtual machine monitor (VMM) that aggregates multiple x86 nodes,
I/O, and the system interconnect into a single (virtual) system (Fig. 2.2). The cluster’s indigenous interconnect is used in
lieu of an SMP’s custom network fabric to maintain memory coherency across nodes. The current-generation vSMP
Foundation product is able to aggregate as many as 128 nodes and up to 32,768 CPUs and 256 TB of memory into one
system.

Fig. 2.2 The leverage of a single virtual machine(VM)


vSMP Foundation creates a virtual shared-memory system from a distributed infrastructure providing the best of both worlds
for big data and analytics problems. It allows scaling just by adding “one more node” but still keeps the OPEX advantage
of a shared-memory system. It provides benefits for small Hadoop deployments where the OPEX costs are high and can
handle big data cases where data cannot be easily distributed by providing a shared-memory processing
environment.

The Elevation of Hadoop Clusters Hadoop is an open-source framework for running data-intensive applications in
compute clusters made out of scores of heterogeneous commodity servers. There are several business and technical use
cases for Hadoop clusters. There are advanced mechanisms being incorporated in Hadoop sensors for scalability,
availability, security, fault tolerance, etc. Automated scale out and scale-in are being achieved through extra and externally
imposed techniques. For large-scale data crunching, Hadoop clusters are turning out to be indispensable due to its simple
architecture. There are better days ahead for Hadoop clusters. Hadoop is designed from the ground up to efficiently
distribute large amounts of processing across a set of machines from a few to over 2000 servers. A small-scale Hadoop
cluster can easily crunch terabytes or even petabytes of data. The steps for efficient data analytics through Hadoop clusters
are given below.

Step 1: The Data Loading and Distribution —the input data is stored in multiple fi les, and hence the scale of parallelism in
a Hadoop job is related to the number of input fi les. For example, if there are 10 input fi les, then the computation can be
distributed across ten nodes. Therefore, the ability to rapidly process large data sets across compute servers is related to
the number of fi les and the speed of the network infrastructure used to distribute the data to the compute nodes. The
Hadoop scheduler assigns jobs to the nodes to process the fi les. As a job is completed, he scheduler assigns another job
to the node with the corresponding data. The job’s data may reside on local storage or on another node in the network.
Nodes remain idle until they receive the data to process. Therefore, planning the data set distribution and a high-speed
data center network both contribute to better performance of the Hadoop processing cluster. By design, the Hadoop
distributed file system ( HDFS ) typically holds three or more copies of the same data set across nodes to avoid idle time.

Steps 2 and 3: Map/Reduce —the first data-processing step applies a mapping function to the data loaded during step 1.
The intermediate output of the mapping process is partitioned using some key, and all data with the same key is next moved
to the same “reducer” node. The final processing step applies a reduce function to the intermediate data; and the output of
the reduce function is stored back on disk. Between the map and reduce operations, the data is shuffled between the nodes.
All outputs of the map function with the same key are moved to the same reducer node. In between these two key steps,
there could be many other tasks such as shuffling, filtering, tunnelling, funnelling, etc.

Step 4: Consolidation —after the data has been mapped and reduced, it must be merged for output and reporting.

Hadoop clusters can scale from hundreds of nodes to over ten thousand nodes to analyse some of the largest data sets in
the world. Hadoop clusters are the most affordable and adept mechanism for high- performance big data storage,
processing, and analysis. It is hence obvious that the Hadoop framework in sync up with well designed compute clusters is
to tackle scores of data-intensive applications through efficient parallelizing of data across. Clusters are bound to be
pervasive as enterprise IT is under immense pressure to reduce cost and take business services and solutions to the market
in double quick time.

Grid Computing

For high- performance workloads across industry verticals, the current IT systems are found to be ineffective. Another
interesting mandate is to reuse existing IT resources to the fullest while preparing additional resources for running high-end
applications. There are several domains such as financial services, manufacturing life sciences, technical computing, etc.
demanding fresh ways and means for HPC . Grid computing is being positioned as one of the exceptionally powerful HPC
paradigms. Grids follow the ideals of distributed computing . The real beauty is not the distributed deployment of servers
but the centralized monitoring, measurement, and management of them. Computational grids allow you to seamlessly
establish a fruitful linkage among the processors, storage, and/or memory of distributed computers to enhance their
utilization to a new high to solve large-scale problems more quickly. The benefi ts of grids include cost savings, improved
business agility by decreasing time to deliver results, and purpose-specifi c collaboration and enhanced sharing of
resources. Grid computing is an extensible compute environment for ensuring scalability, high availability, and faster
processing in a cost-effective manner. “Divide and conquer” has been a success mantra, and here too, grids exploit it to
come out with fl ying colors in assuaging and accomplishing the needs of HPC. All kinds of parallelizable workloads and
data volumes could benefi t out of this computing paradigm.
There are business factors and forces that are paving the way for the systematic realization of grid computing
capabilities widely. Today everyone is being assisted by many differently enabled devices collaborating with one another
in understanding our needs and delivering them in time unobtrusively. Devices enable us to be connected with the outside
world. As machines are empowered to communicate with one another in the vicinity as well as with remote systems, the
amount of data getting generated through various interactions among machines and men is incredibly tremendous. The
hugeness of data is making existing IT systems stressful. That is, the window of opportunity for capturing and turning the
data into information and into knowledge gets shrunken sharply. Increasingly, industry applications are being subjected to
process large volumes of data and to perform repetitive computations that exceed the existing server capabilities. In this
sickening scenario, grid computing came as a boon in order to overcome the brewing data-imposed challenges. The
conspicuous benefi ts include:
• Scalability —long-running applications can be decomposed as manageable execution units, and similarly large
data sets can be precisely segmented into data subsets. Both of these can be executed together at the same time to speed
up the execution process. As there is suffi cient number of commodity servers joining in the processing process, the
application segregation and data division are bound to do well. Further on, the unique capability of runtime addition of newer
servers is to ensure the fl uent scalability.
• User Growth —several users can access a virtualized pool of resources in order to obtain the best possible
response time by maximizing utilization of the computing resources.
• Cost Savings —leveraging unutilized and underutilized compute machines in the network is the prime move for
IT cost reduction. Resource sharing is another noteworthy factor in a grid environment.
• Business Agility —grid computing sharply enhances the IT agility, which in turn leads to business agility. That is,
IT is being empowered to listen to business changes and challenges quickly.
• Heightened Automation —with the implementation and incorporation of powerful algorithms in grid environments,
the automation of managing grid applications and platforms is elevated to a new level.
For big data analytics, the grid concepts are very contributive and constructive. Grids provide exemplary workload
management, job scheduling and prioritization, and the subdivision of analytics job for higher productivity . As indicated
above, system availability, scalability, and sustainability are fully entrenched through software in grids. Elimination of single
point of failure, embedded fault tolerance, etc. is the principal key drivers for grid-based big data analysis. Grid computing
can parse and partition large-scale analytics jobs into smaller tasks that can be run, in parallel, on smaller and cost-effective
servers than on high-end and expensive symmetric multiprocessor (SMP) systems.
In-Memory Data Grid ( IMDG ) [ 3 ] Although the Hadoop ’s parallel architecture can accelerate big data analytics,
when it comes to fast-changing data, the Hadoop’s batch processing and disk overheads are overly prohibitive. In this
section, we are to explain how real-time and high- performance analytics can be realized by combining an IMDG with an
integrated and stand-alone MapReduce execution engine. This new convergence delivers faster results for live data and
also accelerates the analysis of large and static data sets. IMDG provides low-access latency, scalable capacity and
throughput, and integrated high availability. IMDGs automatically store and load balance the data across an elastic cluster
of servers. IMDGs also redundantly store data on multiple servers to ensure high availability just in case a server or network
link fails. An IMDG’s cluster can easily scale up its capacity by adding servers to handle additional workloads dynamically.
IMDGs need supple storage mechanisms to handle widely differing demands on the data that they store. IMDGs
could host complex objects with rich semantics to support features such as property-oriented query, dependencies,
timeouts, pessimistic locking, and synchronized access from remote IMDGs. Typically MapReduce applications are being
used for crunching large populations of simple objects. There are other applications envisaging the storage and analytics
of huge numbers of very small objects such as sensor data or tweets. To handle these divergent storage requirements and
to use memory and network resources effi ciently, IMDGs need multiple storage APIs, such as the Named Cache and
Named Map APIs. Through these APIs, applications can create, read, update, and delete objects to manage live data. This
comes as a handle for application developers to store and analyze both heavyweight objects with rich metadata and
lightweight objects with highly optimized storage with the same ease.
Operational systems generally process live data, and if IMDGs are integrated with operational systems, then the
access of in-memory data gets signifi cantly accelerated for special-purpose analysis, providing real-time insights to
optimize IToperations and helping to identify any exceptional and risky conditions in time. Integrating a MapReduce engine
into an IMDG minimizes the analysis and response times remarkably because it avoids data movement during processing.
Advanced IMDGs demonstrate parallel computing capabilities to overcome many of the MapReduce-introduced limitations
and also enable the semantics of MapReduce to be emulated and optimized. The result is the same with the advantage of
faster delivery. If a MapReduce application is programmed to analyze both live data and historical data, then the same code
can be utilized in the IMDG-based real-time environment as well as in the Hadoop batch environment.
IMDGs for Real-Time Analytics There are a few activities to be considered and done towards real-time analytics.
The fi rst step is to eliminate batch-scheduling overhead introduced by Hadoop ’s standard batch scheduler. Instead, IMDGs
can pre-stage a Java-based execution environment on all grid servers and reuse it for multiple analyses. This execution
environment consists of a set of Java Virtual Machine s (JVMs), one on every server within the cluster alongside each grid
service process. These JVMs form IMDG ’s MapReduce engine. Also the IMDG can automatically deploy all necessary
executable programs and libraries for the execution of MapReduce across the JVMs, greatly reducing start-up time down
to milliseconds.
The next step [ 3 ] in reducing the MapReduce analysis time is to eliminate data motion as much as possible.
Because an IMDG hosts fast-changing data in memory, MapReduce applications can get data directly from the grid and
put results back to the grid. This accelerates data analysis by avoiding delays in accessing and retrieving data from
secondary storage. As the execution engine is integrated with the IMDG, key/value pairs hosted within the IMDG can be
effi ciently read into the execution engine to minimize access time. A special record reader (grid record reader) can be used
to automatically pipeline the transfer of key/value pairs from the IMDG’s in-memory storage into the mappers. Its input
format automatically creates appropriate splits of the specifi ed input key/value collection to avoid any network overhead
when retrieving key/value pairs from all grid servers. Likewise, a grid record writer enables pipelined results from Hadoop
’s reducers back to the IMDG storage. Thus, IMDGs are turning out to be an excellent tool for accomplishing data analytics
to extract actionable intelligence in time.
In-memory data grid s are popular because they address two related challenges:
• Access to big data for real-time use
• Application performance and scale
In-memory data grid s present a smart solution posed by these two challenges:
• Ensuring that the right data is available in memory with simple access. Inmemory data grid s promote extremely
fast, scalable read-write performance .
• Automatically persisting unused data to a fi le system or maintaining redundant in-memory nodes to ensure high
reliability and fault tolerance.
• Elastically spinning up and down distributed nodes.
• Automatically distributing information across the entire cluster so the grid can grow as your scale and performance
needs change.
GridGain is a JVM-based application middleware that enables companies to easily build highly scalable real-time
compute and data-intensive distributed applications that work on any managed infrastructure from a small local cluster to
large hybrid clouds .
To achieve this capability, GridGain provides a middleware solution that integrates two fundamental technologies
into one cohesive product:

• Computational grid
• In-memory data grid

These two technologies are for any real-time distributed application as they provide the means for co-located
parallelization of processing and data access, and they are the cornerstone capability for enabling scalability under extreme
high loads.
Compute Grid Computational grid technology provides means for distribution of the processing logic. That is, it
enables the parallelization of computations on more than one computer. More specifi cally, computational grids or
MapReduce type of processing defi nes the method of splitting original computational task into multiple subtasks, executing
these subtasks in parallel on any managed infrastructure and aggregating (reducing) results back to one fi nal result.
GridGain provides the most comprehensive computational grid and MapReduce capabilities.
In-Memory Data Grid This provides capability to parallelize the data storage by storing partitioned data in memory
closer to application. IMDGs allow treating grids and clouds as a single virtualized memory bank that smartly partitions data
among the participating computers and providing various caching and accessing strategies. The Goal of IMDGs is to provide
extremely high availability of data by keeping it in memory and in highly distributed (i.e., parallelized) fashion.
On summary, it is clear that using an IMDG with an integrated MapReduce engine opens the door to real-time
analytics on live and operational data. The IMDG’s integrated MapReduce engine also eliminates the need to install, confi
gure, and manage a full Hadoop distribution. Developers can write and run standard MapReduce applications in Java and
these applications can be executed as a standalone by the execution engine. In a nutshell, in-memory data grids in
association with Hadoop engines are capable of producing results in time effi ciently so that informed decisions can be
taken to embark on the execution journey with all confidence and clarity. The productivity of IT infrastructures goes up
considerably with the smart leverage of grids. There are specifi c use cases emerging and evolving across industry verticals,
and they are bound to benefi t immensely by utilizing the subtle and sublime concepts of grid computing.

Cloud Computing
We have discussed about the roles of clusters and grids in fulfi lling the highperformance requirements of big data
analytics. In this section, we are to explain how the raging cloud paradigm is to contribute for the exclusive high-performance
needs of BDA . As we all know, the cloud idea is very popular due to its immense potential in seamlessly bringing up
infrastructure optimization. Typically, the utilization rate of IT infrastructures stands at 15 %, and hence there have been a
string of endeavors at different levels and layers to sharply enhance the utilization rates of capital-intensive and expensive-
to-operate IT resources. The cloud paradigm is a growing collection of pragmatic techniques and tips such as consolidation,
centralization, virtualization , automation, sharing, etc. of various IT resources (compute machines, storage appliances, and
network solutions) towards well-organized and optimized IT environments. There are competent solutions even for
clustering virtualized machines towards high-performance and high-throughput systems for specific uses.
A variety of enhancements are being worked out to make IT environments cloud - ready . As prevalent in software
engineering, the API-backed hardware programmability is being brought in these days in order to automate the activation
of hardware elements over any network. That means the remote discoverability, accessibility, maneuverability,
manageability, and maintainability of hardware elements are being facilitated towards the elevation of their usability and
utilization level. Another noteworthy point here is that the concentrated intelligence in hardware components is being
segregated and presented as a software layer so that the longstanding goals of IT commoditization and industrialization
are fulfi lled. The introduction of software layer is to extremely simplify the operations of hardware modules, which means
that the confi gurability, policy-based replaceability, substitutability, etc. of hardware modules through software are to see
the light soon. This is the reason these days we often hear and read about the buzzwords such as software-defined
infrastructures, networking, and storage. In short, a number of gamechanging advancements are being rolled out to make
IT programmable, converged, and adaptive.
Businesses need automated ways to expand and contract their IT capability as per the changing needs. Production-
ready appliances and cloud -based software delivery provide fl exible answers to this ongoing challenge. However,
extending these approaches to address the extreme demands of mission-critical enterprisescale applications requires
innovation on many levels. Cloud enables ubiquitous and on-demand access to a dynamic and shared pool of highly confi
gurable computing resources such as servers, storages, networking, applications, and services. These resources can be
rapidly deployed and redeployed with minimal human intervention to meet changing resource requirements. That is, the IT
agility, adaptability, affordability, and autonomy being realized through the leverage of the cloud concepts have a positive
impact on business effi ciency.
Scalability in Cloud Environments When cloud as a technology began transforming the IT industry, the buzz was
all about IT infrastructure. Spinning up virtual machines (VMs) on demand was like a social utility (gas, electricity, and
water). The cloud landscape thereafter continued to evolve, and today it is all about data and applications as the goal of
infrastructure on demand is seeing the light fast. Now the ability of cloud service providers to orient their infrastructures to
incoming workloads is the new rat race. As this paradigm continues to expand in different dimensions and directions, today’s
business buyers are learning that “time to value” is more important than the commodity servers in cloud.
For the enigmatic cloud idea, there have arisen a number of use, business, and technical cases that are being
overwhelmingly illustrated and articulated through a string of white papers, data sheets, case studies, research publications,
magazine articles, and keynote talks in various international conferences and confl uences. But the scalability attribute defi
nitely stands tall among all of them. When adding more resources on the cloud to restore or improve application
performance , administrators can scale either horizontally (out) or vertically (up). Vertical scaling (up) entails adding more
resources to the same computing pool (e.g., adding more RAM, disk, or virtual CPU to handle an increased application
load). On the other hand, horizontal scaling (out) requires the addition of more machines or devices to the computing
platform to handle the increased demand.
Big Data Analytic s in Clouds With data becoming big data , insights are bound to become big, and hence any
futuristic applications are going to be defi nitely big insight driven. It is hence easy to understand that, in any enterprise,
technically captured and tremendously stocked data volumes are ultimately to impact it tactically and strategically. That is,
no industry sector and business domain are to be left out of this data-inspired disruption and transformation. As this trend
picks up, increasingly we read about and experience big data applications in the areas of capital market, risk management,
energy, retail, brand and marketing optimization, social media analysis, and customer sentiment analysis . Considering the
hugeness of processing and data storage, businesses are yearning for parallel processing capabilities for analysis and
scalable infrastructures capable of adapting quickly to an increment or decrement in computing or storage needs. Therefore,
many big data applications are being readied to be cloud enabled and deployed in cloud environments so as to innately
avail all the originally envisaged cloud characteristics such as agility, adaptability, and affordability.
High-Performance Cloud Environments There is a common perception that virtualized environments are not
suitable for high- performance applications. However, cloud infrastructures are increasingly a mix of both virtualized and
bare-metal servers. Therefore, for meeting the unique requirements of HPC applications, bare metal systems are being
utilized. Even VMware has conducted a series of tests in order to identify the fitment of big data analytics in virtualized
environments, and the results are very encouraging as per the report published in the VMware website. The real beauty of
cloud environments is the auto-scaling. Besides scaling up and down, scaling out and in is the key differentiator of clouds .
Automatically, adding new resources and taking away the resources allocated as per the changing needs put cloud on top
for cost-effective HPC. Any parallelizable workloads are tackled efficiently in clouds.
There are parallel fi le systems; scale-out storages; SQL, NoSQL, and NewSQL databases; etc. to position cloud
infrastructures as the next-generation and affordable HPC solution. For example, multiple computer-aided engineering
(CAE) loads can process faster in an environment that is able to scale to meet demand, which makes cloud effi cient, fl
exible, and collaborative. By applying the proven cloud principles to established HPC and analytics infrastructures, silos
simply vanish, and shared resources can be leveraged to maximize the operational effi ciency of existing clusters. The
meticulous transition to the promising cloud paradigm can help in many ways. Due to extreme and deeper automation in
cloud environments, there is an optimized usage of resources for achieving more powerful and purposeful computations.
Cloud Platforms for HPC Not only software-defi ned infrastructures but also hosting of pioneering HPC platforms
in cloud environments enables clouds to be positioned as viable and venerable HPC environments. In the recent past, there
have come a number of parallelizing platforms for real-time computing. The prominent platforms among them are IBM
Netezza , SAP HANA , and SAS High-Performance Analytics . An instance of IBM Netezza was deployed in a public cloud
environment (IBM SoftLayer) and tested to gain an insight into how it functions in a cloud environment. The data-processing
speed was found to be excellent, and it has been deduced that a seamless synchronization of HPC platforms with cloud
infrastructures ensures the required high- performance goals.
Similarly, SAP and Intel came together to verify how their products go together in a cloud environment. Their
engineering teams have deployed SAP HANA in a new Petabyte Cloud Lab that provides 8000 threads, 4000 cores, and
100 TB of RAM in a server farm consisting of 100 four-socket Intel Xeon processor E7 familybased se rvers. The cluster
currently hosts a single instance of SAP HANA, and engineers continue to see near-linear scalability operating across a
petabyte of data.
SAS is using Amazon Web Services (AWS) to help organizations improve business functions by creating fl exible,
scalable, and analytics-driven cloud applications. This marks a critical step forward in helping organizations execute their
big data and Hadoop initiatives by applying advanced analytic capabilities in the cloud with products. This not only reduces
the costs considerably but also enables customers to be benefited immensely and immediately because the cloud migration
can make critical decisions to be taken quickly by being able to analyze data from anywhere at any time in an effi cient
fashion. It is an irrefutable truth that the prime factors driving businesses towards the cloud paradigm include faster access
to new functionality, reducing capital IT costs and improving the use of existing resources.
Without an iota of doubt, cloud is the happening place for all kinds of technical advancements as far as IT
optimization, rationalization, simplifi cation, standardization and automation tasks are concerned. With the convergence of
several technologies, clouds are being positioned as the next-generation affordable supercomputer for comprehensively
solving the storage, processing, and analytical challenges thrown up by big data . The availability of massively parallel,
hybrid, and application specific computing resources being accessed through a loosely coupled, distributed, and cloud-
based infrastructure brings forth a bevy of fresh opportunities for a range of complex applications dealing with large data
sets.

Heterogeneous Computing

Heterogeneity is becoming a common affair in IT these days due to the multiplicity of competitive technologies and
tools. Therefore, it makes sense to have a new model of heterogeneous computing to run heterogeneity-fi lled heavy
workloads. Recently, heterogeneous computing has been gaining traction in a number of domains. Heterogeneous
computing, a viable mechanism to realize the goals of accelerated computing, refers to systems that use a variety of
different computational units such as general processors and special-purpose processors (digital signal processors (DSPs),
graphics processing units (GPUs), and application-specific circuits often implemented on Field-Programmable Gate Arrays
(FPGAs)). The GPU is a many-core machine with multiple SIMD multiprocessors (SM) that can run thousands of concurrent
threads. Application-specifi c integrated circuit ( ASIC ) is another application-specifi c circuit. For example, a chip designed
to run in a digital voice recorder or a high-effi ciency bitcoin miner is an ASIC. Accelerators are the paramount solution
approaches these days to substantially speed up specifi c parameters under some circumstances. Some applications have
specifi c algorithms which can be offl oaded from general-purpose CPUs to specifi c hardware which can accelerate those
parts of the application. Mainly, GPUs are the leading force in sustaining the ideals of heterogeneous computing.
Why GPU Clusters? This is the era of multicore computing as the performance of single-core CPUs has stagnated.
Therefore, with the greater awareness, the adoption of GPUs has skyrocketed these days. Due to a series of advancements
in micro- and nanoscale electronics, the value and power of GPUs have increased dramatically with a broad variety of
applications demonstrating order-of-magnitude gains in both performance and price performance. GPUs particularly excel
at the throughput- oriented workloads that are characterizing data and compute-intensive applications.
However, programmers and scientists focus most of their efforts on single-GPU development. The leverage of GPU
clusters to tackle problems of true scale is very minimal as programming of multiple GPU clusters is not an easy one due
to lack of powerful tools and APIs. As with MapReduce , parallel processing of data is handled well by GPUs. But existing
GPU MapReduce targets solo GPUs and only in-core algorithms. Implementing MapReduce on a cluster of GPUs defi nitely
poses a few challenges. Firstly, the multi-GPU communication is diffi cult as GPUs cannot source or sink network I/O, and
hence supporting dynamic and effi cient communication across many GPUs is hard. Secondly, GPUs do not have inherent
out-of-core support and virtual memory. Thirdly, a naive GPU MapReduce implementation abstracts away the
computational resources of the GPU and possible optimizations. Finally, the MapReduce model does not explicitly handle
the system architecture inherent with GPUs. Armed with the knowledge of these critical restrictions, the authors [ 4 ] have
come out with a well-defi ned library “GPU MapReduce (GPMR)” that specifi cally overcomes them.
Heterogeneous Computing for Big Data Analytics In this section, we are to discuss how this newly introduced
computing paves the way for high- performance BDA . Many researchers across the globe have done considerable work
to improve the performance of MapReduce by effi ciently utilizing CPU cores, GPU cores, and multiple GPUs. However,
these novel MapReduce frameworks do not aim to efficiently utilize heterogeneous processors, such as a set of CPUs and
GPUs.
Moim: A Multi-GPU MapReduce Framework [ 5 ] As we all know, MapReduce is a renowned parallel-
programming model that considerably decreases the developmental complexity of next-generation big data applications.
The extra simplicity originates from the fact that developers need to write only two different functions (map and reduce).
The map function specifi es how to convert input <key, value> pairs into intermediate <key, value> pairs, and the reduce
function receives the intermediate pairs from the map function as the input and reduces them into fi nal <Skey, value> pairs.
The MapReduce runtime innately takes care of data partition, scheduling, and fault tolerance in a transparent manner.
However, MapReduce has several limitations. Although it is carefully designed to leverage internode parallelism in a cluster
of commodity servers, it is not designed to exploit the intra-node parallelism provided by heterogeneous parallel processors
such as multicore CPUs and GPUs. There are other concerns too. In MapReduce, the intermediate pairs of a job are shuffl
ed to one or more reducers via the hashing mechanism based on the keys. Unfortunately, this approach may result in
serious load imbalance among the reducers of a job as the distribution of keys is highly skewed. As the speed of a parallel
job consisting of smaller tasks is determined by the slowest task in the chain, a larger degree of load imbalance may
translate into a longer delay.
To tackle these challenges, the authors [ 5 ] have designed a new MapReduce framework, called Moim, which,
while overcoming the abovementioned drawbacks, provides a number of new features as follows in order to increase the
dataprocessing effi ciency of MapReduce:
• Moim effi ciently utilizes the parallelism provided by both multicore CPUs and GPUs.
• It overlaps CPU and GPU computations as much as possible to decrease the endto-end delay.
• It supports effi cient load balancing among the reducers as well as the mappers of a MapReduce job.
• The overall system is designed to process not only fi xed but also variable-size data.
Heterogeneous Computing in Cloud We have discussed about the impending role of clouds as the futuristic and
fl exible HPC environment. Now as chipsets and other accelerated solutions that fully comply with the specifi cations of
heterogeneous computing are hitting the market, the days of full-fl edged heterogeneous computing is not far off. Another
path-breaking development is that the combination of heterogeneous computing and cloud computing is emerging as a
powerful new paradigm to meet the requirements for HPC and higher data-processing throughput. Cloud-based
heterogeneous computing represents a signifi cant step in fulfi lling the advancing HPC needs. “The Intel Xeon Phi
coprocessor represents a breakthrough in heterogeneous computing by delivering exceptional throughput and energy effi
ciency without the high costs, infl exibility, and programming challenges that have plagued many previous approaches to
heterogeneous computing.”
Cloud-Based Heterogeneous Computing via Nimbix Cloud Nimbix launched the world’s fi rst Accelerated
Compute Cloud infrastructure (a non-virtualized cloud that featured the latest co-processors for advanced processing), and
the focus had always been on time to value. Thereafter, they had introduced JARVICE (Just Applications Running
Vigorously in a Cloud Environment), and this is the central platform technology developed in order to run applications faster
at the lowest total cost. One of the benefi ts of running a big data application in the Nimbix Cloud is that it automatically
leverages powerful supercomputing-class GPUs underneath. This can comfortably speed up rendering by tens or even
hundreds of times depending on the model vs. even very powerful desktops and laptops. The NVIDIA Tesla GPUs support
computation, not just visualization, and are much more powerful than most graphics processors available on PCs. Thus,
there are providers working on delivering high- performance heterogeneous computing from cloud environments.
Several companies are attracted towards heterogeneous computing. The OpenPOWER Foundation consortium
was initiated by IBM and sustained by many other organizations to make the Power architecture pervasive and persuasive.
This is for bringing in the extreme optimization in next-generation compute systems to comfortably achieve the success on
compute-intensive workloads and for lessening the workloads of application developers. It was announced that the future
versions of IBM Power Systems will feature NVIDIA NVLink technology , eliminating the need to transfer data between the
CPU and GPUs over the PCI Express interface. This will enable NVIDIA GPUs to access IBM POWER CPU memory at its
full bandwidth, improving performance for numerous enterprise applications. Power Systems are at the forefront of
delivering solutions to gain faster insights from analysing both structured information and unstructured big data —such as
video, images and content from sensors —and data from social networks and mobile devices. To draw insights and make
better decisions, businesses need solutions that consist of proprietary and open system software to solve their specifi c
pain points.
To drive those solutions, secure and fl exible Power Systems servers are designed to keep data moving by running
multiple concurrent queries that take advantage of industry-leading memory and I/O bandwidths. All of this leads to highly
supported utilization rates.

Mainframes for High-Performance Computing

The essential architecture of a mainframe system, with its ready-made network of specialized devices, centrally
managed and organized, delivers the performance and scalability required for BDA workloads . The reliability of the
mainframe at a level that distributed systems still cannot match, a result of decades of development and refi nement, makes
it the ideal platform for mission-critical workloads. The mainframe system virtualizes its physical resources as part of its
native operation.
Physically, the mainframe is not a single computer but a network of computing components including a central
processor with main memory, with channels that manage networks of storage and peripheral devices. The operating system
uses symbolic names to enable users to dynamically deploy and redeploy virtual machines, disk volumes, and other
resources, making the shared use of common physical resources among many projects a straightforward proposition.
Multiple such systems may be blended together.
Mainframe computers still rule the IT division of many organizations across the world. The mainframe is the
undisputed king of transactional data, and anywhere from 60 to 80 % of all the world’s transactional data is said to reside
on the mainframe. Relational data only represents a fraction of all the data held on the mainframe. Other prominent data
assets are stored in record-oriented fi le management systems such as VSAM that predates the advent of RDBMS. XML
data is another kind of data getting produced, captured, and stored in large quantities. And in the recent past, a huge and
largely untapped source of data comes in the form of unstructured or semi-structured data from multiple internal as well as
external sources. This non-mainframe data is growing at exponential rates. Consider social media data.
Twitter alone generates 12 terabytes of tweets every day. It is not desirable or practical to move such volumes of
non-mainframe data into the mainframe for analysis. But there are tactic and strategic use cases for multi-structured data.
For example, it would be very useful to comb through social media for information that can augment the traditional analytics
that you already do against your relational mainframe data.
IBM’s InfoSphere BigInsights, the commercial-grade implementation of the Hadoop standard, running on a Power
or IBM System x server is certainly well suited to ingest and process this sort of poly-structured data. Connectors available
with DB2 for z/OS enable a DB2 process to initiate a BigInsights analytics job against the remote Hadoop cluster and then
ingest the result set back into a relational database or traditional data warehouse to augment the mainframe data.
Mainframe data never leaves the mainframe and is simply augmented and enhanced by data ingested from other sources.
Veristorm [ 6 ] has delivered a commercial distribution of Hadoop for the mainframe. The Hadoop distribution
coupled with state-of-the-art data connector technology makes z/OS data available for processing using the Hadoop
paradigm without that data ever having to leave the mainframe. Because the entire solution runs in Linux on System z, it
can be deployed to low-cost and dedicated mainframe Linux processors. Further on, by taking advantage of the mainframe’s
ability to activate additional capacity on demand as needed, vStorm Enterprise can be used to build out highly scalable
private clouds for BDA . vStorm Enterprise includes zDoop, a fully supported implementation of the open-source Apache
Hadoop. zDoop delivers Hive for developers who prefer to draw on their SQL background and Pig for a more procedural
approach to building applications.
In the mainframe environment, users can integrate data held in Hadoop with various NoSQL, DB2, and IMS da
tabases in a common environment and analyze that data using analytic mainframe software such as IBM Cognos and
SPSS, ILOG, and IBM InfoSphere Warehouse. Mainframe users can take advantage of such factoryintegrated capabilities
• The IBM DB2 Analytics Accelerator for z/OS, which is based on IBM Netezza technology , substantially
accelerates queries by transparently offloading certain queries to the massively parallel architecture of the Accelerator
appliance. The DB2 for z/OS code recognizes the Accelerator is installed and automatically routes queries that would benefi
t from this architecture to the appliance. No application changes are required.
• IBM PureData System for Hadoop is a purpose-built, standards-based system that architecturally integrates IBM
InfoSphere BigInsights Hadoop-based software, server, and storage into a single system.
• IBM zEnterprise Analytics System (ISAS) 9700/9710 is a mainframe-based, high- performance , integrated
software and hardware platform with broad business analytic capabilities to support data warehousing, query, reporting,
multidimensional analysis, and data and text mining.
• The ability to integrate real-time analytics transactional scoring in DB2 for z/OS allows for effi cient scoring of
predictive models within the milliseconds of a transaction by integrating the IBM SPSS Modeler Scoring within IBM DB2 for
z/OS.
On summary, IBM Big Data Analytic s on zEnterprise provides a truly modern and cost-competitive analytics
infrastructure with an extensive and integrated set of offerings that are primed for delivering on today’s business-critical
analytics and big data initiatives across all of your data sources

Supercomputing for Big Data Analytics

Cray is bringing in an integrated open-source Hadoop big data analytics software to its supercomputing platforms.
Cray cluster supercomputers for Hadoop will pair Cray CS300 systems with the Intel distribution for Apache Hadoop
software. The Hadoop system will include a Linux operating system, workload management software, the Cray Advanced
Cluster Engine (ACE) management software, and the Intel distribution. Thus, the BDA has successfully penetrated into
supercomputing domain too. Other corporates are not lagging behind in synchronizing big data with their powerful
infrastructures. Fujitsu brings out the high- performance processors in the Fujitsu M10 enterprise server family to help
organizations meet their everyday challenges. What once was reserved for data-intensive scientifi c computing is now made
available for meeting the distinct challenges of mission-critical business computing especially for BDA. From high-
throughput connectivity to data sources to high-speed data movement and to high-performance processing units, Fujitsu
Japan has been in the forefront serving data-driven insights towards realizing and deploying mission-critical intelligent
systems.
IBM wants to unlock big data secrets for businesses with a new free Watson Analytics tool. This offering leverages
IBM’s Watson technology and allows businesses to upload data to IBM’s Watson Analytics cloud service and then query
and explore results to spot trends and patterns and conduct predictive analysis. Watson Analytics parses data, cleans it up,
preps it for analysis, identifi es important trends, and makes it easily searchable via natural language queries. The tool
could help companies better understand customer behavior or connect the dots between sales, weather, time of day, and
customer demographic data.
Thus, big data analytics has become an important trend drawing the attentions of many. Product innovators and
vendors focusing on supercomputing, cognitive, and parallel computing models are consciously redrawing their strategy in
tweaking their solutions and services in order to bring forth quantifi able value for the challenging proposition of BDA .

Appliances for Big Data Analytics

Undoubtedly, appliances represent the next-generation IT delivery. They are purpose built and are pre-integrated
devices that club together the right size of the hardware modules and the relevant software libraries to run specifi c
workloads quickly, easily, and effectively. This is turning out to be a cheaper and viable option for small and medium
enterprises. Certain applications produce better results in the appliance mode, and appliances rapidly and religiously deliver
better returns on the investment made, whereas the total cost of ownership (TCO) remains on the lower side. Due to the
smart bundling, making appliances up and running is furiously fast due to the automated confi guration facility, and the
functioning is smooth. This is a kind of accelerated computing, and human intervention, instruction, and involvement are
very minimal. The powerful emergence of appliances is a signifi cant part of the series of innovations in the IT fi eld in order
to streamline and sharpen the IT delivery. IT productivity is bound to go up signifi cantly with the incorporation of appliances
in the enterprise IT environments and cloud centers.
Bundling has been a cool concept in IT, and in the recent past, the containerization aspect, which is a kind of
bundling of all the relevant stuff together to automate and accelerate IT, has received a solid boost with the simplifi ed
engine introduced by the Docker initiative. Appliances are hence bound to score well in this competitive environment and
acquire greater space in the world market in the days ahead. Appliances are much more varied and vast in the sense that
there are multiple varieties emerging as per the changing requirements. Appliances can be virtual appliances too. That is,
virtual or software appliances can easily get installed in specifi ed hardware. This segregation enhances the appliance fl
exibility, and different hardware vendors can easily move into the world of appliances. The result is the pervasive and
persuasive nature of appliances. In the ensuing sections, we are to discuss about the enhanced role of appliances for
different purposes as listed below:
• Data warehouse appliances for large-scale data analytics
• In-memory data analytics
• In-database data analytics
• Hadoop -based data analytics

Data Warehouse Appliances for Large-Scale Data Analytics

The growing number of data sources and the resulting rises in data volumes simply strain and stress traditional IT
platforms and infrastructures. The legacy methods for data management and analytics are turning out to be inadequate for
tackling the fresh challenges posed by big data . The data storage, management, processing, and mining are the real pains
for conventional IT environments. The existing systems, if leveraged to crunch big data, require huge outlays of technical
resources in a losing battle to keep pace with the demands for timely and actionable insights. Many product vendors have
come out with appliances to accelerate the process of data analytics in an easy manner.
IBM PureData System for Analytics Revolution Analytics® and IBM have teamed together to enable
organizations to incorporate R as a vital part of their big data analytics strategy. With multiple deployment options,
organizations can simply and cost-effectively optimize key steps in the analytic process (data distillation, model
development, and deployment), maximize performance at scale, and gain effi ciencies.
PureData System for Analytics , powered by Netezza technology , architecturally integrates database, server, and
storage into a single, purpose-built, easy-to-manage system that minimizes data movement, thereby accelerating the
processing of analytical data, analytics modeling, and data scoring. It delivers exceptional performance on large-scale data
(multiple petabytes), while leveraging the latest innovations in analytics. Revolution R Enterprise “plugs-in” to IBM Netezza
Analytics, a built-in analytical infrastructure. With IBM Netezza Analytics, all analytic activity is consolidated into a single
appliance. PureData System for Analytics delivers integrated components to provide exceptional performance, with no
indexing or tuning required. As an appliance, the hardware, software (including IBM Netezza Analytics), and storage are
completely and exceptionally integrated, leading to shorter deployment cycles and faster time to value for business
analytics.
EMC Greenplum Appliance We know that business intelligence (BI) and analytical workloads are fundamentally
different from online transaction processing (OLTP) workloads, and therefore we require a profoundly different architecture
for enabling online analytical processing (OLAP). Generally, OLTP workloads require quick access and updates to a small
set of records, and this work is typically performed in a localized area on disk with one or a small number of parallel units.
Shared-everything architectures, wherein processors share a single large disk and memory, are well suited for
OLTP workloads. However, shared-everything and shared-disk architectures are quickly overwhelmed by the full-table
scans, multiple complex table joins, sorting, and aggregation operations against vast volumes of data that represent the
lion’s share of BI and analytical workloads.
EMC Greenplum is being presented as the next generation of data warehousing and large-scale analytic
processing. EMC offers a new and disruptive economic model for large-scale analytics that allows customers to build
warehouses that harness low-cost commodity servers, storage, and networking to economically scale to petabytes of data.
Greenplum makes it easy to expand and leverage the parallelism of hundreds or thousands of cores across an ever-growing
pool of machines. The Greenplum’s massively parallel and shared-nothing architecture fully utilizes every single core with
linear scalability and unmatched processing performance. Supporting SQL and MapReduce parallel processing, the
Greenplum database offers industry-leading performance at a low cost for companies managing terabytes to petabytes of
data.
Hitachi Unifi ed Compute Platform (UCP) The ability to store data in memory is crucial to move business from
traditional business intelligence to business advantages through big data . SAP High-Performance Analytic Application
(HANA) is a prominent platform for in-memory and real-time analytics. It lets you analyze business operations in real time
based on large volumes of structured data. The platform can be deployed as an appliance or delivered as a service via the
cloud . The convergence of Hitachi UCP and SAP HANA has resulted in a high- performance big data appliance that helps
you accelerate adoption and achieve faster time to the value.
Oracle SuperCluster is an integrated server, storage, networking, and software system that provides maximum
end-to-end database and application performance and minimal initial and ongoing support and maintenance effort and
complexity at the lowest total cost of ownership. Oracle SuperCluster incorporates high-speed onchip encryption engines
for data security , low-latency QDR Infi niBand or 10 GbE networking for connection to application infrastructure; integrated
compute server, network, and storage virtualization through Oracle Solaris Zones; and the missioncritical, Oracle Solaris
operating system. Oracle SuperCluster provides unique database, data warehouse, and OLTP performance and storage
effi ciency enhancements and unique middleware and application performance enhancements, is engineered and pre-
integrated for easy deployment, and minimizes overhead backed by the most aggressive support SLAs in the industry. The
large 4 terabyte memory footprint of SuperCluster T5-8 allows many applications to run entirely in memory. SuperCluster
M6-32 allows even greater memory scaling with up to 32 terabytes in a single confi guration. Running Oracle in-memory
applications on Oracle SuperCluster provides signifi cant application performance benefits.
SAS High-Performance Analytics (HPA) SAS HPA is a momentous step forward in the area of high-speed and
analytic processing in a scalable clustered environment.Teradata ’s innovative Unifi ed Data Architecture (UDA) represents
a significant improvement for meeting up all kinds of recent challenges thrown by big data .
The UDA provides 3 distinct, purpose-built data management platforms, each integrated with the others, intended
with specialized needs:
• Enterprise Data Warehouse — Teradata database is the market-leading platform for delivering strategic and
operational analytics throughout your organization, so users from across the company can access a single source of
consistent, centralized, and integrated data.
• Teradata Discovery Platform —Aster SQL- MapReduce delivers data discovery through iterative analytics against
both structured and complex multi-structured data, to the broad majority of your business users. Prepackaged analytics
allow businesses to quickly start their data-driven discovery model that can provide analytic lift to the SAS Analytics
Platform.
• Data Capture and Staging Platform — Teradata uses Hortonworks Hadoop , an open-source Hadoop solution to
support highly fl exible data capture and staging. Teradata has integrated Hortonworks with robust tools for system
management, data access, and one-stop support for all Teradata products. Hadoop provides low-cost storage and
preprocessing of large volumes of data, both structured and fi le based.

SAS HPA software platform in sync up with Teradata UDA infrastructure provides business users, analysts, and
data scientists with the required capabilities and competencies to fulfi ll their analytic needs. SAS in-memory architecture
has been a signifi cant step forward in the area of high-speed and analytic processing for big data . One of the major themes
of new SAS offerings over the last couple of years has been high- performance analytics largely using in-memory clustered
technology to provide very fast analytic services for very large data sets. SAS visual analytics has also made signifi cant
strides towards the affordable analytic processing using largely a very similar in-memory architecture.
The key to SAS HPA and SAS visual analytics (VA) is the clustered processing (massively parallel processing
(MPP)), and this proven and promising model enables SAS deployments to scale cluster size to support larger data, higher
user concurrency, and greater parallel processing. However, the most signifi cant benefi t for SAS users is the blazing
speed. Both environments with high-speed memorycentric techniques are to achieve extremely fast analytic processing.
For example, for a Teradata customer, the SAS HPA reduced the analytic processing time from 16 h to 83 s for an analytical
requirement, which is a dramatic improvement in speed. The biggest impact for that customer was that it now enabled their
users to “experiment more,” try more advanced model development techniques, and utilize the mantra of “failing faster.”
That is, if it only takes a couple of minutes to fail, then people would go for more trials and errors towards producing better
models.
For SAS VA users, it is the same story. That is, one billion row analytical data sets can be pulled up and visualized
in a couple of seconds. Users can apply different analytic techniques, slice and fi lter their data, and update different
visualization techniques, all with near instantaneous response time. The SAS in-memory architecture comes in two data
management fl avors. The fi rst one utilizes an MPP database management platform for storage of all of the data, and the
second one utilizes a Hadoop fi le system cluster for persistence. Both work fast and can scale, especially when working
with advanced clustered MPP style database models. However, for many complex organizations undertaking large-scale
big data projects, neither is a “one-size-fi ts-all” solution.
The most economical model utilizes Hadoop running HPA directly against data scattered across the nodes of a
distributed fi le system. With this model, relatively low-cost servers can be used to support the Hadoop workers, connected
together by high-speed networking, with processing managed via the MapReduce . This distributed process gives the same
benefi ts of many of the MPP model databases, but with fewer bells and whistles, at a lower cost. The Teradata appliance
for SAS HPA extends the analytic capabilities of the Teradata environment, enabling SAS inmemory architecture directly
onto the Teradata environment. With the HPA Appliance , Teradata extends any UDA data platform with new HPA
capabilities.

The Aster Big Analytics Appliance The Aster Big Analytics Appliance is a powerful and ready-to-run platform
that is pre-confi gured and optimized specifi cally for big data storage and analysis. As a purpose-built, integrated hardware
and software solution for analytics at big data scale, this appliance runs the Aster SQL- MapReduce and SQL-H technology
on a time-tested and fully supported Teradata hardware platform. Depending on workload needs, it can be confi gured
exclusively with Aster nodes, Hadoop nodes, or a mixture of both Aster and Hadoop nodes. Additionally, integrated backup
nodes for Aster nodes are available for data protection. By minimizing the number of moving parts required for deployment,
the appliance offers easy and integrated management of an enterprise-ready information discovery solution with the benefi
ts of optimized performance , continuous availability, and linear scalability. The appliance comes with Aster Database ,
which features more than 80 prepackaged SQL-MapReduce functions to enable faster insights. The SQLMapReduce
framework allows developers to write powerful and highly expressive SQL-MapReduce functions in various programming
languages such as Java, C#, Python, C++, and R and push them into the discovery platform for advanced in database
analytics. Business analysts can then invoke SQL-MapReduce functions using standard SQL through Aster Database, and
the discovery platform allows applications to be fully embedded within the database engine to enable ultrafast, deep analysis
of massive data sets.

In-Memory Big Data Analytics

The need is for a highly productive environment which enables the data analyst to carry out analysis swiftly and
implement the knowledge that has been discovered quickly so that it is delivered to the individual or software application
that can use it just in time or sooner. Even so, there are three different latencies that an organization will experience with
the new BI.
• The Time to Discovery —the time it takes for a data scientist to explore a collection of data and discover useful
knowledge in it
• The Time to Deployment —the time it takes to implement the discovered knowledge within the business processes
that it can enrich
• Knowledge Delivery Time —the time it takes for the BI application to deliver its knowledge in real time
Organizations are constantly seeking out for better ways to make informed decisions based on the trustworthy
insights extracted by deeper analytics on the huge volumes of data. However, the worrying point is that the amount of data
we have to analyze is expanding exponentially. Social and sentiment data (Facebook), weblogs, people profi les (LinkedIn),
opinionated tweets (Twitter), actuator and sensor data, business transactional data, lab data, biological information, etc.
are pumping out a tremendous amount of data to be systematically captured and subjected to a variety of drills. At the same
time, the pressure to make better, fact-based decisions faster has never been greater than now.
Salient MPP ( http://www.salient.com/ ) is a super scalable, in-memory, and multidimensional analytical data
platform that defeats traditional limitations of speed, granularity, simplicity, and fl exibility in use. When combined with
Salient’s discovery visualization user interface, it provides an overall analytical solution preferred by executives, analysts,
and basic users to perform simple through complex analytics much faster than previously possible.
The GridGain In-Memory Data Fabric is a proven software solution, which delivers unprecedented speed and
unlimited scale to accelerate the process of extracting timely insights. It enables high- performance transactions, real-time

Fig. 2.3 The GridGain in-memory data fabric


unifi ed API that spans all key types of applications (Java, .NET, C++) and connects them with multiple data stores
containing structured, semi-structured, and unstructured data (SQL, NoSQL, Hadoop ). Figure 2.3 depicts the GridGain In-
Memory Data Fabric.
Keeping data in random access memory (RAM) allows a system to process data hundreds of times faster than by
electromechanical input/output (processor to disk) operations. Through advanced data compression techniques, MPP can
handle very large volumes and, at the same time, take advantage of the speed of in-memory processing. This speed
advantage is enhanced by Salient’s proprietary n- dimensional GRID indexing scheme, which enables a processor to go
through only that portion of data most relevant to the specifi c query. MPP also takes full advantage of both multi-threading
platforms and multiprocessor machines to accommodate very large numbers of concurrent user queries, without
performance degradation. Increasing the number of processors will scale the number of concurrent users in near-linear
fashion.
What Is In-Memory Streaming? Stream processing fi ts a large family of applications for which traditional
processing methods and disk-based storages like databases or fi le systems fall short. Such applications are pushing the
limits of traditional data-processing infrastructures. Processing of market feeds, electronic trading by many fi nancial
companies on Wall Street, security and fraud detection, and military data analysis—all these applications produce large
amounts of data at very fast rates and require appropriate infrastructure capable of processing data in real time without any
bottlenecks. Apache Storm is mainly focused on providing event workfl ow and routing functionality without focusing on
sliding windows or data querying capabilities. Products in the CEP family, on the other hand, are mostly focused on
providing extensive capabilities for querying and aggregating streaming events,while generally neglecting event workfl ow
functionality. Customers looking for a real-time streaming solution usually require both rich event workfl ow combined and
CEP data querying.
It is true that keeping data in memory removes the bottleneck of reading data for large queries from disk. But the
structure of the data in memory is equally important. To perform and scale well, the structure of the data needs to be
designed for analytics. The Birst in-memory database ( http://www.birst.com/ ) uses columnar data storage. Every column
is fully indexed for rapid lookup and aggregation. A highly parallel architecture means that performance scales with the
addition of more processing cores. Birst dynamically indexes based on the context (join, fi lter, and sort), thereby maximizing
performance and minimizing memory footprint. Birst uses hash maps when dealing with arbitrary, sparse data. Bitmaps are
used for more structured and dense data. Rowset lists come to play when sorting performance and effi cient memory usage
are paramount.

In- Database Processing of Big Data

This is the ability to perform analytical computations in the database management system where the data resides
rather than in an application server or desktop program. It accelerates enterprise analytics’ performance , data
manageability, and scalability. In-database processing is ideal for big data analytics, where the sheer volume of the data
involved makes it impractical to repetitively copy it over the network.
By leveraging in-database processing, analytics users are leveraging the power of the database platform, designed
specifi cally for highly effi cient data access methods, even with enormous data sets consisting of millions or billions of rows.
SAS enables in-database processing for a set of core statistical and analytical functions and model scoring capabilities
within Teradata , leveraging the MPP architecture for scalability and performance of analytic computations. These
capabilities allow analytic computations to be run in parallel across potentially hundreds or thousands of processors. Parallel
execution greatly accelerates the processing time for analytics calculations, providing very signifi cant performance gains
for faster results.
This is for predominantly simplifying and accelerating the analytics fi eld, and an embedded, purpose-built, and
advanced analytics platform delivered with every IBM Netezza appliance is to empower worldwide enterprises’ analytic
needs to meet and exceed their business demands. This analytical solution fuses data warehousing and in-database
analytics together into a scalable, high- performance , and massively parallel platform to crunch through petascale data
volumes furiously fast. This platform is specifi cally designed to quickly and effectively provide better and faster answers to
the most sophisticated business questions. This allows the integration of its robust set of built-in analytic capabilities with
leading analytic tools from divergent vendors such as Revolution Analytics , SAS, IBM SPSS, Fuzzy Logix, and Zementis.

Hadoop -Based Big Data Appliances

Hadoop is an open-source software framework for big data crunching and is composed of several modules. The
key modules are MapReduce , the large-scale dataprocessing framework, and Hadoop distributed fi le system ( HDFS ),
the data storage framework. HDFS supports the Hadoop distributed architecture that puts compute engines into the same
physical nodes that store the data. This new arrangement and approach brings computation to the data nodes rather than
the other way around. As data sizes are typically huge, it is prudent to move the processing logic over data. The data heaps
are therefore segregated into a number of smaller and manageable data sets to be processed in parallel by each of the
Hadoop data nodes. The results are then smartly aggregated into the answer for the original problem.
It is anticipated that the Hadoop framework will be positioned as the universal preprocessing engine for the ensuing
big data era. The coarse-grained searching, indexing, and cleansing tasks are being allocated to Hadoop module, whereas
the fi ne-grained analytics are being accomplished via the fully matured and stabilized data management solutions.
Ultimately apart from the preprocessing Hadoop comes handy in eliminating all kinds of redundant, repetitive, and routine
data to arrive at data that are really rewarding at the end. The second major task is to transformall multi-structured data into
structured data so that traditional data warehouses and bases can work on the refurbished data to emit pragmatic
information to users. There are both open-source (Cloudera, Hortonworks, Apache Hadoop, Map R, etc.) and commerce-
grade (IBM BigInsights, etc.) implementations and
Fig. 2.4 The macro-level Hadoop architecture

distributions of the Hadoop standard. Datameer is the end-to-end platform for data ingestion, processing, analysis, and
visualization. Figure 2.4 clearly tells how data are being split, mapped, merged, and reduced by MapReduce framework,
while HDFS is the data storage mechanism.

Why Hadoop Scores Well? The plummeting cost of physical storage has created a bevy of tantalizing opportunities to do
more with the continuous rise of data volumes like extracting and delivering insights. However, there is still the issue of
processor cost. A 1 TB, massively parallel processing (MPP) appliance could run$100,000 to $200,000. The total cost for
an implementation could go up to a few millions of dollars. In contrast, a terabyte of processing capacity on a cluster of
commodity servers can be had for $2000 to $5000. This is the compelling factor for IT consultants to incline towards the
Hadoop framework. This enormous gain in computational costs being achieved by Hadoop clusters is due to the proven
distributed architecture of commodity clusters for low-cost data processing and knowledge dissemination. Appliances follow
the contrast method and impose a hefty amount on IT budgets. Hadoop operates in a shared-nothing mode. All Hadoop
data is stored locally on each node rather than on a networked storage. Processing is distributed across an array of
commodity servers with independent CPUs and memory. The system is intelligent in the sense that the MapReduce
scheduler optimizes for the processing to happen on the same node storing the associated data or located on the same
leaf Ethernet switch.
Hadoop is natively fault tolerant. Hardware failure is expected and is mitigated by data replication and speculative
processing. If capacity is available, Hadoop will start multiple copies of the same task for the same block of data. Results
are accepted from the task that fi nishes fi rst, while the other task is canceled and the results discarded. Speculative
processing enables Hadoop to work around slow nodes and eliminates the need to restart a process if a failure occurs
during a long-running calculation. Hadoop’s fault tolerance is based on the fact that the Hadoop cluster can be confi gured
to store data on multiple worker nodes. A key benefi t of Hadoop is theability to upload unstructured fi les without having to
“schematize” them fi rst. You can dump any type of data into Hadoop and allow the consuming programs to determine and
apply structure when necessary.
As written above, Hadoop is being positioned as a vehicle to load formalized and formatted data into a traditional
data warehouse for performing activities such as data mining, OLAP, and reporting or for loading data into a BI system for
advanced analytics. Organizations can also dump large amounts of data into a Hadoop cluster, use a compatible visual
analytics tool to quickly make sense out of the data, aggregate it, and export the data into analytic solution. In addition, the
distributed processing capability of Hadoop can be used to facilitate the extract-transform-load (ETL) processes for getting
the data from disparate and distributed sources into the warehouse.
Gartner defi nes a big data appliance as an integrated system that delivers a combination of server, storage, and
network devices in a pre-integrated stack, together with a big data distributed processing framework such as Hadoop .
The well-known advantages include:
• Standard confi guration with maintenance and support provided by the vendor
• Converged hardware and software that can potentially lower setup time
• Unifi ed monitoring and management tools that can simplify administration

The negative points include:


• Expensive acquisition and incremental expansion costs.
• Rigid confi guration with minimal ability to tune the infrastructure.
• When operating at scale, vendor lock-in poses a signifi cant barrier for safe exit.

Oracle Big Data Appliance Oracle Big Data Appliance is a high- performance and secure platform for running
Hadoop and NoSQL workloads . With Oracle Big Data SQL, it extends Oracle’s industry-leading implementation of SQL to
Hadoop and NoSQL systems. By combining the newest technologies from the Hadoop ecosystem and powerful Oracle
SQL capabilities together on a single pre-confi gured platform, Oracle Big Data Appliance is uniquely able to support rapid
development of big data applications and tight integration with existing relational data. It is preconfigured for secure
environments leveraging Apache Sentry, Kerberos, both network encryption and encryption at rest, as well as Oracle Audit
Vault and Database Firewall.
Oracle Big Data SQL is a new architecture for SQL on Hadoop , seamlessly integrating data in Hadoop and NoSQL
with data in Oracle Database . Oracle Big Data SQL radically simplifi es integrating and operating in the big data domain
through two powerful features: newly expanded External Tables and Smart Scan functionality on Hadoop. Oracle Big Data
Appliance integrates tightly with Oracle Exadata and Oracle Database using Big Data SQL and Oracle Big Data Connectors,
seamlessly enabling analysis of all data in the enterprise.

Dell In-Memory Appliance for Cloudera Enterprise Hadoop platforms are being increasingly embedded into
powerful hardware to derive powerful appliances. It is absolutely clear that across all industries and markets, data is the
new currency and competitive differentiator. But in the recent past, data has become big data . And to realize the promised
competencies out of big data, organizations need competent solutions in place to facilitate faster and easier data ingestion,
storage, analysis, and building of insights from big data. That’s the idea behind the Dell In-Memory Appliance for Cloudera
Enterprise. Building on a deep engineering partnership among Dell, Intel, and Cloudera, this next-generation analytic
solution solves the big data challenge with a purpose-built turnkey and an in-memory advanced analytics data platform.

To enable fast analytics and stream processing, the Dell In-Memory Appliance for Cloudera Enterprise is bundled
with Cloudera Enterprise, which includes Apache Spark. Cloudera Enterprise allows your business to implement powerful
end-to-end analytic workfl ows, comprising batch data processing, interactive query, navigated search, deep data mining,
and stream processing, all from a single common platform. With a single common platform, there is no need to maintain
separate systems with separate data, metadata, security , and management that drive up the complexity and cost.

High-Performance Big Data Networking The unprecedented speed with which the prime IoT technologies are
being adapted powerfully sets the stage for the explosion of poly-structured data. Scores of breakthrough edge technologies
and their meticulous usage have collectively facilitated the realization of a growing array of digitized entities. The pioneering
connectivity technologies and tools are enabling every smart and sentient object in our personal and professional
environments to seamlessly fi nd and connect with one another in the neighborhood and with the remotely held cyber
applications and data. All types of purposeful interactions among IT-enabled physical, mechanical, electrical, and electronic
systems result in data getting produced in plenty and accumulated in massive storage systems.
Hadoop deployments can have very large infrastructure requirements, and therefore hardware and software
choices made at design time have a signifi cant impact on performance and return on investment (ROI) of Hadoop clusters.
Specifi cally, Hadoop cluster performance and ROI are highly dependent on network architecture and technology choices.
While Gigabit Ethernet is the most commonly deployed network, it provides less than ideal bandwidth for many Hadoop
workloads and for the whole of range of I/O-bound operations comprising a Hadoop job.
Typical Hadoop scale-out cluster servers utilize TCP/IP networking over one or more Gigabit Ethernet network
interface cards (NICs) connected to a Gigabit Ethernet (GbE) network. The latest generation of commodity servers offers
multisocket and multicore CPU technology , which outstrips the network capacity offered by GbE networks. Similarly, with
noteworthy advances in processor technology, this mismatch between server and network performance is predicted to grow
further. Solid-state disks (SSDs) are evolving to offer equivalent capacity per dollar of hard disk drives (HDDs), and they
are also being rapidly adopted for caching and for use with medium-sized data sets. The advances in storage I/O
performance offered by SSDs exceed the performance offered by GbE networking. All these clearly make network I/O
increasingly the most common impediment to improved Hadoop cluster performance.
There is a concern expressed in different quarters on the viability of transferring huge quantities of data from on-
premise systems to cloud -based big data analytics. The transmission of tens of terabytes of data over the open and shared
public Internet, which is the prominent communication infrastructure for cloud-based data processing and service delivery,
evokes some inconvenient questions.

IBM Aspera Solution Built on top of the patented FASP transport technology , IBM Aspera’s suite of On Demand
Transfer products solves both technical problems of the WAN and the cloud I/O bottleneck, delivering unrivaled
performance for the transfer of large fi les or large collections of fi les, in and out of the cloud. Aspera’s FASP transport
protocol eliminates the WAN bottleneck associated with conventional fi le transfer technologies such as FTP and HTTP.
With FASP, transfers of any size into and out of the cloud achieve perfect throughput effi ciency, independent of the network
delays, and are robust to extreme packet loss.
Aspera has developed a high-speed software bridge, Direct-to- Cloud, which transfers data at line speed, from
source directly into cloud storage. Using parallel HTTP streams between the Aspera on-demand transfer server running on
a cloud virtual machine and the cloud storage, the intra-cloud data movement no longer constrains the overall transfer rate.
The fi les are written directly to cloud storage, without a stop-off on the cloud compute server.
In summary, as big data gets even bigger, the demand to increase processing power, storage capacity, storage
performance , and network bandwidth grows even greater. To meet this demand, more racks and data nodes may be added
to Hadoop clusters. But this solution is neither economical nor effi cient. It requires more space, increases the power
consumption, and adds to the management and maintenance overheads while ignoring the fact that slow 1GbE
interconnects continue to handicap Hadoop’s node to node and overall input/output speed and cluster performance. 10
Gigabit Ethernet has the potential of bringing the Hadoop cluster networking into balance with the recent improvements in
performance brought by server CPUs and advances in storage technology . However, to achieve optimum balance, the
network I/O gains delivered by 10GbE must come with optimal effi ciency so that the impact of high-speed network I/O on
the server CPU is minimized.

High-Performance Big Data Storage Appliances

With the amount of data getting stored doubling every two years and the energy required to store that data
exceeding 40 % of data center (DC) power consumption, nearly every organization has a critical need for massively scalable
and intelligent storage solutions that are highly effi cient, easy to manage, and cost-effective. As organizations of all sizes
face the daunting task of the big data era, many struggle with maintaining their productivity and competitiveness as the
overpowering waves of big data simply bog them down or negatively impact their performance . The unprecedented growth
rate of data, particularly the unstructured data, has brought them a series of challenges. As per the IDC report, by the end
of 2012, the total amount of digital information captured and stocked in the world reached 2.7 zettabytes (ZB). It also clearly
reports that 90 % of it is unstructured data like multimedia fi les (still and dynamic images, sound fi les, machine data,
experimentation results, etc.). Searching for useful information in those data mountains poses a real challenge to enterprise
IT teams of business organizations across the globe with the existing IT infrastructures. But along with this unprecedented
explosion of data comes an incredible opportunity for monetizing better products and services. Datadriven insights
ultimately lead to fresh possibilities and opportunities. Companies are armed to seek out hitherto unexplored avenues for
fresh revenues.
One critical requirement in BDA is a fi le system that can provide the requisite performance and maintain that
performance while it scales. HDFS is a highly scalable and distributed fi le system that provides a single, global namespace
for the entire Hadoop cluster. HDFS comprises of DataNodes supporting the directattached storage ( DAS ), which store
data in large 64 MB or 128 MB chunks to take advantage of the sequential I/O capabilities of disks and to minimize the
latencies resulting from random seeks. The HDFS NameNode is at the center of HDFS. It manages the fi le system by
maintaining a fi le metadata image that includes fi le name, location, and replication state. The NameNode has the ability
to detect failed DataNodes and replicate the data blocks on those nodes to surviving DataNodes.
HDFS has some limitations in scale and performance as it scales due to its single namespace server. File system
is the standard mechanism for storing, organizing, retrieving, and updating the data. File system solutions have placed an
additional burden on most commercial organizations, requiring technical resources and time as well as fi nancial
commitments. And, to complicate matters further, the myriad of fi le system choices has only created more confusion for
organizations seeking a big data solution. Network –attached storage (NAS) appliances have been a popular choice for
workgroup and departmental settings where simplicity and utilization of existing Ethernet networks were critical
requirements. However, the various NAS solutions do not scale adequately to meet the needs of managing big data and
delivering high through put to data-intensive applications. Thus, there is a clarion call for fresh solutions to take on big data
with confi dence and clarity. Some fi le system solutions do not allow the parallelization of multiple paths of data fl ow to
and from the application and the processing cores. The end result with this approach is a large fi le repository that cannot
be utilized effectively by multiple computer-intensive applications.
Considering these constrictions, parallel fi le systems (PFS) like Lustre have become popular these days especially
for the stringent requirements imposed by big data . As a more versatile solution, PFS represents the hope for the big data
era. A PFS enables a node or fi le server to service multiple clients simultaneously. Using a technique called fi le striping,
PFS signifi cantly increases IO performance by enabling reading and writing from multiple clients concurrently and thus
increases available I/O bandwidth. In many environments, a PFS implementation results in applications experiencing a fi
ve- to tenfold increase in the performance. However, a major hurdle to wider commercial adoption of PFS is the lack of
technical expertise required to install, confi gure, and manage a PFS environment.
Having realized this huge and untapped opportunity, the company Terascala has come out with an innovative
storage appliance that addresses the big and fast data challenges. It is the fi rst company to introduce a scalable high-
performance storage appliance that stands up quickly; is easy to manage; augments existing systems available through
Dell, EMC, and NetApp; and enables achieving a level of investment protection previously unavailable with parallel fi le
systems.
ActiveStor ActiveStor is a next-generation storage appliance that delivers the highperformance benefi ts of fl ash
technology with the cost-effectiveness of enterprise SATA disk drives. It scales linearly with capacity without the
manageability and reliability compromises that often occur at scale. The principal benefi ts include:
• High parallel performance
• Enterprise-grade reliability and resiliency
• Easy management
For many big data applications, the limiting factor in performance is often the transportation of large amount of data
from hard disks to where it can be processed (DRAM). In the paper [ 8 ], the authors have explained and analyzed an
architectural pattern for a scalable distributed fl ash store which aims to overcome this limitation in two ways. First, the
architecture provides a high-performance, high-capacity, and scalable random access storage. It achieves high throughput
by sharing large numbers of fl ash chips across a low-latency and chip-to-chip backplane network managed by the fl ash
controllers. The additional latency for remote data access via this network is negligible as compared to fl ash access time.
Second, it permits some computation near the data via an FPGA -based programmable fl ash controller. The controller is
located in the data path between the storage and the host and provides hardware acceleration for applications without any
additional latency. The authors had constructed a small-scale prototype whose network bandwidth scales directly with the
number of nodes and where average latency for user software to access flash store is less than 70 microsecond, including
3.5 microsecond of network overhead.
It is clear that state-of-the-art fi le systems and storage solutions are becoming indispensable for the world of big
data . High- performance storage appliances are hitting the market these days to facilitate big data analytics effi ciently and
effectively.

Conclusions
As the cloud era is hardening, where the world creates 2.5 exabytes of data every day, traditional approaches and
techniques for data ingestion, processing, and analysis are found to be limited because some lack parallelism and most
lack fault tolerance capabilities. Path-breaking technologies in the form of converged and resilient infrastructures, versatile
platforms, and adaptive applications are being prescribed as the correct way of working with massive data volumes.
Insight-driven strategizing, planning, and execution are very vital for world businesses to survive in the knowledge-
driven and market-centric environment. Decision-making has to be proactive and preemptive in this competitive
environment in order to keep up the edge earned by businesses. Innovation has to be fully ingrained in all the things
businesses do in their long and arduous journey.
Data- driven insights are the most validated and venerable artifacts to bring forth next- generation customer-centric
offerings. These days, worldwide corporations are getting stuffed and saturated with a large amount of decision-enabling
and valueadding data, dealing with more complicated business issues and experiencing increased globalization challenges.
Thus, it is essential to transform data assets into innovation and maximize the productivity of resources to drive sustainable
growth.
In this chapter, we have explained the importance of high- performance IT infrastructures and platforms besides
end-to-end big data frameworks in order to substantially speed up data crunching to derive actionable insights in time to
empower people to be the smartest in their personal as well as professional lives. Today, it is simple and straightforward
that to out-compute is to out-compete.

You might also like