0% found this document useful (0 votes)
82 views

CH 10

big data
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
82 views

CH 10

big data
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Big Data Analytics: Turning Big Data into Big Money

By Frank Ohlhor st
Copyright 2013 by John Wiley & Sons, Inc.

CHAPTER

Bringing It
All Together

h e prom ises offered by data-driven decision m akin g h ave been


widely recogn ized. Bu sin esses h ave been u sin g bu sin ess in telli-
gen ce (BI) an d bu sin ess an alytics for years n ow, realizin g th e
valu e offered by sm aller data sets an d of in e advan ced processin g.
However, bu sin esses are ju st startin g to realize th e valu e of Big Data
an alytics, especially wh en paired with real-tim e processin g.
Th at h as led to a growin g en th u siasm for th e n otion of Big Data,
with bu sin esses of all sizes startin g to th row resou rces beh in d th e qu est
to leverage th e valu e ou t of large data stores com posed of stru ctu red,
sem istru ctu red, an d u n stru ctu red data. Alth ou gh th e prom ises wrap-
ped arou n d Big Data are very real, th ere is still a wide gap between its
poten tial an d its realization .
Th at wide gap is h igh ligh ted by th ose wh o h ave su ccessfu lly u sed
th e con cepts of Big Data at th e ou tset. For exam ple, it is estim ated th at
Google alon e con tribu ted $54 billion to th e U.S. econ om y in 2009,
a sign i can t econ om ic effect, m ostly attribu ted to th e ability to h an dle
large data sets in an ef cien t m an n er.
Th at alon e is probably reason en ou gh for th e m ajority of bu si-
n esses to start evalu atin g h ow Big Data an alytics can affect th e bottom
lin e, an d th ose bu sin esses sh ou ld probably start evalu atin g Big Data
prom ises soon er rath er th an later.

111

c10 22 October 2012; 18:1:22


112 BI G DATA ANAL YTI CS

Delving into th e value of Big Data analytics reveals th at elements


su ch as h eterogeneity, scale, tim elin ess, complexity, and privacy prob-
lem s can impede progress at all phases of the process th at create value
from data. Th e prim ary problem begin s at the poin t of data acquisition ,
wh en the data tsu nami requires u s to m ake decision s, currently in an ad
h oc m ann er, about wh at data to keep, what to discard, and h ow to
reliably store wh at we keep with the right m etadata.
Addin g to th e con fu sion is th at m ost data today are n ot n atively
stored in a stru ctu red form at; for exam ple, tweets an d blogs are weakly
stru ctu red pieces of text, wh ile im ages an d video are stru ctu red for
storage an d display bu t n ot for sem an tic con ten t an d search . Tran s-
form in g su ch con ten t in to a stru ctu red form at for later an alysis is a
m ajor ch allen ge.
Neverth eless, th e valu e of data explodes w h en th ey can be lin ked
w ith oth er data; th u s data in tegration is a m ajor creator of valu e.
Sin ce m ost data are directly gen erated in digital form at today, bu si-
n esses h ave th e opportu n ity an d th e ch allen ge to in u en ce th e
creation of facilitatin g later lin kage an d to au tom atically lin k previ-
ou sly created data.
Data analysis, organization , retrieval, and m odeling are oth er foun -
dational challen ges. Data analysis is a clear bottleneck in m an y applica-
tions becau se of the lack of scalability of the un derlying algorithm s as well
as the com plexity of the data that n eed to be analyzed. Finally, presen -
tation of the results and their interpretation by n on tech nical dom ain
experts is crucial for extracting action able kn owledge.

THE PA TH TO BIG DA TA

Du rin g th e last th ree to fou r decades, prim ary data m an agem en t


prin ciples, in clu din g ph ysical an d logical in depen den ce, declarative
qu eryin g, an d cost-based optim ization , h ave created a m u ltibillion -
dollar in du stry th at h as delivered added valu e to collected data. Th e
evolu tion of th ese tech n ical advan tages h as led to th e creation of BI
platform s, wh ich h ave becom e on e of th e prim ary ten ets of valu e
extraction an d corporate decision m akin g.
Th e fou n dation laid by BI application s an d platform s h as created
th e ideal en viron m en t for m ovin g in to Big Data an alytics. After all,

c10 22 October 2012; 18:1:22


BRI NGI NG I T AL L TOGETHER 113

m an y of th e con cepts rem ain th e sam e; it is ju st th e data sou rces an d


th e qu an tity th at prim arily ch an ge, as well as th e algorith m s u sed to
expose th e valu e.
Th at creates an opportu n ity in wh ich in vestm en t in Big Data an d its
associated tech n ical elem en ts becom es a m u st for m an y bu sin esses. Th at
in vestm en t will spu r fu rth er evolu tion of th e an alytical platform s in u se
an d will strive to create collaborative an alytical solu tion s th at look
beyon d th e con n es of tradition al an alytics. In oth er words, appropriate
in vestm en t in Big Data will lead to a n ew wave of fu n dam en tal tech -
n ological advan ces th at will be em bodied in th e n ext gen eration s of Big
Data m an agem en t an d an alysis platform s, produ cts, an d system s.
Th e tim e is n ow. Usin g Big Data to solve bu sin ess problem s an d
prom ote research in itiatives will m ost likely create h u ge econ om ic
valu e in th e U.S. econ om y for years to com e, m akin g Big Data an a-
lytics th e n orm for larger organ ization s. However, th e path to su ccess is
n ot easy an d m ay requ ire th at data scien tists reth in k data an alysis
system s in fu n dam en tal ways.
A m ajor in vestm en t in Big Data, properly directed, n ot on ly can
resu lt in m ajor scien ti c advan ces bu t also can lay th e fou n dation for th e
n ext gen eration of advan ces in scien ce, m edicin e, an d bu sin ess. So
bu sin ess leaders m u st ask th em selves th e followin g: Do th ey wan t to be
part of th e n ext big th in g in IT?

THE REA LITIES O F THIN KIN G BIG DA TA

Today, organ ization s an d in dividu als are awash in a ood of data.


Application s an d com pu ter-based tools are collectin g in form ation on
an u n preceden ted scale. Th e down side is th at th e data h ave to be
m an aged, wh ich is an expen sive, cu m bersom e process. Yet th e cost of
th at m an agem en t can be offset by th e in trin sic valu e offered by th e
data, at least wh en looked at properly.
Th e valu e is derived from th e data th em selves. Decision s th at were
previou sly based on gu esswork or on pain stakin gly con stru cted m odels
of reality can n ow be m ade based on th e data th em selves. Su ch Big
Data an alysis n ow drives n early every aspect of ou r m odern society,
in clu din g m obile services, retail, m an u factu rin g, n an cial services, life
scien ces, an d ph ysical scien ces.

c10 22 October 2012; 18:1:22


114 BI G DATA ANAL YTI CS

Certain m arket segm en ts h ave h ad early su ccess with Big Data


an alytics. For exam ple, scien ti c research h as been revolu tion ized by
Big Data, a prim e case bein g th e Sloan Digital Sky Su rvey, wh ich h as
becom e a cen tral resou rce for astron om ers th e world over.
Big Data h as tran sform ed astron om y from a eld in wh ich takin g
pictu res of th e sky was a large part of th e job to on e in wh ich th e
pictu res are all in a database already an d th e astron om er s task is to
n d in terestin g objects an d ph en om en a in th e database.
Transformation is taking place in the biological arena as well. There
is n ow a well-established tradition of depositin g scien ti c data into a
public repository and of creatin g public databases for use by oth er
scien tists. In fact, there is an entire discipline of bioinformatics that
is largely devoted to the m ainten an ce and analysis of such data. As
technology advances, particularly with the advent of n ext-generation
sequ en cing, the size and nu mber of available experim en tal data sets
are increasin g expon en tially.
Big Data h as th e poten tial to revolu tion ize m ore th an ju st research ;
th e an alytics process h as started to tran sform edu cation as well. A recen t
detailed qu an titative com parison of differen t approach es taken by 35
ch arter sch ools in New York City h as fou n d th at on e of th e top ve
policies correlated with m easu rable academ ic effectiven ess was th e u se
of data to gu ide in stru ction .
Th is exam ple is on ly th e tip of th e iceberg; as access to data an d
an alytics im proves an d evolves, m u ch m ore valu e can be derived. Th e
poten tial h ere leads to a w orld w h ere au th orized in dividu als h ave
access to a h u ge database in wh ich every detailed m easu re of every
stu den t s academ ic perform an ce is stored. Th at data cou ld be u sed to
design th e m ost effective approach es for edu cation , ran gin g from
th e basics, su ch as readin g, writin g, an d m ath , to advan ced college-
level cou rses.
A n al exam ple is th e h ealth care in du stry, in wh ich everyth in g
from in su ran ce costs to treatm en t m eth ods to dru g testin g can be
im proved with Big Data an alytics. Ultim ately, Big Data in th e h ealth
care in du stry will lead to redu ced costs an d im proved qu ality of care,
wh ich m ay be attribu ted to m akin g care m ore preven tive an d per-
son alized an d basin g it on m ore exten sive (h om e-based) con tin u ou s
m on itorin g.

c10 22 October 2012; 18:1:23


BRI NGI NG I T AL L TOGETHER 115

More exam ples are readily available to prove th at data can deliver
valu e well beyon d on e s expectation s. Th e key issu es are th e an alysis
perform ed an d th e goal sou gh t. Th e previou s exam ples on ly scratch
th e su rface of wh at Big Data m ean s to th e m asses. Th e essen tial poin t
h ere is to u n derstan d th e in trin sic valu e of Big Data an alytics an d
extrapolate th e valu e as it can be applied to oth er circu m stan ces.

HA N DS-O N BIG DA TA

The analysis of Big Data involves m ultiple distin ct phases, each of wh ich
introdu ces challen ges. These phases inclu de acquisition , extraction,
aggregation, m odeling, and interpretation . However, m ost people focu s
just on the m odeling (an alysis) phase.
Alth ou gh th at ph ase is cru cial, it is of little u se w ith ou t th e oth er
ph ases of th e data an alysis process, wh ich can create problem s like
false ou tcom es an d u n in terru ptable resu lts. Th e an alysis is on ly as
good as th e data provided. Th e problem stem s from th e fact th at
th ere are poorly u n derstood com plexities in th e con text of m u lti-
ten an ted data clu sters, especially wh en several an alyses are bein g
ru n con cu rren tly.
Man y sign i can t ch allen ges exten d beyon d an d u n dern eath th e
m odelin g ph ase. For exam ple, Big Data h as to be m an aged for con text,
wh ich m ay in clu de spu riou s in form ation an d can be h eterogen eou s in
n atu re; th is is fu rth er com plicated by th e lack of an u pfron t m odel.
It m ean s th at data proven an ce m u st be accou n ted for, as well as m eth ods
created to h an dle u n certain ty an d error.
Perh aps th e problem s can be attribu ted to ign oran ce or, at th e very
least, a lack of con sideration for prim ary topics th at de n e th e Big Data
process yet are often afterth ou gh ts. Th is m ean s th at qu estion s an d
an alytical processes m u st be plan n ed an d th ou gh t ou t in th e con text of
th e data provided. On e h as to determ in e wh at is wan ted from th e data
an d th en ask th e appropriate qu estion s to get th at in form ation .
Accom plish in g th at will requ ire sm arter system s as well as better
su pport for th ose m akin g th e qu eries, perh aps by em powerin g th ose
u sers with n atu ral lan gu age tools (rath er th an com plex m ath em atical
algorith m s) to qu ery th e data. Th e key issu e is th e level of ach ievable
arti cial in telligen ce an d h ow m u ch th at can be relied on . Cu rren tly,

c10 22 October 2012; 18:1:23


116 BI G DATA ANAL YTI CS

IBM s Watson is a m ajor step toward in tegratin g arti cial in telligen ce


with th e Big Data an alytics space, yet th e sh eer size an d com plexity of
th e system preclu des its u se for m ost an alysts.
Th is m ean s th at oth er m eth odologies to em power u sers an d an a-
lysts will h ave to be created, an d th ey m u st rem ain affordable an d be
sim ple to u se. After all, th e cu rren t bottlen eck with processin g Big Data
really h as becom e th e n u m ber of u sers wh o are em powered to ask
qu estion s of th e data an d an alyze th em .

THE BIG DA TA PIPELIN E IN DEPTH

Big Data does n ot arise from a vacuum (except, of cou rse, wh en studying
deep space). Basically, data are recorded from a data-generating sou rce.
Gathering data is akin to sen sing and observing the world aroun d u s, from
the h eart rate of a h ospital patient to the con tents of an air sam ple to the
n um ber of Web page queries to scien ti c experim ents that can easily
produ ce petabytes of data.
However, m u ch of th e data collected is of little in terest an d can be
ltered an d com pressed by m an y orders of m agn itu de, wh ich creates a
bigger ch allen ge: th e de n ition of lters th at do n ot discard u sefu l
in form ation . For exam ple, su ppose on e data sen sor readin g differs
su bstan tially from th e rest. Can th at be attribu ted to a fau lty sen sor, or
are th e data real an d worth in clu sion ?
Fu rth er com plicatin g th e lterin g process is h ow th e sen sors gath er
data. Are th ey based on tim e, tran saction s, or oth er variables? Are th e
sen sors affected by en viron m en t or oth er activities? Are th e sen sors tied
to spatial an d tem poral even ts su ch as traf c m ovem en t or rain fall?
Before th e data are ltered, th ese con sideration s an d oth ers m u st
be addressed. Th at m ay requ ire n ew tech n iqu es an d m eth odologies to
process th e raw data in telligen tly an d deliver a data set in m an ageable
ch u n ks with ou t th rowin g away th e n eedle in th e h aystack. Fu rth er
lterin g com plication s com e with real-tim e processin g, in wh ich th e
data are in m otion an d stream in g on th e y, an d on e does n ot h ave
th e lu xu ry of bein g able to store th e data rst an d process th em later
for redu ction .
An oth er ch allen ge com es in th e form of au tom atically gen eratin g
th e righ t m etadata to describe wh at data are recorded an d h ow th ey

c10 22 October 2012; 18:1:23


BRI NGI NG I T AL L TOGETHER 117

are recorded an d m easu red. For exam ple, in scien ti c experim en ts,
con siderable detail on speci c experim en tal con dition s an d procedu res
m ay be requ ired to be able to in terpret th e resu lts correctly, an d it is
im portan t th at su ch m etadata be recorded with observation al data.
Wh en im plem en ted properly, au tom ated m etadata acqu isi-
tion system s can m in im ize th e n eed for m an u al processin g, greatly
redu cin g th e h u m an bu rden of recordin g m etadata. Th ose wh o
are gath erin g data also h ave to be con cern ed w ith th e data prove-
n an ce. Recordin g in form ation abou t th e data at th eir tim e of creation
becom es im portan t as th e data m ove th rou gh th e data an alysis
process. Accu rate proven an ce can preven t processin g errors from
ren derin g th e su bsequ en t an alysis u seless. With su itable proven an ce,
th e su bsequ en t processin g steps can be qu ickly iden ti ed. Provin g th e
accu racy of th e data is accom plish ed by gen eratin g su itable m etadata
th at also carry th e proven an ce of th e data th rou gh th e data an alysis
process.
An oth er step in th e process con sists of extractin g an d clean in g th e
data. Th e in form ation collected will frequ en tly n ot be in a form at
ready for an alysis. For exam ple, con sider electron ic h ealth records in a
m edical facility th at con sist of tran scribed dictation s from several
ph ysician s, stru ctu red data from sen sors an d m easu rem en ts (possibly
with som e associated an om alou s data), an d im age data su ch as scan s.
Data in th is form can n ot be effectively an alyzed. Wh at is n eeded is an
in form ation extraction process th at draws ou t th e requ ired in form a-
tion from th e u n derlyin g sou rces an d expresses it in a stru ctu red form
su itable for an alysis.
Accom plish in g th at correctly is an on goin g tech n ical ch allen ge,
especially wh en th e data in clu de im ages (an d, in th e fu tu re, video).
Su ch extraction is h igh ly application depen den t; th e in form ation in an
MRI, for in stan ce, is very differen t from wh at you wou ld draw ou t of a
su rveillan ce ph oto. Th e u biqu ity of su rveillan ce cam eras an d th e
popu larity of GPS-en abled m obile ph on es, cam eras, an d oth er portable
devices m ean s th at rich an d h igh - delity location an d trajectory (i.e.,
m ovem en t in space) data can also be extracted.
An oth er issu e is th e h on esty of th e data. For th e m ost part, data are
expected to be accu rate, if n ot tru th fu l. However, in som e cases, th ose
wh o are reportin g th e data m ay ch oose to h ide or falsify in form ation .

c10 22 October 2012; 18:1:23


118 BI G DATA ANAL YTI CS

For exam ple, patien ts m ay ch oose to h ide risky beh avior, or poten tial
borrowers llin g ou t loan application s m ay in ate in com e or h ide
expen ses. Th e list is en dless of ways in wh ich data cou ld be m is-
in terpreted or m isreported. Th e act of clean in g data before an alysis
sh ou ld in clu de well-recogn ized con strain ts on valid data or well-
u n derstood error m odels, wh ich m ay be lackin g in Big Data platform s.
Movin g data th rou gh th e process requ ires con cen tration on in te-
gration , aggregation , an d represen tation of th e data all of wh ich are
process-orien ted steps th at address th e h eterogen eity of th e ood of
data. Here th e ch allen ge is to record th e data an d th en place th em in to
som e type of repository.
Data analysis is con siderably m ore challen ging than sim ply locating,
identifyin g, u nderstan ding, and citin g data. For effective large-scale
analysis, all of this h as to h appen in a com pletely automated m an ner.
This requires differences in data stru ctu re and sem an tics to be expressed
in forms th at are m achine readable and then com puter resolvable.
It m ay take a signi cant amoun t of work to ach ieve autom ated error-
free difference resolution.
Th e data preparation ch allen ge even exten ds to an alysis th at u ses
on ly a sin gle data set. Here th ere is still th e issu e of su itable database
design , fu rth er com plicated by th e m an y altern ative ways in wh ich to
store th e in form ation . Particu lar database design s m ay h ave certain
advan tages over oth ers for an alytical pu rposes. A case in poin t is th e
variety in th e stru ctu re of bioin form atics databases, in wh ich in for-
m ation on su bstan tially sim ilar en tities, su ch as gen es, is in h eren tly
differen t bu t is represen ted with th e sam e data elem en ts.
Exam ples like th ese clearly in dicate th at database design is an
artistic en deavor th at h as to be carefu lly execu ted in th e en terprise
con text by profession als. Wh en creatin g effective database design s,
profession als su ch as data scien tists m u st h ave th e tools to assist th em
in th e design process, an d m ore im portan t, th ey m u st develop tech -
n iqu es so th at databases can be u sed effectively in th e absen ce of
in telligen t database design .
As th e data m ove th rou gh th e process, th e n ext step is qu eryin g
th e data an d th en m odelin g it for an alysis. Meth ods for qu eryin g an d
m in in g Big Data are fu n dam en tally differen t from tradition al statistical
an alysis. Big Data is often n oisy, dyn am ic, h eterogen eou s, in terrelated,

c10 22 October 2012; 18:1:23


BRI NGI NG I T AL L TOGETHER 119

an d u n tru stworth y a very differen t in form ation al sou rce from sm all
data sets u sed for tradition al statistical an alysis.
Even so, n oisy Big Data can be m ore valu able th an tin y sam ples
becau se gen eral statistics obtain ed from frequ en t pattern s an d cor-
relation an alysis u su ally overpower in dividu al u ctu ation s an d often
disclose m ore reliable h idden pattern s an d kn ow ledge. In addition ,
in tercon n ected Big Data creates large h eterogen eou s in form a-
tion n etworks w ith w h ich in form ation redu n dan cy can be explored
to com pen sate for m issin g data, cross-ch eck con ictin g cases, an d
validate tru stw orth y relation sh ips. In tercon n ected Big Data resou rces
can disclose in h eren t clu sters an d u n cover h idden relation sh ips
an d m odels.
Min in g th e data th erefore requ ires in tegrated, clean ed, tru stwor-
th y, an d ef cien tly accessible data, backed by declarative qu ery an d
m in in g in terfaces th at featu re scalable m in in g algorith m s. All of th is
relies on Big Data com pu tin g en viron m en ts th at are able to h an dle th e
load. Fu rth erm ore, data m in in g can be u sed con cu rren tly to im prove
th e qu ality an d tru stworth in ess of th e data, expose th e sem an tics
beh in d th e data, an d provide in telligen t qu eryin g fu n ction s.
Viru len t exam ples of in trodu ced data errors can be readily fou n d
in th e h ealth care in du stry. As n oted previou sly, it is n ot u n com m on
for real-w orld m edical records to h ave errors. Fu rth er com plicatin g th e
situ ation is th e fact th at m edical records are h eterogen eou s an d are
u su ally distribu ted in m u ltiple system s. Th e resu lt is a com plex an a-
lytics en viron m en t th at lacks an y type of stan dard n om en clatu re to
de n e its respective elem en ts.
Th e valu e of Big Data an alysis can be realized on ly if it can be
applied robu stly u n der th ose ch allen gin g con dition s. However, th e
kn owledge developed from th at data can be u sed to correct errors an d
rem ove am bigu ity. An exam ple of th e u se of th at corrective an alysis is
wh en a ph ysician writes DVT as th e diagn osis for a patien t. Th is
abbreviation is com m on ly u sed for both deep vein th rom bosis an d
diverticu litis, two very differen t m edical con dition s. A kn owledge base
con stru cted from related data can u se associated sym ptom s or m edi-
cation s to determ in e wh ich of th e two th e ph ysician m ean t.
It is easy to see h ow Big Data can en able th e n ext gen eration
of in teractive data an alysis, wh ich by u sin g au tom ation can deliver

c10 22 October 2012; 18:1:23


120 BI G DATA ANAL YTI CS

real-tim e an swers. Th is m ean s th at m ach in e in telligen ce can be u sed


in th e fu tu re to direct au tom atically gen erated qu eries toward Big
Data a key capability th at will exten d th e valu e of data for au tom atic
con ten t creation for web sites, popu late h ot lists or recom m en dation s,
an d to provide an ad h oc an alysis of th e valu e of a data set to decide
wh eth er to store or discard it.
Ach ievin g th at goal will requ ire scalin g com plex qu ery-processin g
tech n iqu es to terabytes wh ile en ablin g in teractive respon se tim es, an d
cu rren tly th is is a m ajor ch allen ge an d an open research problem .
Neverth eless, advan ces are m ade on a regu lar basis, an d wh at is a
problem today will u n dou btedly be solved in th e n ear fu tu re as pro-
cessin g power in creases an d data becom e m ore coh eren t.
Solvin g th at problem will requ ire a tech n iqu e th at elim in ates
th e lack of coordin ation am on g database system s th at h ost th e data
an d provide SQL qu eryin g, w ith an alytics packages th at perform
variou s form s of n on -SQL processin g su ch as data m in in g an d sta-
tistical an alyses. Today s an alysts are im peded by a tediou s process of
exportin g data from a database, perform in g a n on -SQL process, an d
brin gin g th e data back. Th is is a m ajor obstacle to providin g th e
in teractive au tom ation th at w as provided by th e rst gen eration of
SQL-based OLAP system s. Wh at is n eeded is a tigh t cou plin g between
declarative qu ery lan gu ages an d th e fu n ction s of Big Data an alytics
packages th at w ill ben e t both th e expressiven ess an d th e perfor-
m an ce of th e an alysis.
On e of th e m ost im portan t steps in processin g Big Data is th e
in terpretation of th e data an alyzed. Th at is wh ere bu sin ess decision s can
be form ed based on th e con ten ts of th e data as th ey relate to a bu sin ess
process. Th e ability to an alyze Big Data is of lim ited valu e if th e u sers
can n ot u n derstan d th e an alysis. Ultim ately, a decision m aker, provided
with th e resu lt of an an alysis, h as to in terpret th ese resu lts. Data
in terpretation can n ot h appen in a vacu u m . For m ost scen arios, in ter-
pretation requ ires exam in in g all of th e assu m ption s an d retracin g th e
an alysis process.
An im portan t elem en t of in terpretation com es from th e u n der-
stan din g th at th ere are m an y possible sou rces of error, ran gin g from
processin g bu gs to im proper an alysis assu m ption s to resu lts based on
erron eou s data a situ ation th at logically preven ts u sers from fu lly

c10 22 October 2012; 18:1:23


BRI NGI NG I T AL L TOGETHER 121

cedin g au th ority to a fu lly au tom ated process ru n solely by th e com -


pu ter system . Proper in terpretation requ ires th at th e u ser u n derstan ds
an d veri es th e resu lts produ ced by th e com pu ter. Neverth eless, th e
an alytics platform sh ou ld m ake th at easy to do, wh ich cu rren tly
rem ain s a ch allen ge with Big Data becau se of its in h eren t com plexity.
In m ost cases, cru cial assu m ption s beh in d th e data are recorded
th at can tain t th e overall an alysis. Th ose an alyzin g th e data n eed to
be aware of th ese situ ation s. Sin ce th e an alytical process in volves
m u ltiple steps, assu m ption s can creep in at an y poin t, m akin g doc-
u m en tation an d explan ation of th e process especially im portan t to
th ose in terpretin g th e data. Ultim ately th at w ill lead to im proved
resu lts an d will in trodu ce self-correction in to th e data process as
th ose in terpretin g th e data in form th ose writin g th e algorith m s of
th eir n eeds.
It is rarely en ou gh to provide ju st th e resu lts. Rath er, on e m u st
provide su pplem en tary in form ation th at explain s h ow each resu lt
was derived an d w h at in pu ts it w as based on . Su ch su pplem en tary
in form ation is called th e provenance of th e data. By stu dyin g h ow best
to acqu ire, store, an d qu ery proven an ce, in con ju n ction with u sin g
tech n iqu es to accu m u late adequ ate m etadata, w e can create an
in frastru ctu re th at provides u sers w ith th e ability to in terpret th e
an alytical resu lts an d to repeat th e an alysis with differen t assu m p-
tion s, param eters, or data sets.

BIG DA TA VISUA LIZA TIO N

System s th at offer a rich palette of visu alization s are im portan t in con -


veyin g to th e u sers th e resu lts of th e qu eries, u sin g a represen tation th at
best illu strates h ow data are in terpreted in a particu lar situ ation . In th e
past, BI system s u sers were n orm ally offered tabu lar con ten t con sistin g
of n u m bers an d h ad to visu alize th e data relation sh ips th em selves.
However, th e com plexity of Big Data m akes th at dif cu lt, an d graph ical
represen tation s of an alyzed data sets are m ore in form ative an d easier
to u n derstan d.
It is u su ally easier for a m u ltitu de of u sers to collaborate on th e
an alytical resu lts wh en it is presen ted in a graph ical form , sim ply
becau se in terpretation is rem oved from th e form u la an d th e u sers are

c10 22 October 2012; 18:1:23


122 BI G DATA ANAL YTI CS

sh own th e resu lts. Today s an alysts n eed to presen t resu lts in powerfu l
visu alization s th at assist in terpretation an d su pport u ser collaboration .
Th ese visu alization s sh ou ld be based on in teractive sou rces th at
allow th e u sers to click an d rede n e th e presen ted elem en ts, creatin g a
con stru ctive en viron m en t wh ere th eories can be played ou t an d oth er
h idden elem en ts can be brou gh t forward. Ideally, th e in terface will
allow visu alization s to be affected by wh at-if scen arios or ltered by
oth er related in form ation , su ch as date ran ges, geograph ical location s,
or statistical qu eries.
Furth ermore, with a few clicks the u ser should be able to go deeper
into each piece of data and u nderstand its provenance, wh ich is a key
featu re to u nderstanding the data. Users n eed to be able to n ot only see
the results but also un derstand why they are seein g those results.
Raw proven an ce, particu larly regardin g th e ph ases in th e an alytics
process, is likely to be too tech n ical for m an y u sers to grasp com pletely.
On e altern ative is to en able th e u sers to play with th e steps in th e
an alysis m ake sm all ch an ges to th e process, for exam ple, or m odify
valu es for som e param eters. Th e u sers can th en view th e resu lts of
th ese in crem en tal ch an ges. By th ese m ean s, th e u sers can develop an
in tu itive feelin g for th e an alysis an d also verify th at it perform s as
expected in corn er cases, th ose th at occu r ou tside n orm al circu m -
stan ces. Accom plish in g th is requ ires th e system to provide con ven ien t
facilities for th e u ser to specify an alyses.

BIG DA TA PRIVA CY

Data privacy is an oth er h u ge con cern , wh ich in creases as on e equ ates


su ch privacy with th e power of Big Data. For electron ic h ealth records,
th ere are strict laws govern in g wh at can an d can n ot be don e. For oth er
data, regu lation s, particu larly in th e Un ited States, are less forcefu l.
However, th ere is great pu blic fear abou t th e in appropriate u se of
person al data, particu larly th rou gh th e lin kin g of data from m u ltiple
sou rces. Man agin g privacy is effectively both a tech n ical an d a socio-
logical problem , an d it m u st be addressed join tly from both perspec-
tives to realize th e prom ise of Big Data.
Take, for exam ple, th e data glean ed from location -based services.
A situ ation in wh ich n ew arch itectu res requ ire a u ser to sh are h is

c10 22 October 2012; 18:1:23


BRI NGI NG I T AL L TOGETHER 123

or h er location with th e service provider resu lts in obviou s privacy


con cern s. Hidin g th e u ser s iden tity alon e with ou t h idin g th e location
wou ld n ot properly address th ese privacy con cern s.
An attacker or a (poten tially m aliciou s) location -based server can
in fer th e iden tity of th e qu ery sou rce from its location in form ation . For
exam ple, a u ser s location in form ation can be tracked th rou gh several
station ary con n ection poin ts (e.g., cell towers). After a wh ile, th e u ser
leaves a m etaph orical trail of bread cru m bs th at lead to a certain res-
iden ce or of ce location an d can th ereby be u sed to determ in e th e
u ser s iden tity.
Several oth er types of private in form ation , su ch as h ealth issu es (e.g.,
presen ce in a can cer treatm en t cen ter) or religiou s preferen ces
(e.g., presen ce in a ch u rch ), can also be revealed by ju st observin g
an on ym ou s u sers m ovem en t an d u sage pattern over tim e.
Fu rth erm ore, with th e cu rren t platform s in u se, it is m ore dif cu lt
to h ide a u ser location th an to h ide h is or h er iden tity. Th is is a resu lt
of h ow location -based services in teract with th e u ser. Th e location of
th e u ser is n eeded for su ccessfu l data access or data collection , bu t th e
iden tity of th e u ser is n ot n ecessary.
Th ere are m an y addition al ch allen gin g research problem s, su ch as
de n in g th e ability to sh are private data wh ile lim itin g disclosu re
an d en su rin g su f cien t data u tility in th e sh ared data. Th e existin g
m eth odology of differen tial privacy is an im portan t step in th e righ t
direction , bu t it u n fortu n ately cripples th e data payload too severely to
be u sefu l in m ost practical cases.
Real-world data are n ot static in n atu re, bu t th ey get larger an d
ch an ge over tim e, ren derin g th e prevailin g tech n iqu es alm ost u seless,
sin ce u sefu l con ten t is n ot revealed in an y m easu rable am ou n t for
fu tu re an alytics. Th is requ ires a reth in kin g of h ow secu rity for in for-
m ation sh arin g is de n ed for Big Data u se cases. Man y on lin e services
today requ ire u s to sh are private in form ation (th in k of Facebook
application s), bu t beyon d record-level access con trol we do n ot
u n derstan d wh at it m ean s to sh are data, h ow th e sh ared data can be
lin ked, an d h ow to give u sers n e-grain ed con trol over th is sh arin g.
Th ose issu es will h ave to be worked ou t to preserve u ser secu rity
wh ile still providin g th e m ost robu st data set for Big Data an alytics.

c10 22 October 2012; 18:1:23

You might also like