BCE Report
BCE Report
Big Data
By
59 Steve Correia
60 Latesh Billava
62 Nathen Vaz
63 Nathen Carneiro
64 Justin Madhri
September 2022
1
Abstract
Big data is a new driver of the world economic and societal changes. The world’s data
collection is reaching a tipping point for major technological changes that can bring new ways
in decision making, managing our health, cities, finance and education. While the data
complexities are increasing including data’s volume, variety, velocity and veracity, the real
impact hinges on our ability to uncover the ‘value’ in the data through Big Data Analytics
technologies. Big Data Analytics poses a grand challenge on the design of highly scalable
algorithms and systems to integrate the data and uncover large hidden values from datasets
that are diverse, complex, and of a massive scale. Potential breakthroughs include new
algorithms, methodologies, systems and applications in Big Data Analytics that discover useful
and hidden knowledge from the Big Data efficiently and effectively. Big Data Analytics is
relevant to Hong Kong as it moves towards a digital economy and society. Hong Kong is already
among the best in the world in Big Data Analytics. Big data analytics must also be team effort
cutting across academic institutions, government and society and industry, and by researchers
from multiple disciplines including computer science and engineering, health, data science and
social and policy areas
2
Acknowledgement
A project is always coordinated, guided and scheduled team effort aimed at realizing a common
goal. We are grateful and gracious to all those people who have helped and guided us through this
project and make this experience worthwhile. We wish to sincerely thank our Director Brother
Shantilal Kujur and Principal Dr. Sincy George and our HOD of Information Technology Dr.
Prachi Raut for giving us this opportunity to prepare a project in the Third Year of Information
Technology. We are highly indebted to our institute St. Francis Institute of Technology and the
Department of Information Technology for providing us with this learning opportunity with the
required resources to accomplish our task so far. We are truly grateful to our mentor Ms. Eden
Fernandes who persistently guided us for the betterment of our project, report and the presentation.
This work would not have been possible without her necessary insights and intellectual suggestions
that have helped us achieve so much. We also take the opportunity to thank all teaching and non-
teaching staff for their endearing support and cooperation.
3
Table of Contents
1 Introduction 6
2 Problem Definition 7
3 Architecture 8
5 Technology 11
6 Application 12
7 Conclusion 13
8 Reference 14
4
List of Illustrations
5
Chapter 1: Introduction
Big data is a broad term for data sets so large or complex that traditional data
processing applications are inadequate. Challenges include analysis, capture, data curation,
search, sharing, storage, transfer, visualization, and information privacy. The term often refers
simply to the use of predictive analytics or other certain advanced methods to extract
value from data, and seldom to a particular size of data set. Accuracy in big data may lead to more
confident decision making. And better decisions can mean greater operational efficiency, cost
reductions and reduced risk.
Data sets grow in size in part because they are increasingly being gathered by cheap
and numerous information-sensing mobile devices, aerial (remote sensing), software logs,
cameras, microphones, radio-frequency identification (RFID) readers, and wireless sensor
networks. The world's technological per-capita capacity to store information has roughly
doubled every 40 months since the 1980s; as of 2012, every day 2.5 exabytes (2.5×1018) of
data were created; The challenge for large enterprises is determining who should own big data
initiatives that straddle the entire organization.
Work with big data is necessarily uncommon; most analysis is of "PC size" data, on a
desktop PC or notebook that can handle the available data set.
Relational database management systems and desktop statistics and visualization
packages often have difficulty handling big data. The work instead requires "massively parallel
software running on tens, hundreds, or even thousands of servers". What is considered
"big data" varies depending on the capabilities of the users and their tools, and expanding
capabilities make Big Data a moving target.
6
Chapter 2: Problem Definition
Problem Definition is probably one of the most complex and heavily neglected stages
in the big data analytics pipeline. In order to define the problem a data product would solve,
experience is mandatory. Most data scientist aspirants have little or no experience in this stage.
Most big data problems can be categorized in the following ways −
• Supervised Regression
In this case, the problem definition is rather similar to the previous example; the difference
relies on the response. In a regression problem, the response y ∈ ℜ, this means the response is
real valued. For example, we can develop a model to predict the hourly salary of individuals
given the corpus of their CV.
• Unsupervised Learning
Management is often thirsty for new insights. Segmentation models can provide this insight in
order for the marketing department to develop products for different segments. A good
approach for developing a segmentation model, rather than thinking of algorithms, is to select
features that are relevant to the segmentation that is desired.
• Learning to Rank
This problem can be considered as a regression problem, but it has particular characteristics
and deserves a separate treatment. The problem involves given a collection of documents we
seek to find the most relevant ordering given a query. In order to develop a supervised learning
algorithm, it is needed to label how relevant an ordering is, given a query.
7
Chapter 3: Architecture
• Data storage: Data for batch processing operations is typically stored in a distributed file
store that can hold high volumes of large files in various formats. This kind of store is
often called a data lake. Options for implementing this storage include Azure Data Lake
Store or blob containers in Azure Storage.
• Batch processing: Because the data sets are so large, often a big data solution must
process data files using long-running batch jobs to filter, aggregate, and otherwise
prepare the data for analysis. Usually these jobs involve reading source files, processing
them, and writing the output to new files.
• Real-time message ingestion: If the solution includes real-time sources, the
architecture must include a way to capture and store real-time messages for stream
processing. This might be a simple data store, where incoming messages are dropped into
a folder for processing.
8
• Stream processing: After capturing real-time messages, the solution must process
them by filtering, aggregating, and otherwise preparing the data for analysis. The
processed stream data is then written to an output sink.
• Analytical data store: Many big data solutions prepare data for analysis and then
serve the processed data in a structured format that can be queried using analytical tools.
The analytical data store used to serve these queries can be a Kimball-style relational
data .
• Analysis and reporting: The goal of most big data solutions is to provide insights
into the data through analysis and reporting. To empower users to analyze the data,
the architecture may include a data modeling layer, such as a multidimensional OLAP
cube or tabular data model in Azure Analysis Services. It might also support self-
service BI, using the modeling and visualization technologies in Microsoft Power BI or
Microsoft Excel.
• Orchestration: Most big data solutions consist of repeated data processing operations,
encapsulated in workflows, that transform source data, move data between multiple
sources and sinks, load the processed data into an analytical data store, or push the results
straight to a report or dashboard.
9
Chapter 4: Components of Big Data
Big-data projects have a number of different layers of abstraction from abstraction of the
data through to running analytics against the abstracted data. Following figure shows the
basic elements of analytical Big-data and their interrelationships. The higher level
components help make big data projects easier and more dynamic. Hadoop is often at the
center of Big-data projects, but it is not a precondition.
10
Chapter 5: Technology
11
Chapter 6: Applications
12
Chapter 7: Conclusion
The availability of Big Data, low-cost commodity hardware, and new information
management and analytic software have produced a unique moment in the history of data
analysis. The convergence of these trends means that we have the capabilities required to
analyze astonishing data sets quickly and cost-effectively for the first time in history.
As more and more data is generated and collected, data analysis requires scalable, flexible,
and high performing tools to provide insights in a timely fashion. However, organizations are
facing a growing big data ecosystem where new tools emerge and “die” very quickly.
Therefore, it can be very difficult to keep pace and choose the right tools.
The Age of Big Data is here, and these are truly revolutionary times if both business and
technology professionals continue to work together and deliver on the promise.
13
Chapter 8: References
[1]https://www.oracle.com/in/big-data/what-is-big-data/#:~:text=Big%20data%20defined,-
What%20exactly%20is&text=The%20definition%20of%20big%20data,especially%20from%20
new%20data%20sources.
[2]https://www.google.com/imgres?imgurl=https%3A%2F%2Flearn.microsoft.com%2Fen-
us%2Fazure%2Farchitecture%2Fguide%2Farchitecture-styles%2Fimages%2Fbig-data-
logical.svg&imgrefurl=https%3A%2F%2Flearn.microsoft.com%2Fen-
us%2Fazure%2Farchitecture%2Fguide%2Farchitecture-styles%2Fbig-
data&tbnid=JXDcfbAxV60fkM&vet=12ahUKEwjLo7H7uLD6AhWsKbcAHeJDAHwQMygAe
gUIARDcAQ..i&docid=CaW5MoAE7PDdYM&w=751&h=267&q=big%20data%20architectur
e&ved=2ahUKEwjLo7H7uLD6AhWsKbcAHeJDAHwQMygAegUIARDcAQ
[3]https://www.google.com/imgres?imgurl=https%3A%2F%2Fstatic.packt-
cdn.com%2Fproducts%2F9781784391409%2Fgraphics%2F4008_01_02.jpg&imgrefurl=https%
3A%2F%2Fsubscription.packtpub.com%2Fbook%2Fbig-data-and-business-
intelligence%2F9781784391409%2F1%2Fch01lvl1sec12%2Fcomponents-of-the-big-data-
ecosystem&tbnid=xqwfeKqU07ihM&vet=12ahUKEwjNnczmurD6AhUmk9gFHaSNBJQQMyg
DegUIARDOAQ..i&docid=CBIJo1w1l4bLAM&w=1000&h=738&q=components%20of%20big
%20data&ved=2ahUKEwjNnczmurD6AhUmk9gFHaSNBJQQMygDegUIARDOAQ
[4]https://www.techtarget.com/searchdatamanagement/definition/big-data
[5] https://www.javatpoint.com/what-is-big-data
[6]https://www.sap.com/india/insights/what-is-big-data.html
14