0% found this document useful (0 votes)
13 views34 pages

Unit 1 (Chapter 2) - Big Data Storage

The document discusses Big Data Storage, focusing on clusters, file systems, NoSQL databases, and their components. It highlights the advantages of clustered storage, the architecture of Hadoop's HDFS, and the importance of NoSQL in handling big data challenges. Key features and types of distributed file systems and NoSQL databases are also outlined, emphasizing their scalability and flexibility for big data analytics.

Uploaded by

Prasad Patil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views34 pages

Unit 1 (Chapter 2) - Big Data Storage

The document discusses Big Data Storage, focusing on clusters, file systems, NoSQL databases, and their components. It highlights the advantages of clustered storage, the architecture of Hadoop's HDFS, and the importance of NoSQL in handling big data challenges. Key features and types of distributed file systems and NoSQL databases are also outlined, emphasizing their scalability and flexibility for big data analytics.

Uploaded by

Prasad Patil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Big Data Analytics – Unit 1 (Chapter – 2: Big Data Storage) 1

Big Data Analytics


Unit-1 (Chapter - 2: Big Data Storage)
• Clusters,
• File Systems and Distributed File Systems,
• NoSQL,
• Sharding,
• Replication,
• Combining Sharding and Replication,
• On Disk Storage Devices,
• In-memory Storage Devices.

Clusters:
Clusters in big data storage are the group of interconnected servers (nodes) that work together to store,
manage, and process large volumes of data efficiently. These clusters help in distributed storage,
parallel processing, and high availability of big data.

These servers usually have the same hardware specifications and are connected together via a network
to work as a single unit, as represented in the below figure. Each node in the cluster has its own
dedicated resources, such as memory, a processor, and a hard drive. A cluster can execute a task by
splitting it into small pieces and distributing their execution onto different computers that belong to
the cluster.

Advantages of Clustered Big Data Storage


❖ Scalability: Easily add more nodes for increased capacity.
❖ Fault Tolerance: Data replication ensures high availability.
❖ High Performance: Parallel processing speeds up data retrieval.
❖ Cost-Effectiveness: Uses commodity hardware, reducing storage costs.
❖ Efficient Load Balancing: Workload is distributed across multiple nodes.

Prof. Prasad Patil,


Department of Computer Applications,
KLE Tech University, Belagavi.
Big Data Analytics – Unit 1 (Chapter – 2: Big Data Storage) 2

Main Components of a Big Data Storage Cluster:


Following are the main components of the Cluster (Big Data Storage Cluster) are as follows:
1. HDFS (Hadoop Distributed File System),
2. NameNode,
3. DataNode,
4. YARN (Yet Another Resource Negotiator).

1. HDFS (Hadoop Distributed File System):


The primary storage system for large datasets, distributing data across multiple DataNodes for
fault tolerance and scalability.
2. NameNode:
A central server responsible for managing the file system namespace, tracking file locations, and
coordinating data access across the cluster.
3. DataNode:
Individual nodes in the cluster that store data blocks from HDFS, replicating data across multiple
nodes for redundancy.
4. YARN (Yet Another Resource Negotiator):
A resource management system that allocates computing resources (CPU, memory) to running
applications within the cluster, allowing for efficient job scheduling and resource utilization.

File Systems and Distributed File Systems:


File System:
A file system is a method used by operating systems to store, organize, retrieve, and manage data on
storage devices such as hard drives, SSDs, and USB drives. A file system provides a logical view of the
data stored on the storage device & presents it as a tree structure of directories and files.

Types of File System:


Following are the types of File system:
1. FAT (File Allocation Table): An older file system used by older versions of Windows and other
operating systems.
2. NTFS (New Technology File System): A modern file system used by Windows. It provides features
such as file and folder permissions, encryption and file compression.
3. Ext (Extended File System): A file system commonly used on Linux based operating systems.
4. exFAT (Extended File Allocation Table): A file system designed for flash drives like USB memory
sticks and SD cards.

Prof. Prasad Patil,


Department of Computer Applications,
KLE Tech University, Belagavi.
Big Data Analytics – Unit 1 (Chapter – 2: Big Data Storage) 3

Distributed File Systems (DFS):


A Distributed file system is a file system that can store large files spread across the nodes of a cluster.
Distributed File System (DFS) falls under the category of Network File Systems (NFS) because it enables
access to files over a network across multiple computers, appearing as a single storage system to users.

Examples of Distributed File System (DFS):


Examples of Distributed File Systems are as follows:
❖ Google File System (GFS),
❖ Hadoop Distributed File System (HDFS).

Components of a Distributed File System (DFS):


A Distributed File System (DFS) consists of several key components that work together to provide file
storage, access and management across multiple networked machines.
Following are the components:
1. Client
• Users or applications that request files from the DFS.
• Sends read/write requests to the DFS.
• Uses APIs or system calls to interact with the file system.
2. Server
• Manages file storage and retrieval.
• Processes client requests for files.
• Can be distributed across multiple machines.
3. Name Node
• Stores information about file locations, permissions, and structure.
• Does not store actual file data, only metadata.
• Example: HDFS NameNode in Hadoop.
4. Data Node
• Store actual file data in chunks or blocks.
• Spread across multiple machines for redundancy.
• Example: HDFS DataNodes.
5. Replication Manager
• Ensures file copies exist across different nodes for fault tolerance.
• If one node fails, another node provides the data.
• Example: Google File System (GFS) maintains multiple replicas of data blocks.

Prof. Prasad Patil,


Department of Computer Applications,
KLE Tech University, Belagavi.
Big Data Analytics – Unit 1 (Chapter – 2: Big Data Storage) 4

Key Features of Distributed File System (DFS):


Following are the key features of Distributed File System:
1. File sharing across multiple nodes (computers in a network).
2. Fault tolerance (redundancy in case of failure).
3. Scalability (supports growing data needs).
4. Replication & caching (to enhance performance and availability).

Comparison between the Distributed File System and the Traditional File System
Following are the differences between the Distributed File System and the Traditional File System in
the tabular column

Prof. Prasad Patil,


Department of Computer Applications,
KLE Tech University, Belagavi.
Big Data Analytics – Unit 1 (Chapter – 2: Big Data Storage) 5

Apache Hadoop
There are two components of Hadoop, such as
a) HDFS
b) MapReduce

a) HDFS
HDFS is a distributed file system, which was developed by Apache software foundations designer to
store very large files across multiple machines in a cluster, with each file splits into block and each block
are replicated across the multiple nodes to ensure reliability and availability in the time of hardware
failure.

The two components of HDFS are NameNode and DataNode.


NameNode:
❖ The NameNode is a master server (Node) that manages the file system namespace and
regulates access to files by clients.
❖ It stores the metadata of all the files in the file system and tracks where the file is kept across
the cluster.
❖ It controls access to the files.
❖ The reading of two or more blocks can be done in parallel. So, that reading will be efficient.
❖ It provides authorisation and authentication ensuring security within the file system to access
the files.
❖ NameNode is a single point failure because if NameNode fails then the entire HDFS cluster
becomes inaccessible (DataNode cannot be accessed).

DataNode:
❖ DataNodes are the worker nodes that stores the actual data HDFS cluster.
❖ DataNodes stores actual data blocks of files in the HDFS.
❖ DataNodes can perform Read and Write operations.
❖ Sense Heartbeat signal to the NameNode to confirm that they are alive and functioning
❖ DataNodes stores and retrieves blocks when requested by clients or the NameNodes.
❖ DataNodes keeps sending block report to the NameNode at regular interval of time in order to
tells about the data storage blocks information.

Prof. Prasad Patil,


Department of Computer Applications,
KLE Tech University, Belagavi.
Big Data Analytics – Unit 1 (Chapter – 2: Big Data Storage) 6

Block Creation:
❖ The HDFS Client divides a file into fixed-size blocks before storing it in HDFS. The file is broken
into large blocks and each block is stored independently to ensure data availability and fault
tolerance.
❖ The default block size is 128MB. (Its size can be 64 MB also).
❖ The NameNode assigns blocks to different DataNodes for storage.
❖ The DataNodes store these blocks and manage replication for fault tolerance.

Block Report:
❖ Block report is a periodic report sent by a DataNode to the NameNode.
❖ This report contains detailed information about all the blocks that the DataNode is currently
storing.

Block report contains


❖ Block ID: Unique identifier for each block stored on the DataNode.
❖ Block sizes: The size of each block,
❖ Block replicas: Information about each block stored on the DataNode.

Role of Block reports in HDFS:


1. The NameNode uses block report to update its metadata.
2. If a DataNode fails, then the NameNode can initiate re-replication from existing replicas on the
other DataNodes.
3. It helps to monitor the health of the cluster.
4. Data integrity and Data availability can be achieved using block report.

Block report process:


1. The DataNode generates a block report that list all the blocks it is storing.
2. DataNode sends the report to the name node at regular intervals of time.
3. The NameNode processes the block report, updating its metadata to reflect the current state
of block storage.
4. Based on the block reports, the NameNode can generate various actions such as initiating
replication for under-replicated blocks or removing metadata for blocks that are no longer
existing.

Replication factor:
❖ The number of copies of each block of data stored across different DataNodes in the HDFS
cluster.
❖ The default replication factor in HDFS is 3.
❖ If the number of DataNodes in a HDFS cluster are less than 8 then the replication factor is 2.
❖ If the number of DataNodes in HDFS cluster are 8 or more than 8, then the replication factor
will be 3.

Prof. Prasad Patil,


Department of Computer Applications,
KLE Tech University, Belagavi.
Big Data Analytics – Unit 1 (Chapter – 2: Big Data Storage) 7

Anatomy of File READ:


In HDFS (Hadoop Distributed File System), reading a file works differently because the file is stored
in blocks across multiple nodes in a distributed system.

Representation for Anatomy of File Read

Step-by-Step Anatomy of File Read in HDFS:


1. Client Requests to Read a File
❖ The client (user/application) asks the NameNode (the master node) for the file’s
metadata.
❖ The NameNode does not store the file data but knows which DataNodes (worker
nodes) hold the file’s blocks.
2. NameNode Responds with Block Locations
❖ The NameNode returns a list of DataNodes where each block of the file is stored.
❖ The client gets this information and directly contacts the nearest DataNode for each
block.
3. Client Reads from DataNodes
❖ The client reads each block in parallel from the DataNodes.
❖ If one DataNode is busy or down, the client can fetch the block from a replica stored
on another node.
4. Data Is Assembled and Processed
❖ The client reassembles the blocks to form the complete file.
❖ The application processes the data as needed.
5. File Read Complete
❖ The client finishes reading and closes the connection.
The HDFS read operation is designed for fast, parallel, and fault-tolerant data retrieval. The
NameNode provides metadata, and the client directly fetches data from DataNodes, ensuring
minimal bottlenecks. The system automatically handles failures by using replicated blocks, allowing
seamless reading even if some nodes are down. This makes HDFS ideal for big data processing
where large files need to be read efficiently.

Prof. Prasad Patil,


Department of Computer Applications,
KLE Tech University, Belagavi.
Big Data Analytics – Unit 1 (Chapter – 2: Big Data Storage) 8

Anatomy of File WRITE:


When writing a file to HDFS, the process is slightly more complex than reading because it ensures
fault tolerance and replication.

Representation for Anatomy of File Write

Step-by-Step Anatomy of File Write in HDFS:


1. Client Requests to Write a File
❖ The client contacts the NameNode to create a new file in HDFS.
❖ The NameNode checks if the file already exists and if the user has permission to write.
2. File is Split into Blocks
❖ The file is broken into blocks (default: 128MB or 256MB).
❖ The NameNode assigns DataNodes to store these blocks based on availability and
replication policies.
3. Data Streaming Begins
❖ The client writes the first block to the first DataNode in a pipeline fashion.
❖ The first DataNode immediately forwards it to the second DataNode, which forwards
it to the third DataNode (for replication).
4. Acknowledgment Flow
❖ Once all replicas are written, the third DataNode sends an acknowledgment to the
second,
❖ The second sends it to the first, and then
❖ The first sends it back to the client.
❖ This process repeats for each block of the file.
5. Final Commit & Metadata Update
❖ When all blocks are written successfully, the client closes the file.
❖ The NameNode updates the metadata to confirm the file is successfully written.
The HDFS write operation ensures fault tolerance and scalability through replication and a pipeline
write mechanism. The NameNode manages metadata, while data is written sequentially to multiple
DataNodes, ensuring redundancy. Acknowledgments confirm successful writes, making the system
reliable. This structured approach allows high-throughput data ingestion, making HDFS suitable for
distributed storage in large-scale applications.

Prof. Prasad Patil,


Department of Computer Applications,
KLE Tech University, Belagavi.
Big Data Analytics – Unit 1 (Chapter – 2: Big Data Storage) 9

NoSQL
A Not-only SQL (NoSQL) database is a non-relational database that is highly scalable, fault-tolerant
and specifically designed to house semi-structured and unstructured data. A NoSQL database often
provides an API-based query interface that can be called from within an application. NoSQL
databases also support query languages other than Structured Query Language (SQL) because SQL
was designed to query structured data stored within a relational database. As an example, a NoSQL
database that is optimized to store XML files will often use XQuery as the query language. Likewise,
a NoSQL database designed to store RDF data will use SPARQL to query the relationships it contains.
NoSQL databases play a crucial role in Big Data Analytics by offering scalability,
flexibility, and high-speed processing. Traditional relational databases (RDBMS) struggle with the
volume, variety, velocity, and veracity of big data, whereas NoSQL databases efficiently handle these
challenges.

Why NoSQL is important for Big Data?


NoSQL is important for Big Data due to following reasons
❖ Scalability: NoSQL databases can scale horizontally by adding more servers, making them ideal
for large-scale data storage and retrieval.
❖ Schema Flexibility: Unlike rigid schemas in SQL, NoSQL databases allow dynamic schema
changes, accommodating different data formats.
❖ High-Speed Processing: Optimized for read and write performance, NoSQL enables real-time
analytics.
❖ Handling Unstructured Data: Supports various data types such as text, images, videos & logs.
❖ Distributed Computing – Works seamlessly with distributed frameworks like Hadoop and
Apache Spark for big data processing.

Types of NoSQL Databases Used in Big Data Analytics


1. Document Stores (MongoDB, CouchDB)
❖ Best for semi-structured data like JSON or XML.
❖ Used in content management systems, real-time big data analytics.
2. Column-Family Stores (Apache Cassandra, HBase)
❖ Optimized for large-scale, read-intensive applications.
❖ Used in log processing, recommendation systems, and IoT analytics.
3. Key-Value Stores (Redis, DynamoDB)
❖ Best for caching and high-speed lookups.
❖ Used in session management, real-time analytics, and financial transactions.
4. Graph Databases (Neo4j, ArangoDB)
❖ Ideal for relationship-based data analysis.
❖ Used in social networks, fraud detection, and recommendation engines.

Prof. Prasad Patil,


Department of Computer Applications,
KLE Tech University, Belagavi.
Big Data Analytics – Unit 1 (Chapter – 2: Big Data Storage) 10

Use Cases of NoSQL


❖ E-commerce – Recommendation engines using NoSQL (e.g., MongoDB, Cassandra).
❖ Finance & Banking – Fraud detection with NoSQL graph databases.
❖ IoT & Sensor Data – Handling time-series data efficiently.
❖ Healthcare – Storing and analyzing patient records.

Sharding
Sharding is the process of horizontally partitioning a large dataset into a collection of smaller, more
manageable datasets called shards. The shards are distributed across multiple nodes, where a node is
a server or a machine. Each shard is stored on a separate node and each node is responsible for only
the data stored on it. Each shard shares the same schema, and all shards collectively represent the
complete dataset.
Sharding is a database technique that splits data into smaller partitions, or shards. It can be used for
blockchains and traditional databases. Sharding helps with scalability and processing speed.

Sharding is often transparent to the client, but this is not a requirement. Sharding allows the
distribution of processing loads across multiple nodes to achieve horizontal scalability. Horizontal
scaling is a method for increasing a system’s capacity by adding similar or higher capacity resources
alongside existing resources. Since each node is responsible for only a part of the whole dataset,
read/write times are greatly improved.

How does database sharding work?


Now, let's see how sharding works. A typical unsharded database stores data in multiple columns and
rows. So, database sharding splits a single dataset into partitions or shards. Each shard contains unique
rows of data, which are stored separately across multiple computers, called nodes. All the shards run
on separate nodes but share the same schema or design. The partitioned data chunks are called logical
shards and the machine that stores the logical shard is called a physical shard or database node. Keep
in mind that a physical shard can contain multiple logical shards that belong to the same table. Now,
software developers would use a shard key to decide how to partition the dataset. A column in the
dataset, such as the customer’s name, determines which rows of data group together to form a shard.
Developers can choose a shard key from an existing column or create a new one. Database sharding
operates on a shared-nothing architecture. Each physical shard operates independently and is unaware
of other shards. Only the physical shards containing the data you request will process that data in
parallel.

Prof. Prasad Patil,


Department of Computer Applications,
KLE Tech University, Belagavi.
Big Data Analytics – Unit 1 (Chapter – 2: Big Data Storage) 11

Above the shard, we have another abstraction software layer that coordinates data storage
and access across these multiple shards. For instance, some databases have automatic sharding
features built in. In others, you, as a software developer, need to write sharding code to
store or retrieve data from the correct shard.

Benefits of Sharding:
❖ Horizontal Scalability – Instead of storing all data on a single server, data is spread across
multiple machines, enabling efficient storage and retrieval.
❖ Load Balancing – Reduces the burden on a single database server, improving performance.
❖ High Availability – If one shard fails, the rest of the system can continue functioning.
❖ Faster Queries – Queries are executed only on relevant shards, reducing processing time.
❖ Increased read/write throughput — By distributing the dataset across multiple shards, both
read and write operation capacity is increased as long as read and write operations are
confined to a single shard.
❖ Increased storage capacity — Similarly, by increasing the number of shards, you can also
increase overall total storage capacity, allowing near-infinite scalability.

Prof. Prasad Patil,


Department of Computer Applications,
KLE Tech University, Belagavi.
Big Data Analytics – Unit 1 (Chapter – 2: Big Data Storage) 12

Replication
Replication stores multiple copies of a dataset, known as replicas, on multiple nodes (Figure 5.7).
Replication provides scalability and availability due to the fact that the same data is replicated on
various nodes. Fault tolerance is also achieved since data redundancy ensures that data is not lost
when an individual node fails.
There are two different methods that are used to implement replication:
a. master-slave
b. peer-to-peer

Above figure shows an example of replication where a dataset is replicated to Node A and Node B, resulting in
Replica A and Replica B.

Prof. Prasad Patil,


Department of Computer Applications,
KLE Tech University, Belagavi.
Big Data Analytics – Unit 1 (Chapter – 2: Big Data Storage) 13

Master-Slave
During master-slave replication, nodes are arranged in a master-slave configuration, and all
data is written to a master node. Once saved, the data is replicated over to multiple slave
nodes. All external write requests, including insert, update and delete, occur on the master
node, whereas read requests can be fulfilled by any slave node. In the below figure, writes are
managed by the master node and data can be read from either Slave A or Slave B.

Above figure showing an example of master-slave replication where Master A is the single point of contact for
all writes, and data can be read from Slave A and Slave B.

Master-slave replication is ideal for read intensive loads rather than write intensive loads since growing
read demands can be managed by horizontal scaling to add more slave nodes. Writes are consistent,
as all writes are coordinated by the master node. The implication is that write performance will suffer
as the amount of writes increases. If the master node fails, reads are still possible via any of the slave
nodes. A slave node can be configured as a backup node for the master node. In the event that the
master node fails, writes are not supported until a master node is reestablished. The master node is
either resurrected from a backup of the master node, or a new master node is chosen from the slave
nodes.
One concern with master-slave replication is read inconsistency, which can be an issue. In master-slave
replication, data updates first happen on the master and then get copied to slave nodes. Sometimes,
if a user reads data from a slave before it gets updated, they might see old or inconsistent data.

Prof. Prasad Patil,


Department of Computer Applications,
KLE Tech University, Belagavi.
Big Data Analytics – Unit 1 (Chapter – 2: Big Data Storage) 14

To solve this, a voting system can be used. Here’s how it works:


❖ When a user requests data, multiple slave nodes are checked.
❖ If most slaves (the majority) have the same version of the data, it is considered consistent
and given to the user.
❖ This requires fast and reliable communication between slaves to quickly compare data
versions.
This method helps ensure users always get the most accurate data, reducing inconsistencies.

Below figure illustrates a scenario where read inconsistency occurs.


1. User A updates data.
2. The data is copied over to Slave A by the Master.
3. Before the data is copied over to Slave B, User B tries to read the data from Slave
B, which results in an inconsistent read.
4. The data will eventually become consistent when Slave B is updated by the Master.

Above figure shows an example of master-slave replication where read inconsistency occurs.

Prof. Prasad Patil,


Department of Computer Applications,
KLE Tech University, Belagavi.
Big Data Analytics – Unit 1 (Chapter – 2: Big Data Storage) 15

Peer-to-Peer
With peer-to-peer replication, all nodes operate at the same level. In other words, there is
not a master-slave relationship between the nodes. Each node, known as a peer, is equally
capable of handling reads and writes. Each write is copied to all peers, as illustrated in
Figure below

Above figure shows how writes are copied to Peers A, B and C simultaneously. Data is read from Peer A, but it
can also be read from Peers B or C.

Peer-to-peer replication is prone to write inconsistencies that occur as a result of a simultaneous


update of the same data across multiple peers. This can be addressed by implementing either a
pessimistic or optimistic concurrency strategy.
In peer-to-peer replication, multiple peers can update data at the same time, which can cause write
inconsistencies (conflicting updates).
There are two ways to handle this:
1. Pessimistic Concurrency (Prevention)
❖ Uses locking to prevent conflicts.
❖ Only one update is allowed at a time.
❖ Ensures consistency but reduces availability (others must wait).
2. Optimistic Concurrency (Resolution)
❖ No locking, allowing temporary inconsistencies.
❖ Peers update independently and later sync up.
❖ Ensures availability but might cause temporary inconsistencies.
To ensure read consistency, a voting system can be implemented where a read is declared
consistent if the majority of the peers contain the same version of the record. As
previously indicated, implementation of such a voting system requires a reliable and fast
communication mechanism between the peers.

Prof. Prasad Patil,


Department of Computer Applications,
KLE Tech University, Belagavi.
Big Data Analytics – Unit 1 (Chapter – 2: Big Data Storage) 16

Below figure demonstrates a scenario where an inconsistent read occurs.


1. User A updates data.
2. a. The data is copied over to Peer A.
b. The data is copied over to Peer B.
3. Before the data is copied over to Peer C, User B tries to read the data from Peer C, resulting in an
inconsistent read.
4. The data will eventually be updated on Peer C, and the database will once again become
consistent.

Above Figure shows an example of peer-to-peer replication where an inconsistent read occurs.

To improve on the limited fault tolerance offered by sharding, while additionally benefiting from the
increased availability and scalability of replication, both sharding and replication can be combined, as
shown in the below figure and covers under the umbrella of combination of sharding and master-
slave replication & even sharding and peer-to-peer replication.

Prof. Prasad Patil,


Department of Computer Applications,
KLE Tech University, Belagavi.
Big Data Analytics – Unit 1 (Chapter – 2: Big Data Storage) 17

The above figure shows a comparison of sharding and replication that shows how a dataset is
distributed between two nodes with the different approaches.

Prof. Prasad Patil,


Department of Computer Applications,
KLE Tech University, Belagavi.
Big Data Analytics – Unit 1 (Chapter – 2: Big Data Storage) 18

Combining Sharding and Master-Slave Replication


When sharding is combined with master-slave replication, multiple shards become slaves of a single
master, and the master itself is a shard. Although this results in multiple masters, a single slave-shard
can only be managed by a single master-shard. Write consistency is maintained by the master-shard.
However, if the master-shard becomes non-operational or a network outage occurs, fault tolerance
with regards to write operations is impacted. Replicas of shards are kept on multiple slave nodes to
provide scalability and fault tolerance for read operations.
In below figure:
❖ Each node acts both as a master and a slave for different shards.
❖ Writes (id = 2) to Shard A are regulated by Node A, as it is the master for Shard A.
❖ Node A replicates data (id = 2) to Node B, which is a slave for Shard A.
❖ Reads (id = 4) can be served directly by either Node B / Node C as they each contain Shard B.

The above figure shows the combination of sharding and master-slave replication

Working:
❖ Each shard has its own master node and multiple slave nodes.
❖ Write operations occur on the master node of the respective shard.
❖ Read operations can be distributed across slave nodes to optimize performance.
❖ If a master fails, a failover mechanism promotes one of the slaves to be the new master.

Benefits of Combining Sharding and Master-Slave Replication:


❖ Scalability: Sharding distributes data across multiple servers, reducing the load on any single
database instance.
❖ High Availability: Master-slave replication ensures that even if one database instance fails,
read operations can continue from replicas.
❖ Load Balancing: Read queries can be directed to slaves, while writes go to the master,
reducing contention.
❖ Fault Tolerance: If a master node fails, a slave can be promoted to master to maintain uptime.

Prof. Prasad Patil,


Department of Computer Applications,
KLE Tech University, Belagavi.
Big Data Analytics – Unit 1 (Chapter – 2: Big Data Storage) 19

Common Challenges of Combining Sharding and Master-Slave Replication:


❖ Complexity: Managing both sharding and replication increases system complexity, requiring
careful design.
❖ Increased Latency: Cross-shard queries may involve multiple database instances, increasing
response time.

Combining Sharding and Peer-to-Peer Replication


When combining sharding with peer-to-peer replication, each shard is replicated to multiple peers,
and each peer is only responsible for a subset of the overall dataset. Collectively, this helps achieve
increased scalability and fault tolerance. As there is no master involved, there is no single point of
failure and fault-tolerance for both read and write operations is supported.
In below figure:
❖ Each node contains replicas of two different shards.
❖ Writes (id=3) are replicated to both Node A / Node C (Peers) as they are responsible for Shard C.
❖ Reads (id = 6) can be served by either Node B or Node C as they each contain Shard B.

The above figure shows the combination of sharding and peer-to-peer replication.

Working of Combining Sharding and Peer-to-Peer Replication:


1. Sharding is applied first – The dataset is divided into multiple shards. Each shard contains a
subset of the total data.
2. Each shard is then replicated using a peer-to-peer replication model – All nodes within a
shard maintain copies of the shard’s data.
3. Reads and writes are distributed:
❖ Writes go to any peer in the shard and are propagated to other peers.
❖ Reads can be performed from any peer, improving response times.
4. Global consistency mechanisms (like conflict resolution strategies) are needed to handle
data synchronization issues.

Prof. Prasad Patil,


Department of Computer Applications,
KLE Tech University, Belagavi.
Big Data Analytics – Unit 1 (Chapter – 2: Big Data Storage) 20

Common Challenges in Combining Sharding and Peer-to-Peer Replication:


❖ Conflict Resolution: Since multiple peers can accept writes, conflicts may arise that
need resolution strategies (e.g., timestamps, vector clocks).
❖ Increased Latency: Synchronization across peers introduces replication lag, affecting
real-time updates.

*****************************************************************

On-Disk Storage Devices


On-disk storage means saving data on physical storage devices like hard disk drives (HDDs) or solid-state
drives (SSDs) for long-term use. Since HDDs are cheaper and can store large amounts of data, they are
commonly used for this purpose.
There are two main ways to store data on disks:
1. Using a Distributed File System (DFS): A Distributed File System (DFS) stores data across multiple
computers instead of just one. It splits large files into smaller pieces & saves them on different disks.
This method makes data easily accessible, scalable & safe from loss.
Example: Hadoop Distributed File System (HDFS), Google File System (GFS).
2. Using a Database: A database organizes data into tables, rows, and columns so it can be quickly
searched and retrieved. Some databases store data across multiple disks, ensuring faster access and
data protection.
Example: MySQL, RDBMS, NoSQL, Apache Cassandra.

Above figure shows how On-disk storage can be implemented with a distributed file system or a database.

Prof. Prasad Patil,


Department of Computer Applications,
KLE Tech University, Belagavi.
Big Data Analytics – Unit 1 (Chapter – 2: Big Data Storage) 21

Distributed File Systems:


A Distributed File System is a storage system that spreads data across multiple computers instead of
keeping it in just one place. It helps store huge amounts of data & makes sure it is always available by
keeping copies in different locations. It is good for storing & processing large files quickly, but it
struggles with many small files. It provides automatic backup, high availability & fast performance,
making it ideal for Big Data applications like video storage, log analysis & large-scale analytics.

Following points explains about DFS:


1. DFS Can Store Any Type of Data (Schema-less Storage): DFS doesn’t care what kind of data
is being stored—it can be structured or unstructured (like videos, images, or logs). It does
not need a fixed format like databases do.
2. Automatic Backup & High Availability: DFS automatically makes copies of data and stores
them in multiple locations (this is called replication). If one disk or computer fails, the data is
still safe and accessible from another copy.
3. Great for Storing Large, Non-Relational Data: DFS is best for storing huge files, such as
videos, logs, or backup files. It is not designed for structured data like databases.
4. Fast Reading & Writing, But Limited Control: DFS writes and reads data quickly, but it does
not have advanced control like databases. It uses simple file locking to manage multiple
users accessing data at the same time.
5. Not Good for Many Small Files: If there are too many small files, the system slows down
because it has to keep searching for different files on the disk (disk-seek activity). Processing
many small files also takes more time because each file needs a separate process.
6. Best for Large Files Read in Order: DFS works best with fewer but larger files that are read
one after another (sequentially). To improve performance, small files are often combined
into bigger ones to speed up storage and processing.
7. Optimized for Streaming Data: DFS is very efficient for streaming large files (like videos or
sensor data) where data is read in sequence.

Above figure shows how a distributed file system accessing data in streaming mode with no random reads and writes.

A Distributed File System (DFS) is ideal for storing large raw datasets or archiving data for a long time.
It is cost-effective because you can keep adding more disks instead of moving data to offline storage
like tapes. However, DFS does not have built-in search functionality to look inside files—you need extra
tools to search for file contents.

Prof. Prasad Patil,


Department of Computer Applications,
KLE Tech University, Belagavi.
Big Data Analytics – Unit 1 (Chapter – 2: Big Data Storage) 22

RDBMS Databases:
RDBMS is great for quick, small transactions, but not ideal for handling very large, growing datasets
because scaling is expensive and failure risks are high.

Following points explains about RDBMS Databases:


1. Good for Small, Fast Transactions: A Relational Database (RDBMS) is best for handling small
amounts of data that need quick and random access, like bank transactions or user records.
2. Follows Strict Rules (ACID Compliance): It ensures data is always correct and safe by
following strict rules (ACID properties) But to maintain these rules, it usually runs on just one
computer (single node).
3. No Built-in Backup or Fault Tolerance: If the system fails, data may be lost because RDBMS
doesn’t automatically make copies.
4. Scaling is Expensive and Hard: To store huge amounts of fast-arriving data, RDBMS needs to
be scaled up (vertical scaling) by adding more powerful hardware (CPU, RAM, storage). This
is very costly and harder to upgrade compared to systems that spread data across multiple
computers (horizontal scaling).
5. Some RDBMS Can Use Multiple Computers, But Have Risks: Some databases like IBM DB2,
Oracle RAC, and Microsoft PDW can work across multiple computers. However, they still rely
on shared storage, which means if that storage fails, the whole system fails.

Above figure shows how a clustered rational database uses a shared storage architecture, which is a potential single point of
failure that affects the availability of the database.

Prof. Prasad Patil,


Department of Computer Applications,
KLE Tech University, Belagavi.
Big Data Analytics – Unit 1 (Chapter – 2: Big Data Storage) 23

Relational databases require manual sharding, meaning the application must decide where to store
and find data, making things more complicated as data grows. Sharding is a method of splitting a large
database into smaller pieces (shards) to improve performance. However, in relational databases
(RDBMS), sharding is not automatic—the application must decide where to store and find data.
Challenges of Manual Sharding are as first, application must keep track of which shard holds which
data, second, if a query needs data from multiple shards, it gets more complicated and slower and
third, Data movement or rebalancing (when shards grow too big) is difficult to manage.

Note: Sharding is a method of splitting a large database into smaller pieces (shards) to improve
performance. However, in relational databases (RDBMS), sharding is not automatic—the application
must decide where to store and find data.

The following steps are shown in the below figure:


1. A user writes a record (id = 2).
2. The application logic determines which shard it should be written to.
3. It is sent to the shard determined by the application logic.
4. The user reads a record (id = 4), & the application logic determines which shard contains the data.
5. The data is read and returned to the application.
6. The application then returns the record to the user.

Above figure shows how a relational database is manually sharded using application logic.

Prof. Prasad Patil,


Department of Computer Applications,
KLE Tech University, Belagavi.
Big Data Analytics – Unit 1 (Chapter – 2: Big Data Storage) 24

The following steps are shown in the below figure:


1. A user requests multiple records (id = 1, 3) and the application logic is used to determine which
shards need to be read.
2. It is determined by the application logic that both Shard A and Shard B need to be read.
3. The data is read and joined by the application.
4. Finally, the data is returned to the user.

Above figure shows an example of the use of the application logic to join data retrieved from multiple shards.

Relational databases require structured data (following a fixed schema). They do not support semi-
structured or unstructured data easily.
RDBMS is Not Ideal for Big Data due to following reasons
❖ Schema restriction: Data must match a predefined format before being stored.
❖ Slow writes: Checking data against rules before inserting or updating adds delay (latency).
❖ Not suited for fast data: They struggle with high-speed data that needs quick storage.
Because of these limitations, relational databases are not the best choice for Big Data, where large,
fast, and diverse data needs to be handled efficiently.

Prof. Prasad Patil,


Department of Computer Applications,
KLE Tech University, Belagavi.
Big Data Analytics – Unit 1 (Chapter – 2: Big Data Storage) 25

NoSQL databases
NoSQL databases are modern, flexible, and scalable databases designed for Big Data and high-speed
applications. They don’t require a fixed structure like relational databases and can store different
types of data, including text, images, and videos

Characteristics of NoSQL databases are as follows:

❖ Schema-less: Stores data without a fixed structure.

❖ Scales Out: Adds more servers instead of upgrading a single one.

❖ Highly Available: Works even if some servers fail.

❖ Low Cost: Many NoSQL databases are free and work on cheap hardware.

❖ Eventual Consistency: Data updates may take time but will sync across servers.

❖ BASE, Not ACID: Prioritizes availability over strict consistency.

❖ API-based Access: Uses APIs like REST for data queries.

❖ Auto Sharding & Replication: Automatically splits and copies data across servers.

❖ Built-in Caching: No need for external caching systems.

❖ Distributed Queries: Supports searching data across multiple servers.

❖ Polyglot Persistence: Works alongside SQL databases for flexibility.

❖ Aggregate-Focused: Stores data in a way that reduces the need for joins

Types of NoSQL storage devices: NoSQL storage devices can mainly be divided into four types
based on the way they store data,
1. Key-Value,
2. Document,
3. Column-Family,
4. Graph.

Key-Value:
Key-Value Storage devices store data as key-value pairs, like a dictionary or hash table. The key is
unique, and the value can be anything (text, images, videos, etc.).
The working is as follows:
❖ Data is retrieved only using keys.
❖ No complex queries, just insert, delete, or fetch by key.
❖ Very fast because it doesn’t use indexes.
Key-Value is useful when we
❖ Need fast read/write.
❖ Data is independent (no relations).
❖ Simple structure like caching, session storage.
Key-Value is not useful when we
❖ Need searching/filtering inside values.
❖ Need relations between data.
❖ Need schema consistency.
Examples:
❖ Redis, Riak, Amazon DynamoDB.

Prof. Prasad Patil,


Department of Computer Applications,
KLE Tech University, Belagavi.
Big Data Analytics – Unit 1 (Chapter – 2: Big Data Storage) 26

Document Storage
Document storage devices store data as key-value pairs, but the value is a document (JSON, XML,
BSON). Documents can have nested structures.
The working is as follows:
❖ Unlike key-value storage, databases can search inside documents.
❖ Supports partial updates and indexing for faster queries.
Document storage is useful when we
❖ Storing semi-structured data (e.g., user profiles, invoices).
❖ Need flexible schema (easy to add fields).
❖ Need fast searching inside documents.
Document storage is not useful when we
❖ Need complex transactions involving multiple documents.
❖ Need strong schema enforcement.
Examples: MongoDB, CouchDB, Firebase Firestore.

A depiction of JSON data stored in a document storage device.

The main differences between document storage devices and key-value storage devices are as
follows:
❖ document storage devices are value-aware,
❖ the stored value is self-describing; the schema can be inferred from the structure of the value
or a reference to the schema for the document is included in the value,
❖ a select operation can reference a field inside the aggregate value,
❖ a select operation can retrieve a part of the aggregate value,
❖ partial updates are supported; therefore, a subset of the aggregate can be updated.

Column-Family:
Column-Family Storage Stores data in rows and columns, like relational databases, but groups related
columns together (column families). Each row can have different columns (flexible schema) and used
for fast reads and writes on large datasets.
Column-Family is useful when we are dealing with
❖ Large-scale, high-speed applications (e.g., analytics, logging).
❖ Handling massive datasets with flexible structure.
❖ Distributed databases with automatic partitioning.
Column-Family is not useful when we are dealing with
❖ You need complex queries with joins.
❖ You require strict schema enforcement.
❖ Transactions across multiple rows are necessary.
Examples: Apache Cassandra, HBase, ScyllaDB.

Prof. Prasad Patil,


Department of Computer Applications,
KLE Tech University, Belagavi.
Big Data Analytics – Unit 1 (Chapter – 2: Big Data Storage) 27

The highlighted columns depict the flexible schema feature supported by the column-family databases, where each row can
have a different set of columns.

Graph:
Graph storage devices are used to persist inter-connected entities. Unlike another NoSQL
storage devices, where the emphasis is on the structure of the entities, graph storage
devices place emphasis on storing the linkages between entities.
Graph Databases
❖ Stores relationships between data instead of just data itself.
❖ Best for interconnected data, like social networks or recommendation systems.
❖ Uses nodes (entities) and edges (connections) to represent data.
❖ Optimized for queries involving relationships, like "Find mutual friends."
Example in real life:
❖ Social networks (friend connections).
❖ Recommendation systems (suggesting products).
❖ Fraud detection (tracking suspicious transactions).
Examples of Graph Databases: Neo4j, ArangoDB, Amazon Neptune.
Useful when:
❖ You need to analyze relationships between data (e.g., social networks, fraud detection).
❖ Queries involve complex connections (e.g., "Find shortest path between 2 users").
❖ You require real-time insights on interconnected data.
❖ Your data has many relationships that need fast traversal.
Not Useful when:
❖ Your data is simple and doesn’t require relationships (a key-value or document database is
better).
❖ You need fast, high-volume transactions without complex queries (a column-family database
is better).
❖ Your application relies heavily on ACID transactions (a relational database may be better).

Prof. Prasad Patil,


Department of Computer Applications,
KLE Tech University, Belagavi.
Big Data Analytics – Unit 1 (Chapter – 2: Big Data Storage) 28

The above figure shows graph storage devices store entities and their relationships.

NewSQL databases:
NewSQL databases combine the best of both RDBMS and NoSQL. They provide ACID compliance like
traditional relational databases while offering scalability and fault tolerance like NoSQL. Unlike NoSQL,
which follows the BASE model and provides eventual consistency, NewSQL ensures immediate
consistency, making it suitable for large-scale transactional systems. These databases support SQL
syntax, making it easier for organizations to transition from traditional databases while handling high-
volume OLTP transactions and real-time analytics. They are ideal for banking systems and operational
analytics. Examples of NewSQL databases include VoltDB, NuoDB and InnoDB.

Prof. Prasad Patil,


Department of Computer Applications,
KLE Tech University, Belagavi.
Big Data Analytics – Unit 1 (Chapter – 2: Big Data Storage) 29

In-Memory Storage Devices


In-memory storage devices use RAM (Random Access Memory) instead of hard drives to store data.
This makes them super-fast because they don’t have to wait for a disk to read or write data.

Why are they fast?


❖ No Disk Delay – Regular hard drives take time to spin and find data, but RAM can access it
instantly.
❖ No Transfer Time – Usually, data moves between RAM and the hard drive, but here, everything
stays in RAM, making it much quicker.
How big can in-memory storage be?
❖ You can combine multiple RAMs from different computers (called a cluster) to store huge
amounts of data.
❖ This is great for Big Data analytics, which needs to process large amounts of information in
real-time.
How much faster is it? (refer diagram)
❖ Reading 1 MB of data from in-memory storage takes 0.25 milliseconds (ms).
❖ The same amount of data from a hard drive takes 20 ms.
❖ This means in-memory storage is about 80 times faster than regular hard drives!

In-memory storage devices are 80 times faster at transferring data than on disk storage devices.

Because of this speed, in-memory storage is used in real-time applications, like banking, gaming, and
big data processing.
An in-memory storage device allows fast data analysis because it keeps data in RAM instead
of on a hard drive. This is called in-memory analytics—it helps process and analyze data instantly,
making quick business decisions possible.

Why is In-Memory Analytics Useful?


❖ Faster Insights – It allows companies to analyze data in real-time, like detecting fraud in
banking or predicting customer trends.
❖ Better for Big Data – It helps process huge amounts of fast-moving data quickly, which is
important in Big Data environments.

Prof. Prasad Patil,


Department of Computer Applications,
KLE Tech University, Belagavi.
Big Data Analytics – Unit 1 (Chapter – 2: Big Data Storage) 30

Working
❖ Data is stored across multiple computers (a cluster), so even if one system fails, the data is
still available.
❖ More memory or computers (nodes) can be added to increase storage capacity (horizontal
scaling).
Limitations of In-Memory Storage
1. Expensive – RAM is costly compared to hard drives.
2. Limited Space – Even though a 64-bit computer can handle huge memory (16 exabytes),
physical space for RAM is limited.
3. Not Permanent – Unlike hard drives, RAM does not store data permanently. Only the most
important and fresh data is kept, and older data is removed.
Because of its speed, in-memory storage is used in AI, finance, and Big Data processing, where real-
time data insights are essential.

When to Use and Not Use In-Memory Storage

In-memory storage device will be useful when:


❖ Fast Data Processing is Needed – If data is coming in quickly and needs real-time analytics
(e.g., fraud detection in banking).
❖ Always-On Analytics – If the system must constantly analyze data (e.g., monitoring stock
prices).
❖ Instant Querying & Visualization – When interactive reports or real-time dashboards are
needed.
❖ Shared Data Access – If multiple programs need access to the same dataset at the same
time.
❖ Exploratory Data Analysis (EDA) – When testing different algorithms without reloading data
from disk.
❖ Graph-Based Algorithms – Used in social networks and recommendation systems, where the
same data is accessed repeatedly.
❖ Low-Latency Big Data Processing – When building fast Big Data solutions that support ACID
transactions (Atomicity, Consistency, Isolation, Durability).

When is In-Memory Storage NOT Useful?


Avoid in-memory storage when:
❖ Batch Processing is Enough – If data can be processed in groups instead of real-time.
❖ Too Much Data Needs to be Stored – If data must be kept for a long time (in-memory storage
is expensive).
❖ Deep Data Analysis is Needed – If strategic analytics requires accessing massive amounts of
old data.
❖ Data is Too Large for RAM – If the dataset is too big to fit into available memory.
❖ Transitioning from Traditional Storage – If a company is shifting from traditional storage,
setting up in-memory storage requires extra skills and complex setup.
❖ Limited Budget – If upgrading RAM or adding more nodes is too expensive.

Prof. Prasad Patil,


Department of Computer Applications,
KLE Tech University, Belagavi.
Big Data Analytics – Unit 1 (Chapter – 2: Big Data Storage) 31

Types of In-Memory Storage


1. In-Memory Data Grid (IMDG) – Stores data across multiple computers (nodes) for high
performance and availability.
2. In-Memory Database (IMDB) – Works like a regular database but stores data in memory
instead of on disk for faster queries.

Both use RAM but in different ways.

In-Memory Data Grids


An In-Memory Data Grid (IMDG) is a distributed computing technology that stores data in RAM across
multiple nodes to improve performance, scalability, and reliability. It is commonly used for high-
performance computing applications, caching, and real-time analytics.

Definition:
In-Memory Data Grids (IMDGs) are designed to provide high-speed data access by storing data in
RAM. However, to ensure durability and consistency with traditional storage (like relational
databases or disk-based systems), IMDGs use various techniques to synchronize data between
memory and persistent storage.

An IMDG storage device.

Characteristics of IMDGs
❖ Distributed Architecture – Data is stored across multiple servers/nodes in a cluster.
❖ High-Speed Access – Since data resides in RAM, access times are significantly reduced.
❖ Data Replication & Partitioning – Ensures fault tolerance and scalability by replicating or
partitioning data.
❖ Concurrency Control – Supports multiple users accessing data simultaneously.
❖ Event Processing – Supports real-time event-driven processing.
❖ Persistence Options – Can integrate with databases for durability (e.g., write-through, write-
behind caching).
❖ Support for Transactions – Provides ACID or eventual consistency models.

Prof. Prasad Patil,


Department of Computer Applications,
KLE Tech University, Belagavi.
Big Data Analytics – Unit 1 (Chapter – 2: Big Data Storage) 32

Example:
Suppose an e-commerce website uses an IMDG for order processing:
1. A user adds items to a cart → Data is stored in the IMDG for quick access.
2. At checkout, the order is validated → IMDG retrieves stock availability in real time.
3. Once the order is confirmed, data is persisted to a traditional database.
4. If an application server crashes, another node with replicated session data handles the
request.

Detailed IMDG Working:


1. Data Distribution & Partitioning
❖ The dataset is split into partitions and distributed across multiple nodes in a cluster.
❖ Each node manages a portion of the data, reducing database load and improving
scalability.
2. Replication for Fault Tolerance
❖ Data can be replicated across multiple nodes to ensure redundancy and fault tolerance.
❖ If a node fails, another node containing a replica takes over to maintain availability.
3. Near-Real-Time Data Processing
❖ IMDGs support parallel processing by distributing computation tasks across nodes.
❖ Ideal for real-time analytics, complex event processing, and machine learning workloads.
4. Query & Transaction Management
❖ Supports SQL-like querying or key-value access for retrieving data quickly.
❖ Some IMDGs provide ACID-compliant transactions or eventual consistency models.
5. Eviction & Persistence Policies
❖ When RAM fills up, the grid can evict old data based on policies (e.g., LRU - Least
Recently Used).
❖ Can be configured with persistence mechanisms like writing to a database or disk.
6. Scalability & Load Balancing
❖ New nodes can be added dynamically to scale horizontally.
❖ Load balancers distribute requests to optimize resource usage.

IMDGs use different techniques to combine with traditional storage:


1. Read-Through – Reads data from disk when needed.
2. Write-Through – Writes to disk immediately after updating RAM.
3. Write-Behind – Writes to disk after a delay to improve speed.
4. Refresh-Ahead – Predicts which data will be needed and pre-loads it into RAM.

Explanation of IMDG Storage Integration Techniques


1. Read-Through
❖ Definition: This technique automatically fetches data from the traditional storage
(disk/database) when it is not found in the in-memory data grid.
❖ Working: The IMDG checks if the requested data is in memory. If the data is not available
in RAM, it reads from the underlying database. The fetched data is then stored in the
IMDG for faster future access.
❖ Example: A web application retrieving user profile data from the database only when it is
not found in the cache.

Prof. Prasad Patil,


Department of Computer Applications,
KLE Tech University, Belagavi.
Big Data Analytics – Unit 1 (Chapter – 2: Big Data Storage) 33

2. Write-Through
❖ Definition: In this approach, every write operation to the IMDG is immediately written to
the database as well, ensuring consistency.
❖ How It Works: When data is updated or inserted into the IMDG, the same operation is
instantly performed on the database. This ensures that the database always has the
latest data.
❖ Example: A banking application where balance updates must be immediately reflected in
both memory and the database to maintain accuracy.

3. Write-Behind
❖ Definition: This technique improves performance by allowing updates to be stored
in RAM first and written to the database asynchronously after a short delay.
❖ How It Works: The application writes data to the IMDG. Instead of immediately
writing to the database, the IMDG batches multiple updates and writes them to disk
after a defined interval.
❖ Example: An e-commerce system that processes thousands of orders per second but
writes them to the database in bulk to reduce load.

4. Refresh-Ahead
❖ Definition: This technique anticipates which data will be needed soon and preloads
it into RAM from the database before it is requested.
❖ Working: The IMDG analyses historical data access patterns or follows predefined
rules. It proactively loads frequently accessed data into memory to reduce access
latency.
❖ Example: A stock market application preloading the latest stock prices before traders
request them, ensuring real-time updates.

In-Memory Databases (IMDBs)


Definition:
An In-Memory Database (IMDB) is a type of database that stores and processes data entirely in RAM
(Random Access Memory) rather than traditional disk-based storage. By leveraging high-speed
memory, IMDBs eliminate the latency issues associated with disk I/O operations, resulting in faster
data access and query execution.
Examples include Aerospike, MemSQL, Altibase HDB, eXtreme DB and Pivotal GemFire.

An IMDB storage device is appropriate when:


1. Relational Data with ACID Support - If the data needs to be stored in a structured format
(tables with rows and columns) and must follow ACID rules (Atomicity, Consistency,
Isolation, Durability), IMDB is the right choice.
2. Adding Real-Time Support to an Existing Big Data System - If a company already uses on-
disk storage but needs faster, real-time processing, IMDB can help without replacing
everything.
3. Replacing On-Disk Storage with In-Memory Storage - If the existing database is too slow,
IMDB can completely replace it for ultra-fast data access.
4. Minimizing Changes to SQL-Based Applications - If an application already uses SQL,
switching to an IMDB requires minimal code changes, making the transition easier.
5. When Relational Storage is More Important than Scalability - If structured data (tables,
relationships, constraints, etc.) is more important than expanding storage, then IMDB is a
great choice.

Prof. Prasad Patil,


Department of Computer Applications,
KLE Tech University, Belagavi.
Big Data Analytics – Unit 1 (Chapter – 2: Big Data Storage) 34

In the below figure:


1. A relational dataset is stored into an IMDB.
2. A client requests a customer record (id = 2) via SQL.
3. The relevant customer record is then returned by the IMDB, which is directly manipulated by
the client without the need for any deserialization.

Prof. Prasad Patil,


Department of Computer Applications,
KLE Tech University, Belagavi.

You might also like