Unit 1 (Chapter 2) - Big Data Storage
Unit 1 (Chapter 2) - Big Data Storage
Clusters:
Clusters in big data storage are the group of interconnected servers (nodes) that work together to store,
manage, and process large volumes of data efficiently. These clusters help in distributed storage,
parallel processing, and high availability of big data.
These servers usually have the same hardware specifications and are connected together via a network
to work as a single unit, as represented in the below figure. Each node in the cluster has its own
dedicated resources, such as memory, a processor, and a hard drive. A cluster can execute a task by
splitting it into small pieces and distributing their execution onto different computers that belong to
the cluster.
Comparison between the Distributed File System and the Traditional File System
Following are the differences between the Distributed File System and the Traditional File System in
the tabular column
Apache Hadoop
There are two components of Hadoop, such as
a) HDFS
b) MapReduce
a) HDFS
HDFS is a distributed file system, which was developed by Apache software foundations designer to
store very large files across multiple machines in a cluster, with each file splits into block and each block
are replicated across the multiple nodes to ensure reliability and availability in the time of hardware
failure.
DataNode:
❖ DataNodes are the worker nodes that stores the actual data HDFS cluster.
❖ DataNodes stores actual data blocks of files in the HDFS.
❖ DataNodes can perform Read and Write operations.
❖ Sense Heartbeat signal to the NameNode to confirm that they are alive and functioning
❖ DataNodes stores and retrieves blocks when requested by clients or the NameNodes.
❖ DataNodes keeps sending block report to the NameNode at regular interval of time in order to
tells about the data storage blocks information.
Block Creation:
❖ The HDFS Client divides a file into fixed-size blocks before storing it in HDFS. The file is broken
into large blocks and each block is stored independently to ensure data availability and fault
tolerance.
❖ The default block size is 128MB. (Its size can be 64 MB also).
❖ The NameNode assigns blocks to different DataNodes for storage.
❖ The DataNodes store these blocks and manage replication for fault tolerance.
Block Report:
❖ Block report is a periodic report sent by a DataNode to the NameNode.
❖ This report contains detailed information about all the blocks that the DataNode is currently
storing.
Replication factor:
❖ The number of copies of each block of data stored across different DataNodes in the HDFS
cluster.
❖ The default replication factor in HDFS is 3.
❖ If the number of DataNodes in a HDFS cluster are less than 8 then the replication factor is 2.
❖ If the number of DataNodes in HDFS cluster are 8 or more than 8, then the replication factor
will be 3.
NoSQL
A Not-only SQL (NoSQL) database is a non-relational database that is highly scalable, fault-tolerant
and specifically designed to house semi-structured and unstructured data. A NoSQL database often
provides an API-based query interface that can be called from within an application. NoSQL
databases also support query languages other than Structured Query Language (SQL) because SQL
was designed to query structured data stored within a relational database. As an example, a NoSQL
database that is optimized to store XML files will often use XQuery as the query language. Likewise,
a NoSQL database designed to store RDF data will use SPARQL to query the relationships it contains.
NoSQL databases play a crucial role in Big Data Analytics by offering scalability,
flexibility, and high-speed processing. Traditional relational databases (RDBMS) struggle with the
volume, variety, velocity, and veracity of big data, whereas NoSQL databases efficiently handle these
challenges.
Sharding
Sharding is the process of horizontally partitioning a large dataset into a collection of smaller, more
manageable datasets called shards. The shards are distributed across multiple nodes, where a node is
a server or a machine. Each shard is stored on a separate node and each node is responsible for only
the data stored on it. Each shard shares the same schema, and all shards collectively represent the
complete dataset.
Sharding is a database technique that splits data into smaller partitions, or shards. It can be used for
blockchains and traditional databases. Sharding helps with scalability and processing speed.
Sharding is often transparent to the client, but this is not a requirement. Sharding allows the
distribution of processing loads across multiple nodes to achieve horizontal scalability. Horizontal
scaling is a method for increasing a system’s capacity by adding similar or higher capacity resources
alongside existing resources. Since each node is responsible for only a part of the whole dataset,
read/write times are greatly improved.
Above the shard, we have another abstraction software layer that coordinates data storage
and access across these multiple shards. For instance, some databases have automatic sharding
features built in. In others, you, as a software developer, need to write sharding code to
store or retrieve data from the correct shard.
Benefits of Sharding:
❖ Horizontal Scalability – Instead of storing all data on a single server, data is spread across
multiple machines, enabling efficient storage and retrieval.
❖ Load Balancing – Reduces the burden on a single database server, improving performance.
❖ High Availability – If one shard fails, the rest of the system can continue functioning.
❖ Faster Queries – Queries are executed only on relevant shards, reducing processing time.
❖ Increased read/write throughput — By distributing the dataset across multiple shards, both
read and write operation capacity is increased as long as read and write operations are
confined to a single shard.
❖ Increased storage capacity — Similarly, by increasing the number of shards, you can also
increase overall total storage capacity, allowing near-infinite scalability.
Replication
Replication stores multiple copies of a dataset, known as replicas, on multiple nodes (Figure 5.7).
Replication provides scalability and availability due to the fact that the same data is replicated on
various nodes. Fault tolerance is also achieved since data redundancy ensures that data is not lost
when an individual node fails.
There are two different methods that are used to implement replication:
a. master-slave
b. peer-to-peer
Above figure shows an example of replication where a dataset is replicated to Node A and Node B, resulting in
Replica A and Replica B.
Master-Slave
During master-slave replication, nodes are arranged in a master-slave configuration, and all
data is written to a master node. Once saved, the data is replicated over to multiple slave
nodes. All external write requests, including insert, update and delete, occur on the master
node, whereas read requests can be fulfilled by any slave node. In the below figure, writes are
managed by the master node and data can be read from either Slave A or Slave B.
Above figure showing an example of master-slave replication where Master A is the single point of contact for
all writes, and data can be read from Slave A and Slave B.
Master-slave replication is ideal for read intensive loads rather than write intensive loads since growing
read demands can be managed by horizontal scaling to add more slave nodes. Writes are consistent,
as all writes are coordinated by the master node. The implication is that write performance will suffer
as the amount of writes increases. If the master node fails, reads are still possible via any of the slave
nodes. A slave node can be configured as a backup node for the master node. In the event that the
master node fails, writes are not supported until a master node is reestablished. The master node is
either resurrected from a backup of the master node, or a new master node is chosen from the slave
nodes.
One concern with master-slave replication is read inconsistency, which can be an issue. In master-slave
replication, data updates first happen on the master and then get copied to slave nodes. Sometimes,
if a user reads data from a slave before it gets updated, they might see old or inconsistent data.
Above figure shows an example of master-slave replication where read inconsistency occurs.
Peer-to-Peer
With peer-to-peer replication, all nodes operate at the same level. In other words, there is
not a master-slave relationship between the nodes. Each node, known as a peer, is equally
capable of handling reads and writes. Each write is copied to all peers, as illustrated in
Figure below
Above figure shows how writes are copied to Peers A, B and C simultaneously. Data is read from Peer A, but it
can also be read from Peers B or C.
Above Figure shows an example of peer-to-peer replication where an inconsistent read occurs.
To improve on the limited fault tolerance offered by sharding, while additionally benefiting from the
increased availability and scalability of replication, both sharding and replication can be combined, as
shown in the below figure and covers under the umbrella of combination of sharding and master-
slave replication & even sharding and peer-to-peer replication.
The above figure shows a comparison of sharding and replication that shows how a dataset is
distributed between two nodes with the different approaches.
The above figure shows the combination of sharding and master-slave replication
Working:
❖ Each shard has its own master node and multiple slave nodes.
❖ Write operations occur on the master node of the respective shard.
❖ Read operations can be distributed across slave nodes to optimize performance.
❖ If a master fails, a failover mechanism promotes one of the slaves to be the new master.
The above figure shows the combination of sharding and peer-to-peer replication.
*****************************************************************
Above figure shows how On-disk storage can be implemented with a distributed file system or a database.
Above figure shows how a distributed file system accessing data in streaming mode with no random reads and writes.
A Distributed File System (DFS) is ideal for storing large raw datasets or archiving data for a long time.
It is cost-effective because you can keep adding more disks instead of moving data to offline storage
like tapes. However, DFS does not have built-in search functionality to look inside files—you need extra
tools to search for file contents.
RDBMS Databases:
RDBMS is great for quick, small transactions, but not ideal for handling very large, growing datasets
because scaling is expensive and failure risks are high.
Above figure shows how a clustered rational database uses a shared storage architecture, which is a potential single point of
failure that affects the availability of the database.
Relational databases require manual sharding, meaning the application must decide where to store
and find data, making things more complicated as data grows. Sharding is a method of splitting a large
database into smaller pieces (shards) to improve performance. However, in relational databases
(RDBMS), sharding is not automatic—the application must decide where to store and find data.
Challenges of Manual Sharding are as first, application must keep track of which shard holds which
data, second, if a query needs data from multiple shards, it gets more complicated and slower and
third, Data movement or rebalancing (when shards grow too big) is difficult to manage.
Note: Sharding is a method of splitting a large database into smaller pieces (shards) to improve
performance. However, in relational databases (RDBMS), sharding is not automatic—the application
must decide where to store and find data.
Above figure shows how a relational database is manually sharded using application logic.
Above figure shows an example of the use of the application logic to join data retrieved from multiple shards.
Relational databases require structured data (following a fixed schema). They do not support semi-
structured or unstructured data easily.
RDBMS is Not Ideal for Big Data due to following reasons
❖ Schema restriction: Data must match a predefined format before being stored.
❖ Slow writes: Checking data against rules before inserting or updating adds delay (latency).
❖ Not suited for fast data: They struggle with high-speed data that needs quick storage.
Because of these limitations, relational databases are not the best choice for Big Data, where large,
fast, and diverse data needs to be handled efficiently.
NoSQL databases
NoSQL databases are modern, flexible, and scalable databases designed for Big Data and high-speed
applications. They don’t require a fixed structure like relational databases and can store different
types of data, including text, images, and videos
❖ Low Cost: Many NoSQL databases are free and work on cheap hardware.
❖ Eventual Consistency: Data updates may take time but will sync across servers.
❖ Auto Sharding & Replication: Automatically splits and copies data across servers.
❖ Aggregate-Focused: Stores data in a way that reduces the need for joins
Types of NoSQL storage devices: NoSQL storage devices can mainly be divided into four types
based on the way they store data,
1. Key-Value,
2. Document,
3. Column-Family,
4. Graph.
Key-Value:
Key-Value Storage devices store data as key-value pairs, like a dictionary or hash table. The key is
unique, and the value can be anything (text, images, videos, etc.).
The working is as follows:
❖ Data is retrieved only using keys.
❖ No complex queries, just insert, delete, or fetch by key.
❖ Very fast because it doesn’t use indexes.
Key-Value is useful when we
❖ Need fast read/write.
❖ Data is independent (no relations).
❖ Simple structure like caching, session storage.
Key-Value is not useful when we
❖ Need searching/filtering inside values.
❖ Need relations between data.
❖ Need schema consistency.
Examples:
❖ Redis, Riak, Amazon DynamoDB.
Document Storage
Document storage devices store data as key-value pairs, but the value is a document (JSON, XML,
BSON). Documents can have nested structures.
The working is as follows:
❖ Unlike key-value storage, databases can search inside documents.
❖ Supports partial updates and indexing for faster queries.
Document storage is useful when we
❖ Storing semi-structured data (e.g., user profiles, invoices).
❖ Need flexible schema (easy to add fields).
❖ Need fast searching inside documents.
Document storage is not useful when we
❖ Need complex transactions involving multiple documents.
❖ Need strong schema enforcement.
Examples: MongoDB, CouchDB, Firebase Firestore.
The main differences between document storage devices and key-value storage devices are as
follows:
❖ document storage devices are value-aware,
❖ the stored value is self-describing; the schema can be inferred from the structure of the value
or a reference to the schema for the document is included in the value,
❖ a select operation can reference a field inside the aggregate value,
❖ a select operation can retrieve a part of the aggregate value,
❖ partial updates are supported; therefore, a subset of the aggregate can be updated.
Column-Family:
Column-Family Storage Stores data in rows and columns, like relational databases, but groups related
columns together (column families). Each row can have different columns (flexible schema) and used
for fast reads and writes on large datasets.
Column-Family is useful when we are dealing with
❖ Large-scale, high-speed applications (e.g., analytics, logging).
❖ Handling massive datasets with flexible structure.
❖ Distributed databases with automatic partitioning.
Column-Family is not useful when we are dealing with
❖ You need complex queries with joins.
❖ You require strict schema enforcement.
❖ Transactions across multiple rows are necessary.
Examples: Apache Cassandra, HBase, ScyllaDB.
The highlighted columns depict the flexible schema feature supported by the column-family databases, where each row can
have a different set of columns.
Graph:
Graph storage devices are used to persist inter-connected entities. Unlike another NoSQL
storage devices, where the emphasis is on the structure of the entities, graph storage
devices place emphasis on storing the linkages between entities.
Graph Databases
❖ Stores relationships between data instead of just data itself.
❖ Best for interconnected data, like social networks or recommendation systems.
❖ Uses nodes (entities) and edges (connections) to represent data.
❖ Optimized for queries involving relationships, like "Find mutual friends."
Example in real life:
❖ Social networks (friend connections).
❖ Recommendation systems (suggesting products).
❖ Fraud detection (tracking suspicious transactions).
Examples of Graph Databases: Neo4j, ArangoDB, Amazon Neptune.
Useful when:
❖ You need to analyze relationships between data (e.g., social networks, fraud detection).
❖ Queries involve complex connections (e.g., "Find shortest path between 2 users").
❖ You require real-time insights on interconnected data.
❖ Your data has many relationships that need fast traversal.
Not Useful when:
❖ Your data is simple and doesn’t require relationships (a key-value or document database is
better).
❖ You need fast, high-volume transactions without complex queries (a column-family database
is better).
❖ Your application relies heavily on ACID transactions (a relational database may be better).
The above figure shows graph storage devices store entities and their relationships.
NewSQL databases:
NewSQL databases combine the best of both RDBMS and NoSQL. They provide ACID compliance like
traditional relational databases while offering scalability and fault tolerance like NoSQL. Unlike NoSQL,
which follows the BASE model and provides eventual consistency, NewSQL ensures immediate
consistency, making it suitable for large-scale transactional systems. These databases support SQL
syntax, making it easier for organizations to transition from traditional databases while handling high-
volume OLTP transactions and real-time analytics. They are ideal for banking systems and operational
analytics. Examples of NewSQL databases include VoltDB, NuoDB and InnoDB.
In-memory storage devices are 80 times faster at transferring data than on disk storage devices.
Because of this speed, in-memory storage is used in real-time applications, like banking, gaming, and
big data processing.
An in-memory storage device allows fast data analysis because it keeps data in RAM instead
of on a hard drive. This is called in-memory analytics—it helps process and analyze data instantly,
making quick business decisions possible.
Working
❖ Data is stored across multiple computers (a cluster), so even if one system fails, the data is
still available.
❖ More memory or computers (nodes) can be added to increase storage capacity (horizontal
scaling).
Limitations of In-Memory Storage
1. Expensive – RAM is costly compared to hard drives.
2. Limited Space – Even though a 64-bit computer can handle huge memory (16 exabytes),
physical space for RAM is limited.
3. Not Permanent – Unlike hard drives, RAM does not store data permanently. Only the most
important and fresh data is kept, and older data is removed.
Because of its speed, in-memory storage is used in AI, finance, and Big Data processing, where real-
time data insights are essential.
Definition:
In-Memory Data Grids (IMDGs) are designed to provide high-speed data access by storing data in
RAM. However, to ensure durability and consistency with traditional storage (like relational
databases or disk-based systems), IMDGs use various techniques to synchronize data between
memory and persistent storage.
Characteristics of IMDGs
❖ Distributed Architecture – Data is stored across multiple servers/nodes in a cluster.
❖ High-Speed Access – Since data resides in RAM, access times are significantly reduced.
❖ Data Replication & Partitioning – Ensures fault tolerance and scalability by replicating or
partitioning data.
❖ Concurrency Control – Supports multiple users accessing data simultaneously.
❖ Event Processing – Supports real-time event-driven processing.
❖ Persistence Options – Can integrate with databases for durability (e.g., write-through, write-
behind caching).
❖ Support for Transactions – Provides ACID or eventual consistency models.
Example:
Suppose an e-commerce website uses an IMDG for order processing:
1. A user adds items to a cart → Data is stored in the IMDG for quick access.
2. At checkout, the order is validated → IMDG retrieves stock availability in real time.
3. Once the order is confirmed, data is persisted to a traditional database.
4. If an application server crashes, another node with replicated session data handles the
request.
2. Write-Through
❖ Definition: In this approach, every write operation to the IMDG is immediately written to
the database as well, ensuring consistency.
❖ How It Works: When data is updated or inserted into the IMDG, the same operation is
instantly performed on the database. This ensures that the database always has the
latest data.
❖ Example: A banking application where balance updates must be immediately reflected in
both memory and the database to maintain accuracy.
3. Write-Behind
❖ Definition: This technique improves performance by allowing updates to be stored
in RAM first and written to the database asynchronously after a short delay.
❖ How It Works: The application writes data to the IMDG. Instead of immediately
writing to the database, the IMDG batches multiple updates and writes them to disk
after a defined interval.
❖ Example: An e-commerce system that processes thousands of orders per second but
writes them to the database in bulk to reduce load.
4. Refresh-Ahead
❖ Definition: This technique anticipates which data will be needed soon and preloads
it into RAM from the database before it is requested.
❖ Working: The IMDG analyses historical data access patterns or follows predefined
rules. It proactively loads frequently accessed data into memory to reduce access
latency.
❖ Example: A stock market application preloading the latest stock prices before traders
request them, ensuring real-time updates.