0% found this document useful (0 votes)
13 views5 pages

Cassandra_Complete_Notes

Apache Cassandra is a distributed NoSQL database designed for high performance and scalability, suitable for handling large volumes of various data types. It features a decentralized architecture with no single point of failure, ensuring high availability and fault tolerance, and uses a flexible, schema-less data model. While it excels in write-heavy applications and real-time data processing, it may not be ideal for complex queries and has a steeper learning curve for data modeling.

Uploaded by

Vasanth Msd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views5 pages

Cassandra_Complete_Notes

Apache Cassandra is a distributed NoSQL database designed for high performance and scalability, suitable for handling large volumes of various data types. It features a decentralized architecture with no single point of failure, ensuring high availability and fault tolerance, and uses a flexible, schema-less data model. While it excels in write-heavy applications and real-time data processing, it may not be ideal for complex queries and has a steeper learning curve for data modeling.

Uploaded by

Vasanth Msd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Apache Cassandra - Complete Notes

1 What is Cassandra?
Apache Cassandra is a high-performance, distributed NoSQL database system designed to han-
dle large volumes of structured, semi-structured, and unstructured data across many servers
with no single point of failure.

1.1 Key Characteristics


• Developed at Facebook; later open-sourced by the Apache Software Foundation.
• Combines features of Amazons Dynamo and Googles Bigtable.
• Uses a peer-to-peer architecture (no master-slave).
• Best for applications needing massive scalability, fast write performance, and high avail-
ability.

1.2 Use Cases


• Real-time big data applications.
• Messaging and social media.
• IoT and time-series data.
• Logging and analytics at scale.

2 Features of Cassandra
• Decentralized: No master-slave; all nodes are equal. Any node can service any request.
• Scalable: Add more nodes easily without downtime. Linear horizontal scalability.
• High Availability: Data replication ensures constant availability. No single point of
failure.
• Fault Tolerance: Handles node failures gracefully. Commit log ensures durability of
writes.
• Tunable Consistency: Configure consistency level per request. Balance between avail-
ability and consistency.
• High Performance: Optimized for fast write operations. Read performance improves
with replication.
• Flexible Schema (Schema-less): Dynamic columns and tables. Easy to evolve data
structure.
• Support for CQL (Cassandra Query Language): SQL-like language. Simplifies
querying and schema design.

1
• Efficient Storage: Uses SSTables and compaction strategies. Handles write-heavy work-
loads efficiently.
• Support for Time-Series Data: Natural fit for IoT, sensor, event data. Fast insert and
retrieve of timestamped values.

3 Cassandra Architecture
Cassandra uses a peer-to-peer distributed architecture. All nodes are equal, and data is parti-
tioned across them using consistent hashing.

3.1 Important Architectural Concepts


• Ring-based architecture: Data is distributed across the ring.
• Gossip Protocol: Nodes exchange state information about themselves and others.
• Snitch: Helps in determining the location of nodes (rack, data center).
• Partitioner: Decides how data is distributed.

3.2 No Master Node


• No single point of failure.
• Uniform distribution of load and responsibility.
• Higher scalability and fault tolerance.

4 Key Components of Cassandra


• Node: Basic unit where data is stored. Each node has equal responsibilities. Handles
read/write requests.
• Data Center: A collection of related nodes. Can be physical or logical. Used for geo-
graphic replication and fault isolation.
• Cluster: A collection of one or more data centers. Manages the entire Cassandra setup.
Consists of multiple nodes working together.
• Commit Log: Every write is recorded here first. Used for durability and recovery. Helps
restore data after node failure.
• Memtable: In-memory store of data. Holds recently written data before flushing. Flushed
to SSTables when full.
• SSTable (Sorted String Table): Immutable disk-based storage. Stores flushed data from
Memtable. Compacted periodically for optimization.
• Bloom Filter: A probabilistic data structure. Speeds up read by checking SSTables to
skip. Reduces unnecessary disk lookups.

2
• Cassandra Query Language (CQL): SQL-like syntax to interact with Cassandra. Sup-
ports DDL and DML operations. Easy to use and familiar for developers.

5 Cassandra Data Model


Cassandra follows a column-family data model, which is similar to a table but more flexible. It
is designed to handle large-scale data with dynamic schemas.

5.1 Key Concepts


• Keyspaces: Highest level of logical grouping, similar to a database in RDBMS. Defines
replication factor and strategy (e.g., SimpleStrategy, NetworkTopologyStrategy).
• Column Families (Tables): Collection of rows, analogous to tables in RDBMS. Each
row is identified by a unique primary key.
• Rows: Each row is uniquely identified by a primary key. Can have dynamic columns,
allowing flexibility in data storage.
• Columns: Basic data unit, consisting of a name, value, and timestamp. Can be added
dynamically per row, supporting schema-less design.
• Super Column (Deprecated): A column that contains sub-columns, used for nested
data. Not commonly used in modern Cassandra.

5.2 Data Model Explanation


The Cassandra data model is designed for scalability and flexibility, differing significantly from
traditional RDBMS. It uses a wide-column store, where each row can have a different set of
columns, making it schema-less in practice.

5.3 Keyspace Example


A keyspace named social_media might contain tables like users, posts, and comments.
The keyspace defines how data is replicated across nodes (e.g., replication factor of 3 means
three copies of data).

5.4 Table Example


Consider a users table:
1 CREATE TABLE social_media.users (
2 user_id uuid,
3 username text,
4 email text,
5 created_at timestamp,
6 PRIMARY KEY (user_id)
7 );

Here, user_id is the partition key, determining which node stores the row. Each row can
have additional columns added dynamically (e.g., phone_number).

3
5.5 How It Works
• Partitioning: The partition key (e.g., user_id) is hashed to determine the node(s)
where the data resides.
• Clustering: Clustering columns (if any) sort data within a partition.
• Dynamic Columns: A row for user_id=123 might have columns username, email,
while another for user_id=456 might include phone.
• Timestamps: Each columns timestamp ensures the latest write wins in case of conflicts.

5.6 Why It Matters


The data model is optimized for specific query patterns. Developers must design tables based
on query requirements, unlike RDBMS where normalization is key.

6 Cassandra Primary Keys


A Primary Key consists of a Partition Key and optional Clustering Column(s):
• Partition Key: Determines which node stores the data.
• Clustering Columns: Define sorting within the partition.
Example:
1 PRIMARY KEY ((user_id), message_id)

user_id: Partition Key, message_id: Clustering Column.

7 Read & Write Operations


7.1 Write Path
• Data is written to Commit Log for durability.
• Written to Memtable in memory.
• When full, Memtable is flushed to disk as an SSTable.
• Periodic compaction merges SSTables.

7.2 Read Path


• Check Memtable for the latest data.
• Use Bloom Filter to find if data might be in SSTable.
• Read from SSTable(s).
• Use Read Repair if inconsistencies found.

4
8 Real-world Example Use Cases
• Messaging Applications: WhatsApp, Messenger handle millions of messages per sec-
ond.
• IoT & Sensor Data: Collect time-series data from millions of devices.
• E-commerce: Real-time product catalog updates. Customer behavior tracking.

9 Advantages of Cassandra
• No single point of failure.
• Handles massive amounts of writes.
• High availability and fault tolerance.
• Linear scalability.
• Flexible data model.
• SQL-like query language (CQL).

10 Disadvantages of Cassandra
• Not suitable for complex joins and aggregations.
• Eventual consistency may not fit all use cases.
• Higher operational complexity.
• Schema changes need caution.
• Learning curve for data modeling.

You might also like