0% found this document useful (0 votes)
15 views17 pages

Cassandra data model

Cassandra is a distributed database system organized into clusters, with key components including keyspaces, column families, and nodes, designed to handle big data workloads without a single point of failure. Data replication ensures reliability, while the Gossip Protocol facilitates communication between nodes. Users interact with Cassandra through the Cassandra Query Language (CQL), enabling efficient read and write operations across the distributed architecture.

Uploaded by

srini767676
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views17 pages

Cassandra data model

Cassandra is a distributed database system organized into clusters, with key components including keyspaces, column families, and nodes, designed to handle big data workloads without a single point of failure. Data replication ensures reliability, while the Gossip Protocol facilitates communication between nodes. Users interact with Cassandra through the Cassandra Query Language (CQL), enabling efficient read and write operations across the distributed architecture.

Uploaded by

srini767676
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 17

Cassandra data model

Cluster
Cassandra database is distributed over several
machines that operate together. The outermost
container is known as the Cluster. For failure handling,
every node contains a replica, and in case of a failure,
the replica takes charge. Cassandra arranges the
nodes in a cluster, in a ring format, and assigns data to
them.
Keyspace: Keyspace is the outermost container for data in Cassandra. The basic attributes of a
Keyspace in Cassandra are −

• Replication factor − It is the number of machines in the cluster that will


receive copies of the same data.

• Replica placement strategy − It is nothing but the strategy to place


replicas in the ring. We have strategies such as simple strategy (rack-
aware strategy), old network topology strategy (rack-aware strategy),
and network topology strategy (datacenter-shared strategy).

• Column families − Keyspace is a container for a list of one or more


column families. A column family, in turn, is a container of a collection
of rows. Each row contains ordered columns. Column families represent
the structure of your data. Each keyspace has at least one and often
Keyspace: -
CREATE KEYSPACE Keyspace name WITH replication = {'class': 'SimpleStrategy',
'replication_factor' : 3};
A column family is a container for an ordered collection of
rows. Each row, in turn, is an ordered collection of
columns. The following table lists the points that
differentiate a column family from a table of relational
databases.
Cassandra column family has following attributes
• keys_cached − It represents the number of locations to keep cached
per SSTable.
• rows_cached − It represents the number of rows whose entire
contents will be cached in memory.
• preload_row_cache − It specifies whether you want to pre-populate
the row cache.
SuperColumn
A super column is a special column, therefore, it is also a key-value pair.
But a super column stores a map of sub-columns.
Cassandra data model
• Cassandra deals with unstructured data.
• Cassandra has a flexible schema.
• In Cassandra, a table is a list of “nested key-value pairs”. (ROW x
COLUMN key x COLUMN value)
• Keyspace is the outermost container that contains data corresponding
to an application
• Tables or column families are the entity of a keyspace.
• Row is a unit of replication in Cassandra.
• Column is a unit of storage in Cassandra.
• Relationships are represented using collections.
Cassandra architecture
• The design goal of Cassandra is to handle big data workloads across
multiple nodes without any single point of failure. Cassandra has
peer-to-peer distributed system across its nodes, and data is
distributed among all the nodes in a cluster.
• All the nodes in a cluster play the same role. Each node is
independent and at the same time interconnected to other nodes.
• Each node in a cluster can accept read and write requests, regardless
of where the data is actually located in the cluster.
• When a node goes down, read/write requests can be served from
other nodes in the network.
Data Replication in Cassandra

• In Cassandra, one or more of the nodes in a cluster act as replicas for


a given piece of data. If it is detected that some of the nodes
responded with an out-of-date value, Cassandra will return the most
recent value to the client. After returning the most recent value,
Cassandra performs a read repair in the background to update the
stale values.

• The following figure shows a schematic view of how Cassandra uses


data replication among the nodes in a cluster to ensure no single
point of failure.
Cassandra uses the Gossip Protocol in the background
to allow the nodes to communicate with each other and
detect any faulty nodes in the cluster.
Key components of cassandra
• Node − It is the place where data is stored.
• Data center − It is a collection of related nodes.
• Cluster − A cluster is a component that contains one or more data centers.
• Commit log − The commit log is a crash-recovery mechanism in Cassandra. Every write
operation is written to the commit log.
• Mem-table − A mem-table is a memory-resident data structure. After commit log, the data
will be written to the mem-table. Sometimes, for a single-column family, there will be
multiple mem-tables.
• SSTable − It is a disk file to which the data is flushed from the mem-table when its contents
reach a threshold value.
• Bloom filter − These are nothing but quick, nondeterministic, algorithms for testing
whether an element is a member of a set. It is a special kind of cache. Bloom filters are
accessed after every query.
Cassandra Query Language

• Users can access Cassandra through its nodes using Cassandra Query
Language (CQL). CQL treats the database (Keyspace) as a container of
tables. Programmers use cqlsh: a prompt to work with CQL or
separate application language drivers.
• Clients approach any of the nodes for their read-write operations.
That node (coordinator) plays a proxy between the client and the
nodes holding the data.
Write Operations

Every write activity of nodes is captured by the commit


logs written in the nodes. Later the data will be captured and
stored in the mem-table. Whenever the mem-table is full, data
will be written into the SStable data file. All writes are
automatically partitioned and replicated throughout the cluster.
Cassandra periodically consolidates the SSTables, discarding
unnecessary data.

Read Operations

During read operations, Cassandra gets values from the mem-


table and checks the bloom filter to find the appropriate
SSTable that holds the required data.
Cassandra Hadoop
for integrating Hadoop with Cassandra includes:MapReduce

• To disable virtual nodes:In the cassandra.yaml file, set num_tokens to


1.
• Uncomment the initial_token property and set it to 1 or to the value
of a generated token for a multi-node cluster.
• Start the cluster for the first time.You cannot convert single-token
nodes to vnodes. See
Enabling virtual nodes on an existing production clusterfor another
option.
• Setup and configuration, described in the Apache docs, involves
overlaying a Hadoop cluster on Cassandra nodes, configuring a
separate server for the Hadoop NameNode/JobTracker, and installing
a Hadoop TaskTracker and Data Node on each Cassandra node. The
nodes in the Cassandra datacenter can draw from data in the HDFS
Data Node as well as from Cassandra. The Job Tracker/Resource
Manager (JT/RM) receives MapReduce input from the client
application. The JT/RM sends a MapReduce job request to the Task
Trackers/Node Managers (TT/NM) and an optional clients
MapReduce. The data is written to Cassandra and results sent back to
the client.
Input and Output Formats
• Hadoop jobs can receive data from CQL tables and indexes and can
write their output to Cassandra tables as well as to the Hadoop
FileSystem. Cassandra 3.0 supports the following formats for these
tasks:
• CqlInputFormat class: for importing job input into the Hadoop
filesystem from CQL tables
• CqlOutputFormat class: for writing job output from the Hadoop
filesystem to CQL tables
• CqlBulkOutputFormat class: generates Cassandra SSTables from the
output of Hadoop jobs, then loads them into the cluster using
the SSTableLoaderBulkOutputFormat class
• Reduce tasks can store keys (and corresponding bound variable

You might also like