0% found this document useful (0 votes)

365 views112 pages

Cassandra Architecture PDF

Apache Cassandra is a distributed, decentralized, highly scalable NoSQL database. It provides availability without sacrificing consistency, and eventually consistency with strong durability. Cassandra's architecture allows it to scale linearly by distributing data across commodity servers and providing failure tolerance through replication.

Uploaded by

Raj Verma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

365 views112 pages

Cassandra Architecture PDF

Uploaded by

Raj Verma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 112

Apache

Cassandra
Overview
Relational Databases
• Transactions with ACID properties - Atomicity, Consistency, Isolation &
Durability
• Adherence to Strong Schema of data being written/read
• Real time query management ( in case of data size < 10 Tera bytes )
• Execution of complex queries involving join & group by clauses
• Support ad-hoc queries
• Not designed to work in modern distributed architecture
• Data are organized in the 3NF form - storage and memory efficient
Third normal form does not scale
• Queries are unpredictable
• Users are impatient
• Data must be denormalized
• If data > memory, performance degrade
• Disk seeks are the worst
Problems with sharding
• Data is all over the place
• No more joins
• No more aggregations
• Denormalized all thing
• Querying secondary index requires hitting every shard
• Adding shards requires manual movement of data
• Manage Schema Changes
Challenges with high availability
• Master failover is messy
• Multi DC failover is harder
• Downtime is frequent
• Change in database settings
• Drive, power supply failures
• OS updates
NoSQL Databases

A NoSQL (originally referring to "non SQL", "non relational" or "not

only SQL") database provides a mechanism for storage and retrieval of
data which is modeled in means other than the tabular relations used
in relational databases.
Common goals of NoSQL databases
• Support data models, data definition languages (DDLs), and interfaces
beyond the standard SQL available in popular relational databases.
• Typically distributed systems without centralized control.
• Emphasize horizontal scalability and high availability, in some cases at the
cost of strong consistency and ACID semantics.
• Tend to support rapid development and deployment.
• Take flexible approaches to schema definition, in some cases not requiring
any schema to be defined up front.
• Provide support for Big Data and analytics use cases.
Types of NoSQL Databases
• Key-value stores
• Column stores
http://nosql-database.org.
• Document stores
• Graph databases
• Object databases
• XML databases
CAP Theorem
What is CAP Theorem
• It is often used to understand the tradeoffs
between different databases - Proposed by C
Eric Brewer in 2000

• Consistency: Every read receives the most

recent write or an error
• Availability: Every request receives a (non-
error) response – without guarantee that it
contains the most recent write
• Partition Tolerance: The system continues to
operate despite a part of the system is not A P
communicated with the rest
Choose 2! You cannot get all.
CAP Theorem
In the presence of a network partition, one has Response Conclusion
to choose between consistency and availability
Option 1 DB is down DB is not partition
tolerant
Option 2 N2 returns the local Available, sacrifices
value consistency
Option 3 N2 returns an error Guarantees
N1 N2 that other replicas consistency, sacrificing
are not reachable the availability
Option 4 N2 waits for N1 to Guarantees consistency
** sync and then and trade off latency
Client A – Write Client B – Read returns the response

** This is an extensions to CAP known and known as PACELC theorem

What You Get
What is Apache Cassandra
• Fast Distributed Database
• Peer-to-Peer architecture
• Row oriented
• High Availability
• Linear Scalability
• Predictable Performance
• No Single Point of Failure
• Multi DC
• Commodity Hardware
• Easy to operationally manage
• Needs to redesign data mode to move
RDBMS based application to Cassandra
• Decentralized

http://techblog.netflix.com/2011/11/benchmarking-
cassandra-scalability-on.html
Cassandra transactions are atomic, isolated and
durable
• Atomicity
• In Cassandra, a write operation is atomic at the partition level, meaning the
insertions or updates of two or more rows in the same partition are treated as one
write operation. A delete operation is also atomic at the partition level.
• Isolation
• Cassandra write and delete operations are performed with full row-level isolation.
This means that a write to a row within a single partition on a single node is only
visible to the client performing the operation – the operation is restricted to this
scope until it is complete.
• Durability
• All writes to a replica node are recorded both in memory and in a commit log on
disk before they are acknowledged as a success.
DB Ranking

https://db-engines.com/en/ranking
Cassandra Use Cases
• Over 1000 companies use Cassandra today
• Accenture uses Cassandra for message
- 2 billion messages / day
• Netflix use Cassandra for their catalog and
playlist
- 10 million transactions / sec
• EBay - personalization, messaging and fraud
detection
- Connects 100 millions customer with 400
million items and stored 250 TB data
Cassandra Performance
University of Toronto Study
Nodes (no of transactions/sec)
Load Type 1 2 4 8 16 32
Load Process 18,686.43 31,144.21 53,067.62 86,924.94 173,001.20 326,427.07
Read Mostly Workload 11,173.75 18,137.80 39,481.33 65,963.30 116,363.93 221,073.15
Balanced Read/Write 18,925.59 35,539.69 64,911.39 117,237.91 210,237.90 384,682.44
Read Modify Read 10,555.73 19,919.52 37,418.16 69,221.07 141,715.80 256,165.66
Mixed Operational and
Analytical Workload 4,455.63 9,343.11 19,737.82 36,177.48 73,600.53 120,755.00
Insert Mostly 24,163.62 47,974.09 85,324.69 159,945.39 NA NA

https://academy.datastax.com/planet-cassandra/nosql-performance-benchmarks
Cassandra Limitations
• Does not support query time join,
• Cassandra recommends “join on write”
• Data duplication can scale, but joins cannot
• Does not support group by queries
• Supports a limited variation of sum, avg functions
• Does not support order by on any ad-hoc columns
• Does not support where clause on ad-hoc columns
• Cassandra recommends duplicate data based on the query requirements
• Secondary Index is supported but anti pattern
• Recommends use materialized views
• 3NF Data modeling is antipattern in Cassandra
• Recommends query first data model
Node
• Node is single server running Cassandra JVM process
• Node can handle typically 3k-5k transactions / secs / core
• Node can manage about 1-3 TB data
• Node generally have 8-32 GB of RAM
• Node is connected to the other nodes through 10 Gbps lines
• Supports both SSD or Rotational disks in JBOD architecture
Data Centers
and Racks
Data center
rack
Hash Ring
• Cassandra Hash ring
• No config servers, zookeeper
• Data is partitioned around ring
• Data is replicated to RF = N servers
• All nodes hold data and can handle
write and read queries
• Location of the data is determined by
Partition Key
Replication
• Data is replicated automatically
• Specified by replication-factor
• If a machine is down, missing data is replayed
Multi DC
• Typical usage: client writes to local DC and
replicates async to other DC
• Replication factor per key space per data
center
• Datacenters can be physical or logical

Purpose of multiple data centers

•High availability
•Data locality in term of users
•Data back up
•Workloads segregation – OLTP vs. OLAP
Partitioner
• A partitioner is a hash function for computing the token of a partition key
• Generates 64 bit hash code
• Available partitioner are
• Murmur3Partitioner (default)
• RandomPartitioner
• ByteOrderedPartitioner
Virtual Nodes
• Virtual node automates the token range assignment to nodes
• Based on the snitch information, Cassandra automatically assigns one or
more token ranges to each node.
• Default is 256 vnodes per node
• Specified in num_tokens cassandra.yaml file
• Other benefits
• Vnodes make it easier to maintain a cluster containing heterogeneous machines.
Server with more horse power can be assigned more token ranges.
• Vnodes speed up bootstraping, decommisioning and repairing processing by
spreading the workload across the cluster
Gossip
• Each node initiates a gossip round every one seconds
• Picks the one to three nodes to gossip with
• Nodes can gossip with any other node in the cluster
• Probabilistically seed and downed nodes
• Nodes do not tack which nodes they gossiped with prior
• Reliably and efficiently spreads node metadata through the cluster
• Fault tolerant
What is Gossiped
• Cluster meta data - state of the node
• Heartbeat State
• Generation
• Version
• Timestamp
• Application State
• Status
• Location - DC, Rack
• Schema Version
• Load - disk pressure
• Severity - IO pressure
Gossip and
Failure Detection
Each round of gossip requires three
messages.
A. The gossip initiator sends its
chosen friend
a GossipDigestSynMessage.
B. When the friend receives this
message, it returns
a GossipDigestAckMessage.
C. When the initiator receives
the ack message from the friend, it
sends the friend
a GossipDigestAck2Message to
complete the round of gossip.
Snitches
• The job of a snitch is to determine relative host proximity for each node in
a cluster, which is used to determine which nodes to read and write from.
• Snitches gather information about your network topology so that
Cassandra can efficiently route requests. The snitch will figure out where
nodes are in relation to other nodes
• Example of use of snitch:
• To support the maximum speed for reads, Cassandra selects a single replica to
query for the full object, and asks additional replicas for hash values in order to
ensure the latest version of the requested data is returned. The role of the snitch is
to help identify the replica that will return the fastest, and this is the replica which is
queried for the full data.
• Snitches: SimpleSnitch, Dynamic Snitch, Amazon EC2, Google Cloud, and
Apache Cloudstack
Simple Snitch
Property File Snitch
• Read datacenter and rack information for all
nodes from a file
• You must keep files in sync with all nodes in the
clusters
• cassandra-topolofy.properties
• Old school - do not use it
Gossip Property File Snitch
• Relives the pain of property file snitch
• Declare the current nodes DC/rack information in a file - cassandra-
rackdc.properties
• You must set each individual nodes settings, but you do not have to copy
settings as with property file snitch
• Gossip spreads the settings through the cluster

dc = DC1
rack=RAC1
Dynamic Snitch
• Layered automatically with other snitches
• Maintains a pulse on each node’s performance
• Determines which node to query replicas from depending the node health
• Tuned on by default for all snitches
Replication Strategy
• Cassandra OOB support 2 strategies
• SimpleStrategy and NetworkTopologyStrategy.
• The SimpleStrategy places replicas at consecutive nodes around the ring,
starting with the node indicated by the partitioner.
• The NetworkTopologyStrategy allows you to specify a different replication
factor for each data center. Within a data center, it allocates replicas to
different racks in order to maximize availability.
Consistent Model
• Per query consistency
• ALL, Quorum, ONE
• How many replicas for query to respond OK
Consistency Levels

Write Consistency Levels - in the order of write latency period

Read Consistency Levels - in the order of write latency period

• As you increase consistency latency grows as well

• You can set consistency level at transaction level - just before each select, insert etc.
• Choose the consistency level according to the table you are handling
• ANY or ALL are probably not good choice.
Guaranteed Strong Consistency
• Write all, read one
• Write all, read quorum
• Write quorum, read all
• Write one, read all
• Write quorum, read quorum

Write Consistency Levels + Read Consistency Level > Replication factor

Write
Operation
Write Path
• Writes are written to any node in the cluster
… called coordinator node for that particular
request.
• Writes are written to commit logs then to
memtable
• Every write include timestamp
• Memtable flushed to the disk periodically
asynchronously without locking (sstable)
• New memtable is created in memory
• Deletes are special write case, called
tombstone
Read Path
• A query can land to any server act as
coordinator
• Contacts nodes with the requested key
• On each node data is pulled from SSTables
and merged
• Consistency < ALL performs read repair in
background (read_repair_chance)
• Choice disk (SSD vs HDD) has significant
impact on read performance
• Compaction has significant impact on
performance
• The fastest node returns data … other node
that contains the replica returns digest
• Uses conflict resolution protocol called - last Client can send request either in
write wins. native binary or Thrift protocol
Read
Operation
Readiing an SSTable

Partition Index
Partition Index
Sample

SStable
Reading SSTable using Index Cache

You can turn on index caching at table properties

Data
Structures
Commit Logs
• When you perform a write operation, it’s immediately written to a commit
log.
• The commit log is a crash-recovery mechanism that supports Cassandra’s
durability goals.
• A write will not count as successful until it’s written to the commit log
• Following unexpected shutdown, next time you start the node, the commit
log gets replayed.
• That’s the only time the commit log is read; clients never read from it.
• Commit log common at the server level
Memtable
• After it’s written to the commit log, the value is written to a memory-
resident data structure called the memtable.
• Each memtable contains data for a specific table. In early implementations
of Cassandra, memtables were stored on the JVM heap, but improvements
starting with the 2.1 release have moved the majority of memtable data to
native memory. This makes Cassandra less susceptible to fluctuations in
performance due to Java garbage collection.
What is SSTable
• Immutable data file for row storage
• Every cell value includes - write time stamp
collected at client side
• Partition is spread across multiple SSTables
• Same column can be used in multiple SSTables
• Merged through compaction, only latest
timestamp is kept
• Deletes are written as tombstones
• Easy backups!
• SSTables Contain
• data
• index and bloom filter
• checksum
• statistics
SS Tables
File Name Description
CompressionInfo.db compression metadata information that includes uncompressed data length, chuck size, and a list of the chunk
offsets
Data.db Data file stores the base data of SSTable which contains the set of rows and their columns. For each row, it stores
the row key, data size, column names bloom filter, columns index, row level tombstone information, column
count, and the list of columns. The columns are stored in sorted order by their names.
Statistics.db Statistics file contains metadata for a SSTable. The metadata includes histograms for estimated row size and
estimated column count. It also includes the partitioner used for distributing the key, the ratio of compressed data
to uncompressed data and the list of SSTable generation numbers from which this SSTable is compacted. If a
SSTable is created from Memtable flush then the list of ancestor generation numbers will be empty.
Index.db Index file contains the SSTable Index which maps row keys to their respective offsets in the Data file. Row keys are
stored in sorted order based on their tokens. Each row key is associated with an index entry which includes the
position in the Data file where its data is stored. New versions of SSTable (version "ia" and above), promoted
additional row level information from Data file to the index entry to improve performance for wide rows. A row's
columns index, and its tombstone information are also included in its index entry.
Summary.db Summary file contains the index summary and index boundaries of the SSTable index. The index summary is
calculated from SSTable index. It samples row indexes that are index_interval (Default index_interval is 128) apart
with their respective positions in the index file. Index boundaries include the start and end row keys in the SSTable
index.
Filter.db Stores bloom filter data
Digest.crc32 Store CRC code for the Data.db
TOC.txt TOC.txt contains the list of components for the SSTable.
Cache
• Cassandra provides three forms of caching:
• The key cache stores a map of partition keys to row index entries, facilitating faster
read access into SSTables stored on disk. The key cache is stored on the JVM heap.
• The row cache caches entire rows and can greatly speed up read access for
frequently accessed rows, at the cost of more memory usage. The row cache is
stored in off-heap memory.
• The counter cache was added in the 2.1 release to improve counter performance by
reducing lock contention for the most frequently accessed counters.
• By default, key and counter caching are enabled, while row caching is
disabled, as it requires more memory.
• Cassandra saves its caches to disk periodically in order to warm them up
more quickly on a node restart.
Hinted Handoff
• When a write request is sent to Cassandra, but a replica node where the
write properly belongs is not available, Cassandra implements a feature
called hinted handoff to ensure general availability of the ring in such a
situation.
• A hint is like a little Post-it note that contains the information from the
write request.
• You can control how fast hints will send off to the node
(hinted_handoff_throttle_in_kb) once the node is up and what is max
amount of hints a node would stored (max_hint_window_in_ms).
Tombstones
• When you execute a delete operation, the data is not immediately deleted.
Instead, it’s treated as an update operation that places a tombstone on the
value. A tombstone is a deletion marker that is required to suppress older
data in SSTables until compaction can run.
• Garbage Collection Grace Seconds - is the amount of time that the server
will wait to garbage-collect a tombstone. [Default: 864,000 seconds, or 10
days]
• The purpose of this delay is to give a node that is unavailable time to
recover; if a node is down longer than this value, then it is treated as failed
and replaced.
Bloom Filters
• Bloom filters are used to boost the performance of reads.
• Bloom filters are very fast, non-deterministic algorithms for testing
whether an element is a member of a set.
• Bloom filters work by mapping the values in a data set into a bit array and
condensing a larger data set into a digest string using a hash function. The
digest, by definition, uses a much smaller amount of memory than the
original data would.
• The filters are stored in memory and are used to improve performance by
reducing the need for disk access on key lookups.
• A Bloom filter is a special kind of cache. When a query is performed, the
Bloom filter is checked first before accessing disk.
Compaction
• Compaction is the process of freeing up space by merging large
accumulated datafiles.
• SSTables are immutable, which helps Cassandra achieve such high write
speeds.
• Periodic compaction of these SSTables is important in order to support fast
read performance and clean out stale data values.
• A compaction operation in Cassandra is performed in order to merge
SSTables.
• During compaction, the data in SSTables is merged: the keys are merged,
columns are combined, tombstones are discarded, and a new index is
created.
Compaction Types
• SizeTieredCompactionStrategy (STCS) is the default compaction strategy
and is recommended for write-intensive tables
• LeveledCompactionStrategy (LCS) is recommended for read-intensive
tables
• DateTieredCompactionStrategy (DTCS), which is intended for time series or
otherwise date-based data.
Compaction Strategy
SizeTieredCompactionStrategy (default) LeveledCompactionStrategy
• This strategy triggers a minor compaction • Creates SSTables of a fixed, relatively small
whenever there are a number of similar sized size (5 MB by default) that are grouped into
SSTables on disk levels.
• Using this strategy causes bursts in I/O activity • Each level (L0, L1, L2 and so on) is 10 times as
while a compaction is in process, followed by large as the previous.
longer and longer lulls in compaction activity • Disk I/O is more uniform and predictable as
as SSTable files grow larger in size. SSTables are continuously being compacted
• Good option for frequent writes into progressively larger levels.
• Good option in read heavy, low updates
scenarios
Read Repair Chance
• Specifies the probability with which read repairs should be invoked on
non-quorum reads. The value must be between 0 and 1.
• Performed when read is at a consistency level less than ALL
• Request reads only a subset of the replicas
• We cannot be sure replicas are in sync
• Generally you are safe, but no guarantees
• Response sent immediately when consistency level is met
• Read repair is done asynchronously in the background
• 0.10 is default
Nodetool Repair
• Syncs all data in cluster
• Run to synchronize a failed node coming back online
• Run on nodes not read from very often
• Expensive
• Grows with amount of data of in the cluster
• Use with cluster servicing high writes/deletes
• Last line of defense
• Run as maintenance once in ~ 10 days
Anti-Entropy, Repair, and Merkle Trees
• Cassandra uses an anti-entropy protocol, which is a type of gossip protocol
for repairing replicated data. Anti-entropy protocols work by comparing
replicas of data and reconciling differences observed between the replicas.
• Anti-entropy is a manually initiated repair process
• Running nodetool repair causes Cassandra to execute a major compaction
Data Types
Type Details

text ● UTF8 encoded string

● varchar is same as text

int ● Signed
● 32 bits

timestamp ● Date and time

● 64 bit integer
● Epoch - stores numbers of seconds since Jan 1, 1970 midnight

UUID and ● Universal Unique Identifier - 128 bit value

TIMEUUID ● Generated via uuid()
● TIMEUUID embeds a TIMESTAMP value and is sortable
● Generated via now()
Cassandra Row Structure
Table Partition
Partitions
PRIMARY KEY((state), id)
• Partition is a grouping
• First value aka partition key in
the primary key is the partition
key. In the example, state is the
partition key.
Cluster Columns
• Clustering column help sort the data within each partition

PRIMARY KEY ((state), id) PRIMARY KEY ((state), city, id) PRIMARY KEY ((state), city, name
id)
Querying Clustering Columns
• Must provide partition key
• Clustering column can follow thereafter
• You can perform either equality or range on clustering columns
• All equality comparisons must come before inequality comparisons
• Since data is sorted on disk, range searches are a binary search followed
by a sequential read
Allow Filtering
• ALLOW FILTERING relaxes the querying on partition key constraint
• SELECT * FROM users
• You can then query on just clustering columns
• Causes Cassandra to scan all partitions in the table
• Do not use it
• Unless you really have to
• Best on small data sets
• Consider use of Spark!
Operations
Topics
• Adding Nodes
• Bootstrapping a Node into a cluster
• Removing a Node
• Replace a Downed Node
• Run a Repair Operation
• Dividing SSTables with sstableslits
• Create snapshot
• Implement Multiple Data Centers
• Best Practices for Cluster Sizing
• Using Cassandra-stress Tool
Adding Node
• Data capacity problem
• You data has outgrown the node’s hardware capacity
• Reached Traffic Capacity
• You application needs more rapid response with less latency
• Need more operational headroom
• For node repair, compaction, and other resource intensive operation
Bootstrapping
• Simple process that brings a node up to speed
• The amount of time it takes to bootstrap depends on the amount of data
• Can be a long running process
• Node announces itself to ring using seed node
The Bootstrapping Process
• Calculate range(s) of new node, notify ring of these pending ranges
• Calculate the node(s) that currently own these ranges and will no longer
own once the bootstrap completes
• Stream the data from these nodes to the bootstrapping node (monitor
with nodetool netstatus)
• Join the new node to the ring to it can serve traffic
Nodetool Cleanup
• Makes sure no data in left behind in SSTables that are outside the
allocated token ranges
• It essentially copies the SSTables to new SSTables keeping only valid data
• Compaction process will clean up
• Not essentials - it is more a deterministic process
$ bin/nodetool [options] cleanup -- <keyspace> (<table>)
Use flags to specify:
- h [host] [IP Address]
- p port
-pw password
-u username
Replace a Downed Node
• First find the ip address of the down node using nodetool status
command
• In the node, open the cassandra-env.sh
• Swap in the IP address of the dead node as replace_address value in the
JVM option. This will eliminate bootstrapping of the new node.
• Use nodetool removenode to remove the dead node
• Use force option if necessary nodetool assassinate
• You can monitor the process using nodetool netstats command
What if the node was also a seed node?
• Need to add to list of seeds in cassandra.yaml
• Cassandra will not allow seed node to auto-bootstrap
• You will have to run repair on new seed node to do so
• Add a new node making the necessary changes to the cassandra.yaml
file
• Specify new seed node in cassandra.yaml
• Start Cassandra on new seed node
• Run nodetool repair on the new seed node to manually
bootstrap
• Remove the old seed node using nodetool removednode with the
Host ID of the downed node
• Run nodetool cleanup on previously existing nodes
Repair Operation
• Repair is a deliberate process to cope with cluster entropy … a
consistency check
• Entropy can arise from the node that were down longer than hint-
window, dropped mutations or other causes
• A repair operates on all of the nodes in the replica set by default
• Ensures that all replicas have identical copies of a given partition
• Consists of two phases
• Build Merkle tree of the data per partition
• Replicas then compare the differences between their trees and stream the
differences to each other as needed
A Merkle Tree Exchange
• Starts with root of the tree ( a list of one hash value)
• The origin sends the list of hashes at the current level
• The destination diffs the list of hashes against its own, then requests
subtrees that are different
• If there are no differences, the request can terminate
• Repeat steps 2 and 3 until leaf nodes are reached
• The origin sends the values of the keys in the resulting set
Why Repair is Necessary?
• A node’s data can get inconsistent over time (Repair is just a
maintenance action in this case)
• If a node goes down for some time, it misses writes and will need to
catch up
• Sometimes it is best to repair a node:
• If the node has been down longer than the length specified in
max_hint_window_in_ms, the node is out of sync
• Depending on the amount of data, might be faster to repair
• Sometimes it is better to bring the node back as a new node and
bootstrap
• If there is a significant amount of data, might be faster just to bring in a new
node and stream data just to that node
What are Incremental Repairs
• To avoid the need for constant tree
construction, incremental repairs have been
introduced
• Idea is to persist already repaired data and
only calculate merkle tree for sstables that
been created since
• This allows the repair process to stay
performant and lightweight
What are incremental repairs?
• Incremental repairs begin the repair leader sending out a prepare
message to the peers
• Each node builds a Merkle tree from the unrepaired sstables
• This can be distinguished by the repairedAt field in the each sstable’s metadata
• Once the leader receives a merkle tree from each node, it compares the
tree and issues streaming requests
• This is just as in the classic repair case
• Finally the leader issues an anti-compaction command
• Anti-compaction is the process of segregating repaired and unrepaired ranges
into separate sstables
• Repaired sstables are written with a new repairedAt field denoting the time of
repair
Best Practices for Repair
• Run repair weekly
• Run repair on a small number of nodes at a time
• Schedule for low usage hours
• Run repair on a partition or subrange of a partition
Repair Type - Primary and Secondary Ranges
• Primary Range (inside ring) - first node that
data is stored on based on partition key
• Secondary Range (outside rings) - additional
replicas of the same data
• What are the implications of repair?
SSTableSplit - Why?
• You did a nodetool compaction
• Maybe you used SizeTieredCompactionStrategy for a major compaction
• This would result in an excessively large SSTable
• Good idea to split the table
• Using the size tiered compaction, we may have gotten some really large
files over time
• May find yourself with a 200GB file you need to split up
• It is an anti-compaction in a way
Usage: sstablesplit
• Stop Cassandra Node
$ sudo service cassandra stop
$ sstablesplit [options] filename [filename]*

Options:
--debug Displays stack traces
--no-snapshot Do not snapshot the SSTables before splitting
--size Maximum size in MB for output SSTables (default: 50)
--verbose Verbose output
Snapshot
Why Backup? Why Snapshots?
• We do not backup like traditional databases • Snapshot represents a state at a given time
where we copy out all the data • Snapshots create hard link to the file systems
• It is a distributed system; every server or node as opposed to copying data
has a portion of the data • Therefore very fast!
• SSTables are immutable, which is great! • Represents the state of the data files at a
Makes them easy to back up. particular point of time
• Can consist of single table, single keyspace or
multiple keyspace
What is a Snapshot
• Represents the state of the data files at a particular point in time
• Snapshot directory is created (this has pointers)
• Then you can either leave it there or copy it offline to an NFS mount or
copy S3
Incremental Backup
• Incremental backups create a hard link to every SSTable upon flush
• User must manually delete them after creating a new snapshot
• Incremental backups are disabled by default
• Configured in cassandra.yaml file setting increatemental_backups to true
• Need a snapshot before taking incremental backups
• Snapshot information is stored in a snapshots directory under each table
directory
• Snapshot need only be stored once offsite
Location of Snapshot and Incremental Backups
• Snapshots and incremental backups are stored in on each Cassandra
node
• Vulnerable to hardware failures
• Commonly, the files are copied to off-node location
• Open source programs like tablesnap is useful for backing up to S3
• Scripts can be used to automate backup files to another machine: cron + bash
script, rsync etc
Auto Snapshot
• Critical safety factor
• If enabled, before a table is truncated or tables/keyspace is dropped, a
snapshot is taken
• A configuration in cassandra.yaml, default is true
• Recommended to keep it true
How to take Snapshot?
$ bin/nodetool [options] snapshot (-cf <table> | -t
<tag> --keyspace)
We can specify to take a snapshot of
• One more keyspaces
• A table specified to backup data
Clean up Snapshots
• $ bin/nodetool clearsnapshot command removed snapshots
• Same options as nodetool command
• Specify the snapshot file and keyspace
• Not specifying a snapshot name removes all snapshots
• Remember to remove old snapshots before taking new ones - previous
snapshots are not automatically deleted
• To clear snapshots on all nodes at once, use parallel ssh utility - pssh,
clusterssh
How to restore?
• Most common method is to delete current data files and copy the
snapshot and incremental files to the appropriate data directories
• If using incremental backups, copy the contents of the backups directory to
each table directory
• Table schema must already be present in order to use this command
• Restart and repair the node after the file copying is done
• Another method is to use sstableloader
• Great if you are loading it into a different size cluster
• Must be careful about its use as it can add significant load to cluster while
loading
Cluster Wide Backup And Restore
• OpsCenter
• SSH Programs - pssh, clusterssh
• Honorable mention - tablesnap and tablestore
• For Cassandra backup to AWS S3
• Recovery
Multi Data Center Implementation
• Node - the virtual or physical host of a single Cassandra instance
• Rack - a logical grouping of physical related nodes
• Data Center - a logical grouping of a set racks
• Enables geographically aware read and write request routing
• Each node belongs to one rack in one data center
• The identity of each node’s rack and data center may be configured in its
conf/cassandra-rackdc.properties
Adding a Secondary Data Center
• Ensures continuous availability of your data and application
• Live backup
• Improved performance
• Lower latency by serving near from geo location
• Analytics
• One DC is dedicated for transactional load and one for analytics
How cluster operate between data centers
• A data center is a grouping of nodes configured together for replication
purposes
• Data replicates across data centers automatically and transparently
• Consistency level can be specified at LOCAL level for read/write
operations
• Consistency level can also be specified as EACH
• EACH meaning - each data center has its own quorum
Implementing Multi Data Center Cluster
• Use NetWorkTopologyStrategy rather than SimpleStrategy
• Use LOCAL_* consistency level for read/write operations to limit latency
• If possible, define one rack for entire cluster
• Specify the snitch
Cluster Sizing
Factors 1. Data Model and Use Cases
• Application data model and use case • What is the read/write bias?
• Volume of data over time • What are the SLAs
• Velocity of data • Do I need to tier out data? Hot / cold
• Replication Factor (RF)
2. Volume of Data
• Estimate the volume based on your application

• Data per node = RF * Avg amount of data per

row x number of rows / # nodes

3. Velocity
• How many writes per second

• How many reads per second

• Volume of data per operation

Cluster Size - Testing Limits
• Use cassandra-stress to simulate workload
• Test product level servers
• Monitor to find limits
• Disk is the first thing to manage
• CPU is second
Cluster Sizing - An Example
• Data growth 100G data / day
• Writes per second - 150 K
• Reads per second - 100 k

What additional information do we need?

• Replication Factor of 3
• Multiple Data Center
• 25ms read latency 95th percentile
• At max packet size 50k writes / sec
• At max packet size 40k reads / sec
• 4TB per node to allow for compaction
overhead
Cluster Sizing - Based on Volume
• (100G / day) x (30 days) x (6 months) = 18 TB
• (18 TB data) x (RF 3) = 54 TB
• 54 TB total data / (4TB max per node) = 14 nodes
• (14 nodes) x (2 Data Center) = 28 nodes
Cluster Sizing - Sizing for Velocity
• (150 K writes/ sec load) / (50k writes/sec per node) = 3 nodes
• Volume capacity covers the write capacity
Cluster Sizing - Future Capacity
• Validate your assumptions often
• Monitor for changes over time
• Plan for increasing cluster size before you need it
• Be ready to draw down if needed
Cassandra Stress Tool
• Java based utility built in Cassandra
• Used for basic benchmarking and load testing a Cassandra cluster
• Quickly determine how a schema performs
• Understand how your database scale
• Optimize your data model and settings
• Determine production capacity
Cassandra Stress
YAML file is configuration file for stress tool. It is split into few sections
• DDL - for defining your schema
• Column distributions - for defining the shape and size of each column
globally and within each partition
• Insert Distributions - for defining how the data is written during the
stress test
• DML - for defining how the data is queried during the stress test
Cassandra Stress
DDL
• Define the keyspace and table information
• If the schema is not already define, stress tool will define it when it runs
• If schema is already defined, then stress tool needs know only keyspace
and table names
Cassandra Stress YAML - Example
Stress Test: Column Distribution
• The columnspec section describes the different distribution for use for
each column
• Columnspec section describes the different distributions to use for each
column. These distributions model the size of the data in column, the
number of unique values and the clustering of them within a given
partition
• These distributions are used to auto generate data that looks like what
you would see in reality
Stress Test: Column Distribution
EXP(min..max) — An exponential distribution over the range [min..max]
EXTREME(min..max,shape) — An extreme value (Weibull) distribution over
the range [min..max]
GAUSSIAN(min..max,stdvrng) — A gaussian/normal distribution, where
mean=(min+max)/2, and stdev is (mean-min)/stdvrng
GAUSSIAN(min..max,mean,stdev) — A gaussian/normal distribution, with
explicitly defined mean and stdev
UNIFORM(min..max) — A uniform distribution over the range [min, max]
FIXED(val) — A fixed distribution, always returning the same value
Stress test: other column specs
Size distribution — Defines the distribution of sizes for text, blob, set and list
types (default of UNIFORM(4..8))
Population distribution — Defines the distribution of unique values for the
column values (default of UNIFORM(1..100B))
Cluster distribution — Defines the distribution for the number of clustering
prefixes within a given partition (default of FIXED(1))
Stress Test: column spec - example
Stress Test: insert distribution
• The insert section lets you specify how data is inserted during stress.
• For each insert operation you can specify the following
distributions/ratios:
• Partition distribution — The number of partitions to update per batch (default
FIXED(1))
• select distribution ratio — The ratio of rows each partition should insert as a
proportion of the total possible rows for the partition (as defined by the
clustering distribution columns). default FIXED(1)/1
• Batch type — The type of CQL batch to use. Either LOGGED/UNLOGGED (default
LOGGED)
Stress Tool: DML
• You can specify any CQL query on the table by naming them under the
'queries' section.
• The 'fields' field specifies if the bind variables should be picked from the
same row or across all rows in the partition

All About Java Mail
100% (26)
All About Java Mail
175 pages
CassandraTraining v3.3.4
100% (1)
CassandraTraining v3.3.4
183 pages
Queues in CICS - With Paging Logic
90% (10)
Queues in CICS - With Paging Logic
25 pages
Cassandra - Module5
No ratings yet
Cassandra - Module5
37 pages
App Ache
No ratings yet
App Ache
55 pages
ScalaJVMBigData SparkLessons PDF
100% (1)
ScalaJVMBigData SparkLessons PDF
100 pages
Cassendra
100% (1)
Cassendra
21 pages
Dynamic Binary Modification: Tools, Techniques, and Applications
No ratings yet
Dynamic Binary Modification: Tools, Techniques, and Applications
81 pages
Cassandra 30
No ratings yet
Cassandra 30
259 pages
CSI USB File Upload Download
No ratings yet
CSI USB File Upload Download
28 pages
Cassandra Certification Study Guide DataStax
13% (8)
Cassandra Certification Study Guide DataStax
20 pages
Hortonworks Cluster Config Guide.1.0
No ratings yet
Hortonworks Cluster Config Guide.1.0
15 pages
MNHS Library Management System
No ratings yet
MNHS Library Management System
7 pages
System Security Checklist Template
No ratings yet
System Security Checklist Template
8 pages
Introduction To Cassandra
No ratings yet
Introduction To Cassandra
37 pages
Cloudera Apache Impala Guide
No ratings yet
Cloudera Apache Impala Guide
691 pages
MySQL Data Analysis
100% (3)
MySQL Data Analysis
140 pages
Cassandra Tutorial
100% (3)
Cassandra Tutorial
111 pages
Cassandra Datastax
No ratings yet
Cassandra Datastax
10 pages
Question CH 6
No ratings yet
Question CH 6
6 pages
Apache Spark 24 Hours PDF
100% (6)
Apache Spark 24 Hours PDF
1,129 pages
Cassandra: Wa'el Belkasim Arash Akhlaghi Badrinath Jayakumar
No ratings yet
Cassandra: Wa'el Belkasim Arash Akhlaghi Badrinath Jayakumar
37 pages
Introduction To The New Mainframe Chapter 7 Batch Processing and The Job Entry Subsystem (JES)
No ratings yet
Introduction To The New Mainframe Chapter 7 Batch Processing and The Job Entry Subsystem (JES)
27 pages
V1328 - NSP - Saas Security - Overview
No ratings yet
V1328 - NSP - Saas Security - Overview
23 pages
E-Governance Project: Industrial Training Report
No ratings yet
E-Governance Project: Industrial Training Report
30 pages
31920832
No ratings yet
31920832
322 pages
Apache Cassandra
No ratings yet
Apache Cassandra
7 pages
Cassandra Quick Guide
No ratings yet
Cassandra Quick Guide
60 pages
TechnoEssentials Module 2 Where Are We TechnoKids PH
No ratings yet
TechnoEssentials Module 2 Where Are We TechnoKids PH
14 pages
MongoDB Manual Master
No ratings yet
MongoDB Manual Master
1,234 pages
Introduction To Cassandra
No ratings yet
Introduction To Cassandra
15 pages
Aticara ByteStorm API Reference
No ratings yet
Aticara ByteStorm API Reference
35 pages
DAA - Chapter 01
No ratings yet
DAA - Chapter 01
14 pages
Mongodb Interview Questions (V4.4)
No ratings yet
Mongodb Interview Questions (V4.4)
25 pages
14-Oracle Developer - Show Image From Database With Webutil and Canvus Type.
No ratings yet
14-Oracle Developer - Show Image From Database With Webutil and Canvus Type.
7 pages
Project Report On - Food Booking System
No ratings yet
Project Report On - Food Booking System
22 pages
CCNPv7 ROUTE Lab5-2 IP SLA Tracking and Path Control Student Alexis Pedroza
No ratings yet
CCNPv7 ROUTE Lab5-2 IP SLA Tracking and Path Control Student Alexis Pedroza
24 pages
Learn Cassandra
100% (1)
Learn Cassandra
37 pages
Apache Cassandra Sample Resume
No ratings yet
Apache Cassandra Sample Resume
17 pages
10-1 Case and Process Design
No ratings yet
10-1 Case and Process Design
11 pages
Cassandra
No ratings yet
Cassandra
7 pages
Cassandra - An Introduction
100% (1)
Cassandra - An Introduction
35 pages
Apache Cassandra Database - Instaclustr
No ratings yet
Apache Cassandra Database - Instaclustr
8 pages
Cassandra Tutorial For Beginners: Learn in 3 Days: What Is Apache Cassandra?
No ratings yet
Cassandra Tutorial For Beginners: Learn in 3 Days: What Is Apache Cassandra?
4 pages
Complete Windows Hacking With Kali and Python Course Content
No ratings yet
Complete Windows Hacking With Kali and Python Course Content
2 pages
Guide Eventon Guia
No ratings yet
Guide Eventon Guia
5 pages
Hadoop With Python
100% (6)
Hadoop With Python
71 pages
Cassandra Best Practices
No ratings yet
Cassandra Best Practices
49 pages
Big Data As A Service On Google Cloud
No ratings yet
Big Data As A Service On Google Cloud
329 pages
Learning Apache Cassandra - Sample Chapter
No ratings yet
Learning Apache Cassandra - Sample Chapter
20 pages
Chandra Prakash - Delivery Manager
No ratings yet
Chandra Prakash - Delivery Manager
3 pages
Cassandra DBA
No ratings yet
Cassandra DBA
5 pages
Ignite Sample
0% (1)
Ignite Sample
88 pages
Tutorial Hbase
No ratings yet
Tutorial Hbase
100 pages
SQL
50% (4)
SQL
30 pages
Hadoop For Windows Succinctly PDF
No ratings yet
Hadoop For Windows Succinctly PDF
148 pages
Big Data Introduction PDF
No ratings yet
Big Data Introduction PDF
180 pages
Cassandra Interview Questions Answers
No ratings yet
Cassandra Interview Questions Answers
10 pages
Spark Summit East 2015 - Adv Dev Ops - Student Slides
No ratings yet
Spark Summit East 2015 - Adv Dev Ops - Student Slides
219 pages
PostgreSQL and NoSQL
100% (7)
PostgreSQL and NoSQL
36 pages
Postgresql Tutorial
100% (1)
Postgresql Tutorial
257 pages
Big Data & Hadoop
100% (3)
Big Data & Hadoop
189 pages
Apache Cassandra
No ratings yet
Apache Cassandra
3 pages
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
No ratings yet
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
55 pages
Python in Excel (2024)
100% (10)
Python in Excel (2024)
607 pages
Redis Tutorial
100% (1)
Redis Tutorial
199 pages
Compare Mongodb and Cassandra
No ratings yet
Compare Mongodb and Cassandra
6 pages
MongoDB Quick Guide
No ratings yet
MongoDB Quick Guide
61 pages
Pythons Basics
No ratings yet
Pythons Basics
104 pages
Quantum Training
No ratings yet
Quantum Training
2 pages
SS1123 - D2T - Apache Cassandra Overview PDF
100% (1)
SS1123 - D2T - Apache Cassandra Overview PDF
45 pages
This Paper Is A Sample Exam ONLY. It Contains No Questions From The Actual
No ratings yet
This Paper Is A Sample Exam ONLY. It Contains No Questions From The Actual
8 pages
Fill The Blank Space: Quiz Marketing Research
No ratings yet
Fill The Blank Space: Quiz Marketing Research
4 pages
Libertas Library Stategic Plan: Mission Statement
No ratings yet
Libertas Library Stategic Plan: Mission Statement
3 pages
SIMPLE SQL Begginers Guide To Master SQL and Boost Career
83% (6)
SIMPLE SQL Begginers Guide To Master SQL and Boost Career
425 pages
Basics of Apache Kafka
100% (1)
Basics of Apache Kafka
168 pages
Eaton® Announces Power Xpert® Software - Foreseer Class v.4.3.10
No ratings yet
Eaton® Announces Power Xpert® Software - Foreseer Class v.4.3.10
4 pages
01 Assignment 1
No ratings yet
01 Assignment 1
1 page
Power BI DAX Simplified B099SBN1XP
93% (15)
Power BI DAX Simplified B099SBN1XP
542 pages
Cassandra High Availability Sample Chapter
No ratings yet
Cassandra High Availability Sample Chapter
16 pages
Cassandra Tutorial
No ratings yet
Cassandra Tutorial
27 pages
MongoDB Schema Design
0% (1)
MongoDB Schema Design
116 pages
Python Data Science
92% (12)
Python Data Science
65 pages
Hadoop in Action
No ratings yet
Hadoop in Action
1 page
Hadoop Security S360 2015v8 PDF
No ratings yet
Hadoop Security S360 2015v8 PDF
27 pages
Distributed Computing With Python - Sample Chapter
No ratings yet
Distributed Computing With Python - Sample Chapter
18 pages
E 2010
No ratings yet
E 2010
17 pages
POWER BI Tutorial
89% (9)
POWER BI Tutorial
77 pages
SQL Interview Questions PDF
88% (43)
SQL Interview Questions PDF
48 pages
Cassandra Installation Review
No ratings yet
Cassandra Installation Review
6 pages
Power BI Tutorial
93% (14)
Power BI Tutorial
34 pages
SQL & Advanced SQL
100% (6)
SQL & Advanced SQL
37 pages
Data Visualization With Python PDF
93% (14)
Data Visualization With Python PDF
662 pages
MGH Data Analysis With Microsoft Power BI 126045861X
92% (13)
MGH Data Analysis With Microsoft Power BI 126045861X
808 pages
(Hunt, J.) A Beginners Guide To Python 3 Programming
96% (47)
(Hunt, J.) A Beginners Guide To Python 3 Programming
440 pages
SQL - With Practice Exercises, Learn SQL Fast (PDFDrive) PDF
100% (3)
SQL - With Practice Exercises, Learn SQL Fast (PDFDrive) PDF
167 pages
Root A. Python For Data Analytics. A Beginners Guide For Learning 2019
100% (8)
Root A. Python For Data Analytics. A Beginners Guide For Learning 2019
167 pages
NoSQL Intro
No ratings yet
NoSQL Intro
26 pages
A Deep Dive Into PostgreSQL Indexing
No ratings yet
A Deep Dive Into PostgreSQL Indexing
29 pages
Evolution of Analytical Scalability
100% (1)
Evolution of Analytical Scalability
11 pages
Syllabus of Punjab State Co Operative Bank IT Officer
No ratings yet
Syllabus of Punjab State Co Operative Bank IT Officer
3 pages
Mongodb Vs Couchbase Architecture WP PDF
No ratings yet
Mongodb Vs Couchbase Architecture WP PDF
45 pages
The Quote
No ratings yet
The Quote
2 pages
Excel Formulas and Functions
85% (27)
Excel Formulas and Functions
126 pages
Aws Redshift: Calculations Are Typically Executed On Small Number of Columns
No ratings yet
Aws Redshift: Calculations Are Typically Executed On Small Number of Columns
8 pages
Data Analysis From Scratch With Python - Beginner Guide Using Python, Pandas, NumPy, Scikit-Learn, IPython, TensorFlow and
100% (10)
Data Analysis From Scratch With Python - Beginner Guide Using Python, Pandas, NumPy, Scikit-Learn, IPython, TensorFlow and
104 pages
SQL
90% (10)
SQL
101 pages
Python Pandas Tutorial
96% (28)
Python Pandas Tutorial
178 pages
Mysql Dba Qa
No ratings yet
Mysql Dba Qa
4 pages
Full Course of Machine Learning
100% (16)
Full Course of Machine Learning
660 pages
The Python Bible
97% (31)
The Python Bible
506 pages
Algorithms For Data Science 1st Brian Steele (WWW - Ebook DL - Com)
100% (15)
Algorithms For Data Science 1st Brian Steele (WWW - Ebook DL - Com)
438 pages
Learning The Pandas Library Python Tools For Data Munging Analysis and Visual PDF
100% (18)
Learning The Pandas Library Python Tools For Data Munging Analysis and Visual PDF
208 pages
A Performance Comparison of SQL and NoSQL Databases
No ratings yet
A Performance Comparison of SQL and NoSQL Databases
5 pages
Machine Learning Projects in Python
100% (16)
Machine Learning Projects in Python
135 pages
Object Oriented Python Tutorial
100% (20)
Object Oriented Python Tutorial
111 pages
Mastering Kafka Streams: From Basics to Expert Proficiency
From Everand
Mastering Kafka Streams: From Basics to Expert Proficiency
William Smith
No ratings yet
HDInsight Essentials - Second Edition
From Everand
HDInsight Essentials - Second Edition
Rajesh Nadipalli
No ratings yet
Optimizing Hadoop for MapReduce
From Everand
Optimizing Hadoop for MapReduce
Khaled Tannir
No ratings yet
Apache Cassandra Administrator Associate - Exam Practice Tests
From Everand
Apache Cassandra Administrator Associate - Exam Practice Tests
Cristian Scutaru
No ratings yet

Uploaded by

Uploaded by

Apache

A NoSQL (originally referring to "non SQL", "non relational" or "not

• Consistency: Every read receives the most

** This is an extensions to CAP known and known as PACELC theorem

Purpose of multiple data centers

Write Consistency Levels - in the order of write latency period

Read Consistency Levels - in the order of write latency period

• As you increase consistency latency grows as well

Write Consistency Levels + Read Consistency Level > Replication factor

You can turn on index caching at table properties

text ● UTF8 encoded string

timestamp ● Date and time

UUID and ● Universal Unique Identifier - 128 bit value

• Data per node = RF * Avg amount of data per

• How many reads per second

• Volume of data per operation

What additional information do we need?

You might also like