0% found this document useful (0 votes)

193 views32 pages

Cassandra Design Patterns - Sample Chapter

Chapter No.1 Co-existence Patterns Create, package, and customize stunning themes using PrimeFaces For more information: http://bit.ly/1Pm0s5B

Uploaded by

Packt Publishing

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

193 views32 pages

Cassandra Design Patterns - Sample Chapter

Chapter No.1 Co-existence Patterns Create, package, and customize stunning themes using PrimeFaces For more information: http://bit.ly/1Pm0s5B

Uploaded by

Packt Publishing

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 32

Fr

Second Edition
Cassandra is one of the most widely-used NoSQL data
stores, and is frequently used by a huge number of
heavy-duty Internet-scale applications. High performance
and elastic scalability make it one of the most in-demand
databases today.

What you will learn from this book

Enable Cassandra to coexist with RDBMS
and other legacy data stores
Explore various design patterns for building
effective and robust storage solutions

Who this book is written for

See how Cassandra can be used in

analytical use cases

Understand the behavior of Cassandra when

trying to balance the needs of consistency,
availability, and partition tolerance

Apply the design patterns covered in this

book in real-world use cases

$ 34.99 US
22.99 UK

community experience distilled

P U B L I S H I N G

Sa
m

C o m m u n i t y

E x p e r i e n c e

D i s t i l l e d

Deal with time stamps related to

data effectively

Prices do not include

local sales tax or VAT
where applicable

Visit www.PacktPub.com for books, eBooks,

code, downloads, and PacktLib.

Rajanarayanan Thottuvaikkatumana

This book is intended for big data developers who

are familiar with the basics of Cassandra and wish to
understand and utilize Cassandra design patterns to develop
real-world big data solutions. Prior knowledge of RDBMS
solutions is assumed.

Migrate from RDBMS-based data stores

and caching solutions to Cassandra

Second Edition

This book starts with strategies to integrate Cassandra with

other legacy data stores and progresses to ways in which
migration from RDBMS to Cassandra can be accomplished.
The journey continues with ideas for migrating data from
cache solutions to Cassandra. With this, the stage is set and
the book moves on to some of the most commonly-seen
problems in applications when dealing with consistency,
availability, and partition tolerance guarantees. The
time-series pattern and log pattern, which deal with temporal
data, are covered next. Keeping analytical needs in mind,
you'll walk through different and interesting design patterns.
The book concludes with a set of use cases where you can
apply the patterns you've learned.

Cassandra Design Patterns

Second Edition
Build industry-strength data storage solutions with time-tested
design methodologies using Cassandra

Rajanarayanan Thottuvaikkatumana

In this package, you will find:

The author biography

A preview chapter from the book, Chapter 1 'Co-existence Patterns'
A synopsis of the books content
More information on Cassandra Design Patterns Second Edition

About the Author

Rajanarayanan Thottuvaikkatumana, "Raj", is a seasoned technologist with more

than 23 years of software development experience at various multinational companies.

He has lived and worked in India, Singapore, and the USA, and is presently based out
of the UK. His experience includes architecting, designing, and developing software
applications. He has worked on various technologies including major databases,
application development platforms, web technologies, and big data technologies.
Since 2000 onwards, he has been working mainly in Java-based technologies, and has
been doing heavy-duty server-side programming in Java and Scala. He has worked
on very highly concurrent, highly distributed, and high-transaction-volume systems
with NoSQL data stores such as Cassandra and Riak and caching technologies such
as Redis, Ehcache, and Chronicle Map. Raj has a lot of experience in integrating
Cassandra with Spark and has shared the Scala code repository on GitHub.
Raj holds one master's degree in Mathematics and one master's degree in Computer
Information Systems, and has many certifications in ITIL and Cloud Computing to
his credit.
Apart from all this, Raj is a prolific corporate trainer on various technical subjects
and has contributed to the Apache Cassandra project.
When not working with computers, Raj watches a lot of tennis and he is an avid
listener of classical music.
Even though Raj has worked on many white papers and training materials, this is
his first publication in the form of a book.

Preface
Apache Cassandra is one of the most popular NoSQL data stores based on the
research papers Dynamo: Amazon's Highly Available Key-value Store and Bigtable: A
Distributed Storage System for Structured Data. Cassandra is implemented with the best
features from both of these research papers. In general, NoSQL data stores can be
classified into the following groups.

Key-value data store

Column-family data store

Document data store

Graph data store

Cassandra belongs to the column-family data store group. Cassandra's peer-to-peer

architecture avoids single-point failures in the cluster of Cassandra nodes and gives
the ability to distribute the nodes across racks or data centers. This makes Cassandra
a linearly scalable data store. In other words, the greater your processing need, the
more Cassandra nodes you can add to your cluster. Cassandra's multidata center
support makes it a perfect choice to replicate data stores across data centers for
disaster recovery, high availability, separating transaction processing, and analytical
environments for building resiliency into the data store infrastructure.
The basic data abstraction in Cassandra starts with a column consisting of a name,
value, timestamp, and optional time-to-live attributes. A row comes with a row key
and a collection of sorted columns. A column family or a table is a collection of of
rows. A keyspace is a collection of column families.

Preface

Cassandra 2.1 comes with a lot of new features making it an even more powerful
data store than ever before. Now the new CQL keyword IF NOT EXISTS lets you
check the existence of an object before Cassandra creates a new one. Lightweight
transactions and the batching of CQL commands gives the user an ability to perform
multistep atomic operations. Marking some columns in a column family as STATIC
gives the user the ability to share data across all the rows of a given partition. The
user-defined data type gives the power of modeling your data store very close to
the real-world objects and objects used in the applications written using objectoriented programming languages. Collection indexes may be used to index and
query collection data types in Cassandra. Row cache improvements, changes to reads
and writes, off-heap memory tables, incremental node repair, and the new counter
implementation all make Cassandra perform much better than its previous releases.
All the code samples that are used in this book are written for Cassandra 2.1.5. All
the examples are as per the CQL specification 3.x. The pre-CQL Thrift API-based
Cassandra CLI is being used to list the physical layout of the column families.
An insight into the physical layout is very important because a wrong choice of a
partition key or primary key will result in insidious performance problems. As a best
practice, it is a good idea to create the column family, insert a couple of records, and
use the list command in the Cassandra CLI with the column-family name. It will give
the physical layout.
The term "design patterns" is a highly misinterpreted term in the software
development community. In a very general sense, it is a set of solutions for some
known problems in a very specific context. The way it is being used in this book is
to describe a pattern of using certain features of Cassandra to solve some real-world
problems. To refer to them and to identify them later, a name is also given to
each of such design patterns. These pattern names may not be related at all to any
similar sounding design pattern name used in other contexts and in other software
development paradigms.
Users love Cassandra because of its SQL-like interface, CQL, and its features are very
closely related to the RDBMS even though the paradigm is totally new. Application
developers love Cassandra because of the plethora of drivers available in the market
so that they can write applications in their preferred programming language.
Architects love Cassandra because they can store structured, semi-structured, and
unstructured data in it. Database administers love Cassandra because it comes with
almost no maintenance overhead. Service managers love Cassandra because of the
wonderful monitoring tools available in the market. CIOs love Cassandra because
it gives value for their money. And, Cassandra works!

Preface

What this book covers

Chapter 1, Co-existence Patterns, discusses how Cassandra may be used in a legacy
environment coexisting with RDBMSs.
Chapter 2, RDBMS Migration Patterns, discusses how some of the unique Cassandra
features may be used to provide value and hence migrate traditional RDBMS data to
Cassandra. It is a natural progression from coexistence with other legacy RDBMSs.
Chapter 3, Cache Migration Patterns, deals with some of the pitfalls of using caching
solutions and how Cassandra may be used to overcome them.
Chapter 4, CAP Patterns, talks about data integrity considerations, consistency,
availability, and partition tolerance and how some of the fine-tuning possibilities
in Cassandra may be used to design powerful data stores.
Chapter 5, Temporal Patterns, discusses temporal data and how some of the features
in Cassandra may be used to design powerful temporal data stores.
Chapter 6, Analytics Patterns, talks about the need for data analytics and how
Cassandra in conjunction with Spark may be used to serve the data analysis
use cases.
Chapter 7, Designing Applications, discusses designing a complete application that
makes use of all the design patterns discussed in this book.

Co-existence Patterns
"It's coexistence or no existence"
Bertrand Russell
Relational Database Management Systems (RDBMS) have been pervasive since the
'70s. It is very difficult to find an organization without any RDBMS in their solution
stack. Huge efforts have gone into the standardization of RDBMS. Because of that,
if you are familiar with one RDBMS, switching over to another will not be a big
problem. You will remain in the same paradigm without any major shifts. Pretty
much all the RDBMS vendors offer a core set of features with standard interfaces
and then include their own value-added features on top of it. There is a standardized
language to interact with RDBMS called Structured Query Language (SQL). The
same queries written against one RDBMS will work without significant changes in
another RDBMS. From a skill set perspective, this is a big advantage because you
need not learn and relearn new dialects of these query languages as and when the
products evolve. These enable the migration from one RDBMS to another RDBMS,
which is a painless task. Many application designers designed the applications in
an RDBMS agnostic way. In other words, the applications will work with multiple
RDBMS. Just change some configuration file properties of the application, and it will
start working with a different but supported RDBMS. Many software products are
designed to support multiple RDBMS through their configuration file settings to suit
the needs of the customers' preferred choice of RDBMS.
Mostly in RDBMS, a database schema organizes objects such as tables, views,
indexes, stored procedures, sequences, and so on, into a logical group. Structured
and related data is stored in tables as rows and columns. The primary key in a table
uniquely identifies a row. There is a very strong theoretical background in the way
data is stored in a table.

[1]

Co-existence Patterns

A table consists of rows and columns. Columns contain the fields, and rows contain
the values of data. Rows are also called records or tuples. Tuple calculus, which
was introduced by Edgar F. Codd as part of the relational model, serves as basis for
the structured query language or SQL for this type of data model. Redundancy is
avoided as much as possible. Wikipedia defines database normalization as follows:
"Database normalization is the process of organizing the attributes and tables of a
relational database to minimize data redundancy."
Since the emphasis is on avoiding redundancy, related data is spread across multiple
tables, and they are joined together with SQL to present data in various application
contexts. Multiple indexes that may be defined on various columns in a table can
help data retrieval, sorting needs, and maintaining data integrity.
In the recent years, the amount of data that is being generated by various
applications is really huge and the traditional RDBMS have started showing their
age. Most of the RDBMS were not able to ingest various types of data into their
schema. When the data starts flowing in quick succession, traditional RDBMS often
become bottlenecks. When data is written into the RDBMS data stores in such speed,
in a very short period of time, the need to add more nodes into the RDBMS cluster
becomes necessary. The SQL performance degradation happens on distributed
RDBMS. In other words, as we enter the era of big data, RDBMS could not handle
the three Vs of data: Volume, Variety, and Velocity of data.
Many RDBMS vendors came up with solutions for handling the three Vs of data,
but these came with a huge cost. The cost involved in the software licensing, the
sophisticated hardware required for that, and the related eco-system of building
a fault-tolerant solution stack, started affecting the bottom line in a big way. New
generation Internet companies started thinking of different solutions to solve this
problem, and very specialized data stores started coming up from these organizations
and open source communities based on some of the popular research papers. These
data stores are generally termed as NoSQL data stores, and they started addressing
very specific data storage and retrieval needs. Cassandra is one of the highly
successful NoSQL data stores, which has a very good similarity with traditional
RDBMS. The advantage of this similarity comes in handy when Cassandra is adopted
by an enterprise. The abstractions of a typical RDBMS and Cassandra have a few
similarities. Because of this, new users can relate things to RDBMS and Cassandra.
From a logical perspective Cassandra tables have a similarity with RDBMS-based
tables in the view of the users, even though the underlying structures of these tables
are totally different. Because of this, Cassandra is the best fit to be deployed along
with the traditional RDBMS to solve some of the problems that RDBMS is not able
to handle.

[2]

Chapter 1

The caveat here is that because of the similarity of RDBMS tables and Cassandra
column families (also known as Cassandra tables) in the view of the end users,
many users and data modelers try to use Cassandra in exactly the same way as an
RDBMS schema is being modeled, used, and is getting into the serious deployment
issues. How do you prevent such pitfalls? At the outset, Cassandra may look like a
traditional RDBMS data store. But the fact is that it is not the same. The key here is
to understand the differences from a theoretical perspective as well as in a practical
perspective, and follow the best practices prescribed by the creators of Cassandra.
In Cassandra, the terms "column family" and "table" are
synonymous. The Cassandra Query Language (CQL) command
syntax uses the term "table."

Why can Cassandra be used along with other RDBMS? The answer to that lies in the
limitations of RDBMS. Some of the obvious ones are cost savings, the need to scale
out, handling high-volume traffic, complex queries slowing down response times,
the data types are getting complex, and the list goes on and on. The most important
aspect of the need for Cassandra to coexist with legacy RDBMS is that you need to
preserve the investments made already and make sure that the current applications
are working without any problems. So, you should protect your investments, make
your future investments in a smart NoSQL store such as Cassandra, and follow the
one-step-at-a-time approach.

A brief overview of Cassandra

Where do you start with Cassandra? The best place is to look at the new application
development requirements and take it from there. Look at cases where there is a
need to denormalize the RDBMS tables and keep all the data items together, which
would have been distributed if you were to design the same solution in an RDBMS.
If an application is writing a set of data items together into a data store, why do you
want to separate them out? No need to worry about redundancy. This is the new
NoSQL philosophy. This is the new way to look at data modeling in NoSQL data
stores. Cassandra supports fast writes and reads. Initial versions of Cassandra had
some performance problems, but a huge number of optimizations have gone into
making the latest version of Cassandra perform much better for reads as well as
writes. There is no problem with consuming space because the secondary storage
is getting cheaper and cheaper. A word of caution here is that, it is fine to write the
data into Cassandra, whatever the level of redundancy, but the data access use cases
have to be thought through carefully before getting involved in the Cassandra data
model. The data is stored in the disk, to be read at a later date. These reads have to be
efficient, and it gives the required data in the desired sorted order.
[3]

Co-existence Patterns

In a nutshell, you should decide how do you want to store the data and make sure
that it is giving you the data in the desired sort order. There is no hard and fast rule
for this. It is purely up to the application requirements. That is, the other shift in the
thought process.
Instead of thinking from the pure data model perspective, start thinking in terms of
the application's perspective. How the data is generated by the application, what are
the read requirements, what are the write requirements, what is the response time
expected out of some of the use cases, and so on. Depending on these aspects, design
the data model. In the big data world, the application becomes the first class citizen
and the data model leaves the driving seat in the application design. Design the data
model to serve the needs of the applications.
In any organization, new reporting requirements come all the time. The major
challenge to generate reports is the underlying data store. In the RDBMS world,
reporting is always a challenge. You may have to join multiple tables to generate even
simple reports. Even though the RDBMS objects such as views, stored procedures,
and indexes may be used to get the desired data for the reports, when the report
is being generated, the query plan is going to be very complex most of the time.
Consumption of the processing power is another need to consider when generating
such reports on the fly. Because of these complexities, many times, for reporting
requirements, it is common to keep separate tables containing data exported from
the transactional tables. Martin Fowler emphasizes the need for separating reporting
data from the operations data in his article, Reporting Database. He states:
"Most Enterprise Applications store persistent data with a database. This database
supports operational updates of the application's state, and also various reports
used for decision support and analysis. The operational needs and the reporting
needs are, however, often quite different - with different requirements from a
schema and different data access patterns. When this happens it's often a wise idea
to separate the reporting needs into a reporting database, which takes a copy of the
essential operational data but represents it in a different schema".
This is a great opportunity to start with NoSQL stores such as Cassandra as a
reporting data store.

[4]

Chapter 1

Data aggregation and summarization are the common requirements in any

organization. This helps to control data growth in by storing only the summary
statistics and moving the transactional data into archives. Often, these aggregated
and summarized data are used for statistical analysis. In many websites, you can
see the summary of your data instantaneously when you log in to the site or when
you perform transactions. Some of the examples include the available credit limit
of credit cards, the available number of text messages, remaining international call
minutes in a mobile phone account, and so on. Making the summary accurate and
easily accessible is a big challenge. Most of the time, data aggregation and reporting
go hand in hand. The aggregated data is used heavily in reports. The aggregation
process speeds up the queries to a great extent. In RDBMS, it is always a challenge
to aggregate data, and you can find new requirements coming all the time. This is
another place you can start with NoSQL stores such as Cassandra.
Now, we are going to discuss some aspects of the denormalization, reporting, and
aggregation of data using Cassandra as the preferred NoSQL data store.

Denormalization pattern
Denormalize the data and store them as column families in Cassandra. This is a
very common practice in NoSQL data stores. There are many reasons why you
might do this in Cassandra. The most important aspect is that Cassandra doesn't
support joins between the column families. Redundancy is acceptable in Cassandra
as storage is cheap, and this is more relevant for Cassandra because Cassandra
runs on commodity hardware while many RDBMS systems need much better
hardware specifications for the optimal performance when deployed in production
environments. Moreover, the read and write operations are highly efficient even
if the column families are huge in terms of the number of columns or rows. In the
traditional RDBMS, you can create multiple indexes on a single table on various
columns. But in Cassandra, secondary indexes are very costly and they affect the
performance of reads and writes.

Motivations/solutions
In many situations, whenever a new requirement comes, if you think in the
traditional RDBMS way, it will lead to many problems such as poor performance on
read/write, long running processes, queries becoming overly complex, and so on. In
this situation, one of the best approaches is to apply denormalization principles and
design column families in Cassandra.

[5]

Co-existence Patterns

In the traditional RDBMS, the operational tables contain the data related to the
current state of the entities and objects involved. So, maintaining lookups for
preserving the integrity of the data is perfectly sensible. But when you have to
maintain history, the concept of lookups will not work. For example, when you
are generating a monthly bank account statement, the current statement should
reflect the current address of the account holder. After the statement is generated,
if the address of the account holder changes during the next reporting period, then
the previous statements must not reflect the new address. They must have the old
address, which was correct on the date that the statement was generated. In such
situations, it does not make sense to keep a set of normalized tables for the historical
data. The best thing to do at that time is to denormalize the data and maintain them
in separate column families in Cassandra.
Complex queries are part of any system. In the RDBMS world, to shield the
complexities of the query from the end users who design data entry forms, generate
reports, typically views or stored procedures are designed. They are useful to run
ad hoc queries, retrieve special set of records, and so on. Even though you solved
the complexity problem from an end user perspective, the real problem remains
unsolved. This means that when you run those queries or stored procedures, the
complex joins of the tables are happening under the hood and on the fly. Because
of this, the running time is greater and the processing requirements are huge. In the
NoSQL world, it is better to denormalize the data and maintain them as big column
families in Cassandra.
Immutable transactions are good candidates for denormalization because they
capture the present state, the references it makes to other table records can be carried
with the record of the transaction even if those references change in the future. The
only use those transactions will have in the future is for read use cases. An immutable
transaction means that once a transaction record it is written to the system, nothing
is going to change in the future. There are many examples in real life that conform
to this type, such as banking transaction records, weather station reading records,
utility monitoring reading records, system monitoring records, service monitoring
records, and you can find countless examples in your real-life applications. Event
records originating from event management systems are possible candidates for
denormalization, but caution has to be exercised when the event status changes and
if the same record is being updated. If the event management systems generate
multiple event records for state changes of the same event, denormalization will
be a good fit. Capture these denormalized records in Cassandra.

[6]

Chapter 1

Performance boosting requirements are good situations where

denormalization may be applied. There may be many applications
performing poorly when data is being written to the RDBMS.
There is a strong possibility that this is happening because of the
single transaction writing data into multiple tables and many
indexes are being used in those tables. Careful analysis and proper
characterization of the performance problems lead to data spread
across multiple tables as the root cause many times. In such cases,
denormalization of the data is the obvious option, and Cassandra
comes as a good fit there.

Data modeling experts in the RDBMS world are typically not comfortable in
denormalization because there is a general belief that the data integrity is maintained
by the RDBMS table design itself, along with other features of RDBMS such as
triggers, stored procedures, and so on. The data modeling in Cassandra is different.
Here, along with the data model, all the application use cases where there is a
data manipulation involved is also taken into consideration. So, the data integrity
maintenance is the responsibility of the applications as well. Here, denormalization
is the norm and the applications using the column families are supposed to handle
the data integrity to make sure that the data is good.

Best practices
Denormalization must be done with utmost care. Normalization avoids redundancy
and it promotes principles to maintain data integrity. When you denormalize, the
only rule that is relaxed is redundancy. Data integrity must be maintained for a
successful application, even if data is denormalized. With normalized tables in the
RDBMS, the primary key constraints, foreign key constraints, unique key constraints
and so on serve as watchdogs that maintain the data integrity even if the applications
don't care about them. Verification and validation happens even at the RDBMS level.
When moving to a NoSQL store such as Cassandra, many such goodies of RDBMS
are lost. So, it is the responsibility of the application designers to prevent insert
anomalies, update anomalies, and delete anomalies. Even though Cassandra comes
with lightweight transactions, most of the data integrity control measures have to be
taken from the application's side. Cassandra security has to be used heavily to make
sure that only the proper applications with the right credentials are writing data to
the column families.

[7]

Co-existence Patterns

Example
Let's take the case of a very simple normalized relation from the RDBMS world, as
shown in the following screenshot. There are two tables in the relationship. One
stores the customer details and the other stores the order details of the customers.
This is a one-to-many relation where every customer record in the Customer table
may have zero or more order records in the Order table. These two tables are joined
by the primary CustomerId key.

Figure 1

In the Order table, CustomerId is a foreign key referring the CustomerId of the
Customer table. When you denormalize this to a Cassandra column family, it will
look like the one given in the following figure. In Cassandra, the primary key is a
combination of CustomerId and OrderId in the CustomerOrder column family. The
CustomerId column becomes the partition key for this column family.

Figure 2

The denormalized Cassandra column family has all the fields of its normalized
counter part. The following script can be used to create the column family in
Cassandra. The following is the sequence of activities given in the script:
1. Create the key space.
2. Create the column family.
3. Insert one record into the column family.
The reason one record is inserted into the column family is to demonstrate the
difference between the physical layout of the rows stored in Cassandra and how
the queries are returning the same records.
[8]

Chapter 1

It is very important to make sure that the physical layout of the

column family is as expected and see how the columns are getting
stored. To view the physical layout of the records, the old Cassandra
CLI [cassandra-cli] must be used. This Cassandra CLI is used
throughout this book in the context where there is a need to view the
physical layout of the data in the column families.

Whenever a new column family is defined in Cassandra, it is very important to

have an understanding of the physical and the logical views, and this helps the
characterization of the column family growth and other behaviors. The following
script is to be executed in cqlsh:
CREATE KEYSPACE PacktCDP1 WITH replication = {'class':
'SimpleStrategy', 'replication_factor' : 3};
USE PacktCDP1;
CREATE TABLE CustomerOrder (
CustomerId bigint,
OrderId bigint,
CustomerName text static,
Email text static,
OrderDate timestamp,
OrderTotal float,
PRIMARY KEY (CustomerId, OrderId)
)
WITH CLUSTERING ORDER BY (OrderId DESC);
INSERT INTO CustomerOrder (CustomerId, OrderId, CustomerName, Email,
OrderDate, OrderTotal) VALUES (1,1,'Mark Thomas', '[email protected]',
1433970556, 112.50);

A great detail of attention needs to be given while choosing your primary key for a
given column family. Conceptually, this is totally different from the RDBMS world.
It is true that a primary key uniquely identifies a row. It may be an individual
column or a combination of multiple columns. The differentiation comes in
Cassandra is the way in which a row is stored in the physical nodes of Cassandra.
The first column of the primary key combination is known as the partition key. All
the rows in a given column family with the same partition key get stored in the same
physical Cassandra node. The commands in the script here are to be executed in the
Cassandra CLI interface:
USE PacktCDP1;
list CustomerOrder;
Using default limit of 100
Using default cell limit of 100
[9]

Co-existence Patterns
RowKey: 1
=> (name=1:, value=, timestamp=1433970886092681)
=> (name=1:customername, value=4d61726b2054686f6d6173,
timestamp=1433970886092681)
=> (name=1:email, value=6d74406578616d706c652e636f6d,
timestamp=1433970886092681)
=> (name=1:orderdate, value=000000005578a77c, timestamp=1433970886092681)
=> (name=1:ordertotal, value=42e10000, timestamp=1433970886092681)

1 Row Returned.

In the output, take a look at the row key. In the preceding example, as per the
primary key used, the CustomerId field will be the row key. In other words, for
every CustomerId, there will be one wide row. It is termed wide because all the
records of a given CustomerId field, which is a partition key, will be stored in
one row. One physical row of the Cassandra column family stores contains many
records. In the use cases for which the column family is designed, it is important to
make sure that if the row is growing, whether it is going to run out of the prescribed
number of columns? If yes, then the design has to be looked at again and an
appropriate partition key has to be identified. At the same time, it is not economical
to have column families having only a very few rows and very few columns.
In a typical RDBMS table, the customer details will be in one or more tables.
The transaction records such as order details will be in another table. But, here
denormalization is applied, and data coming from those two logical entities are
captured in one Cassandra column family.
The following CQL SELECT command gives the output in a human readable format:
SELECT * FROM CustomerOrder;

The output can be seen in this screenshot:

Figure 3

[ 10 ]

Chapter 1

Reporting pattern
Design separate column families in Cassandra for the reporting needs. Keep the
operational data in RDBMS and the reporting data in Cassandra column families.
The most important reason why this separation of concerns is good practice is
because of the tunable consistency feature in Cassandra. Depending on the use
cases and for performance reasons, various consistency parameters may be used
in the Cassandra column families and in the applications using Cassandra for read
and write operations. When data ingestion happens in Cassandra, the consistency
parameters have to be tuned to have fast writes.
The consistency parameters used for fast writes may not be suitable for fast reads.
So, it is better to design separate column families for reporting purposes. From the
same operational data, if various types of reports have to be created, it may be wise
to create separate column families for these different reporting requirements. It is
also a common practice to preprocess the operational data to generate fast reports.
Historical reporting, data archival, statistical analysis, providing data feeds, inputs
for machine learning algorithms such as recommendation systems and so on benefit
a lot from accessing data from the Cassandra column families specifically designed
for reporting.

Motivations/solutions
Coming from the RDBMS world, people rarely think about the reporting needs in
the very beginning. The main reason behind that is, there is good flexibility with
the SQL queries, and you can pretty much get any kind of report from the RDBMS
tables because you may join RDBMS tables. So, the application designers and
data modelers focused on the data and the business logic first. Then, they came
to thinking about the reports. Even though this strategy worked, it introduced
lots of application performance problems either toward the end of the application
development or when the data volume has grown beyond certain limit. The best
thing to do in these kind of situations is to design separate Cassandra column
families for the reporting needs.

[ 11 ]

Co-existence Patterns

In social media and real media applications commonly used by millions of users at
a given time, reporting needs are huge. Most importantly, the performance of those
reports are even more paramount. For example, in a movie streaming website, users
post videos. Users follow other users. The followers like the videos posted by the
users whom they are following. Now, take the two important views in the website:
the first one that gives the list of videos liked by a given user, the second one gives
the list of users liking a given video. In the RDBMS world, it is fine to use one table to
store the data items to generate these two reports. In Cassandra, it is better to define
two column families to generate these two reports. You may be wondering why this
can't be achieved by reading out of a single column family. The reason is that the
sorting order matters in Cassandra, and the records are stored in sorted order.
In Cassandra, it may be necessary to create different column families to produce
different reports. As mentioned earlier, it is fine to write the same piece of data into
multiple Cassandra column families. There is no need to panic as the latest versions
of Cassandra comes with batching capability for data manipulation operations. In
this way, the data integrity may be ensured. It may not be as flexible and powerful as
many RDBMS, but there are ways to do this in Cassandra. For example, take the case
of a hypothetical Twitter-like application.
Users tweet and the tweets have to be shown differently in the default view,
differently in the listing using hashtags, differently in the user time line, and so
on. Assuming that Cassandra is being used for storing the tweets, you may design
different Cassandra column families for materializing these different views. When
a new tweet comes in, that record will be inserted into all these different column
families. To maintain the data integrity, all these INSERT statements may be designed
as atomic unit of statements by enclosing them between the BEGIN BATCH and APPLY
BATCH statements of CQL, as batches are atomic by default.
When it comes to reporting, RDBMS fails miserably in many use cases. This is seen
when the report data is produced by many table joins, and the number of records
in these tables are huge. This is a common situation when there is complex business
logic to be applied to the operational data stored in the RDBMS before producing
the reports. In such situations, it is always better to go with creating separate column
families in Cassandra for the reporting needs. This may be done in two ways. The
first method is the online way, in which the operational data is transformed into
analytical or reporting data and stored in Cassandra column families. The second
method is the batching way. In regular intervals, transform the operational data into
analytical or reporting data in a batch process with business logic processors storing
the data in Cassandra column families.

[ 12 ]

Chapter 1

Predictive analytics or predictive modeling is very common these days in the

commercial and scientific applications. A huge amount of operational data is
processed, sliced, and diced by the data scientists using various machine learning
algorithms and produces outputs for solving classification, regression, and clustering
problems. These are highly calculation-intensive operations and deals with huge
amount of operational data. It is practically impossible to do these calculations on
the fly for the instantaneous requests from the users of the system. In this situation,
the best course of action is to continually process the data and store the outputs in
Cassandra column families.
There is a huge difference between the reporting data and analytical
data. The former deals with producing the data from the data store
as per the user's selection criteria. The latter deals with the data to be
processed to give a result to the user as an answer to some of their
questions, such as "why there is a budget overrun this year?", "when
the forecast and the actual started deviating?" and so on. Whenever
such questions are asked, the analytical data is processed and a result
is given.

Graphical representation of vital statistics is an important use case in many of the

applications. These days many applications provide huge amount of customization
for the users to generate user-defined graphs and charts. For making this happen,
there are sophisticated graphing and charting software packages that are available
in the market. Many of these software packages expect the data in certain format
to produce the graphical images. Most of the time, the operational data may not
be available in the specific format suitable for these specialized software packages.
In these situations, the best choice of any application designer is to transform the
operational data to suit the graphing and charting requirements. This is a good
opportunity to use separate Cassandra column families to store the data specific
for the graphing and charting.
Operational and historical reporting are two different types of needs in many
applications. Operational data is used to report the present, and historical data is
used to report the past. Even in the operational reporting use cases, there are very
good reasons to separate the reporting data to different Cassandra column families.
In the historical reporting use cases, it is even more important because the data
grows over a period of time. If the velocity of the operational data is very high, then
the historical reporting becomes even more cumbersome. Bank account statements,
credit card statements, payslips, telephone bills, and so on, are very good examples
of historical reporting use cases.

[ 13 ]

Co-existence Patterns

In the past, organizations used to keep the historical data as long as it is needed in
the system for the compliance and governance requirements. Things have changed.
Many organizations have started keeping the data eternally to provide value-added
services as the storage is becoming cheaper and cheaper these days. Cassandra is a
linearly scalable NoSQL data store. The more storage requirements you have, the
more nodes you can add to its cluster as and when you need without any need to
reconfigure or any downtime, and the cluster will start making use of the newly
added nodes. Read operations are really fast, so reporting is a highly used use case
supported by Cassandra column families with
the clever use of tunable consistency.
In the old generation applications, operational data is archived for posterity
and auditing purposes. Typically, after its decided lifetime, operational data is
taken offline and archived so that the data growth is not affecting the day-to-day
operations of the system. The main reason why this archival is needed is because of
the constraints in the data storage solutions and the RDBMS used. Clustering and
scaling out of RDBMS-based data store is very difficult and extremely expensive. The
new generation NoSQL data stores such as Cassandra are designed to scale out and
run on commodity hardware. So, the need to take the data offline doesn't exist at all.
Design Cassandra column families to hold the data marked for archival and they can
be kept online for ever. Watch out the data growth and keep on adding more and
more Cassandra nodes into the Cassandra cluster as and when required.
The emergence of cloud as the platform of choice and proliferation of Software as
a Service (SaaS) applications introduced one more complexity into the application
design, which is multitenancy. Multitenancy promotes the use of one instance
of an application catering to the needs of multiple customers. Most of these SaaS
applications give its customers a good amount of customization in terms of the
features and reports. The service providers who host these SaaS have a new challenge
of maintaining customer specific data and reports. This is a good use case where
separate Cassandra column families to be used for maintaining customer-specific
data needed for the tailor made reports.
Financial exchanges, trading systems, mobile phone services, weather forecasting
systems, airline reservation systems, and the like produce high-volume data and
process them with subsecond response to their end users. Obviously, the reporting
needs are also huge in those applications in terms of the number of records to be
processed and the complexity of data processing required. In all these systemsseparating operations data and reporting data is a very important requirement.
Cassandra is a good fit in all these reporting use cases.

[ 14 ]

Chapter 1

Data transformation is an important step in producing many reports. In the

enterprise application integration use cases, often one application will have to
provide data to another application in a certain format. XML and JSON are two
important data exchange formats. In applications with service-oriented architecture,
whether they consume or produce services, data is required in specific formats.
Whatever the technology used to perform the transformation may be, because of
the volume of the data, it is practically impossible to process these data as and
when required on a real-time basis. Preprocessing is required in many situations to
produce the data in specific formats. Even though RDBMS supports data types such
as BLOB and CLOB to store huge chunk of data, often the limitations of RDBMS
will take effect. NoSQL data stores such as Cassandra are designed to handle very
sophisticated data types built using user-defined data types, and it is easy to use
them for storing the preprocessed data for the future reporting purposes.
Providing data feeds to external systems is a very common requirement these days.
This is a very effective mechanism for disseminating data asynchronously to the
subscribers of the data through popular mechanisms such as RSS feeds. The data
designated for the data feeds must be derived from the operational data. Cassandra
column families may be designed to serve such requirements.

Best practices
Separation of operational and reporting data stores is a good idea in many cases,
but care must be taken to check if it is violating any of the data integrity or business
logic invariants in the system. Immutable data items are really good candidates for
this separation because they are not going to be changed anytime in the future. It is
better to keep the frequently changing data items in the operational data stores itself.
In Cassandra, if column families are used only for reporting purposes, care must
be taken on how to load the data into the column families. Typically, these column
families will be optimized for fast reads. If too many writes into those column
families are going to take place, the consistency parameters will have to be tuned
very well so that it does not defeat the original purpose of creating those column
families, which is to read data from it. It is very difficult to tune Cassandra column
families to suit the needs of very fast writes and reads. One of these will need to be
compromised. Since the original purpose of these column families is to provide fast
reads, the speed of writes must be controlled. More detailed treatment on the tuning
for fast reads and fast writes in Cassandra column families is given in the coming
chapters of this book.

[ 15 ]

Co-existence Patterns

Example
Let's take the case of a normalized set of tables from an application using RDBMS,
as shown in Figure 4. There are three tables in the relation. One stores the customer
details, another one stores the order details, and the third one stores the order-line
items. Assume that this is an operational table and the data size is huge. There is a
one-to-many relation between the Customer table and the Order table. For every
customer record in the Customer table, there may be zero or more order records in
the Order table. There is a one-to-many relation between the Order table and the
OrderItems table. For every order record in the Order table, there may be zero or
more order item records in the OrderItems table.

Figure 4

Assume that the requirement is to create a Cassandra column family to generate

a monthly customer order summary report. The report should contain one record
for each customer containing the order total for that month. The Cassandra column
family will look like the one given in the Figure 5.
In the Cassandra column family MonthlyCustomerOrder, a combination of
CustomerId, OrderYear, and OrderMonth columns form the primary key. The
CustomerId column will be the partition key. In other words, all the records for
a given customer will be stored in one wide row of the column family.

Figure 5

[ 16 ]

Chapter 1

Assuming that the key space is created using the scripts given Figure 3, the following
scripts given here will create only the required column family and then insert one
record. Filling in the data in this column family may be done on a real-time basis or
as a batch process. Since the data required for filling the Cassandra column family
is not readily available from the RDBMS tables, a preprocessing needs to be done
to prepare the data that goes into this Cassandra column family:
USE PacktCDP1;
CREATE TABLE MonthlyCustomerOrder (
CustomerId bigint,
OrderYear int,
OrderMonth int,
CustomerName text static,
OrderTotal float,
PRIMARY KEY (CustomerId, OrderYear, OrderMonth)
)
WITH CLUSTERING ORDER BY (OrderYear DESC, OrderMonth DESC);
INSERT INTO MonthlyCustomerOrder (CustomerId, OrderYear, OrderMonth,
CustomerName, OrderTotal) VALUES (1,2015,6,'George Thomas', 255.5);

The following script gives the details of how the row in the Cassandra column family
is physically stored. The commands given here are to be executed in the Cassandra
CLI interface:
USE PacktCDP1;
list MonthlyCustomerOrder;
Using default limit of 100
Using default cell limit of 100
RowKey: 1
=> (name=2015:6:, value=, timestamp=1434231618305061)
=> (name=2015:6:customername, value=476f657267652054686f6d6173,
timestamp=1434231618305061)
=> (name=2015:6:ordertotal, value=437f8000, timestamp=1434231618305061)

1 Row Returned.
Elapsed time: 42 msec(s).

[ 17 ]

Co-existence Patterns

The CQL SELECT command given in Figure 6 gives the output in a human
readable format:
SELECT * FROM MonthlyCustomerOrder;

Figure 6

Aggregation pattern
Design separate Cassandra column families to store the aggregated and summarized
operational data. Aggregated data is used for various reporting and analytical
purposes. Cassandra does not inherently support any joins between column families.
Cassandra does not support the commonly seen SQL aggregation constructs such as
GROUP BY, HAVING, and so on. Because of these constraints, it is better to preprocess the
operational data to do the aggregation, summarization, and storage of the processed
data in Cassandra column families. The lack of ability to do real-time aggregation
using CQL can be converted to an advantage of using Cassandra, which is serving
fast reads of already aggregated data and exploiting its highly scalable architecture.

Motivations/solutions
SQL on RDBMS provides a great deal of flexibility to store and retrieve data,
apply computations, perform aggregations, and summarizations effortlessly. All
these work fine as long as the data volume is manageable. The moment the data
volume goes above the threshold and there is need to scale out to a distributed
model, everything comes to a screeching halt. When the data is distributed across
multiple RDBMS hosts, the queries and computations on top of it crawl even more.
Because of these limitations, the separation of aggregated and operational data
into separate tables became common practice. In this era of Internet of Things
(IoT), even aggregated and summarized data starts overflowing within no time. In
such situations, it is a good idea to move these already processed, aggregated, and
summarized data into Cassandra column families. Cassandra can handle loads of
such data and provide ultra-fast reads, even when the nodes are highly distributed
across multiple racks and data centers.

[ 18 ]

Chapter 1

Over a period of a couple of decades, there has been a clear change in the trend of
how the data is aggregated. Originally, the RDBMS table data was processed through
batch application processes, and the data was aggregated. Then, the traditional batch
processes gave way to the divide-and-conquer methodologies such as Map/Reduce
and Hadoop, to aggregate the data, but even then the aggregated data remained
in separate RDBMS instances or in some distributed filesystems such as Hadoop
Distributed File System (HDFS). The HDFS filesystem-based storage was good in
some use cases, but the data access and reporting became difficult as the big data
market was maturing in terms of the available tools and applications. Now, NoSQL
data stores such as Cassandra offer good interoperability with other applications,
and they can be used as a highly scalable data storage solution.
The drill-down capability has been a very common feature in many of the
applications for a long time. In the user interfaces, a very high level aggregated and
summarized data in the form of tables or graphs are presented. When the user clicks
on a link, button, or section on the graph, the application presents the associated
data that was used to create the aggregation or summarization. Typically, there
will be multiple levels of these drill-downs and for providing that, the data must
be aggregated at different levels. All of these operations are very computationally
intensive, as well as expensive. Cassandra is a good fit to store these preprocessed
data coming from the RDBMS tables. There are many data processing applications
that make use of the multicore architecture of the modern computers and do the
tasks asynchronously. Even though the RDBMS perform well when scaled up by
making use of the multiple processing cores and huge memory seen in modern
hardware, as mentioned earlier, the RDBMS don't perform well when it is scaled out
especially where there are multiple table joins. Proper use of these data processing
tools in conjunction with Cassandra will provide great value in storing the
aggregated and summarized data.
Many organizations sell the data generated from their applications. Depending on
the sensitivity of the data and the potential dangers of violating data protection and
privacy laws, data aggregation becomes a mandatory requirement. Often, these
aggregated data-for-sale need to be completely separated from the organization's live
data. This data goes with totally different access controls, even at the level of hosting
location. Cassandra is a good fit for this use case.

[ 19 ]

Co-existence Patterns

Marketing analytics use lots of aggregation. For example, in the case of retail
transactions happening in a store, a third-party marketing analytics organization
will not be given the individual transaction records. Instead, the transaction data
is aggregated and it is ensured that all the personally identifiable data is masked
or removed before being handed over for any analytical purposes. Consolidation,
aggregation, and summarization are common needs here. Many organizations
gather data from various marketing channels of the same organization itself to
generate a common view of the marketing efforts. Many organizations find new
avenues of creating new applications and value added services based out of these
aggregated data. When new initiatives such as these come, separation of concerns
plays an important role here and often business incubation teams or research and
development units take these initiatives to the next level. These are the times the
teams really think out of the box and start using new technologies. They completely
move away from the legacy technologies to exploit the economies of scale.
Exploration of new technologies happens when the legacy technologies have pain
points, and when there is a need to reduce cost incurred due to specialized hardware
requirements along with software license costs. Exploration with new technologies
also happens when there is a totally new requirement that cannot be served by the
ecosystem in use. Cassandra is a good fit in these use cases because many Internet
scale applications use Cassandra heavily for heavy-duty data storage requirements
running on commodity hardware, thus providing value for money.
Data visualization products use a lot of aggregation. Many such products are
plagued by using too much data. Clutter drives users away and the information is
lost in the abundant usage of data. Seeing all these problems, many other products
are using aggregated data to visualize and provide drill down or other similar
techniques in the visualization. Cassandra can be used store multilevel aggregated
data in its column families.
Data warehousing solutions are fast moving away from RDBMS to NoSQL such as
Cassandra. Data warehousing projects deal with huge amount of data and does lots
of aggregation and summarization. When it comes to huge amount of data, scaling
out beyond a single server is a mandatory requirement. Cassandra fits very well
there. Data warehousing solutions also need to support various data processing
tools. There are many drivers available in the market to connect to Cassandra.
Many data processing and analytics tools such as Apache Spark work very well
with Cassandra.

[ 20 ]

Chapter 1

Online shopping sites generate lots of sale records. Many of them are still using
RDBMS as their preferred data stores. It is practically impossible to generate a report,
including all these sales records. So even in the basic reporting itself, aggregation
plays a big role. These aggregated data is used for sales forecasting, trending, and
undergoing further processing. NoSQL data stores such as Cassandra become the
preferred choice of many to store these aggregated data.
Proliferation of data products mandated the need to process the data in a totally
new way with lots of transformations from one format to another. Aggregation and
summarization has become part of all these processes. Here, even the traditional
SQL-based RDBMS fail because the processing needs are beyond SQL's limited
capabilities. The RDBMS fails here on two counts. The first one being the inability to
process data, and the second one being the inability to store the processed data that
comes in totally different formats. Even Cassandra fails on the first one, but it scores
better on the second one because it can store very sophisticated data types and can
scale out to the roof. A detailed coverage on the Cassandra data types is coming in
the upcoming chapters of this book.

Best practices
When doing aggregation and storing the aggregated data in Cassandra, care must be
taken in the drill-down use cases. The drill-down use cases uses both the operational
and aggregated data as Cassandra is coexisting with the existing RDBMS. When the
operational data is coming from traditional RDBMS tables and the aggregated data
coming from the Cassandra data stores, there are good chances of tight coupling
of application components. If a design is not done properly and not thoughtfully
crafted, the application maintenance will be a nightmare.
The word aggregation is used in a totally different context in the NoSQL
parlance. It is used to consolidate many related data items into one single
unit and stored in the NoSQL data stores to store and retrieve as a single
unit. Martin Fowler used this term in his article titled Aggregate Oriented
Database and in that he uses the term aggregation in this way:
"Aggregates make natural units for distribution strategies such as
sharding, since you have a large clump of data that you expect to
be accessed together. An aggregate also makes a lot of sense to an
application programmer. If you're capturing a screenful of information
and storing it in a relational database, you have to decompose that
information into rows before storing it away. An aggregate makes for a
much simpler mapping - which is why many early adopters of NoSQL
databases report that it's an easier programming model." When this type
of aggregation is being used in Cassandra, care must be taken and don't
store a big load of data items in as a blob.
[ 21 ]

Co-existence Patterns

The application logic must be carefully thought through in order to make sure
that there is a proper sync-up between the operational and the aggregated data. If
out of sync, this will become very obvious in the drill-down use cases because the
aggregate record will show one value and the details will show a different value.
It is a good practice to store the data in Cassandra with proper
structure always, even if the number of data items is large.
It is comparatively easy to manage structured data than
unstructured data.

Example
Let's take the case of a normalized set of tables from an application using RDBMS as
shown in the Figure 7:

Figure 7

There are three tables in the relation. The first stores the customer details, the second
one stores the order details, and the third one stores the order line items. Assume
that this is an operational table and the data size is huge. There is a one-to-many
relation between the Customer and Order table. For every customer record in the
Customer table, there may be zero or more order records in the Order table. There
is a one-to-many relation between the Order and OrderItems table. For every
order record in the Order table, there may be zero or more order item records in
the OrderItems table.

[ 22 ]

Chapter 1

Assume that the requirement is to create an aggregation of the orders to have a

monthly city order summary report. The Cassandra column family will look like
the following screenshot:

Figure 8

In the CityWiseOrderSummary column family, a combination of City, OrderYear,

and OrderMonth form the primary key. The City column will be the partition key.
In other words, all the order aggregate data related to a given city will be stored in
a single wide row.
Assuming that the key space is created using the scripts given in the example of deNormalization pattern, the scripts given here will create only the required column
family and then insert a couple of records. Filling of the data in this column family
may be done on a real-time basis or as a batch process. Since the data required for
filling the Cassandra column family is not readily available from the RDBMS tables,
a preprocessing needs to be done to prepare the data that goes into this Cassandra
column family:
USE PacktCDP1;
CREATE TABLE CityWiseOrderSummary
(
City text,
OrderYear int,
OrderMonth int,
OrderTotal float,
PRIMARY KEY (City, OrderYear, OrderMonth)
)
WITH CLUSTERING ORDER BY (OrderYear DESC, OrderMonth DESC);
INSERT INTO
OrderTotal)
INSERT INTO
OrderTotal)

CityWiseOrderSummary (City, OrderYear, OrderMonth,

VALUES ('Leeds',2015,6,8500);
CityWiseOrderSummary (City, OrderYear, OrderMonth,
VALUES ('London',2015,6,8500);

[ 23 ]

Co-existence Patterns

The following script gives the details of how the row in the Cassandra column family
is physically stored. The commands given here are to be executed in the Cassandra
CLI interface:
USE PacktCDP1;
list CityWiseOrderSummary;
Using default limit of 100
Using default cell limit of 100
RowKey: London
=> (name=2015:6:, value=, timestamp=1434313711537934)
=> (name=2015:6:ordertotal, value=4604d000, timestamp=1434313711537934)

RowKey: Leeds
=> (name=2015:6:, value=, timestamp=1434313683491588)
=> (name=2015:6:ordertotal, value=4604d000, timestamp=1434313683491588)

2 Rows Returned.
Elapsed time: 5.17 msec(s).

The SELECT command given in Figure 9 gives the output in a human readable format:
SELECT * FROM CityWiseOrderSummary;

Figure 9

Just like the way the city-wise order summary is created, if we also need to create
a state-wise order summary from the RDBMS tables given in Figure 7, a separate
Cassandra column family will need to be created and the appropriate application
processing needs to be done to fill in the data into that Cassandra column family.

[ 24 ]

Chapter 1

References
The following links can be referred to for further information:

https://en.wikipedia.org/wiki/Database_normalization

http://martinfowler.com/bliki/ReportingDatabase.html

http://martinfowler.com/bliki/AggregateOrientedDatabase.html

Summary
The coexistence of RDBMS and NoSQL data stores such as Cassandra is very much
possible, feasible, and sensible; and this is the only way to get started with the
NoSQL movement, unless you are embarking on a totally new product development
from scratch. Here, only a few patterns have been discussed. But there are many
other different ways in which Cassandra can harmoniously coexist with other genres
of data stores. All the examples given in this chapter were based on hypothetical
requirements just to demonstrate the design pattern. Those who actually implement
the appropriate design patterns in their Cassandra data model may use it as a
guidance to implement real-world solutions.
Coexistence of two or more technologies gives great confidence to the system
architects as well as the end users. This opens a door to a great opportunity for the
new technology to expand and flourish. Then, the next step that anybody would
take is to migrate from the old technology to the new technology wherever it is
applicable. There are some great use cases where the migration from RDBMS to
Cassandra makes perfect sense. In the next chapter, we will discuss that further.

[ 25 ]

Get more information Cassandra Design Patterns Second Edition

Where to buy this book

You can buy Cassandra Design Patterns Second Edition from the
Packt Publishing website.
Alternatively, you can buy the book from Amazon, BN.com, Computer Manuals and most internet
book retailers.
Click here for ordering and shipping details.

www.PacktPub.com

Stay Connected:

A1000 Yaskawa Manual
No ratings yet
A1000 Yaskawa Manual
71 pages
Variant Configuration Step by Step Config
100% (2)
Variant Configuration Step by Step Config
18 pages
The Art of Differentiating
No ratings yet
The Art of Differentiating
348 pages
Lucene Tutorial
100% (1)
Lucene Tutorial
189 pages
Iot
No ratings yet
Iot
514 pages
Practical Digital Forensics - Sample Chapter
100% (3)
Practical Digital Forensics - Sample Chapter
31 pages
IT Recruiter Mind-Maps - Booklet v2.0 - SAMPLE 2023-03
No ratings yet
IT Recruiter Mind-Maps - Booklet v2.0 - SAMPLE 2023-03
97 pages
Learn DSA
No ratings yet
Learn DSA
11 pages
Aws Glue Information
No ratings yet
Aws Glue Information
46 pages
Data Structures & Algorithms in Java
No ratings yet
Data Structures & Algorithms in Java
61 pages
Prospectus 2019-20 PDF
No ratings yet
Prospectus 2019-20 PDF
134 pages
HPCL Project Report
100% (1)
HPCL Project Report
56 pages
Elite Data Engineering Program Curriculum
No ratings yet
Elite Data Engineering Program Curriculum
38 pages
RStudio For R Statistical Computing Cookbook - Sample Chapter
100% (1)
RStudio For R Statistical Computing Cookbook - Sample Chapter
38 pages
Practical Mobile Forensics - Second Edition - Sample Chapter
No ratings yet
Practical Mobile Forensics - Second Edition - Sample Chapter
38 pages
A Review Paper On Big Data Database'S: Cassandra, Hbase, Hive
No ratings yet
A Review Paper On Big Data Database'S: Cassandra, Hbase, Hive
6 pages
3D Printing Designs: Design An SD Card Holder - Sample Chapter
100% (1)
3D Printing Designs: Design An SD Card Holder - Sample Chapter
16 pages
OPENova - OpenERP 7
No ratings yet
OPENova - OpenERP 7
120 pages
Practical Linux Security Cookbook - Sample Chapter
100% (1)
Practical Linux Security Cookbook - Sample Chapter
25 pages
GL200 Manager Tool User Guide V2.03
No ratings yet
GL200 Manager Tool User Guide V2.03
42 pages
Levelling Procedures
100% (1)
Levelling Procedures
58 pages
Odoo Development Cookbook - Sample Chapter
100% (1)
Odoo Development Cookbook - Sample Chapter
35 pages
Unicode and Non Unicode Printing With The Swiss™721 Font
No ratings yet
Unicode and Non Unicode Printing With The Swiss™721 Font
5 pages
Spring Durga PDF
No ratings yet
Spring Durga PDF
234 pages
Duo Pizza Party Strat
No ratings yet
Duo Pizza Party Strat
7 pages
JIRA 7 Administration Cookbook Second Edition - Sample Chapter
No ratings yet
JIRA 7 Administration Cookbook Second Edition - Sample Chapter
35 pages
M1 - Introducing Google Cloud v5.2 - ILT
No ratings yet
M1 - Introducing Google Cloud v5.2 - ILT
69 pages
Expert Python Programming - Second Edition - Sample Chapter
63% (8)
Expert Python Programming - Second Edition - Sample Chapter
40 pages
Excitation
100% (1)
Excitation
45 pages
Lab Aws 14-10
100% (1)
Lab Aws 14-10
25 pages
Machine Learning in Java - Sample Chapter
100% (1)
Machine Learning in Java - Sample Chapter
26 pages
Android UI Design - Sample Chapter
No ratings yet
Android UI Design - Sample Chapter
47 pages
Angular 2 Essentials - Sample Chapter
0% (1)
Angular 2 Essentials - Sample Chapter
39 pages
Pragmatic Approach To Describing Solution Architectures
No ratings yet
Pragmatic Approach To Describing Solution Architectures
24 pages
Flux Architecture - Sample Chapter
No ratings yet
Flux Architecture - Sample Chapter
25 pages
Data Pipeline
No ratings yet
Data Pipeline
14 pages
Globalprotect App New Features
No ratings yet
Globalprotect App New Features
30 pages
Internet of Things With Python - Sample Chapter
100% (1)
Internet of Things With Python - Sample Chapter
34 pages
Unity 5.x Game Development Blueprints - Sample Chapter
No ratings yet
Unity 5.x Game Development Blueprints - Sample Chapter
57 pages
All Sourcing Platform Cheat Sheet
No ratings yet
All Sourcing Platform Cheat Sheet
14 pages
Java Full Stack Developer Alekya Email: Contact:: Professional Summary
No ratings yet
Java Full Stack Developer Alekya Email: Contact:: Professional Summary
6 pages
PPS Unit-2
No ratings yet
PPS Unit-2
38 pages
Mastering Drupal 8 Views - Sample Chapter
0% (1)
Mastering Drupal 8 Views - Sample Chapter
23 pages
Python Geospatial Development - Third Edition - Sample Chapter
No ratings yet
Python Geospatial Development - Third Edition - Sample Chapter
32 pages
QGIS 2 Cookbook - Sample Chapter
100% (1)
QGIS 2 Cookbook - Sample Chapter
44 pages
Talend Open Studio For ESB Getting Started Guide
No ratings yet
Talend Open Studio For ESB Getting Started Guide
31 pages
SAP HANA Data Tiering With SAP NSE
No ratings yet
SAP HANA Data Tiering With SAP NSE
21 pages
Blitzkrieg - Rolling Thunder Manual (English)
No ratings yet
Blitzkrieg - Rolling Thunder Manual (English)
16 pages
RESTful Web API Design With Node - Js - Second Edition - Sample Chapter
0% (1)
RESTful Web API Design With Node - Js - Second Edition - Sample Chapter
17 pages
Learning Probabilistic Graphical Models in R - Sample Chapter
No ratings yet
Learning Probabilistic Graphical Models in R - Sample Chapter
37 pages
Canvas Cookbook - Sample Chapter
No ratings yet
Canvas Cookbook - Sample Chapter
34 pages
Cardboard VR Projects For Android - Sample Chapter
No ratings yet
Cardboard VR Projects For Android - Sample Chapter
57 pages
Cba 8 Clinical Decision Support System: Capstone Project
No ratings yet
Cba 8 Clinical Decision Support System: Capstone Project
31 pages
Digital Image Processing Lecture 1: Introduction
No ratings yet
Digital Image Processing Lecture 1: Introduction
28 pages
787 Avionics Integration
100% (1)
787 Avionics Integration
1 page
Modeling: Hybrid Systems: Arijit Mondal
No ratings yet
Modeling: Hybrid Systems: Arijit Mondal
20 pages
AIW Numerical
No ratings yet
AIW Numerical
21 pages
Apache Hive Cookbook - Sample Chapter
100% (1)
Apache Hive Cookbook - Sample Chapter
27 pages
Sass and Compass Designer's Cookbook - Sample Chapter
No ratings yet
Sass and Compass Designer's Cookbook - Sample Chapter
41 pages
Sitecore Cookbook For Developers - Sample Chapter
No ratings yet
Sitecore Cookbook For Developers - Sample Chapter
34 pages
Mastering Mesos - Sample Chapter
No ratings yet
Mastering Mesos - Sample Chapter
36 pages
Practical API Design
100% (2)
Practical API Design
18 pages
Data Engineering Explanation
No ratings yet
Data Engineering Explanation
43 pages
Installing HP Z Turbo Drive Quad
No ratings yet
Installing HP Z Turbo Drive Quad
18 pages
Modular Programming With Python - Sample Chapter
No ratings yet
Modular Programming With Python - Sample Chapter
28 pages
Mastering Hibernate - Sample Chapter
No ratings yet
Mastering Hibernate - Sample Chapter
27 pages
Distributed Computing With Python - Sample Chapter
No ratings yet
Distributed Computing With Python - Sample Chapter
18 pages
Java Project Report
No ratings yet
Java Project Report
17 pages
Integration Patterns For Virtual MDM Implementations - WSN1
No ratings yet
Integration Patterns For Virtual MDM Implementations - WSN1
40 pages
Troubleshooting NetScaler - Sample Chapter
No ratings yet
Troubleshooting NetScaler - Sample Chapter
25 pages
Uml Final
No ratings yet
Uml Final
70 pages
Moodle 3.x Teaching Techniques - Third Edition - Sample Chapter
No ratings yet
Moodle 3.x Teaching Techniques - Third Edition - Sample Chapter
23 pages
Harish Oraon
100% (1)
Harish Oraon
5 pages
Puppet For Containerization - Sample Chapter
No ratings yet
Puppet For Containerization - Sample Chapter
23 pages
Data Structures and Algorithms Made Easy: Narasimha Karumanchi
No ratings yet
Data Structures and Algorithms Made Easy: Narasimha Karumanchi
12 pages
DB - Coding Interview
No ratings yet
DB - Coding Interview
54 pages
Ignite Sample
0% (1)
Ignite Sample
88 pages
Spru 723 B
No ratings yet
Spru 723 B
11 pages
02 - Apache Spark On Amazon EMR
No ratings yet
02 - Apache Spark On Amazon EMR
31 pages
Salesforce App Limits Cheatsheet
No ratings yet
Salesforce App Limits Cheatsheet
50 pages
Vue 3 Cheat Sheet
No ratings yet
Vue 3 Cheat Sheet
2 pages
VVM Exam App
No ratings yet
VVM Exam App
11 pages
Malone Et Al 2019 - To Infinity and Beyond - The Use of Gps Devices Within The Football Codes
No ratings yet
Malone Et Al 2019 - To Infinity and Beyond - The Use of Gps Devices Within The Football Codes
10 pages
Notes 2 BayesianStatistics
No ratings yet
Notes 2 BayesianStatistics
6 pages
Anusha Reddy
No ratings yet
Anusha Reddy
10 pages
Neo4j Interview Questions Answers
No ratings yet
Neo4j Interview Questions Answers
4 pages
Abe9050 0047smart
No ratings yet
Abe9050 0047smart
5 pages
VisaBoard Profile - Compressed
No ratings yet
VisaBoard Profile - Compressed
7 pages
When Where and Why To Use NoSQL
No ratings yet
When Where and Why To Use NoSQL
13 pages
Kịch Bản Lồng Tiếng Nhóm 5
No ratings yet
Kịch Bản Lồng Tiếng Nhóm 5
5 pages
Ankit Sdet
No ratings yet
Ankit Sdet
8 pages
Akash Resume
No ratings yet
Akash Resume
7 pages
Fundamentals of Big Data Engineering: A Guide To The
No ratings yet
Fundamentals of Big Data Engineering: A Guide To The
14 pages
Resume 3
No ratings yet
Resume 3
4 pages
5-Introduction To Information Retrieval
No ratings yet
5-Introduction To Information Retrieval
3 pages
Exp:0-10 Yrs Difficulty Level: Easy To Moderate: Common and Random Questions
No ratings yet
Exp:0-10 Yrs Difficulty Level: Easy To Moderate: Common and Random Questions
3 pages
Troubleshooting Spark Challenges
No ratings yet
Troubleshooting Spark Challenges
7 pages
Applied Coding Track
No ratings yet
Applied Coding Track
10 pages
OnkarPramodKurle (3 0)
No ratings yet
OnkarPramodKurle (3 0)
7 pages
Azure Cosmos DB - Change Feed Support
No ratings yet
Azure Cosmos DB - Change Feed Support
8 pages
E - 1-JAVA DEVELOPER-20230410-Siva S K
No ratings yet
E - 1-JAVA DEVELOPER-20230410-Siva S K
4 pages
Nagarjuna Hadoop Resume
No ratings yet
Nagarjuna Hadoop Resume
7 pages
CDT-PFF-EFF Joint Comments in Matter No. P104503
No ratings yet
CDT-PFF-EFF Joint Comments in Matter No. P104503
12 pages
Ebay Case Study
No ratings yet
Ebay Case Study
11 pages
Software Engineering Cheat Sheet: by Via
No ratings yet
Software Engineering Cheat Sheet: by Via
2 pages
Top 10 ETL Design Tips
No ratings yet
Top 10 ETL Design Tips
37 pages
Informatica Power Center Best Practices
No ratings yet
Informatica Power Center Best Practices
8 pages
Sandhaya - Resume ML - DS PDF
0% (1)
Sandhaya - Resume ML - DS PDF
2 pages
Kuldeep Kumar Rawani: Medley Medical Solutions PVT LTD
No ratings yet
Kuldeep Kumar Rawani: Medley Medical Solutions PVT LTD
2 pages
Jack Gartland Resume 2
No ratings yet
Jack Gartland Resume 2
1 page
Lokender Shekhawat: Ustrial Experience (M.Tech.)
No ratings yet
Lokender Shekhawat: Ustrial Experience (M.Tech.)
2 pages
Cloudera Nokia Case Study Final
No ratings yet
Cloudera Nokia Case Study Final
2 pages
Self Service BI
No ratings yet
Self Service BI
6 pages
Hadoop in Action
No ratings yet
Hadoop in Action
1 page
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
From Everand
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
Eric Tome
No ratings yet
Spring Security: Effectively secure your web apps, RESTful services, cloud apps, and microservice architectures
From Everand
Spring Security: Effectively secure your web apps, RESTful services, cloud apps, and microservice architectures
Badr Nasslahsen
No ratings yet
HBase Administration Cookbook
From Everand
HBase Administration Cookbook
Yifeng Jiang
No ratings yet
Modernizing Legacy Applications in PHP
From Everand
Modernizing Legacy Applications in PHP
Paul M. Jones
No ratings yet
Instant Redis Optimization How-to
From Everand
Instant Redis Optimization How-to
Arun Chinnachamy
No ratings yet
ColdFusion Interview Questions, Answers, and Explanations: ColdFusion Certification Review
From Everand
ColdFusion Interview Questions, Answers, and Explanations: ColdFusion Certification Review
equitypress
No ratings yet

Uploaded by

Uploaded by

Fr

What you will learn from this book

Who this book is written for

See how Cassandra can be used in

Understand the behavior of Cassandra when

Apply the design patterns covered in this

community experience distilled

Deal with time stamps related to

Prices do not include

Visit www.PacktPub.com for books, eBooks,

This book is intended for big data developers who

Migrate from RDBMS-based data stores

This book starts with strategies to integrate Cassandra with

Cassandra Design Patterns

Cassandra Design Patterns

Cassandra Design Patterns

In this package, you will find:

The author biography

About the Author

than 23 years of software development experience at various multinational companies.

Key-value data store

Column-family data store

Document data store

Graph data store

Cassandra belongs to the column-family data store group. Cassandra's peer-to-peer

What this book covers

A brief overview of Cassandra

Data aggregation and summarization are the common requirements in any

Performance boosting requirements are good situations where

It is very important to make sure that the physical layout of the

Whenever a new column family is defined in Cassandra, it is very important to

The output can be seen in this screenshot:

Predictive analytics or predictive modeling is very common these days in the

Graphical representation of vital statistics is an important use case in many of the

Data transformation is an important step in producing many reports. In the

Assume that the requirement is to create a Cassandra column family to generate

Assume that the requirement is to create an aggregation of the orders to have a

In the CityWiseOrderSummary column family, a combination of City, OrderYear,

CityWiseOrderSummary (City, OrderYear, OrderMonth,

Get more information Cassandra Design Patterns Second Edition

Where to buy this book

You might also like