0% found this document useful (0 votes)
184 views

Nosql Databases: P.Krishna Reddy Iiit Hyderabad

NoSQL databases provide an alternative to traditional SQL databases that can scale to meet the unprecedented demands of large internet companies. Key-value stores and document databases relax the ACID properties of SQL to achieve higher scalability and availability. This allows data to be partitioned across many servers and replicated for fault tolerance without complex transactions. NoSQL databases have gained popularity due to their ability to handle massive datasets and workloads.

Uploaded by

Sarvesh Mehta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
184 views

Nosql Databases: P.Krishna Reddy Iiit Hyderabad

NoSQL databases provide an alternative to traditional SQL databases that can scale to meet the unprecedented demands of large internet companies. Key-value stores and document databases relax the ACID properties of SQL to achieve higher scalability and availability. This allows data to be partitioned across many servers and replicated for fault tolerance without complex transactions. NoSQL databases have gained popularity due to their ability to handle massive datasets and workloads.

Uploaded by

Sarvesh Mehta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

NoSQL Databases

P.Krishna Reddy
IIIT Hyderabad
Introduction
• The term “database” had become synonymous with
SQL.

• Recently there has been a shift in the database


landscape When considering options for data storage,
there is a new game in town: NoSQL databases

• In this talk, I will cover


– NoSQL
– Application of NoSQL with DBMS
How much data?
• Google processes 20 PB a day (2008)
• Facebook has 2.5 PB of user data + 15
TB/day (4/2009)
• eBay has 6.5 PB of user data + 50
TB/day (5/2009)
Evolution
• NoSQL databases first started out as in-house solutions to real problems in
companies such as
– Amazon Dynamo, Google BigTable, LinkedIn Voldemort, Twitter FlockDB , Facebook
Cassandra , Yahoo! PNUTS , and others.
• These companies didn’t start off by rejecting SQL and relational
technologies; they tried them and found that they didn’t meet their
requirements.
• In particular, these companies faced three primary issues:
– unprecedented transaction volumes,
– expectations of low-latency access to massive datasets, and
– nearly perfect service availability while operating in an unreliable
environment
• Initially, companies tried the traditional approach: they added more
hardware or upgraded to faster
• When that didn’t work, they tried to scale existing relational
solutions by
– simplifying their database schema, de-normalizing the schema,
– relaxing durability and referential integrity,
– introducing various query caching layers,
– separating read-only from write-dedicated replicas, and, finally,
– data partitioning in an attempt to address these new requirements.
• Although each of these techniques extended the
functionality of existing relational technologies, none
fundamentally addressed the core limitations, and they all
introduced additional overhead and technical tradeoffs In
other words, these were good band-aids
but not cures.
• Existing SQL systems
– Single machine will serve the requests.
– Replication was not required.
– Master-slave approach was OK.

• It was the common architecture and design


assumptions underlying most relational
databases that failed to address the scalability,
latency, and availability requirements of many of
the largest sites during the massive growth of the
Internet.
• A growing number of companies were still hitting the scalability and performance
wall even when using the best practices and the most advanced technologies of
the time
• Database architects had sacrificed many of the most central aspects of
a relational database, such as joins and fully consistent data, while introducing
many complex and fragile pieces into the operations puzzle.
• Schema devolved from many interrelated fully expressed tables to something
much more like a simple key/value look-up
• Deployments of expensive servers were not able to keep up with demand
• At this point these companies had taken relational databases so far outside their
intended use cases that it was no wonder that they were unable to meet
performance requirements
• It quickly became clear to them that they could do much better by building
something in-house that was tailored to their particular workloads
• These in-house custom solutions are the inspiration behind the many NoSQL
products we now see on the market
NO SQL foundations
• Companies needed a solution that would scale, be resilient,
and be operationally efficient
• They had been able to scale the Web (HTTP) and dynamic
content generation and business logic layers (Application
Servers), but the database continued to be the system’s
bottleneck.
• Engineers wanted databases to scale like Web servers—simply
add more commodity systems and expect things to speed up
at a nearly linear rate—but to do that they would have to
make a number of tradeoffs
• Luckily, due to the large number of compromises made when
attempting to scale their existing relational databases, these
tradeoffs were not so foreign or distasteful as they might have
been
Why Cloud Data Stores

◼ Explosion of social media sites (Facebook,


Twitter) with large data needs
◼ Explosion of storage needs in large web sites
such as Google, Yahoo
❑ Much of the data is not files
◼ Rise of cloud-based solutions such as Amazon S3
(simple storage solution)
◼ Shift to dynamically-typed data with frequent
schema changes
Parallel Databases and Data Stores
◼ Web-based applications have huge demands on data storage
volume and transaction rate
◼ Scalability of application servers is easy, but what about the
database?
◼ Approach 1: memcache or other caching mechanisms to
reduce database access
❑ Limited in scalability
◼ Approach 2: Use existing parallel databases
◼ Expensive, and most parallel databases were designed for decision
support not OLTP
◼ Approach 3: Build parallel stores with databases underneath
Scaling RDBMS - Partitioning
◼ “Sharding”
❑ Divide data amongst many cheap databases
(MySQL/PostgreSQL)
❑ Manage parallel access in the application

❑ Scales well for both reads and writes

❑ Not transparent, application needs to be partition-aware


Parallel Key-Value Data Stores
◼ Distributed key-value data storage systems allow key-
value pairs to be stored (and retrieved on key) in a
massively parallel system
❑ E.g. Google BigTable, Yahoo! Sherpa/PNUTS, Amazon Dynamo, ..
◼ Partitioning, high availability etc completely transparent
to application
◼ Sharding systems and key-value stores don’t support
many relational features
❑ No join operations (except within partition)
❑ No referential integrity constraints across partitions
❑ etc.
What is NoSQL?
◼ Stands for No-SQL or Not Only SQL??
◼ Class of non-relational data storage systems
❑ E.g. BigTable, Dynamo, PNUTS/Sherpa, ..
◼ Usually do not require a fixed table schema
nor do they use the concept of joins
◼ Distributed data storage systems
◼ All NoSQL offerings relax one or more of the
ACID properties (will talk about the CAP
theorem)
ACID Properties
To preserve integrity of data, the database system must ensure:

• Atomicity. Either all operations of the transaction are properly


reflected in the database or none are.
• Consistency. Execution of a transaction in isolation preserves
the consistency of the database.
• Isolation. Although multiple transactions may execute
concurrently, each transaction must be unaware of other
concurrently executing transactions. Intermediate transaction
results must be hidden from other concurrently executed
transactions.
– That is, for every pair of transactions Ti and Tj, it appears to Ti that
either Tj, finished execution before Ti started, or Tj started execution
after Ti finished.
• Durability. After a transaction completes successfully, the
changes it has made to the database persist, even if there are
system failures.
Example of Fund Transfer
• Transaction to transfer $50 from account A to account B:
1. read(A)
2. A := A – 50
3. write(A)
4. read(B)
5. B := B + 50
6. write(B)
• Consistency requirement – the sum of A and B is
unchanged by the execution of the transaction.
• Atomicity requirement — if the transaction fails after
step 3 and before step 6, the system should ensure that
its updates are not reflected in the database, else an
inconsistency will result.
Example of Fund Transfer (Cont.)
• Durability requirement — once the user has been
notified that the transaction has completed (i.e., the
transfer of the $50 has taken place), the updates to
the database by the transaction must persist despite
failures.
• Isolation requirement — if between steps 3 and 6,
another transaction is allowed to access the partially
updated database, it will see an inconsistent
database
(the sum A + B will be less than it should be).
Can be ensured trivially by running transactions
serially, that is one after the other. However,
executing multiple transactions concurrently has
significant benefits, as we will see.
Relaxing ACID
• Anyone familiar with databases will know the acronym ACID, which
outlines the fundamental elements of transactions: atomicity, consistency,
isolation, and durability Together, these qualities define the basics of any
transaction
• As NoSQL solutions developed it became clear that in order to deliver
scalability it might be necessary to relax or redefine some of these
qualities, in particular consistency and durability
• Complete consistency in a distributed environment requires a great deal of
communication involving locks, which force systems to wait on each other
before proceeding to mutate shared data
• Even in cases where multiple systems are generally not operating on the
same piece of data, there is a great deal of overhead that prevents
systems from scaling
Data Representation
• There are three main data representation
camps within NoSQL: document, key/value,
and graph
– Key/value
– JavaScript Object Notation (JSON)
• https://en.wikipedia.org/wiki/JSON
– Binary Script Object Notation(BSON)
– Big Table
Typical NoSQL API
◼ Basic API access:
❑ get(key) -- Extract the value given a key

❑ put(key, value) -- Create or update the value

given its key


❑ delete(key) -- Remove the key and its associated

value
❑ execute(key, operation, parameters) -- Invoke an

operation to the value (given its key) which is a


special data structure (e.g. List, Set, Map .... etc).
Flexible Data Model
ColumnFamily: Rockets
Key Value
Name Value
1
name Rocket-Powered Roller Skates
toon Ready, Set, Zoom
inventoryQty 5
brakes false

2 Name Value
name Little Giant Do-It-Yourself Rocket-Sled Kit
toon Beep Prepared
inventoryQty 4
brakes false
Name Value
3
name Acme Jet Propelled Unicycle
toon Hot Rod and Reel
inventoryQty 1
wheels 1
NoSQL Data Storage: Classification
◼ Uninterpreted key/value or ‘the big hash table’.
❑ Amazon S3 (Dynamo)
◼ Flexible schema
❑ BigTable, Cassandra, HBase (ordered keys, semi-
structured data),
❑ Sherpa/PNuts (unordered keys, JSON)
❑ MongoDB (based on JSON)
❑ CouchDB (name/value in text)
PNUTS Data Storage Architecture
CAP Theorem
◼ Three properties of a system
❑ Consistency (all copies have same value)
❑ Availability (system can run even if parts have failed)
❑ Via replication
❑ Partitions (network can break into two or more parts, each
with active systems that can’t talk to other parts)
◼ Brewer’s CAP “Theorem”: You can have at most two
of these three properties for any system
◼ Very large systems will partition at some point
❑ ➔Choose one of consistency or availablity
❑ Traditional database choose consistency
❑ Most Web applications choose availability
◼ Except for specific parts such as order processing
About Eventual Consistency
• To address this, most NoSQL solutions choose to relax the notion of
complete consistency to something called “eventual consistency ”

• This allows each system to make updates to data and learn of other
updates made by other systems within a short period of time, without
being totally consistent at all times

• As changes are made, tools such as vector clocks are used to provide
enough information to reason about the ordering of those changes based
on an understanding of the causality of the updates.

• For the majority of systems, knowing that the latest consistent


information will eventually arrive at all nodes is likely to be enough to
satisfy design requirements
Availability

◼ Traditionally, thought of as the server/process


available five 9’s (99.999 %).
◼ However, for large node system, at almost any
point in time there’s a good chance that a node
is either down or there is a network disruption
among the nodes.
❑ Want a system that is resilient in the face of

network disruption
Eventual Consistency
◼ When no updates occur for a long period of time, eventually all
updates will propagate through the system and all the nodes will
be consistent
◼ For a given accepted update and a given node, eventually either
the update reaches the node or the node is removed from
service
◼ Known as BASE (Basically Available, Soft state, Eventual
consistency), as opposed to ACID
❑ Soft state: copies of a data item may be inconsistent
❑ Eventually Consistent – copies becomes consistent at some later
time if there are no more updates to that data item
Common Advantages of NoSQL Systems
◼ Cheap, easy to implement (open source)
◼ Data are replicated to multiple nodes (therefore
identical and fault-tolerant) and can be
partitioned
❑ When data is written, the latest version is on at least
one node and then replicated to other nodes
❑ No single point of failure
◼ Easy to distribute
◼ Don't require a schema
What does NoSQL Not Provide?

◼ Joins
◼ Group by
❑ But PNUTS provides interesting materialized
view approach to joins/aggregation.
◼ ACID transactions
◼ SQL
◼ Integration with applications that are based on
SQL
Should I be using NoSQL Databases?
◼ NoSQL Data storage systems makes sense for
applications that need to deal with very very large
semi-structured data
❑ Log Analysis

❑ Social Networking Feeds

◼ Most of us work on organizational databases, which


are not that large and have low update/query rates
❑ regular relational databases are the correct solution

for such applications


Further Reading
◼ And lots of material on the Web
❑ E.g. nice presentation on NoSQL by Perry Hoekstra
◼ Some material in this talk is from above presentation

❑ Use a search engine to find information on data storage


systems such as
◼ BigTable (Google), Dynamo (Amazon), Cassandra
(Facebook/Apache), Pnuts/Sherpa (Yahoo), CouchDB,
MongoDB, …
◼ Several of above are open source

You might also like