0% found this document useful (0 votes)
203 views

BDA Unit2 Complete

Uploaded by

sirishaksnlp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
203 views

BDA Unit2 Complete

Uploaded by

sirishaksnlp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

UNIT II NOSQL DATA MANAGEMENT

Introduction to NoSQL – aggregate data models – aggregates – key-


value and document data models – relationships – graph databases
– schemaless databases – materialized views – distribution models
– sharding – master- slave replication – peer-peer replication –
sharding and replication – consistency – relaxing consistency –
version stamps – map-reduce – partitioning and combining –
composing map-reduce calculations

NOSQL DATA MANAGEMENT


Introduction to NoSQL

We know that MongoDB is a NoSQL Database, so it is very necessary to know about


NoSQL Database to understand MongoDB throughly.
NoSQL is a type of database management system (DBMS) that is designed to
handle and store large volumes of unstructured and semi-structured data. Unlike
traditional relational databases that use tables with pre-defined schemas to store
data, NoSQL databases use flexible data models that can adapt to changes in
data structures and are capable of scaling horizontally to handle growing
amounts of data.
The term NoSQL originally referred to “non-SQL” or “non-relational” databases,
but the term has since evolved to mean “not only SQL,” as NoSQL databases
have expanded to include a wide range of different database architectures and
data models

 NoSQL or ‘Not Only SQL’ is a data model that starkly differs from traditional
SQL expectations.
 The primary difference is that NoSQL does not use a relational data
modelling technique and it emphasizes flexible design. The lack of
requirement for a schema makes designing a much simpler and cheaper
process. That isn’t to say that you can’t use a schema altogether, but rather
that schema design is very flexible.
 Another useful feature of NoSQL data models is that they are built for high
efficiency and speed in terms of creating up to millions of queries a second.

BDA M Tech UNIT2 1 Sem 1 MLWEC


This is achieved through having all the data contained within one table, and
so JOINS and cross-referencing is not as performance heavy.
 NoSQL is also unique in that it is horizontally scalable, compared to SQL
which is only vertically scalable. With NoSQL you can simply use another
shard, which is cheap, rather than buying more hardware, which is not.

What is NoSQL Database

 Databases can be divided in 3 types:


 1. RDBMS (Relational Database Management System)
 2. OLAP (Online Analytical Processing)
 3. NoSQL (recently developed database)
What is Nosql?
NoSQL database, also called Not Only SQL, is an approach to
data management and database design that's useful for very large sets
of distributed data. NoSQL is a whole new way of thinking about a
database. NoSQL is not a relational database. The reality is that a
relational database model may not be the best solution for all situations.
The easiest way to think of NoSQL, is that of a database which does
not adhering to the traditional relational database management system
(RDMS) structure. Sometimes you will also see it revered to as 'not only
SQL'.the most popular NoSQL database is Apache Cassandra.
Cassandra, which was once Facebook’s proprietary database, was
released as open source in 2008. Other NoSQL implementations
include SimpleDB, Google BigTable, Apache Hadoop, MapReduce,
MemcacheDB, and Voldemort. Companies that use NoSQL include
NetFlix, LinkedIn and Twitter.

Why Are NoSQL Databases Interesting? / Why we should use


Nosql? / when to use Nosql?

There are several reasons why people consider using a NoSQL database.

Application development productivity. A lot of application development


effort is spent on mapping data between in-memory data structures and a
relational database. A NoSQL database may provide a data model that better
fits the application’s needs, thus simplifying that interaction and resulting in
less code to write, debug, and evolve.
Large data. Organizations are finding it valuable to capture more data and
BDA M Tech UNIT2 1 Sem 2 MLWEC
process it more quickly. They are finding it expensive, if even possible, to do
so with relational databases. The primary reason is that a relational database
is designed to run on a single machine, but it is usually more economic to
run large data and computing loads on clusters of many smaller and cheaper
machines. Many NoSQL databases are designed explicitly to run on clusters,
so they make a better fit for big data scenarios.
Analytics. One reason to consider adding a NoSQL database to your
corporate infrastructure is that many NoSQL databases are well suited to
performing analytical queries.

Scalability. NoSQL databases are designed to scale; it’s one of the primary
reasons that people choose a NoSQL database. Typically, with a relational
database like SQL Server or Oracle, you scale by purchasing larger and faster
servers and storage or by employing specialists to provide additional tuning.
Unlike relational databases, NoSQL databases are designed to easily scale
out as they grow. Data is partitioned and balanced across multiple nodes in a
cluster, and aggregate queries are distributed by default.
Massive write performance. This is probably the canonical usage based
on Google's influence. High volume. Facebook needs to store 135 billion
messages a month. Twitter, for example, has the problem of storing 7 TB/data
per day with the prospect of this requirement doubling multiple times per year.
This is the data is too big to fit on one node problem. At 80 MB/s it takes a
day to store 7TB so writes need to be distributed over a cluster, which implies
key-value access, MapReduce, replication, fault tolerance, consistency
issues, and all the rest. For faster writes in-memory systems can be used.
Fast key-value access. This is probably the second most cited virtue of
NoSQL in the general mind set. When latency is important it's hard to beat
hashing on a key and reading the value directly from memory or in as little
as one disk seek. Not every NoSQL product is about fast access, some are
more about reliability, for example. but what people have wanted for a long
time was a better memcached and many NoSQL systems offer that.

Flexible data model and flexible datatypes. NoSQL products support a


whole range of new data types, and this is a major area of innovation in
NoSQL. We have: column-oriented, graph, advanced data structures,
document-oriented, and key-value. Complex objects can be easily stored
without a lot of mapping. Developers love avoiding complex schemas and
ORM frameworks. Lack of structure allows for much more flexibility. We also
have program and programmer friendly compatible datatypes likes JSON.
Schema migration. Schemalessness makes it easier to deal with schema
migrations without so much worrying. Schemas are in a sense dynamic,
because they are imposed by the application at run-time, so different parts
of an application can have a different view of the schema.

BDA M Tech UNIT2 1 Sem 3 MLWEC


Write availability. Do your writes need to succeed no mater what? Then we
can get into partitioning, CAP, eventual consistency and all that jazz.

Easier maintainability, administration and operations. This is very


product specific, but many NoSQL vendors are trying to gain adoption by
making it easy for developers to adopt them. They are spending a lot of
effort on ease of use, minimal administration, and automated operations.
This can lead to lower operations costs as special code doesn't have to be
written to scale a system that was never intended to be used that way.

No single point of failure. Not every product is delivering on this, but we are
seeing a definite convergence on relatively easy to configure and manage
high availability with automatic load balancing and cluster sizing. A perfect
cloud partner.

Generally available parallel computing. We are seeing MapReduce baked


into products, which makes parallel computing something that will bea normal
part of development in the future.

Programmer ease of use. Accessing your data should be easy. While the
relational model is intuitive for end users, like accountants, it's not very
intuitive for developers. Programmers grok keys, values, JSON, Javascript
stored procedures, HTTP, and so on. NoSQL is for programmers. This is a
developer led coup. The response to a database problem can't always be to
hire a really knowledgeable DBA, get your schema right, denormalize a little,
etc., programmers would prefer a system that they can make work for
themselves. It shouldn't be so hard to make a product perform. Money is part
of the issue. If it costs a lot to scale a product then won't you go with the
cheaper product, that you control, that's easier to use, and that's easier to
scale?

Use the right data model for the right problem. Different data models are
used to solve different problems. Much effort has been put into, for example,
wedging graph operations into a relational model, but it doesn't work. Isn't it
better to solve a graph problem in a graph database? We are now seeing a
general strategy of trying find the best fit between a problem and solution.

Distributed systems and cloud computing support. Not everyone is


worried about scale or performance over and above that which can be
BDA M Tech UNIT2 1 Sem 4 MLWEC
achieved by non-NoSQL systems. What they need is a distributed system that
can span datacenters while handling failure scenarios without a hiccup.
NoSQL systems, because they have focussed on scale, tend to exploit
partitions, tend not use heavy strict consistency protocols, and so are well
positioned to operate in distributed scenarios.
Difference betwwen Sql and Nosql
 SQL databases are primarily called as Relational Databases(RDBMS);
whereas NoSQL database are primarily called as non-relational or distributed
database.
 SQL databases are table based databases whereas NoSQL databases are
 Document based, key-value pairs, graph databases or wide-column stores.
This means that SQL databases represent data in form of tables which
consists of n number of rows of data whereas NoSQL databases are the
collection of key-value pair, documents, graph databases or wide-column
stores which do not have standard schema definitions which it needs to
adhered to.
 SQL databases have predefined schema whereas NoSQL databases have
 Dynamic schema for unstructured data.
 SQL databases are vertically scalable whereas the NoSQL databases are
horizontally scalable. SQL databases are scaled by increasing the horse-
power of the hardware. NoSQL databases are scaled by increasing the
databases servers in the pool of resources to reduce the load.
 SQL databases uses SQL ( structured query language ) for defining and
 Manipulating the data, which is very powerful. In NoSQL database, queries
are focused on collection of documents. Sometimes it is also called as
UnQL (Unstructured Query Language). The syntax of using UnQL varies
from database to database.
 SQL database examples: MySql, Oracle, Sqlite, Postgres and MS-SQL.
NoSQL
 database examples: MongoDB, BigTable, Redis, RavenDb, Cassandra,
Hbase, Neo4j and CouchDb
 For complex queries: SQL databases are good fit for the complex
query
 Intensive environment whereas NoSQL databases are not good fit for
complex queries. On a high-level, NoSQL don’t have standard interfaces to
perform complex queries, and the queries themselves in NoSQL are not as
powerful as SQL query language.
 For the type of data to be stored: SQL databases are not best fit for
 hierarchical data storage. But, NoSQL database fits better for the hierarchical
data storage as it follows the key-value pair way of storing data similar to
BDA M Tech UNIT2 1 Sem 5 MLWEC
JSON data. NoSQL database are highly preferred for large data set (i.e for big
data). Hbase is an example for this purpose.
Applications of NoSQL Databases
1. Data Mining
When it comes to data mining, NoSQL databases are useful in retrieving information
for data mining uses. Particularly when it’s about large amounts of data, NoSQL
databases store data points in both structured and unstructured formats leading to
efficient storage of big data.
Perhaps when a user wishes to mine a particular dataset from large amounts of data,
one can make use of NoSQL databases, to begin with. Data is the building block of
technology that has led mankind to such great heights. Therefore, one of the most
essential fields where NoSQL databases can be put to use is data mining and data
storage.

2. Social Media Networking Sites


Social media is full of data, both structured and unstructured. A field that is loaded with
tons of data to be discovered, social media is one of the most effective applications of
NoSQL databases. From comments to posts, user-related information to advertising,
social media marketing requires NoSQL databases to be implemented in certain ways
to retrieve useful information that can be helpful in certain ways. Social media sites
like Facebook and Instagram often approach open-source NoSQL databases to
extract data that helps them keep track of their users and the activities going on around
their platforms.

3. Software Development
The third application that we will be looking at is software development. Software
development requires extensive research on users and the needs of the masses that
are met through software development. However, a developer must be able to scan
through data that is available.Perhaps NoSQL databases are always useful in helping
software developers keep a tab on their users, their details, and other user-related data
that is important to be noted. That said, NoSQL databases are surely helpful in
software development.
We know, NoSQL are databases that store data in another format other than relational
databases. NoSQL deals in nearly every industry nowadays. For the people who
interact with data in databases, the Aggregate Data model will help in that interaction.

Features of NoSQL
Non-relational
 ∙ NoSQL databases never follow the relational model

 ∙ Never provide tables with flat fixed-column records


BDA M Tech UNIT2 1 Sem 6 MLWEC
 ∙ Work with self-contained aggregates or BLOBs

 ∙ Doesn’t require object-relational mapping and data normalization ∙ No complex


features like query languages, query planners,referential integrity joins, ACID
 Schema Agnostic: NoSQL Databases do not require any specific schema or s
storage structure than traditional RDBMS.
 Scalability: NoSQL databases scale horizontally as data grows rapidly certain
commodity hardware could be added and scalability features could be preserved
for NoSQL.
 Performance: To increase the performance of the NoSQL system one can add a
different commodity server than reliable and fast access of database transfer with
minimum overhead.
 Horizontal Scaling: NoSQL Databases can scale horizontally by adding nodes
to share loads. As the data grows the hardware can be added and scalability
features could be preserved for NoSQL.
 Performance: Users can increase the performance of the NoSQL Database by
adding a different server.
 Flexible Schema: NoSQL Databases do not require the same schema as
compared to SQL Databases. The document in the same collection does not
need to have the same set of fields and data type.
 High Availability: Unlike Relational Databases that use primary and secondary
nodes for fetching data. NoSQL Databases use master place architecture.

Schema-free
 NoSQL databases are either schema-free or have relaxed schemas
 Do not require any sort of definition of the schema of the data
 Offers heterogeneous structures of data in the same domain

BDA M Tech UNIT2 1 Sem 7 MLWEC


Simple API
 Offers easy to use interfaces for storage and querying data
provided ∙ APIs allow low-level data manipulation & selection
methods
 Text-based protocols mostly used with HTTP REST with JSON
∙ Mostly used no standard based NoSQL query language
 Web-enabled databases running as internet-facing services

Distributed
 ∙ Multiple NoSQL databases can be executed in a distributed
fashion ∙ Offers auto-scaling and fail-over capabilities
 ∙ Often ACID concept can be sacrificed for scalability and throughput ∙ Mostly
no synchronous replication between distributed nodes Asynchronous Multi-
Master Replication, peer-to-peer, HDFS Replication
 ∙ Only providing eventual consistency

 ∙ Shared Nothing Architecture. This enables less coordination and higher


distribution.
Advantages of NoSQL
 ∙ Can be used as Primary or Analytic Data Source

 ∙ Big Data Capability

 ∙ No Single Point of Failure

 ∙ Easy Replication

 ∙ No Need for Separate Caching Layer

 ∙ It provides fast performance and horizontal scalability.

 Can handle structured, semi-structured, and unstructured data with equal


effect Object-oriented programming which is easy to use and flexible.
 NoSQL databases don’t need a dedicated high-performance server ∙
 Support Key Developer Languages and Platforms Simple to implement than
using RDBMS.
 It can serve as the primary data source for online applications.
 Handles big data which manages data velocity, variety, volume, and
complexity ∙
 Excels at distributed database and multi-data center operations.
 Eliminates the need for a specific caching layer to store data Offers a flexible
schema design which can easily be altered without downtime or service
disruption

BDA M Tech UNIT2 1 Sem 8 MLWEC


Disadvantages of NoSQL
 No standardization rules
 Limited query capabilities
 RDBMS databases and tools are comparatively mature
 It does not offer any traditional database capabilities, like consistency when
multiple transactions are performed simultaneously.
 When the volume of data increases it is difficult to maintain unique values as
keys become difficult
 Doesn’t work as well with relational data
 The learning curve is stiff for new developers
 Open source options so not so popular for enterprises.
 For scalability: In most typical situations, SQL databases are vertically
scalable. You can manage increasing load by increasing the CPU, RAM, SSD,
etc, on a single server. On the other hand, NoSQL databases are
horizontally scalable. You can just add few more servers easily in your
NoSQL database infrastructure to handle the large traffic.
 For high transactional based application: SQL databases are best fit for
 Heavy duty transactional type applications, as it is more stable and
promises the atomicity as well as integrity of the data. While you can use
NoSQL for transactions purpose, it is still not comparable and sable enough
in high load and for complex transactional applications.
 For support: Excellent support are available for all SQL database from their
vendors.
 There are also lot of independent consultations who can help you with SQL database
for a very large scale deployments.
 For some NoSQL database you still have to rely on community support, and only
limitedoutside experts are available for you to setup and deploy your large scale
NoSQL deployments.
 For properties: SQL databases emphasizes on ACID properties (
Atomicity,Consistency, Isolation and Durability) whereas the NoSQL database
follows the Brewers CAP theorem ( Consistency, Availability and Partition tolerance )
 For DB types:
On a high-level, we can classify SQL databases as either open-source or close-
sourced from commercial vendors. NoSQL databases can be classified on the basis of
way of storing data as graph databases, key-value store databases, document store
databases, column store database and XML databases.

Types of Nosql Databases: There are four general types of NoSQL


databases, each with their own specific attributes:

BDA M Tech UNIT2 1 Sem 9 MLWEC


1. Key-Value storage

This is the first category of NoSQL database. Key-value stores have a


simple data model, which allow clients to put a map/dictionary and
request value par key. In the key-value storage, each key has to be
unique to provide non-ambiguous identification of values. For example.

2. Document databases

In the document database NoSQL store document in JSON format.


JSON-based document are store in completely different sets of
attributes can be stored together, which stores highly unstructured data
as named value pairs and applications that look at user
behavior,actions,and logs in real time.

BDA M Tech UNIT2 1 Sem 10 MLWEC


3. Columns storage

Columnar databases are almost like tabular databases. Thus keys in


wide column store scan have many dimensions, resulting in a structure
similar to a multi-dimensional, associative array. Shown in below
example storing data in a wide column system using a two-dimensional
key.

4. Graph storage

Graph databases are best suited for representing data with a high, yet
flexible number of interconnections, especially when information about
those interconnections is at least as important as there presented data.
In NoSQL database, data is stored in a graph like structures in graph
databases, so that the data can be made easily accessible. Graph
databases are commonly used on social networking sites. As show in
below figure.

BDA M Tech UNIT2 1 Sem 11 MLWEC


Example databases

Pros and Cons of Relational Databases


• Advantages
• Data persistence
• Concurrency – ACID, transactions, etc.
BDA M Tech UNIT2 1 Sem 12 MLWEC
• Integration across multiple applications
• Standard Model – tables and SQL
• Disadvantages
• Impedance mismatch
• Integration databases vs. application databases
• Not designed for clustering

Database Impedance mismatch:


Impedance Mismatch means the difference between data model and in
memory data structures.
Impedance is the measure of the amount that some object resists (or
obstruct, resist) the flow of another object.
Imagine you have a low current flashlight that normally uses AAA batteries.
Suppose you could attach your car battery to the flashlight. The low current
flashlight will pitifully output a fraction of the light energy that the high current
battery is capable of producing. However, match the AAA batteries to the
flashlight and they will run with maximum efficiency.
The data representation in RDMS is not matched with the data structure used
in memory. In-memory, data structures are lists, dictionaries, nested and
hierarchical data structures whereas in Relational database, it stores only
atomic values, and there is no lists are nested records. Translating between
these representations can be costly, confusing and limits the application
development productivity.

Some common characteristics of nosql include:


• Does not use the relational model (mostly)
• Generally open source projects (currently)
• Driven by the need to run on clusters
• Built for the need to run 21st century web properties
• Schema-less
• Polygot persistence: The point of view of using different data
stores in different circumstances is known as Polyglot Persistence.
Today, most large companies are using a variety of different data storage
technologies for different kinds of data. Many companies still use relational
databases to store some data, but the persistence needs of applications are
evolving from predominantly relational to a mixture of data sources. Polyglot
persistence is commonly used to define this hybrid approach. The definition
of polyglot is “someone who speaks or writes several languages.” The term

BDA M Tech UNIT2 1 Sem 13 MLWEC


polyglot is redefined for big data as a set of applications that use several
core database technologies.

• Auto Sharding: NoSQL databases usually support auto-


sharding,meaning that they natively and automatically spread data across
an arbitrary number of servers, without requiring the application to even be
aware of thecomposition of the server pool
Nosql data model
Relational and NoSQL data models are very different. The relational model
takes data and separates it into many interrelated tables that contain rows
and columns. Tables reference each other through foreign keys that are
stored in columns as well. When looking up data, the desired information
needs to be collected from many tables (often hundreds in today’s enterprise
applications) and combined before it can be provided to the
application.Similarly, when writing data, the write needs to be coordinated
and performed on many tables.

BDA M Tech UNIT2 1 Sem 14 MLWEC


NoSQL databases have a very different model. For example, a document-
oriented NoSQL database takes the data you want to store and aggregates it
into documents using the JSON format. Each JSON document can be thought
of as an object to be used by your application. A JSON document might, for
example, take all the data stored in a row that spans 20 tables of a relational
database and aggregate it into a single document/object. Aggregating this
information may lead to duplication of information, but since storage is no
longer cost prohibitive, the resulting data model flexibility, ease of efficiently
distributing the resulting documents and read and write performance
improvements make it an easy trade-off for web-based applications.

Another major difference is that relational technologies have rigid schemas


while NoSQL models are schemaless. Relational technology requires strict
definition of a schema prior to storing any data into a database. Changing the
schema once data is inserted is a big deal, extremely disruptive and frequently
avoided – the exact opposite of the behavior desired in the Big Data era,
where application developers need to constantly – and rapidly – incorporate
new types of data to enrich their apps.
Aggregates data model in nosql

Data Model: A data model is the model through which we perceive and
manipulate our data. For people using a database, the data model describes
how we interact with the data in the database.
Relational Data Model: The relational model takes the information that we
want to store and divides it into tuples.
Tuple being a limited Data Structure it captures a set of values and can’t be
nested. This gives Relational Model a space of development.
Aggregate Model: Aggregate is a term that comes from Domain-Driven
Design, an aggregate is a collection of related objects that we wish to treat
as a unit, it is a unit for data manipulation and management of consistency.
BDA M Tech UNIT2 1 Sem 15 MLWEC
• Atomic property holds within an aggregate
• Communication with data storage happens in unit of aggregate
• Dealing with aggregate is much more efficient in clusters
• It is often easier for application programmers to work with aggregates

Example of Relations and Aggregates


 Let’s assume we have to build an e-commerce website; we are going
to beselling items directly to customers over the web, and we will
have to store information about users, our product catalog, orders,
shipping addresses, billing addresses, and payment data.
 We can use this scenario to model the data using a relation data store
as well as NoSQL data stores and talk about their pros and cons.
 For a relational database, we might start with a data model shown in
the following figure.

BDA M Tech UNIT2 1 Sem 16 MLWEC


The following figure presents some sample data for this model.

In relational, everything is properly normalized, so that no data is repeated in


multiple tables. We also have referential integrity. A realistic order system
would naturally be more involved than this. Now let’s see how this model might
look when we think in more aggregate oriented terms

Again, we have some sample data, which we’ll show in JSON format as that’s
a common representation for data in NoSQL.

BDA M Tech UNIT2 1 Sem 17 MLWEC


// in customers
{"
id":1,
"name":"Martin",
"billingAddress":[{"city":"Chicago"}]
}
// in orders
{"
id":99,
"customerId":1,
"orderItems":[
{
"productId":27,
"price": 32.45,
"productName": "NoSQL Distilled"
}
],
"shippingAddress":[{"city":"Chicago"}]
"orderPayment":[
{
"ccinfo":"1000-1000-1000-1000",
"txnId":"abelif879rft",
"billingAddress": {"city": "Chicago"}
}
],
}

In this model, we have two main aggregates: customer and order. We’ve used
the black-diamond composition marker in UML to show how data fits into the
aggregation structure. The customer contains a list of billing addresses; the
order contains a list of order items, a shipping address, and payments. The
payment itself contains a billing address for that payment.
A single logical address record appears three times in the example data, but
instead of using IDs it’s treated as a value and copied each time. This fits the
BDA M Tech UNIT2 1 Sem 18 MLWEC
domain where we would not want the shipping address, nor the payment’s
billing address, to change. In a relational database, we would ensure that the
address rows aren’t updated for this case, making a new row instead. With
aggregates, we can copy the whole address structure into the aggregate as
we need to.

Aggregate-Oriented Databases: Aggregate-oriented databases work best


when most data interaction is done with the same aggregate; aggregate-
ignorant databases are better when interactions use data organized in many
different formations.
• Key-value databases
•Stores data that is opaque to the database
•The database does cannot see the structure of records
•Application needs to deal with this
•Allows flexibility regarding what is stored (i.e. text or binary
data)
• Document databases
 Stores data whose structure is visible to the database
 Imposes limitations on what can be stored
 Allows more flexible access to data (i.e. partial
records)via querying.
 Both key-value and document databases consist of aggregate
records accessed by ID values
• Column-family databases
✓ Two levels of access to aggregates (and hence, two pars to the
“key” to access an aggregate’s data)
✓ ID is used to look up aggregate record
✓ Column name – either a label for a value (name) or a key to
alist entry (order id)
✓ Columns are grouped into column families

BDA M Tech UNIT2 1 Sem 19 MLWEC


BDA M Tech UNIT2 1 Sem 20 MLWEC
Schemaless Databases
 A common theme across all the forms of NoSQL databases is that they
are schemaless.
 When you want to store data in a relational database, you first have to
define a schema—a defined structure for the database which says what
tables exist, which columns exist, and what data types each column
can hold.
 Before you store some data, you have to have the schema defined for
it.
 With NoSQL databases, storing data is much more casual. A key-value
storeallows you to store any data you like under a key.
 A document database effectively does the same thing, since it makes
no restrictions on the structure of the documents you store.
 Column-family databases allow you to store any data under any column
you like.
 Graph databases allow you to freely add new edges and freely add
properties to nodes and edges as you wish.
Why schemaless?
 A schemaless store also makes it easier to deal with nonuniform data
 When starting a new development project you don't need to spend
the same amount of time on up-front design of the schema.
 No need to learn SQL or database specific stuff and tools.
 The rigid schema of a relational database (RDBMS) means you have
to absolutely follow the schema. It can be harder to push data into the
DB as it has to perfectly fit the schema. Being able to add data directly
without having to tweak it to match the schema can save you time
 Minor changes to the model and you will have to change both your
Code and the schema in the DBMS. If no schema, you don't have to
make changes in two places. Less time consuming
 With a NoSql DB you have fewer ways to pull the data out
 Less overhead for DB engine
 Less overhead for developers related to scalability
 Eliminates the need for Database administrators or database experts
 Fewer people involved and less waiting on experts
 Save time writing complex SQL joins -> more rapid development
Pros and cons of schemaless data
Pros:
• More freedom and flexibility
• you can easily change your data organization
BDA M Tech UNIT2 1 Sem 21 MLWEC
• you can deal with nonuniform
data Cons:
 A program that accesses data:
• almost always relies on some form of implicit schema
• it assumes that certain fields are present
• carry data with a certain meaning
 The implicit schema is shifted into the application code that accesses data
 To understand what data is present you have look at the application code
 The schema cannot be used to:
o decide how to store and retrieve data efficiently
o ensure data consistency
 Problems if multiple applications, developed by different people, access the
same database.
 Relational schemas can be changed at any time with standard SQL
commands

Key-value databases
 A key-value store is a simple hash table, primarily used when all access to
the database is via primary key.
 Key-value stores are the simplest NoSQL data stores to use from an API
perspective. The client can either get the value for the key, put a value for
a key, or delete a key from the data store.
 The value is a BLOB(Binary Large Object) that the data store just stores,
without caring or knowing what’s inside; it’s the responsibility of the
application to understand what was stored.
 Since key-value stores always use primary-key access, they generally have
great performance and can be easily scaled.
 It is an associative container such as map, dictionary, and in query
processing an index. It is an abstract data type composed of a collection of
unique keys and a collection of values, where each key is associated with
one value (or set of values).
 The operation of finding the value associated with a key is called a lookup
or indexing the relationship between a key and its value is sometimes called
a mapping or binding.
 Some of the popular key-value databases are Riak, Redis, Memcached DB,
Berkeley DB, HamsterDB, Amazon DynamoDB.
 A Key-Value model is great for lookups of simple or even complex values.

BDA M Tech UNIT2 1 Sem 22 MLWEC


 When the values are themselves interconnected, you’ve got a graph as
shown in following figure.
 Let’s you traverse quickly among all the connected values.

In Key-Value database,

• Data is stored sorted by key.

• Callers can provide a custom comparison function to override the sort


order.

• The basic operations are Put(key,value), Get(key), Delete(key).

• Multiple changes can be made in one atomic batch.

• Users can create a transient snapshot to get a consistent view of data.

• Forward and backward iteration is supported over the data.


In key-value databases, a single object that stores all the data and is put into
a single bucket. Buckets are used to define a virtual keyspace and provide
the ability to define isolated non-default configuration. Buckets might be
compared to tables or folders in relational databases or file systems,
respectively.
As their name suggest, they store key/value pairs. For example, for search
engines, a store may associate to each keyword (the key) a list of documents
containing it (the corresponding value).
One approach to implement a key-value store is to use a file decomposed in
blocks . As the following figure shows, each block is associated with a number
(ranging from 1 to n). Each block manages a set of key-value pairs: the
beginning of the block contained, after some information, an index of keys and
the position of the corresponding values. These values are stored starting
from the end of the block (like a memory heap). The free space available is
delimited by the end of the index and the end of the values.

BDA M Tech UNIT2 1 Sem 23 MLWEC


In this implementation, the size of a block is important since it defines the
largest value that can be stored (for example the longest list of document
identifiers containing a given keyword). Moreover, it supposes that a block
number is associated to each key. These block numbers can be assigned in
two different ways:

1. The block number is obtained directly from the key, typically by using a
hash function. The size of the file is then defined by the largest block
number computed by every possible key.

2. The block number is assigned increasingly. When a new pair must be


stored, the first block that can hold it is chosen. In practice, a given
amount of space is reserved in a block in order to manage updates of
existing pairs (a new value can replace an older and smaller one). This
limit the size of the file to amount of values to store.
Use Cases:

 These Aggregate Data Models in NoSQL Database are used for storing the user
session data.
 Key Value-based Data Models are used for maintaining schema-less user profiles.
 It is used for storing user preferences and shopping cart data.

Advantages:
 Can handle large amounts of data and heavy load,
 Easy retrieval of data by keys

Limitations:

 Complex queries may attempt to involve multiple key-value pairs which may delay
performance.
 Data can be involving many-to-many relationships which may collide.

BDA M Tech UNIT2 1 Sem 24 MLWEC


Examples:
 DynamoDB
 Berkeley DB

Document Databases

 In a relational database system you must define a schema before


adding records to a database.
 The schema is the structure described in a formal language supported
by the database and provides a blueprint for the tables in a database
and the relationships between tables of data.
 Within a table, you need to define constraints in terms of rows and
named columns as well as the type of data that can be stored in each
column.
 In contrast, a document-oriented database contains documents, which
are records that describe the data in the document, as well as the actual
data.
 Documents can be as complex as you choose; you can use nested data
to provide additional sub-categories of information about your object.
 You can also use one or more document to represent a real-world
object. The following compares a conventional table with document-
based objects:

BDA M Tech UNIT2 1 Sem 25 MLWEC


In this example we have a table that represents beers and their respective
attributes: id, beer name, brewer, bottles available and so forth. As we see in
this illustration, the relational model conforms to a schema with a specified
number of fields which represent a specific purpose and data type. The
equivalent document-based model has an individual document per beer; each
document contains the same types of information for a specific beer.
In a document-oriented model, data objects are stored as documents; each
document stores your data and enables you to update the data or delete it.
Instead of columns with names and data types, we describe the data in the
document, and provide the value for that description. If we wanted to add
attributes to a beer in a relational mode, we would need to modify the database
schema to include the additional columns and their data types. In the case of
document-based data, we would add additional key-value pairs into our
documents to represent the new fields.
The other characteristic of relational database is data normalization ; this
means you decompose data into smaller, related tables. The figure below
illustrates this:

BDA M Tech UNIT2 1 Sem 26 MLWEC


In the relational model, data is shared across multiple tables. The advantage
to this model is that there is less duplicated data in the database. If we did not
separate beers and brewers into different tables and had one beer table
instead, we would have repeated information about breweries for each beer
produced by that brewer.
The problem with this approach is that when you change information across
tables, you need to lock those tables simultaneously to ensure information
changes across the table consistently. Because you also spread information
across a rigid structure, it makes it more difficult to change the structure during
production, and it is also difficult to distribute the data across multiple servers.
In the document-oriented database, we could choose to have two different
document structures: one for beers, and one for breweries. Instead of splitting
your application objects into tables and rows, you would turn them into
documents. By providing a reference in the beer document to a brewery
document, you create a relationship between the two entities:

In this example we have two different beers from the Amtel brewery. We
represent each beer as a separate document and reference the brewery in
the brewer field. The document-oriented approach provides several upsides
compared to the traditional RDBMS model. First, because information is
stored in documents, updating a schema is a matter of updating the
documents for that type of object. This can be done with no system downtime.
Secondly, we can distribute the information across multiple servers with
greater ease. Since records are contained within entire documents, it makes
it easier to move, or replicate an entire object to another server.

Using JSON Documents

BDA M Tech UNIT2 1 Sem 27 MLWEC


JavaScript Object Notation (JSON) is a lightweight data-interchange format
which is easy to read and change. JSON is language-independent although it
uses similar constructs to JavaScript. The following are basic data types
supported in JSON:
 Numbers, including integer and floating point,
 Strings, including all Unicode characters and backslash escape
characters,
 Boolean: true or false,
 Arrays, enclosed in square brackets: [“one”, “two”, “three”]
 Objects, consisting of key-value pairs, and also known as
an associative array or hash. The key must be a string and the value
canbe any supported JSON data type.
For instance, if you are creating a beer application, you might want
particular document structure to represent a beer:
{
"name":
"description":
"category":
"updated":
}
For each of the keys in this JSON document you would provide unique
values to represent individual beers. If you want to provide more detailed
information in your beer application about the actual breweries, you could
create a JSON structure to represent a brewery:
{
"name":
"address":
"city":
"state":
"website":
"description":
}
Performing data modeling for a document-based application is no different
than the work you would need to do for a relational database. For the most
part it can be much more flexible, it can provide a more realistic representation
or your application data, and it also enables you to change your mind later
about data structure. For more complex items in your application, one option
is to use nested pairs to represent the information:
{
"name":
"address":

BDA M Tech UNIT2 1 Sem 28 MLWEC


"city":
"state":
"website":
"description":
"geo":
{
"location": ["-105.07", "40.59"],
"accuracy": "RANGE_INTERPOLATED"
}
"beers": [ _id4058, _id7628]
}
In this case we added a nested attribute for the geo-location of the brewery
and for beers. Within the location, we provide an exact longitude and latitude,
as well as level of accuracy for plotting it on a map. The level of nesting you
provide is your decision; as long as a document is under the maximum storage
size for Server, you can provide any level of nesting that you can handle in
your application.
In traditional relational database modeling, you would create tables that contain
a subset of information for an item. For instance a brewery may contain types
of beers which are stored in a separate table and referenced by the beer id.
In the case of JSON documents, you use key-values pairs, or even nested key-
value pairs.

Column-Family Stores
Its name conjured up a tabular structure which it realized with sparse columns
and no schema. The column-family model is as a two-level aggregate
structure. As with key-value stores, the first key is often described as a row
identifier, picking up the aggregate of interest. The difference with column
family structures is that this row aggregate is itself formed of a map of more
detailed values. These second-level values are referred to as columns. As
well as accessing the row as a whole, operations also allow picking out a
particular column, so to get a particular customer’s name from following figure
you could do something like
get('1234', 'name').

BDA M Tech UNIT2 1 Sem 29 MLWEC


Column-family databases organize their columns into column families. Each
column has to be part of a single column family, and the column acts as unit
for access, with the assumption that data for a particular column family will be
usually accessed together.
The data is structured into:
• Row-oriented: Each row is an aggregate (for example, customer with the
ID of 1234) with column families representing useful chunks of data (profile,
order history) within that aggregate.
• Column-oriented: Each column family defines a record type (e.g.,
customer profiles) with rows for each of the records. You then think of a row
as the join of records in all column families.
Even though a document database declares some structure to the database,
each document is still seen as a single unit. Column families give a two-
dimensional quality to column-family databases.
Cassandra uses the terms “ wide” and “ skinny.” Skinny rows have few
columns with the same columns used across the many different rows. In this
case, the column family defines a record type, each row is a record, and each
column is a field. A wide row has many columns (perhaps thousands), with
rows having very different columns. A wide column family models a list, with
each column being one element in that list.

BDA M Tech UNIT2 1 Sem 30 MLWEC


Advantages:
 Data is readily available
 Queries like SUM, AVERAGE, COUNT can be easily performed on columns
Examples:
 HBase
 Bigtable by Google
 Cassandra

EX1:

EX2:

BDA M Tech UNIT2 1 Sem 31 MLWEC


Column family is an Aggregate Data Models in NoSQL Database usually with big-table
style Data Models that are referred to as column stores. It is also called a two-level map
as it offers a two-level aggregate structure. In this Aggregate Data Models in NoSQL,
the first level of the Column family contains the keys that act as a row identifier that is
used to select the aggregate data. Whereas the second level values are referred to as
columns.

Use Cases:

 Column Family Data Models are used in systems that maintain counters.
 These Aggregate Data Models in NoSQL are used for services that have expiring
usage.
 It is used in systems that have heavy write requests.

Relationships: Atomic Aggregates


Aggregates allow one to store a single business entity as one document, row
or key-value pair and update it atomically:

Graph Databases:
Graph databases are one style of NoSQL databases that uses a distribution
model similar to relational databases but offers a different data model that
makes it better at handling data with complex relationships.

• Entities are also known as nodes, which have properties


• Nodes are organized by relationships which allows to find interesting
patterns between the nodes
• The organization of the graph lets the data to be stored once and then
interpreted in different ways based on relationships
Let’s follow along some graphs, using them to express themselves. We’ll

BDA M Tech UNIT2 1 Sem 32 MLWEC


read” a graph by following arrows around the diagram to form sentences.
A Graph contains Nodes and Relationships

A Graph –[:RECORDS_DATA_IN]–> Nodes –[:WHICH_HAVE]–> Properties.


The simplest possible graph is a single Node, a record that has named values
referred to as Properties. A Node could start with a single Property and grow
to a few million, though that can get a little awkward. At some point it makes
sense to distribute the data into multiple nodes, organized with explicit
Relationships.

Query a Graph with a Traversal

A Traversal –navigates–> a Graph; it –identifies–> Paths –which order–>


Nodes.
A Traversal is how you query a Graph, navigating from starting Nodes to
related Nodes according to an algorithm, finding answers to questions like
“what music do my friends like that I don’t yet own,” or “if this power supply
goes down, what web services are affected?”

BDA M Tech UNIT2 1 Sem 33 MLWEC


Example

In this context, a graph refers to a graph data structure of nodes connected


by edges. In the above figure we have a web of information whose nodes are
very small (nothing more than a name) but there is a rich structure of
interconnections between them. With this structure, we can ask questions
such as “ find the books in the Databases category that are written
by someone whom a friend of mine likes.”

Graph databases specialize in capturing this sort of information—but on a


much larger scale than a readable diagram could capture. This is ideal for
capturing any data consisting of complex relationships such as social
networks, product preferences, or eligibility rules.
Materialized Views
In computing, a materialized view is a database object that contains the
results of a query. For example, it may be a local copy of data located
remotely, or may be a subset of the rows and/or columns of a table or join
result, or may be a summary based on aggregations of a table's data.
Materialized views can be used within the same aggregate.
Materialized views, which store data based on remote tables, are also known
as snapshots. A snapshot can be redefined as a materialized view.
Materialized view is computed in advance and cached on disk.
Strategies to building a materialized view:
Eager approach: the materialized view is updated at the same time of the
base data. It is good when you have more frequent reads than writes.
Detached approach: batch jobs update the materialized views at regular
BDA M Tech UNIT2 1 Sem 34 MLWEC
intervals. It is good when you don’t want to pay an overhead on each update.
NoSQL databases do not have views and have precomputed and cached
queries usually called “materialized view”.

Distribution Models
Multiple servers: In NoSQL systems, data distributed over large clusters
Single server – simplest model, everything on one machine. Run the
database on a single machine that handles all the reads and writes to the data
store. We prefer this option because it eliminates all the complexities. It’s easy
for operations people to manage and easy for application developers to
reason about.
Although a lot of NoSQL databases are designed around the idea of running
on a cluster, it can make sense to use NoSQL with a single-server distribution
model if the data model of the NoSQL store is more suited to the application.
Graph databases are the obvious category here—these work best in a single-
server configuration.
If your data usage is mostly about processing aggregates, then a single-
server document or key-value store may well be worthwhile because it’s
easier on application developers.
Orthogonal aspects of data distribution models:
Sharding: DB Sharding is nothing but horizontal partitioning of data. Different
people are accessing different parts of the dataset. In these circumstances
we can support horizontal scalability by putting different parts of the data onto
different servers—a technique that’s called sharding.
A table with billions of rows can be partitioned using “Range Partitioning”. If
the customer transaction date, for an example, based partitioning will partition
the data vertically. So irrespective which instance in a Real Application
Clusters access the data, it is “not” horizontally partitioned although Global
Enqueue Resources are owning certain blocks in each instance but it can be
moving around. But in “db shard” environment, the data is horizontally
partitioned. For an example: United States customer can live in one shard and
European Union customers can be in another shard and the other countries
customers can live in another shard but from an access perspective there is
no need to know where the data lives. The DB Shard can go to the appropriate
shard to pick up the data.

BDA M Tech UNIT2 1 Sem 35 MLWEC


□ Different parts of the data onto different servers
• Horizontal scalability
• Ideal case: different users all talking to different server nodes
• Data accessed together on the same node ̶aggregate unit!
□ Pros: it can improve both reads and writes
□ Cons: Clusters use less reliable machines ̶resilience decreases
Many NoSQL databases offer auto-sharding
□ the database takes on the responsibility of sharding
Improving performance:
Main rules of sharding:
1. Place the data close to where it’s accessed
□ Orders for Boston: data in your eastern US data center
2. Try to keep the load even
□ All nodes should get equal amounts of the load
3. Put together aggregates that may be read in sequence
□ Same order, same node

Master-Slave Replication
□ Master
□ is the authoritative source for the data
□ is responsible for processing any updates to that data
□ can be appointed manually or automatically

BDA M Tech UNIT2 1 Sem 36 MLWEC


□ Slaves
□ A replication process synchronizes the slaves with the
master
□ After a failure of the master, a slave can be appointed as new
master very quickly

Pros and cons of Master-Slave Replication


□ Pros
 More read requests:
 Add more slave nodes
 Ensure that all read requests are routed to the slaves
 Should the master fail, the slaves can still handle
readrequests
 Good for datasets with a read-intensive dataset
□ Cons
 The master is a bottleneck
 Limited by its ability to process updates and to pass
thoseupdates on
 Its failure does eliminate the ability to handle writes until:
 the master is restored or
 a new master is appointed
 Inconsistency due to slow propagation of changes to
BDA M Tech UNIT2 1 Sem 37 MLWEC
theslaves
 Bad for datasets with heavy write
traffic Peer-to-Peer Replication

• All the replicas have equal weight, they can all accept writes
• The loss of any of them doesn’t prevent access to the data store.
Pros and cons of peer-to-peer replication
□ Pros:
□ you can ride over node failures without losing access to data
□ you can easily add nodes to improve your performance
□ Cons:
□ Inconsistency
□ Slow propagation of changes to copies on different nodes
□ Inconsistencies on read lead to problems but are relatively transient
□ Two people can update different copies of the same record stored
ondifferent nodes at the same time - a write-write conflict.
□ Inconsistent writes are forever.
Sharding and Replication on Master-Slave
Replication and sharding are strategies that can be combined. If we use both
master slave replication and sharding, this means that we have multiple
masters, but each data item only has a single master. Depending on your

BDA M Tech UNIT2 1 Sem 38 MLWEC


configuration, you may choose a node to be a master for some data and
slaves for others, or you may dedicate nodes for master or slave duties.

□ We have multiple masters, but each data only has a single master.
□ Two schemes:
□ A node can be a master for some data and slaves for others
□ Nodes are dedicated for master or slave duties

Sharding and Replication on P2P


Using peer-to-peer replication and sharding is a common strategy for column
family databases. In a scenario like this you might have tens or hundreds of
nodes in a cluster with data sharded over them. A good starting point for peer-
to-peer replication is to have a replication factor of 3, so each shard is present
on three nodes. Should a node fail, then the shards on that node will be built
on the other nodes. (See following figure)

BDA M Tech UNIT2 1 Sem 39 MLWEC


□ Usually each shard is present on three nodes
□ A common strategy for column-family databases
Key Points
• There are two styles of distributing data:
• Sharding distributes different data across multiple servers, so each
server acts as the single source for a subset of data.
• Replication copies data across multiple servers, so each bit of data
can be found in multiple places.
A system may use either or both techniques.
• Replication comes in two forms:
• Master-slave replication makes one node the authoritative copy that
handles writes while slaves synchronize with the master and may handle
reads.
• Peer-to-peer replication allows writes to any node; the nodes
coordinate to synchronize their copies of the data.
Master-slave replication reduces the chance of update conflicts but peer to-
peer replication avoids loading all writes onto a single point of failure.
Consistency
The consistency property ensures that any transaction will bring the database
from one valid state to another. Any data written to the database

BDA M Tech UNIT2 1 Sem 40 MLWEC


must be valid according to all defined rules, including constraints, cascades,
triggers, and any combination thereof.
It is a biggest change from a centralized relational database to a cluster
oriented NoSQL.
Relational databases has strong consistency whereas NoSQL systems hass
mostly eventual consistency.
ACID: A DBMS is expected to support “ACID transactions,” processes that are:
• Atomicity: either the whole process is done or none is
• Consistency: only valid data are written
• Isolation: one operation at a time
• Durability: once committed, it stays that way
Various forms of consistency
1. Update Consistency (or write-write conflict):
Martin and Pramod are looking at the company website and notice that the
phone number is out of date. Incredibly, they both have update access, so
they both go in at the same time to update the number. To make the example
interesting, we’ll assume they update it slightly differently, because each uses
a slightly different format. This issue is called a write-write conflict: two
people updating the same data item at the same time.
When the writes reach the server, the server will serialize them—decide to
apply one, then the other. Let’s assume it uses alphabetical order and picks
Martin’s update first, then Pramod’s. Without any concurrency control,
Martin’s update would be applied and immediately overwritten by Pramod’s.
In this case Martin’s is a lost update. We see this as a failure of consistency
because Pramod’s update was based on the state before Martin’s update, yet
was applied after it.
□ Solutions:
 Pessimistic approach
 Prevent conflicts from occurring
 Usually implemented with write locks managed by the system
 Optimistic approach
 Lets conflicts occur, but detects them and takes action to sort them out
□ Approaches:
 conditional updates: test the value just before updating
 save both updates: record that they are in conflict and
then merge them

BDA M Tech UNIT2 1 Sem 41 MLWEC


o Do not work if there’s more than one server (peer-to-peer replication)
2. Read Consistency (or read-write conflict)
Alice and Bob are using Ticketmaster website to book tickets for a specific
show. Only one ticket is left for the specific show. Alice signs on to
Ticketmaster first and finds one left, and finds it expensive. Alice takes time
to decide. Bob signs on and finds one ticket left, orders it instantly. Bob
purchases and logs off. Alice decides to buy a ticket, to find there are no
tickets. This is a typical Read-Write Conflict situation.
Another example where Pramod has done a read in the middle of Martin’s
write as shown in below.

We refer to this type of consistency as logical consistency. To avoid a


logically inconsistent by providing Martin wraps his two writes in a
transaction, the system guarantees that Pramod will either read both data
items before the update or both after the update. The length of time an
inconsistency is present is called the inconsistency window.
Replication consistency
Let’s imagine there’s one last hotel room for a desirable event. The hotel
reservation system runs on many nodes. Martin and Cindy are a couple
considering this room, but they are discussing this on the phone because
Martin is in London and Cindy is in Boston. Meanwhile Pramod, who is in
Mumbai, goes and books that last room. That updates the replicated room
availability, but the update gets to Boston quicker than it gets to London.
When Martin and Cindy fire up their browsers to see if the room is
BDA M Tech UNIT2 1 Sem 42 MLWEC
available, Cindy sees it booked and Martin sees it free. This is another
inconsistent read—but it’s a breach of a different form of consistency we
call replication consistency: ensuring that the same data item has the
same value when read from different replicas.

Eventual consistency: At any time, nodes may have replication


inconsistencies but, if there are no further updates, eventually all nodes will
be updated to the same value. In other words, Eventual consistency is a
consistency model used in nosql database to achieve high availability that
informally guarantees that, if no new updates are made to a given data
item, eventually all accesses to that item will return the last updated value.
Eventually consistent services are often classified as providing BASE
(Basically Available, Soft state, Eventual consistency) semantics, in
contrast to traditional ACID (Atomicity, Consistency, Isolation, Durability)
guarantees.
Basic Availability. The NoSQL database approach focuses on availability
of data even in the presence of multiple failures. It achieves this by using
a highly distributed approach to database management. Instead of
maintaining a single large data store and focusing on the fault tolerance
of that store, NoSQL databases spread data across many storage systems
with a high degree of replication. In the unlikely event that a failure disrupts
access to a segment of data, this does not necessarily result in a complete
database outage.

BDA M Tech UNIT2 1 Sem 43 MLWEC


Soft state. BASE databases abandon the consistency requirements of the
ACID model pretty much completely. One of the basic concepts behind
BASE is that data consistency is the developer's problem and should not
be handled by the database.
Eventual Consistency. The only requirement that NoSQL databases
have regarding consistency is to require that at some point in the future,
data will converge to a consistent state. No guarantees are made,
however, about when this will occur. That is a complete departure from
the immediate consistency requirement of ACID that prohibits a transaction
from executing until the prior transaction has completed and the database
has converged to a consistent state.

Version stamp:
A field that changes every time the underlying data in the record
changes. When you read the data you keep a note of the version stamp,
so that when you write data you can check to see if the version has
changed.
You may have come across this technique with updating resources with
HTTP. One way of doing this is to use etags. Whenever you get a resource,
the server responds with an etag in the header. This etag is an opaque
string that indicates the version of the resource. If you then update that
resource, you can use a conditional update by supplying the etag that you
got from your last GET method. If the resource has changed on the server,
the etags won’t match and the server will refuse the update, returning a
412 (Precondition Failed) error response. In short,
□ It helps you detect concurrency conflicts.
□ When you read data, then update it, you can check the version stamp to
ensure nobody updated the data between your read and write
□ Version stamps can be implemented using counters, GUIDs (a large
random number that’s guaranteed to be unique), content hashes,
timestamps, or a combination of these.
□ With distributed systems, a vector of version stamps (a set of counters,
one for each node) allows you to detect when different nodes have
conflicting updates.
Sometimes this is called a compare-and-set (CAS) operation.

Relaxing consistency
The CAP Theorem: The basic statement of the CAP theorem is that, given the
three properties of
Consistency, Availability, and Partition tolerance, you can only get two.
BDA M Tech UNIT2 1 Sem 44 MLWEC
□ Consistency: all people see the same data at the same time
□ Availability: if you can talk to a node in the cluster, it can read and write
data
□ Partition tolerance: the cluster can survive communication breakages that
separate the cluster into partitions unable to communicate with each other

Network partition: The CAP theorem states that if you get a network
partition, you have to trade off availability of data versus consistency.
Very large systems will “partition” at some point::

• That leaves either C or A to choose from (traditional DBMS prefers


C over A and P )

• In almost all cases, you would choose A over C (except in specific
applications such as order processing)

CA systems
□ A single-server system is the obvious example of a CA system
BDA M Tech UNIT2 1 Sem 45 MLWEC
□ CA cluster: if a partition occurs, all the nodes would go down
□ A failed, unresponsive node doesn’t infer a lack of CAP availability
□ A system that suffer partitions: tradeoff consistency VS availability
□ Give up to some consistency to get some availability

An example
□ Ann is trying to book a room of the Ace Hotel in New York on a node
located in London of a booking system
□ Pathin is trying to do the same on a node located in Mumbai
□ The booking system uses a peer-to-peer distribution
□ There is only a room available
□ The network link breaks

Possible solutions
□ CP: Neither user can book any hotel room, sacrificing availability
□ caP: Designate Mumbai node as the master for Ace hotel
□ Pathin can make the reservation
□ Ann can see the inconsistent room information
□ Ann cannot book the room
□ AP: both nodes accept the hotel reservation
□ Overbooking!

Map-Reduce

It is a way to take a big task and divide it into discrete tasks that can be done
in parallel. A common use case for Map/Reduce is in document database .

A MapReduce program is composed of a Map() procedure that performs


BDA M Tech UNIT2 1 Sem 46 MLWEC
filtering and sorting (such as sorting students by first name into queues, one
queue for each name) and a Reduce() procedure that performs a summary
operation (such as counting the number of students in each queue, yielding
name frequencies).

"Map" step: The master node takes the input, divides it into smaller sub-
problems, and distributes them to worker nodes. A worker node may do this
again in turn, leading to a multi-level tree structure. The worker node
processes the smaller problem, and passes the answer back to its master
node.
"Reduce" step: The master node then collects the answers to all the sub-
problems and combines them in some way to form the output – the answer
to the problem it was originally trying to solve.
Logical view
The Map function is applied in parallel to every pair in the input dataset. This
produces a list of pairs for each call. After that, the MapReduce framework
collects all pairs with the same key from all lists and groups them together,
creating one group for each key.
Map(k1,v1) → list(k2,v2)

The Reduce function is then applied in parallel to each group, which in


turn produces a collection of values in the same domain:
Reduce(k2, list (v2)) → list(v3)

Each Reduce call typically produces either one value v3 or an empty return

Example 1: Counting and Summing

Problem Statement: There is a number of documents where each document


is a set of words. It is required to calculate a total number of occurrences of
each word in all documents. For instance, there is a log file where each record
contains a response time and it is required to calculate an average response
time.
Solution:
Let start with simple. The code snippet below shows Mapper that simply emit
“1” for each word it processes and Reducer that goes through the lists of ones
and sum them up:
class Mapper
method Map(docid id, doc d)
for all word t in doc d do
Emit(word t, count 1)
BDA M Tech UNIT2 1 Sem 47 MLWEC
class Reducer
method Reduce(word t, counts [c1, c2,...])
sum = 0
for all count c in [c1, c2,...] do
sum = sum + c
Emit(word t, sum)

Here, each document is split into words, and each word is counted by the
map function, using the word as the result key. The framework putstogether
all the pairs with the same key and feeds them to the same callto reduce.
Thus, this function just needs to sum all of its input values to find the total
appearances of that word.
Example 2:

BDA M Tech UNIT2 1 Sem 48 MLWEC


Multistage map-reduce calculations

Let us say that we have a set of documents and its attributes with the
following form:

"type": "post",
"name": "Raven's Map/Reduce functionality",
"blog_id": 1342,
"post_id": 29293921,
"tags": ["raven", "nosql"],
"post_content": "<p>...</p>",
"comments": [
{
"source_ip": '124.2.21.2',
"author": "martin",
"text": "excellent blog..."
}]
}
BDA M Tech UNIT2 1 Sem 49 MLWEC
And we want to answer a question over more than a single document. That
sort of operation requires us to use aggregation, and over large amount of
data, that is best done using Map/Reduce, to split the work.

Map / Reduce is just a pair of functions, operating over a list of data. Let us
say that we want to be about to get a count of comments per blog. We can
do that using the following Map / Reduce queries:

from post in docs.posts


select new {
post.blog_id,
comments_length = comments.length
};

from agg in results


group agg by agg.key into g
select new {
agg.blog_id,
comments_length = g.Sum(x=>x.comments_length)
};

There are a couple of things to note here:

• The first query is the map query, it maps the input document into the
final format.
• The second query is the reduce query, it operate over a set of results
and produce an answer.
• The first value in the result is the key, which is what we are
aggregating on (think the group by clause in SQL).

Let us see how this works, we start by applying the map query to the set of
documents that we have, producing this output:

BDA M Tech UNIT2 1 Sem 50 MLWEC


The next step is to start reducing the results, in real Map/Reduce algorithms,
we partition the original input, and work toward the final result. In this case,
imagine that the output of the first step was divided into groups of 3 (so 4
groups overall), and then the reduce query was applied to it, giving us:

BDA M Tech UNIT2 1 Sem 51 MLWEC


You can see why it was called reduce, for every batch, we apply a sum by
blog_id to get a new Total Comments value. We started with 11 rows, and we
ended up with just 10. That is where it gets interesting, because we are still
not done, we can still reduce the data further.

This is what we do in the third step, reducing the data further still. That is
why the input & output format of the reduce query must match, we will feed
the output of several the reduce queries as the input of a new one. You can
also see that now we moved from having 10 rows to have just 7.

BDA M Tech UNIT2 1 Sem 52 MLWEC


And the final step is:

BDA M Tech UNIT2 1 Sem 53 MLWEC


And now we are done, we can't reduce the data any further because all the
keys are unique.

RDBMS compared to MapReduce


MapReduce is a good fit for problems that need to analyze the whole dataset,
in a batch fashion, particularly for ad hoc analysis. An RDBMS is good for
point queries or updates, where the dataset has been indexed to deliver low-
latency retrieval and update times of a relatively small amount of data.
MapReduce suits applications where the data is written once, and read many
times, whereas a relational database is good for datasets that are continually
updated.

Partitioning and Combining

In the simplest form, we think of a map-reduce job as having a single reduce


function. The outputs from all the map tasks running on the various nodes are
concatenated together and sent into the reduce. While this will work, there are
things we can do to increase the parallelism and to reduce the data
transfer(see figure)

BDA M Tech UNIT2 1 Sem 54 MLWEC


Reduce Partitioning Example

The first thing we can do is increase parallelism by partitioning the output of


the mappers. Each reduce function operates on the results of a single key.
This is a limitation—it means you can’t do anything in the reduce that
operates across keys—but it’s also a benefit in that it allows you to run
multiple reducers in parallel. To take advantage of this, the results of the
mapper are divided up based the key on each processing node. Typically,
multiple keys are grouped together into partitions. The framework then takes
the data from all the nodes for one partition, combines it into a single group
for that partition, and sends it off to a reducer. Multiple reducers can then
operate on the partitions in parallel, with the final results merged together.
(This step is also called “ shuffling,” and the partitions are sometimes
referred to as “ buckets” or “ regions.”)
The next problem we can deal with is the amount of data being moved from
node to node between the map and reduce stages. Much of this data is
repetitive, consisting of multiple key-value pairs for the same key. A combiner
function cuts this data down by combining all the data for the same key
into a single value(see fig)

BDA M Tech UNIT2 1 Sem 55 MLWEC


Combinable Reducer Example

A combiner function is, in essence, a reducer function—indeed, in many


cases the same function can be used for combining as the final reduction. The
reduce function needs a special shape for this to work: Its output must match
its input. We call such a function a combinable reducer.

BDA M Tech UNIT2 1 Sem 56 MLWEC

You might also like