BDA Unit2 Complete
BDA Unit2 Complete
NoSQL or ‘Not Only SQL’ is a data model that starkly differs from traditional
SQL expectations.
The primary difference is that NoSQL does not use a relational data
modelling technique and it emphasizes flexible design. The lack of
requirement for a schema makes designing a much simpler and cheaper
process. That isn’t to say that you can’t use a schema altogether, but rather
that schema design is very flexible.
Another useful feature of NoSQL data models is that they are built for high
efficiency and speed in terms of creating up to millions of queries a second.
There are several reasons why people consider using a NoSQL database.
Scalability. NoSQL databases are designed to scale; it’s one of the primary
reasons that people choose a NoSQL database. Typically, with a relational
database like SQL Server or Oracle, you scale by purchasing larger and faster
servers and storage or by employing specialists to provide additional tuning.
Unlike relational databases, NoSQL databases are designed to easily scale
out as they grow. Data is partitioned and balanced across multiple nodes in a
cluster, and aggregate queries are distributed by default.
Massive write performance. This is probably the canonical usage based
on Google's influence. High volume. Facebook needs to store 135 billion
messages a month. Twitter, for example, has the problem of storing 7 TB/data
per day with the prospect of this requirement doubling multiple times per year.
This is the data is too big to fit on one node problem. At 80 MB/s it takes a
day to store 7TB so writes need to be distributed over a cluster, which implies
key-value access, MapReduce, replication, fault tolerance, consistency
issues, and all the rest. For faster writes in-memory systems can be used.
Fast key-value access. This is probably the second most cited virtue of
NoSQL in the general mind set. When latency is important it's hard to beat
hashing on a key and reading the value directly from memory or in as little
as one disk seek. Not every NoSQL product is about fast access, some are
more about reliability, for example. but what people have wanted for a long
time was a better memcached and many NoSQL systems offer that.
No single point of failure. Not every product is delivering on this, but we are
seeing a definite convergence on relatively easy to configure and manage
high availability with automatic load balancing and cluster sizing. A perfect
cloud partner.
Programmer ease of use. Accessing your data should be easy. While the
relational model is intuitive for end users, like accountants, it's not very
intuitive for developers. Programmers grok keys, values, JSON, Javascript
stored procedures, HTTP, and so on. NoSQL is for programmers. This is a
developer led coup. The response to a database problem can't always be to
hire a really knowledgeable DBA, get your schema right, denormalize a little,
etc., programmers would prefer a system that they can make work for
themselves. It shouldn't be so hard to make a product perform. Money is part
of the issue. If it costs a lot to scale a product then won't you go with the
cheaper product, that you control, that's easier to use, and that's easier to
scale?
Use the right data model for the right problem. Different data models are
used to solve different problems. Much effort has been put into, for example,
wedging graph operations into a relational model, but it doesn't work. Isn't it
better to solve a graph problem in a graph database? We are now seeing a
general strategy of trying find the best fit between a problem and solution.
3. Software Development
The third application that we will be looking at is software development. Software
development requires extensive research on users and the needs of the masses that
are met through software development. However, a developer must be able to scan
through data that is available.Perhaps NoSQL databases are always useful in helping
software developers keep a tab on their users, their details, and other user-related data
that is important to be noted. That said, NoSQL databases are surely helpful in
software development.
We know, NoSQL are databases that store data in another format other than relational
databases. NoSQL deals in nearly every industry nowadays. For the people who
interact with data in databases, the Aggregate Data model will help in that interaction.
Features of NoSQL
Non-relational
∙ NoSQL databases never follow the relational model
Schema-free
NoSQL databases are either schema-free or have relaxed schemas
Do not require any sort of definition of the schema of the data
Offers heterogeneous structures of data in the same domain
Distributed
∙ Multiple NoSQL databases can be executed in a distributed
fashion ∙ Offers auto-scaling and fail-over capabilities
∙ Often ACID concept can be sacrificed for scalability and throughput ∙ Mostly
no synchronous replication between distributed nodes Asynchronous Multi-
Master Replication, peer-to-peer, HDFS Replication
∙ Only providing eventual consistency
∙ Easy Replication
2. Document databases
4. Graph storage
Graph databases are best suited for representing data with a high, yet
flexible number of interconnections, especially when information about
those interconnections is at least as important as there presented data.
In NoSQL database, data is stored in a graph like structures in graph
databases, so that the data can be made easily accessible. Graph
databases are commonly used on social networking sites. As show in
below figure.
Data Model: A data model is the model through which we perceive and
manipulate our data. For people using a database, the data model describes
how we interact with the data in the database.
Relational Data Model: The relational model takes the information that we
want to store and divides it into tuples.
Tuple being a limited Data Structure it captures a set of values and can’t be
nested. This gives Relational Model a space of development.
Aggregate Model: Aggregate is a term that comes from Domain-Driven
Design, an aggregate is a collection of related objects that we wish to treat
as a unit, it is a unit for data manipulation and management of consistency.
BDA M Tech UNIT2 1 Sem 15 MLWEC
• Atomic property holds within an aggregate
• Communication with data storage happens in unit of aggregate
• Dealing with aggregate is much more efficient in clusters
• It is often easier for application programmers to work with aggregates
Again, we have some sample data, which we’ll show in JSON format as that’s
a common representation for data in NoSQL.
In this model, we have two main aggregates: customer and order. We’ve used
the black-diamond composition marker in UML to show how data fits into the
aggregation structure. The customer contains a list of billing addresses; the
order contains a list of order items, a shipping address, and payments. The
payment itself contains a billing address for that payment.
A single logical address record appears three times in the example data, but
instead of using IDs it’s treated as a value and copied each time. This fits the
BDA M Tech UNIT2 1 Sem 18 MLWEC
domain where we would not want the shipping address, nor the payment’s
billing address, to change. In a relational database, we would ensure that the
address rows aren’t updated for this case, making a new row instead. With
aggregates, we can copy the whole address structure into the aggregate as
we need to.
Key-value databases
A key-value store is a simple hash table, primarily used when all access to
the database is via primary key.
Key-value stores are the simplest NoSQL data stores to use from an API
perspective. The client can either get the value for the key, put a value for
a key, or delete a key from the data store.
The value is a BLOB(Binary Large Object) that the data store just stores,
without caring or knowing what’s inside; it’s the responsibility of the
application to understand what was stored.
Since key-value stores always use primary-key access, they generally have
great performance and can be easily scaled.
It is an associative container such as map, dictionary, and in query
processing an index. It is an abstract data type composed of a collection of
unique keys and a collection of values, where each key is associated with
one value (or set of values).
The operation of finding the value associated with a key is called a lookup
or indexing the relationship between a key and its value is sometimes called
a mapping or binding.
Some of the popular key-value databases are Riak, Redis, Memcached DB,
Berkeley DB, HamsterDB, Amazon DynamoDB.
A Key-Value model is great for lookups of simple or even complex values.
In Key-Value database,
1. The block number is obtained directly from the key, typically by using a
hash function. The size of the file is then defined by the largest block
number computed by every possible key.
These Aggregate Data Models in NoSQL Database are used for storing the user
session data.
Key Value-based Data Models are used for maintaining schema-less user profiles.
It is used for storing user preferences and shopping cart data.
Advantages:
Can handle large amounts of data and heavy load,
Easy retrieval of data by keys
Limitations:
Complex queries may attempt to involve multiple key-value pairs which may delay
performance.
Data can be involving many-to-many relationships which may collide.
Document Databases
In this example we have two different beers from the Amtel brewery. We
represent each beer as a separate document and reference the brewery in
the brewer field. The document-oriented approach provides several upsides
compared to the traditional RDBMS model. First, because information is
stored in documents, updating a schema is a matter of updating the
documents for that type of object. This can be done with no system downtime.
Secondly, we can distribute the information across multiple servers with
greater ease. Since records are contained within entire documents, it makes
it easier to move, or replicate an entire object to another server.
Column-Family Stores
Its name conjured up a tabular structure which it realized with sparse columns
and no schema. The column-family model is as a two-level aggregate
structure. As with key-value stores, the first key is often described as a row
identifier, picking up the aggregate of interest. The difference with column
family structures is that this row aggregate is itself formed of a map of more
detailed values. These second-level values are referred to as columns. As
well as accessing the row as a whole, operations also allow picking out a
particular column, so to get a particular customer’s name from following figure
you could do something like
get('1234', 'name').
EX1:
EX2:
Use Cases:
Column Family Data Models are used in systems that maintain counters.
These Aggregate Data Models in NoSQL are used for services that have expiring
usage.
It is used in systems that have heavy write requests.
Graph Databases:
Graph databases are one style of NoSQL databases that uses a distribution
model similar to relational databases but offers a different data model that
makes it better at handling data with complex relationships.
Distribution Models
Multiple servers: In NoSQL systems, data distributed over large clusters
Single server – simplest model, everything on one machine. Run the
database on a single machine that handles all the reads and writes to the data
store. We prefer this option because it eliminates all the complexities. It’s easy
for operations people to manage and easy for application developers to
reason about.
Although a lot of NoSQL databases are designed around the idea of running
on a cluster, it can make sense to use NoSQL with a single-server distribution
model if the data model of the NoSQL store is more suited to the application.
Graph databases are the obvious category here—these work best in a single-
server configuration.
If your data usage is mostly about processing aggregates, then a single-
server document or key-value store may well be worthwhile because it’s
easier on application developers.
Orthogonal aspects of data distribution models:
Sharding: DB Sharding is nothing but horizontal partitioning of data. Different
people are accessing different parts of the dataset. In these circumstances
we can support horizontal scalability by putting different parts of the data onto
different servers—a technique that’s called sharding.
A table with billions of rows can be partitioned using “Range Partitioning”. If
the customer transaction date, for an example, based partitioning will partition
the data vertically. So irrespective which instance in a Real Application
Clusters access the data, it is “not” horizontally partitioned although Global
Enqueue Resources are owning certain blocks in each instance but it can be
moving around. But in “db shard” environment, the data is horizontally
partitioned. For an example: United States customer can live in one shard and
European Union customers can be in another shard and the other countries
customers can live in another shard but from an access perspective there is
no need to know where the data lives. The DB Shard can go to the appropriate
shard to pick up the data.
Master-Slave Replication
□ Master
□ is the authoritative source for the data
□ is responsible for processing any updates to that data
□ can be appointed manually or automatically
• All the replicas have equal weight, they can all accept writes
• The loss of any of them doesn’t prevent access to the data store.
Pros and cons of peer-to-peer replication
□ Pros:
□ you can ride over node failures without losing access to data
□ you can easily add nodes to improve your performance
□ Cons:
□ Inconsistency
□ Slow propagation of changes to copies on different nodes
□ Inconsistencies on read lead to problems but are relatively transient
□ Two people can update different copies of the same record stored
ondifferent nodes at the same time - a write-write conflict.
□ Inconsistent writes are forever.
Sharding and Replication on Master-Slave
Replication and sharding are strategies that can be combined. If we use both
master slave replication and sharding, this means that we have multiple
masters, but each data item only has a single master. Depending on your
□ We have multiple masters, but each data only has a single master.
□ Two schemes:
□ A node can be a master for some data and slaves for others
□ Nodes are dedicated for master or slave duties
Version stamp:
A field that changes every time the underlying data in the record
changes. When you read the data you keep a note of the version stamp,
so that when you write data you can check to see if the version has
changed.
You may have come across this technique with updating resources with
HTTP. One way of doing this is to use etags. Whenever you get a resource,
the server responds with an etag in the header. This etag is an opaque
string that indicates the version of the resource. If you then update that
resource, you can use a conditional update by supplying the etag that you
got from your last GET method. If the resource has changed on the server,
the etags won’t match and the server will refuse the update, returning a
412 (Precondition Failed) error response. In short,
□ It helps you detect concurrency conflicts.
□ When you read data, then update it, you can check the version stamp to
ensure nobody updated the data between your read and write
□ Version stamps can be implemented using counters, GUIDs (a large
random number that’s guaranteed to be unique), content hashes,
timestamps, or a combination of these.
□ With distributed systems, a vector of version stamps (a set of counters,
one for each node) allows you to detect when different nodes have
conflicting updates.
Sometimes this is called a compare-and-set (CAS) operation.
Relaxing consistency
The CAP Theorem: The basic statement of the CAP theorem is that, given the
three properties of
Consistency, Availability, and Partition tolerance, you can only get two.
BDA M Tech UNIT2 1 Sem 44 MLWEC
□ Consistency: all people see the same data at the same time
□ Availability: if you can talk to a node in the cluster, it can read and write
data
□ Partition tolerance: the cluster can survive communication breakages that
separate the cluster into partitions unable to communicate with each other
Network partition: The CAP theorem states that if you get a network
partition, you have to trade off availability of data versus consistency.
Very large systems will “partition” at some point::
CA systems
□ A single-server system is the obvious example of a CA system
BDA M Tech UNIT2 1 Sem 45 MLWEC
□ CA cluster: if a partition occurs, all the nodes would go down
□ A failed, unresponsive node doesn’t infer a lack of CAP availability
□ A system that suffer partitions: tradeoff consistency VS availability
□ Give up to some consistency to get some availability
An example
□ Ann is trying to book a room of the Ace Hotel in New York on a node
located in London of a booking system
□ Pathin is trying to do the same on a node located in Mumbai
□ The booking system uses a peer-to-peer distribution
□ There is only a room available
□ The network link breaks
Possible solutions
□ CP: Neither user can book any hotel room, sacrificing availability
□ caP: Designate Mumbai node as the master for Ace hotel
□ Pathin can make the reservation
□ Ann can see the inconsistent room information
□ Ann cannot book the room
□ AP: both nodes accept the hotel reservation
□ Overbooking!
Map-Reduce
It is a way to take a big task and divide it into discrete tasks that can be done
in parallel. A common use case for Map/Reduce is in document database .
"Map" step: The master node takes the input, divides it into smaller sub-
problems, and distributes them to worker nodes. A worker node may do this
again in turn, leading to a multi-level tree structure. The worker node
processes the smaller problem, and passes the answer back to its master
node.
"Reduce" step: The master node then collects the answers to all the sub-
problems and combines them in some way to form the output – the answer
to the problem it was originally trying to solve.
Logical view
The Map function is applied in parallel to every pair in the input dataset. This
produces a list of pairs for each call. After that, the MapReduce framework
collects all pairs with the same key from all lists and groups them together,
creating one group for each key.
Map(k1,v1) → list(k2,v2)
Each Reduce call typically produces either one value v3 or an empty return
Here, each document is split into words, and each word is counted by the
map function, using the word as the result key. The framework putstogether
all the pairs with the same key and feeds them to the same callto reduce.
Thus, this function just needs to sum all of its input values to find the total
appearances of that word.
Example 2:
Let us say that we have a set of documents and its attributes with the
following form:
"type": "post",
"name": "Raven's Map/Reduce functionality",
"blog_id": 1342,
"post_id": 29293921,
"tags": ["raven", "nosql"],
"post_content": "<p>...</p>",
"comments": [
{
"source_ip": '124.2.21.2',
"author": "martin",
"text": "excellent blog..."
}]
}
BDA M Tech UNIT2 1 Sem 49 MLWEC
And we want to answer a question over more than a single document. That
sort of operation requires us to use aggregation, and over large amount of
data, that is best done using Map/Reduce, to split the work.
Map / Reduce is just a pair of functions, operating over a list of data. Let us
say that we want to be about to get a count of comments per blog. We can
do that using the following Map / Reduce queries:
• The first query is the map query, it maps the input document into the
final format.
• The second query is the reduce query, it operate over a set of results
and produce an answer.
• The first value in the result is the key, which is what we are
aggregating on (think the group by clause in SQL).
Let us see how this works, we start by applying the map query to the set of
documents that we have, producing this output:
This is what we do in the third step, reducing the data further still. That is
why the input & output format of the reduce query must match, we will feed
the output of several the reduce queries as the input of a new one. You can
also see that now we moved from having 10 rows to have just 7.