0% found this document useful (0 votes)
35 views135 pages

MongoDB-for-Data-Science-seminar

Uploaded by

chamarilk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views135 pages

MongoDB-for-Data-Science-seminar

Uploaded by

chamarilk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 135

NoSQL Databases

Introduction to MongoDB

DANIELE APILETTI

POLITECNICO DI TORINO
Introduction
•The leader in the NoSQL Document-based databases
•Full of features, beyond NoSQL:
o High performance
o High availability
o Native scalability
o High flexibility
o Open source

DATA MANAGEMENT AND VISUALIZATION 2


Terminology – Approximate mapping

Relational database MongoDB

Table Collection

Record Document

Column Field

DATA MANAGEMENT AND VISUALIZATION 3


Document Data Design
•High-level, business-ready representation of the data
o Records are stored into BSON Documents
▪ BSON is a binary representation of JSON documents
▪ field-value pairs
▪ may be nested

DATA MANAGEMENT AND VISUALIZATION 4


Document Data Design
•High-level, business-ready representation of the data
•Mapping into developer-language objects
o date, timestamp, array, sub-documents, etc.
•Field names
o The field name _id is reserved for use as a primary key; its value must be unique in the
collection, is immutable, possibly autogenerated, and may be of any type other than an array.
o Field names cannot contain the null character.
o The server permits storage of field names that contain dots (.) and dollar signs ($)
o BSON documents may have more than one field with the same name. Most MongoDB
interfaces, however, represent MongoDB with a structure (e.g., a hash table) that does not
support duplicate field names.
o The maximum BSON document size is 16 megabytes. To store documents larger than the
maximum size, MongoDB provides GridFS.
o Unlike JavaScript objects, the fields in a BSON document are ordered.

DATA MANAGEMENT AND VISUALIZATION 5


MongoDB

Databases and collections.


Create and delete operations
Databases and Collections
•Each instance of MongoDB can manage multiple databases
•Each database is composed of a set of collections
•Each collection contains a set of documents
o The documents of each collection represent similar “objects”
o However, remember that MongoDB is schema-less
o You are not required to define the schema of the documents a-priori and objects of the same
collections can be characterized by different fields
o Starting in MongoDB 3.2, you can enforce document validation rules for a collection during
update and insert operations.

DATA MANAGEMENT AND VISUALIZATION 7


Databases and Collections
•Show the list of available databases
show databases

•Select the database you are interested in


use <database-name>

•E.g.
ouse deliverydb

DATA MANAGEMENT AND VISUALIZATION 8


Databases and Collections
•Create a database and a collection inside the database
o Select the database by using the command “use <database name>”
o Then, create a collection
▪ MongoDB creates a collection implicitly when the collection is first referenced in a command

•Delete/Drop a database
o Select the database by using “use <database name>”
o Execute the command
db.dropDatabase()
E.g.,
use deliverydb;
db.dropDatabase();

DATA MANAGEMENT AND VISUALIZATION 9


Databases and Collections
•A collection stores documents, uniquely identified by a document “_id”
•Create collections
db.createCollection(<collection name>, <options>);

o The collection is associated with the current database. Always select the database
before creating a collection.
o Options related to the collection size and indexing, e.g., to create a capped
collection, or to create a new collection that uses document validation
•E.g.,
o db.createCollection(“authors”, {capped: true});

DATA MANAGEMENT AND VISUALIZATION 10


Databases and Collections
•Show collections
show collections

•Drop collections
db.<collection_name>.drop()

•E.g.
o db.authors.drop()

DATA MANAGEMENT AND VISUALIZATION 11


C.R.U.D. Operations

•Create

•Read

•Update

•Delete

DATA MANAGEMENT AND VISUALIZATION 12


Create: insert one document
•Insert a single document in a collection
db.<collection name>.insertOne( {<set of the field:value pairs of the new document>} );

•E.g.,
db.people.insertOne( {
user_id: "abc123",
age: 55,
status: "A"
} );

DATA MANAGEMENT AND VISUALIZATION 13


Create: insert one document
•Insert a single document in a collection
db.<collection name>.insertOne( {<set of the field:value pairs of the new document>} );

•E.g.,
db.people.insertOne( {
user_id: "abc123",
Field age: 55,
name status: "A"
} );

DATA MANAGEMENT AND VISUALIZATION 14


Create: insert one document
•Insert a single document in a collection
db.<collection name>.insertOne( {<set of the field:value pairs of the new document>} );

•E.g.
db.people.insertOne( {
user_id: "abc123",
age: 55, Field value
status: "A"
} );

DATA MANAGEMENT AND VISUALIZATION 15


Create: insert one document
•Insert a single document in a collection
db.<collection name>.insertOne( {<set of the field:value pairs of the new document>} );

Now people contains a new document representing a user with:


user_id: "abc123",
age: 55
status: "A"

DATA MANAGEMENT AND VISUALIZATION 16


Create: insert one document
•E.g.,
db.people.insertOne( {
Favorite_colors is
user_id: "abc124",
an array
age: 45,
favorite_colors: ["blue", "green"]
} );

Now people contains a new document representing a user with:


user_id: "abc124", age: 45 and an array favorite_colors containing
the values "blue" and "green"

DATA MANAGEMENT AND VISUALIZATION 17


Create: insert one document
•E.g.,
db.people.insertOne( { Nested document
user_id: "abc124",
age: 45,
address: {
street: "my street",
city: "my city"
}
} );

Example of a document containing a nested document

DATA MANAGEMENT AND VISUALIZATION 18


Create: insert many documents
•Insert multiple documents in a single statement:
db.<collection name>.insertMany([ <comma separated list of documents> ]);

db.products.insertMany( [
{ user_id: "abc123", age: 30, status: "A"},
{ user_id: "abc456", age: 40, status: "A"},
{ user_id: "abc789", age: 50, status: "B"}
] );

DATA MANAGEMENT AND VISUALIZATION 19


Create: insert many documents
•Insert many documents with one single command
db.<collection name>.insertMany([ <comma separated list of documents> ]);

•E.g.,
db.people.insertMany([
{user_id: "abc123", age: 55, status: “A”},
{user_id: "abc124", age: 45, favorite_colors: ["blue", "green"]}
] );

DATA MANAGEMENT AND VISUALIZATION 20


Delete
•Delete existing data, in MongoDB corresponds to the deletion of
the associated document.
oConditional delete
oMultiple delete

MySQL clause MongoDB operator

DELETE FROM deleteMany()

DATA MANAGEMENT AND VISUALIZATION 21


Delete
MySQL clause MongoDB operator
DELETE FROM deleteMany()

DELETE FROM people db.people.deleteMany(


WHERE status = "D" { status: "D" }
)

DATA MANAGEMENT AND VISUALIZATION 22


Delete
MySQL clause MongoDB operator
DELETE FROM deleteMany()

DELETE FROM people db.people.deleteMany(


WHERE status = "D" { status: "D" }
)
DELETE FROM people db.people.deleteMany({})

DATA MANAGEMENT AND VISUALIZATION 23


MongoDB

Databases and collections.


Querying data (find operations)
Query language
•Most of the operations available in SQL language can be expressend in
MongoDB language

MySQL clause MongoDB operator


SELECT find()

SELECT * db.people.find()
FROM people

DATA MANAGEMENT AND VISUALIZATION 25


Read data from documents
•Select documents
db.<collection name>.find( {<conditions>}, {<fields of interest>} );

DATA MANAGEMENT AND VISUALIZATION 26


Read data from documents (Filter conditions)
•Select documents
db.<collection name>.find( {<conditions>}, {<fields of interest>} );

•Select the documents satisfying the specified conditions and specifically


only the fields specified in fields of interest
o <conditions> are optional
▪ conditions take a document with the form:
{field1 : <value>, field2 : <value> ... }
▪ Conditions may specify a value or a regular expression

DATA MANAGEMENT AND VISUALIZATION 27


Read data from documents (Project fields)
•Select documents
db.<collection name>.find( {<conditions>}, {<fields of interest>} );

•Select the documents satisfying the specified conditions and specifically


only the fields specified in fields of interest
o <fields of interest> are optional
▪ projections take a document with the form:
{field1 : <value>, field2 : <value> ... }
▪ 1/true to include the field, 0/false to exclude the field

DATA MANAGEMENT AND VISUALIZATION 28


find() operator (1)

SELECT id, db.people.find(


user_id, { },
status { user_id: 1,
FROM people status: 1
}
)

DATA MANAGEMENT AND VISUALIZATION 29


find() operator (2)
MySQL clause MongoDB operator
SELECT find()

Where Condition
SELECT id, db.people.find(
user_id, { },
status { user_id: 1,
FROM people status: 1
}
)

Select fields

DATA MANAGEMENT AND VISUALIZATION 30


find() operator (3)
MySQL clause MongoDB operator
SELECT find()
WHERE find({<WHERE CONDITIONS>})

SELECT * db.people.find(
FROM people { status: "A" }
WHERE status = "A" )

Where Condition

DATA MANAGEMENT AND VISUALIZATION 31


find() operator (4)
MySQL clause MongoDB operator
SELECT find()
WHERE find({<WHERE CONDITIONS>})

Where Condition
SELECT user_id, status db.people.find(
FROM people { status: "A" },
WHERE status = "A" { user_id: 1,
status: 1,
_id: 0
}
)

Selection fields
By default, the _id field is always returned.
To remove it, you must explicitly indicate _id: 0

DATA MANAGEMENT AND VISUALIZATION 32


find() operator (5)
MySQL clause MongoDB operator
SELECT find()
WHERE find({<WHERE CONDITIONS>})

db.people.find(
{"address.city":“Rome" }
)
{ _id: "A",
address: {
street: “Via Torino”,
nested document
number: “123/B”,
city: “Rome”,
code: “00184”
}
}

DATA MANAGEMENT AND VISUALIZATION 33


Read data from one document
•Select a single document
db.<collection name>.findOne( {<conditions>}, {<fields of interest>} );

•Select one document that satisfies the specified query criteria.


oIf multiple documents satisfy the query, it returns the first one according
to the natural order which reflects the order of documents on the disk.

DATA MANAGEMENT AND VISUALIZATION 34


(No) joins
•No join operator exists (but $lookup)
o You must write a program that
▪ Selects the documents of the first collection you are interested in
▪ Iterates over the documents returned by the first step, by using the loop statement provided by
the programming language you are using
▪ Executes one query for each of them to retrieve the corresponding document(s) in the other
collection

https://docs.mongodb.com/manual/reference/operator/aggregation/lookup

DATA MANAGEMENT AND VISUALIZATION 35


(No) joins
•(no) joins
o Relations among documents/records are provided by
▪ Object_ID (_id), named “Manual reference” in MongoDB, a second query is required
▪ DBRef, a standard approach across collections and databases (check the driver compatibility)
{ "$ref" : <value>, "$id" : <value>, "$db" : <value> }

https://docs.mongodb.com/manual/reference/database-references/

DATA MANAGEMENT AND VISUALIZATION 36


Comparison query operators
Name Description
$eq or : Matches values that are equal to a specified value
$gt Matches values that are greater than a specified value
$gte Matches values that are greater than or equal to a specified
value
$in Matches any of the values specified in an array
$lt Matches values that are less than a specified value
$lte Matches values that are less than or equal to a specified value

$ne Matches all values that are not equal to a specified value,
including documents that do not contain the field.
$nin Matches none of the values specified in an array

DATA MANAGEMENT AND VISUALIZATION 38


Comparison operators (>)
MySQL MongoDB Description
> $gt greater than

SELECT * db.people.find(
FROM people { age: { $gt: 25 } }
WHERE age > 25 )

DATA MANAGEMENT AND VISUALIZATION 39


Comparison operators (>=)
MySQL MongoDB Description
> $gt greater than
>= $gte greater equal then

SELECT * db.people.find(
FROM people { age: { $gte: 25 } }
WHERE age >= 25 )

DATA MANAGEMENT AND VISUALIZATION 40


Comparison operators (<)
MySQL MongoDB Description
> $gt greater than
>= $gte greater equal then
< $lt less than

SELECT * db.people.find(
FROM people { age: { $lt: 25 } }
WHERE age < 25 )

DATA MANAGEMENT AND VISUALIZATION 41


Comparison operators (<=)
MySQL MongoDB Description
> $gt greater than
>= $gte greater equal then
< $lt less than
<= $lte less equal then

SELECT * db.people.find(
FROM people { age: { $lte: 25 } }
WHERE age <= 25 )

DATA MANAGEMENT AND VISUALIZATION 42


Comparison operators (=)
MySQL MongoDB Description
> $gt greater than
>= $gte greater equal then
< $lt less than
<= $lte less equal then
= $eq equal to
The $eq expression is equivalent
to
{ field: <value> }.

SELECT * db.people.find(
FROM people { age: { $eq: 25 } }
WHERE age = 25 )

DATA MANAGEMENT AND VISUALIZATION 43


Comparison operators (!=)
MySQL MongoDB Description
> $gt greater than
>= $gte greater equal then
< $lt less than
<= $lte less equal then
= $eq equal to
!= $ne Not equal to

SELECT * db.people.find(
FROM people { age: { $ne: 25 } }
WHERE age != 25 )

DATA MANAGEMENT AND VISUALIZATION 44


Conditional operators
•To specify multiple conditions, conditional operators are used
•MongoDB offers the same functionalities of MySQL with a different
syntax.

MySQL MongoDB Description


AND , Both verified
OR $or At least one verified

DATA MANAGEMENT AND VISUALIZATION 45


Conditional operators (AND)
MySQL MongoDB Description
AND , Both verified

SELECT * db.people.find(
FROM people { status: "A",
WHERE status = "A" age: 50 }
AND age = 50 )

DATA MANAGEMENT AND VISUALIZATION 46


Conditional operators (OR)
MySQL MongoDB Description
AND , Both verified
OR $or At least one verified

SELECT * db.people.find(
FROM people { $or:
WHERE status = "A" [ { status: "A" } ,
OR age = 50 { age: 50 }
]
}
)

DATA MANAGEMENT AND VISUALIZATION 47


Type of read operations (1)
• Count
db.people. count({ age: 32 })

• Comparison
db.people. find({ age: {$gt: 32 }) // or equivalently with $gte, $lt, $lte,

db.people.find({ age: {$in: [32, 40] }) // returns all documents having age either 32 or 40

db.people.find({ age: { $gt: 25, $lte: 50 } }) //returns all documents having age > 25 and age <= 50

•Logical
db.people.find({ name: {$not: {$eq: ‘‘Max’’ } } })

db.people.find({ $or: [ {age: 32}, {age: 33} ] } )

DATA MANAGEMENT AND VISUALIZATION 48


Type of read operations (2)
db.items.find({
$and: [
{$or: [{qty: {$lt: 15}}, {qty: {$gt: 50}} ]},
{$or: [{sale: true}, {price: {$lt: 5}} ]}
]

This query returns documents (items) that satisfy both these conditions:
1. Quantity sold either less than 15 or greater than 50
2. Either the item is on sale (field “sale”: true) or its price is less than 5

DATA MANAGEMENT AND VISUALIZATION 49


Type of read operations (3)
• Element
db.inventory.find( { item: null } ) // equality filter

db.inventory.find( { item : { $exists: false } } ) // existence filter

db.inventory.find( { item : { $type: 10 } } ) // type filter


Note:
o Item: null → matches documents that either
▪ contain the item field whose value is null or
▪ that do not contain the item field
o Item: {$exists: false} → matches documents that do not contain the item field

• Aggregation → Slides on “Data aggregation”

DATA MANAGEMENT AND VISUALIZATION 50


Type of read operations (4)
• Embedded Documents
db.inventory.find( { size: { h: 14, w: 21, uom: "cm" } } )
Select all documents where the field size equals the exact document { h: 14, w: 21, uom: "cm" }

db.inventory.find( { "size.uom": "in" } )

To specify a query condition on fields in an embedded/nested document, use dot notation

db.inventory.find( { "size.h": { $lt: 15 } } )

Dot notation and comparison operator

DATA MANAGEMENT AND VISUALIZATION 51


Cursor
•db.collection.find()gives back a cursor. It can be used to iterate over the
result or as input for next operations.
•E.g.,
o cursor.sort()
o cursor.count()
o cursor.forEach() //shell method
o cursor.limit()
o cursor.max()
o cursor.min()
o cursor.pretty()

DATA MANAGEMENT AND VISUALIZATION 52


Cursor: sorting data
•Sort is a cursor method
•Sort documents
o sort( {<list of field:value pairs>} );

o field specifies which filed is used to sort the returned documents


o value = -1 descending order
o Value = 1 ascending order

•Multiple field: value pairs can be specified


o Documents are sort based on the first field
o In case of ties, the second specified field is considered

DATA MANAGEMENT AND VISUALIZATION 53


Cursor: sorting data
•Sorting data with respect to a given field in sort() operator
MySQL clause MongoDB operator
ORDER BY sort()

SELECT * db.people.find(
FROM people { status: "A" }
WHERE status = "A" ).sort( { age: 1 } )
ORDER BY age ASC

•Returns all documents having status=“A”. The result is sorted in ascending age order

DATA MANAGEMENT AND VISUALIZATION 54


Cursor: sorting data
•Sorting data with respect to a given field in sort() operator
MySQL clause MongoDB operator
ORDER BY sort()

SELECT * db.people.find(
FROM people { status: "A" }
WHERE status = "A" ).sort( { age: 1 } )
ORDER BY age ASC
SELECT * db.people.find(
FROM people { status: "A" }
WHERE status = "A" ).sort( { age: -1 } )
ORDER BY age DESC

•Returns all documents having status=“A”. The result is sorted in ascending age order
•Returns all documents having status = “A”. The result is sorted in descending age order

DATA MANAGEMENT AND VISUALIZATION 55


Cursor: counting
MySQL clause MongoDB operator
COUNT count()or find().count()

SELECT COUNT(*) db.people.count()


FROM people or
db.people.find().count()

DATA MANAGEMENT AND VISUALIZATION 56


Cursor: counting
MySQL clause MongoDB operator
COUNT count()or find().count()

SELECT COUNT(*) db.people.count()


FROM people or
db.people.find().count()
SELECT COUNT(*) db.people.count(status: "A")}
WHERE status = "A" or
FROM people db.people.find({status: "A"}).count()

DATA MANAGEMENT AND VISUALIZATION 57


Cursor: counting
MySQL clause MongoDB operator
COUNT count()or find().count()

SELECT COUNT(*) db.people.count()


FROM people or
db.people.find().count()
SELECT COUNT(*) db.people.count(status: "A")}
WHERE status = "A" or
FROM people db.people.find({status: "A"}).count()
SELECT COUNT(*) db.people.count(
FROM people { age: { $gt: 30 } }
WHERE age > 30 )

Similar to the find() operator, count() can embed conditional statements.

DATA MANAGEMENT AND VISUALIZATION 58


Cursor: forEach()
•forEach applies a JavaScript function to apply to each document from the cursor.

db.people.find({status: "A“}).forEach(
function(myDoc){
print( "user:”+myDoc.name );
})

•Select documents with status=“A” and print the document name.

DATA MANAGEMENT AND VISUALIZATION 59


MongoDB

Databases and collections.


Update operations
Document update
•Back at the C.R.U.D. operations, we can now see how documents
can be updated using:
db.collection.updateOne(<filter>, <update>, <options>)

db.collection.updateMany(<filter>, <update>, <options>)

o <filter> = filter condition. It specifies which documents must be updated

o <update> = specifies which fields must be updated and their new values

o <options> = specific update options

DATA MANAGEMENT AND VISUALIZATION 61


Document update
•E.g.,
db.inventory.updateMany(
{ "qty": { $lt: 50 } },
{
$set: { "size.uom": "in", status: "P" },
$currentDate: { lastModified: true }
}
)
oThis operation updates all documents with qty<50
oIt sets the value of the size.uom field to "in", the value of the status field to
"P", and the value of the lastModified field to the current date.

DATA MANAGEMENT AND VISUALIZATION 62


Updating data
•Tuples to be updated should be selected using the WHERE
statements

MySQL clause MongoDB operator


UPDATE <table> db.<table>.updateMany(
SET <statement> { <condition> },
WHERE <condition> { $set: {<statement>} }
)

DATA MANAGEMENT AND VISUALIZATION 63


Updating data
MySQL clause MongoDB operator
UPDATE <table> db.<table>.updateMany(
SET <statement> { <condition> },
WHERE <condition> { $set: {<statement>}}
)

UPDATE people db.people.updateMany(


SET status = "C" {age: { $gt: 25 } },
WHERE age > 25 {$set: { status: "C"}}
)

64AND VISUALIZATION
DATA MANAGEMENT
Updating data
MySQL clause MongoDB operator
UPDATE <table> db.<table>.updateMany(
SET <statement> { <condition> },
WHERE <condition> { $set: {<statement>}}
)

UPDATE people db.people.updateMany(


SET status = "C" {age: { $gt: 25 } },
WHERE age > 25 {$set: { status: "C"}}
)
UPDATE people db.people.updateMany(
SET age = age + 3 { status: "A" } ,
WHERE status = "A" { $inc: { age: 3 } }
)
The $inc operator increments a field by a specified value
65AND VISUALIZATION
DATA MANAGEMENT
MongoDB

Data aggregation pipeline


General concepts
•Documents enter a multi-stage pipeline that transforms the documents of a
collection into an aggregated result
•Pipeline stages can appear multiple times in the pipeline
o exceptions $out, $merge, and $geoNear stages

•Pipeline expressions can only operate on the current document in the pipeline and
cannot refer to data from other documents: expression operations provide in-memory
transformation of documents (max 100 Mb of RAM per stage).
•Generally, expressions are stateless and are only evaluated when seen by the
aggregation process with one exception: accumulator expressions used in the $group
stage (e.g. totals, maximums, minimums, and related data).
•The aggregation pipeline provides an alternative to map-reduce and may be the
preferred solution for aggregation tasks since MongoDB introduced the $accumulator
and $function aggregation operators starting in version 4.4

DATA MANAGEMENT AND VISUALIZATION 67


Aggregation Framework
SQL MongoDB
WHERE $match
GROUP BY $group
HAVING $match
SELECT $project
ORDER BY $sort
//LIMIT $limit
SUM $sum
COUNT $sum

DATA MANAGEMENT AND VISUALIZATION 68


Aggregation pipeline
•Aggregate functions can be applied to collections to group documents

db.collection.aggregate( { <set of stages> })

o Common stages: $match, $group ..


o The aggregate function allows applying aggregating functions (e.g. sum, average, ..)
o It can be combined with an initial definition of groups based on the grouping fields

DATA MANAGEMENT AND VISUALIZATION 69


Aggregation example (1)
db.people.aggregate( [
{ $group: { _id: null,
mytotal: { $sum: "$age" },
mycount: { $sum: 1 }
}
}
] )
•Considers all documents of people and
o sum the values of their age
o sum a set of ones (one for each document)

•The returned value is associated with a field called “mytotal” and a field “mycount”
DATA MANAGEMENT AND VISUALIZATION 70
Aggregation example (2)
db.people.aggregate( [
{ $group: { _id: null,
myaverage: { $avg: "$age" },
mytotal: { $sum: "$age" }
}
}
] )
o Considers all documents of people and computes
▪ sum of age
▪ average of age

DATA MANAGEMENT AND VISUALIZATION 71


Aggregation example (3)
db.people.aggregate( [ Where conditions

{ $match: {status: "A"} } ,


{ $group: { _id: null,
count: { $sum: 1 }
}
}
] )
o Counts the number of documents in people with status equal to “A”

DATA MANAGEMENT AND VISUALIZATION 72


Aggregation in “Group By”
MySQL clause MongoDB operator
GROUP BY aggregate($group)

SELECT status,
AVG(age) AS total
FROM people
GROUP BY status
db.orders.aggregate( [
{
$group: {
_id: "$status",
total: { $avg: "$age" }
}
}
] )

DATA MANAGEMENT AND VISUALIZATION 73


Aggregation in “Group By”
MySQL clause MongoDB operator
GROUP BY aggregate($group)

SELECT status,
SUM(age) AS total
FROM people
GROUP BY status
db.orders.aggregate( [
{
$group: {
_id: "$status", Group field
total: { $sum: "$age" }
}
}
] )

DATA MANAGEMENT AND VISUALIZATION 74


Aggregation in “Group By”
MySQL clause MongoDB operator
GROUP BY aggregate($group)

SELECT status,
SUM(age) AS total
FROM people
GROUP BY status
db.orders.aggregate( [
{
$group: {
_id: "$status", Group field
total: { $sum: "$age" }
}
} Aggregation function
] )

DATA MANAGEMENT AND VISUALIZATION 75


Aggregation in “Group By + Having”
MySQL clause MongoDB operator
HAVING aggregate($group, $match)

SELECT status,
SUM(age) AS total
FROM people
GROUP BY status
HAVING total > 1000
db.orders.aggregate( [
{
$group: {
_id: "$status",
total: { $sum: "$age" }
}
},
{ $match: { total: { $gt: 1000 } } }
] )

DATA MANAGEMENT AND VISUALIZATION 76


Aggregation in “Group By + Having”
MySQL clause MongoDB operator
HAVING aggregate($group, $match)

SELECT status,
SUM(age) AS total
FROM people
GROUP BY status
HAVING total > 1000
db.orders.aggregate( [
{ Group stage: Specify
$group: {
the aggregation field
_id: "$status",
total: { $sum: "$age" } and the aggregation
} function
},
{ $match: { total: { $gt: 1000 } } }
] )

DATA MANAGEMENT AND VISUALIZATION 77


Aggregation in “Group By + Having”
MySQL clause MongoDB operator
HAVING aggregate($group, $match)

SELECT status,
SUM(age) AS total
FROM people
GROUP BY status
HAVING total > 1000
db.orders.aggregate( [
{ Group stage: Specify
$group: {
the aggregation field
_id: "$status",
total: { $sum: "$age" } and the aggregation
} function
},
{ $match: { total: { $gt: 1000 } } } Match Stage: specify
] ) the condition as in
HAVING

DATA MANAGEMENT AND VISUALIZATION 78


Aggregation at a glance

DATA MANAGEMENT AND VISUALIZATION 79


Pipeline stages (1)
Stage Description
$addFields Adds new fields to documents. Reshapes each document by adding new fields to
output documents that will contain both the existing fields from the input documents
and the newly added fields.
$bucket Categorizes incoming documents into groups, called buckets, based on a specified
expression and bucket boundaries. On the contrary, $group creates a “bucket” for
each value of the group field.
$bucketAuto Categorizes incoming documents into a specific number of groups, called buckets,
based on a specified expression. Bucket boundaries are automatically determined in
an attempt to evenly distribute the documents into the specified number of buckets.
$collStats Returns statistics regarding a collection or view (it must be the first stage)
$count Passes a document to the next stage that contains a count of the input number of
documents to the stage (same as $group+$project)

DATA MANAGEMENT AND VISUALIZATION 80


Pipeline stages (2)

Stage Description
$facet Processes multiple aggregation pipelines within a single stage on the same set of
input documents. Enables the creation of multi-faceted aggregations capable of
characterizing data across multiple dimensions. Input documents are passed to the
$facet stage only once, without needing multiple retrieval.
$geoNear Returns an ordered stream of documents based on the proximity to a geospatial
point. The output documents include an additional distance field. It must in the first
stage only.
$graphLookup Performs a recursive search on a collection. To each output document, adds a new
array field that contains the traversal results of the recursive search for that
document.

DATA MANAGEMENT AND VISUALIZATION 81


Example

db.employees.aggregate( [ •The $graphLookup operation recursively matches on the


reportsTo and name fields in the employees collection, returning
{ the reporting hierarchy for each person.

$graphLookup: { •Returns a list of documents such as


{
from: "employees",
"_id" : 5,
startWith: "$reportsTo", "name" : "Asya",
connectFromField: "reportsTo", "reportsTo" : "Ron", original
document
connectToField: "name", "reportingHierarchy" : [

as: "reportingHierarchy" { "_id" : 1, "name" : "Dev" },


{ "_id" : 2, "name" : "Eliot", "reportsTo" : "Dev" },
}
{ "_id" : 3, "name" : "Ron", "reportsTo" : "Eliot" }
} ]
]) }

DATA MANAGEMENT AND VISUALIZATION 82


Pipeline stages (3)
Stage Description
$group Groups input documents by a specified identifier expression and applies the
accumulator expression(s), if specified, to each group. Consumes all input documents
and outputs one document per each distinct group. The output documents only
contain the identifier field and, if specified, accumulated fields.
$indexStats Returns statistics regarding the use of each index for the collection.
$limit Passes the first n documents unmodified to the pipeline where n is the specified limit.
For each input document, outputs either one document (for the first n documents) or
zero documents (after the first n documents).
$lookup Performs a join to another collection in the same database to filter in documents from
the “joined” collection for processing. To each input document, the $lookup stage
adds a new array field whose elements are the matching documents from the “joined”
collection. The $lookup stage passes these reshaped documents to the next stage.

DATA MANAGEMENT AND VISUALIZATION 83


Pipeline stages (4)
Stage Description
$match Filters the document stream to allow only matching documents to pass
unmodified into the next pipeline stage. $match uses standard MongoDB queries.
For each input document, outputs either one document (a match) or zero
documents (no match).
$merge Writes the resulting documents of the aggregation pipeline to a collection. The
stage can incorporate (insert new documents, merge documents, replace
documents, keep existing documents, fail the operation, process documents with
a custom update pipeline) the results into an output collection. To use
the $merge stage, it must be the last stage in the pipeline.
$out Writes the resulting documents of the aggregation pipeline to a collection. To use
the $out stage, it must be the last stage in the pipeline.
$project Reshapes each document in the stream, such as by adding new fields or removing
existing fields. For each input document, outputs one document.

DATA MANAGEMENT AND VISUALIZATION 84


Pipeline stages (5)
Stage Description
$sample Randomly selects the specified number of documents from its input.
$set Adds new fields to documents. Similar to $project, $set reshapes each document in
the stream; specifically, by adding new fields to output documents that contain both
the existing fields from the input documents and the newly added fields. $set is an
alias for $addFields stage. If the name of the new field is the same as an existing field
name (including _id), $set overwrites the existing value of that field with the value of
the specified expression.
$skip Skips the first n documents where n is the specified skip number and passes the
remaining documents unmodified to the pipeline. For each input document, outputs
either zero documents (for the first n documents) or one document (if after the
first n documents).
$sort Reorders the document stream by a specified sort key. Only the order changes; the
documents remain unmodified. For each input document, outputs one document.

DATA MANAGEMENT AND VISUALIZATION 85


Pipeline stages (6)

Stage Description
$sortByCount Groups incoming documents based on the value of a specified expression, then computes the
count of documents in each distinct group.
$unset Removes/excludes fields from documents.
$unwind Deconstructs an array field from the input documents to output a document for each element.
Each output document replaces the array with an element value. For each input document,
outputs n documents where n is the number of array elements and can be zero for an empty
array.

DATA MANAGEMENT AND VISUALIZATION 86


MongoDB

Data aggregation examples


Data Model
•Given the following collection of books
{
"title":"MongoDb Guide2",
"tag":["mongodb","guide","database"],
"n":200,
"review_score": 2.2,
"price":[ {"v": 22.22, "c": "€", "country": "IT"},
{"v": 22.00, "c": "£", "country": "UK"}
],
"author": { price currency
"_id": 1,
"name":"Mario",
"surname": "Rossi"}
price value
}
{_id:ObjectId("5fb29b175b99900c3fa24293",
title:”Developing with Python",
tag:[”python”,”guide”,“programming”],
n:352,
review_score:4.6,
price:[{v: 24.99, c: “€”, country: “IT”},
{v: 19.49, c: “£”, country:”UK”} ],
author: {_id: 2,
name:”John”, number of pages
surname: “Black”}
}, …

DATA MANAGEMENT AND VISUALIZATION 88


Example 1
•For each country, select the average price and the average review_score.
•The review score should be rounded down.
•Show the first 20 results with a total number of books higher than 50.

DATA MANAGEMENT AND VISUALIZATION 89


$unwind

db.book.aggregate( [ Build a document


{ $unwind: ”$price” } , for each entry of
the price array
])

DATA MANAGEMENT AND VISUALIZATION 90


Result - $unwind
{ "_id" : ObjectId("5fb29ae15b99900c3fa24292"), "title" : "MongoDb guide", "tag" : [ "mongodb", "guide",
"database" ], "n" : 100, "review_score" : 4.3, "price" : { "v" : 19.99, "c" : " € ", "country" : "IT" }, "author" : { "_id" : 1,
"name" : "Mario", "surname" : "Rossi" } }

{ "_id" : ObjectId("5fb29ae15b99900c3fa24292"), "title" : "MongoDb guide", "tag" : [ "mongodb", "guide",


"database" ], "n" : 100, "review_score" : 4.3, "price" : { "v" : 18, "c" : "£", "country" : "UK" }, "author" : { "_id" : 1,
"name" : "Mario", "surname" : "Rossi" } }

{ "_id" : ObjectId("5fb29b175b99900c3fa24293"), "title" : " Developing with Python ", "tag" : [ "python", "guide",
"programming" ], "n" : 352, "review_score" : 4.6, "price" : { "v" : 24.99, "c" : " € ", "country" : "IT" }, "author" : {
"_id" : 2, "name" : "John", "surname" : "Black" } }

{ "_id" : ObjectId("5fb29b175b99900c3fa24293"), "title" : " Developing with Python ", "tag" : [ "python", "guide",
"programming" ], "n" : 352, "review_score" : 4.6, "price" : { "v" : 19.49, "c" : "£", "country" : "UK" }, "author" : {
"_id" : 2, "name" : "John", "surname" : "Black" } }

DATA MANAGEMENT AND VISUALIZATION 91


$group

db.book.aggregate( [
{ $unwind: ”$price” } ,
dot notation to access the
{ $group: { _id: ”$price.country”}, value of the embedded
avg_price: { $avg: ” $price.v” , document fields
bookcount: {$sum:1},
count the number
review: {$avg: ” $review_score”}
of books (number
} of documents)
}
])

DATA MANAGEMENT AND VISUALIZATION 92


Result - $group

{ "_id" : "UK", "avg_price" : 18.75, "bookcount": 150, "review": 4.3}


{ "_id" : "IT", "avg_price" : 22.49, "bookcount": 132, "review": 3.9}
{ "_id" : "US", "avg_price" : 22.49, "bookcount": 49, "review": 4.2}

DATA MANAGEMENT AND VISUALIZATION 93


$match
db.book.aggregate( [
{ $unwind: '$price' } ,
{ $group: { _id: '$price.country',
avg_price: { $avg: '$price.v' },
bookcount: {$sum:1},
review: {$avg: '$review_score'}
}
}, Filter the documents
{$match: { bookcount: { $gte: 50 } } }, where bookcount is
]) greater than 50

DATA MANAGEMENT AND VISUALIZATION 94


Result - $match

{ "_id" : "UK", "avg_price" : 18.75, "bookcount": 150, "review": 4.3}


{ "_id" : "IT", "avg_price" : 22.49, "bookcount": 132, "review": 3.9}

DATA MANAGEMENT AND VISUALIZATION 95


$project

db.book.aggregate( [
{ $unwind: '$price' } ,
{ $group: { _id: '$price.country',
avg_price: { $avg: '$price.v' },
bookcount: {$sum:1},
review: {$avg: '$review_score'}
}
},
{$match: { bookcount: { $gte: 50 } } },
round down the
{$project: {avg_price: 1, review: { $floor: '$review' }}},
review score
])

DATA MANAGEMENT AND VISUALIZATION 96


Result - $project

{ "_id" : "UK", "avg_price" : 18.75, "review": 4}


{ "_id" : "IT", "avg_price" : 22.49, "review" : 3}

DATA MANAGEMENT AND VISUALIZATION 97


$limit
db.book.aggregate( [
{ $unwind: '$price' } ,
{ $group: { _id: '$price.country',
avg_price: { $avg: '$price.v' },
bookcount: {$sum:1},
review: {$avg: '$review_score'}
}
},
{$match: { bookcount: { $gte: 50 } } },
{$project: {avg_price: 1, review: { $floor: '$review' }}}, Limit the results
{$limit:20} to the first 20
documents
])

DATA MANAGEMENT AND VISUALIZATION 98


Example 2
•Compute the 95 percentile of the number of pages,
•only for the books that contain the tag “guide”.

DATA MANAGEMENT AND VISUALIZATION 99


$match

db.book.aggregate( [ select documents containing


{$match: { tag : "guide"} } “guide” in the tag array,
]) compare with tag:[“guide”]

DATA MANAGEMENT AND VISUALIZATION 100


Result - $match

{ "_id" : ObjectId("5fb29b175b99900c3fa24293"), "title" : " Developing with Python", "tag" : [ "python",


"guide", "programming" ], "n" : 352, "review_score" : 4.6, "price" : [ { "v" : 24.99, "c" : "€", "country" : "IT" },
{ "v" : 19.49, "c" : "£", "country" : "UK" } ], "author" : { "_id" : 1, "name" : "John", "surname" : "Black" } }
{ "_id" : ObjectId("5fb29ae15b99900c3fa24292"), "title" : "MongoDb guide", "tag" : [ "mongodb", "guide",
"database" ], "n" : 100, "review_score" : 4.3, "price" : [ { "v" : 19.99, "c" : "€", "country" : "IT" }, { "v" : 18, "c" :
"£", "country" : "UK" } ], "author" : { "_id" : 1, "name" : "Mario", "surname" : "Rossi" } }

DATA MANAGEMENT AND VISUALIZATION 101


$sort

db.book.aggregate( [
{$match: { tag : "guide"} }, sort the documents in ascending order
{$sort : { n: 1} } according to the value of the n field, which
]) stores the number of pages of each book

DATA MANAGEMENT AND VISUALIZATION 102


Result - $sort

{ "_id" : ObjectId("5fb29ae15b99900c3fa24292"), "title" : "MongoDb guide", "tag" : [ "mongodb", "guide",


"database" ], "n" : 100, "review_score" : 4.3, "price" : [ { "v" : 19.99, "c" : "€", "country" : "IT" }, { "v" : 18, "c" :
"£", "country" : "UK" } ], "author" : { "_id" : 1, "name" : "Mario", "surname" : "Rossi" } }
{ "_id" : ObjectId("5fb29b175b99900c3fa24293"), "title" : " Developing with Python", "tag" : [ "python",
"guide", "programming" ], "n" : 352, "review_score" : 4.6, "price" : [ { "v" : 24.99, "c" : "€", "country" : "IT" },
{ "v" : 19.49, "c" : "£", "country" : "UK" } ], "author" : { "_id" : 1, "name" : "John", "surname" : "Black" } }

DATA MANAGEMENT AND VISUALIZATION 103


$group + $push

db.book.aggregate( [
{$match: { tag : "guide"} },
{$sort : { n: 1} }, group all the records
{$group: {_id:null, value: {$push: "$n"}}} together inside a single
]) document (_id:null),
which contains an array
with all the values of n
of all the records

DATA MANAGEMENT AND VISUALIZATION 104


Result - $group + $push

{ "_id": null, "value": [100, 352, …]}

DATA MANAGEMENT AND VISUALIZATION 105


$project + $arrayElemAt

db.book.aggregate( [
{$match: { tag : "guide"} },
{$sort : { n: 1} },
{$group: {_id:null, value: {$push: "$n"}}},
{$project:
get the value of the array at a given index
{"n95p": {$arrayElemAt: with { $arrayElemAt: [ <array>, <idx> ] }
["$value",
{$floor: {$multiply: [0.95, {$size: "$value"}]}}
]
}} compute the index at 95% of the array length
}
])

DATA MANAGEMENT AND VISUALIZATION 106


Result - $project + $arrayElemAt

{ "_id" : null, "n95p" : 420 }

DATA MANAGEMENT AND VISUALIZATION 107


Example 3
•Compute the median of the review_score,
•only for the books having at least a price whose value is higher than 20.0.

DATA MANAGEMENT AND VISUALIZATION 108


Solution
db.book.aggregate( [
{$match: {'price.v' : { $gt: 20 }} },
{$sort : {review_score: 1} },
{$group: {_id:null, rsList: {$push: '$review_score'}}},
{$project:
{'median': {$arrayElemAt:
['$rsList',
{$floor: {$multiply: [0.5, {$size: '$rsList'}]}}
]
}}
}
])

DATA MANAGEMENT AND VISUALIZATION 109


MongoDB

Indexing
Indexes
•Without indexes, MongoDB must perform a collection scan, i.e. scan
every document in a collection, to select those documents that match the
query statement.
•Indexes are data structures that store a small portion of the collection’s
data set in a form easy to traverse.
•They store ordered values of a specific field, or set of fields, in order to
efficiently support
o equality matches,
o range-based queries and
o sorting operations.

DATA MANAGEMENT AND VISUALIZATION 111


Indexes

DATA MANAGEMENT AND VISUALIZATION 112


Indexes
•MongoDB creates a unique index on the _id field during the creation of a
collection.
•The _id index prevents clients from inserting two documents with the
same value for the _id field.
•You cannot drop this index on the _id field.

DATA MANAGEMENT AND VISUALIZATION 113


Create new indexes
•Creating an index

db.collection.createIndex(<index keys>, <options>)

o Before v. 3.0 use db.collection.ensureIndex()

•Options include:
o name - a mnemonic name given by the user, you cannot rename an index once
created, instead, you must drop and re-create the index with a new name
o unique - whether to accept or not insertion of documents with duplicate keys,
o background, dropDups, …

DATA MANAGEMENT AND VISUALIZATION 114


Indexes
•MongoDB provides different data-type indexes
o Single field indexes
o Compound field indexes
o Multikey indexes (to index the content stored in arrays, MongoDB creates separate
index entries for every element of the array)
o Geospatial indexes (2d indexes with planar and 2dsphere with spherical geometry)
o Text indexes (searching for string content in a collection, they do not store
language-specific stop words, e.g., "the", "a", "or“, and stem the words in a collection
to only store root words
o Hashed indexes (indexes the hash of the value of a field, they have a more random
distribution of values along their range, but only support equality matches and
cannot support range-based queries)

DATA MANAGEMENT AND VISUALIZATION 115


Indexes
•Single field indexes
o Support user-defined ascending/descending indexes on a single field of a document

•E.g.,
o db.orders.createIndex( {orderDate: 1} )

•Compound field indexes


o Support user-defined indexes on a set of fields

•E.g.,
o db.orders.createIndex( {orderDate: 1, zipcode: -1} )

DATA MANAGEMENT AND VISUALIZATION 116


Indexes
•MongoDB supports efficient queries of geospatial data
•Geospatial data are stored as:
o GeoJSON objects: embedded document { <type>, <coordinate> }

▪ E.g., location: {type: "Point", coordinates: [-73.856, 40.848]}


o Legacy coordinate pairs: array or embedded document

▪ point: [-73.856, 40.848]

•Fields with 2dsphere indexes must hold geometry data in the form of
coordinate pairs or GeoJSON data.
o If you attempt to insert a document with non-geometry data in a 2dsphere indexed field, or build a 2dsphere
index on a collection where the indexed field has non-geometry data, the operation will fail.

DATA MANAGEMENT AND VISUALIZATION 117


Indexes
•Geospatial indexes
o Two type of geospatial indexes are provided: 2d and 2dsphere

•A 2dsphere index supports queries that calculate geometries on an


earth-like sphere
•Use a 2d index for data stored as points on a two-dimensional plane.
•E.g.,
o db.places.createIndex( {location: “2dsphere”} )

•Geospatial query operators


o $geoIntersects, $geoWithin, $near, $nearSphere

DATA MANAGEMENT AND VISUALIZATION 118


Indexes
•$near syntax:
{
<location field>: {
$near: {
$geometry: {
type: "Point" ,
coordinates: [ <longitude> , <latitude> ]
},
$maxDistance: <distance in meters>,
$minDistance: <distance in meters>
}
}
}

DATA MANAGEMENT AND VISUALIZATION 119


Indexes
•E.g.,
o db.places.createIndex( {location: “2dsphere”} )

•Geospatial query operators


o $geoIntersects, $geoWithin, $near, $nearSphere

•Geopatial aggregation stage


o $near

DATA MANAGEMENT AND VISUALIZATION 120


Indexes
•E.g.,
o db.places.find({location:

{$near:
{$geometry: {
type: "Point",
coordinates: [ -73.96, 40.78 ] },
$maxDistance: 5000}
}})
o Find all the places within 5000 meters from the specified GeoJSON point, sorted in order from nearest
to furthest

DATA MANAGEMENT AND VISUALIZATION 121


Indexes
•Text indexes
o Support efficient searching for string content in a collection
o Text indexes store only root words (no language-specific stop words or stem)

•E.g.,
db.reviews.createIndex( {comment: “text”} )
o Wildcard ($**) allows MongoDB to index every field that contains string data
o E.g.,
db.reviews.createIndex( {“$**”: “text”} )

DATA MANAGEMENT AND VISUALIZATION 122


VIEWS
•A queryable object whose contents are defined by an aggregation pipeline on other collections or views.
•MongoDB does not persist the view contents to disk. A view’s content is computed on-demand.
•Starting in version 4.2, MongoDB adds the $merge stage for the aggregation pipeline to create on-demand
materialized views, where the content of the output collection can be updated each time the pipeline is run.
•Read-only views from existing collections or other views. E.g.:
o excludes private or confidential data from a collection of employee data
o adds computed fields from a collection of metrics
o joins data from two different related collections

db.runCommand( {
create: <view>, viewOn: <source>, pipeline: <pipeline>, collation: <collation> } )

•Restrictions
o immutable Name
o you can modify a view either by dropping and recreating the view or using the collMod comman

DATA MANAGEMENT AND VISUALIZATION 123


MongoDB Compass

GUI for MongoDB


MongoDB Compass

•Visually explore data.


•Available on Linux, Mac, or Windows.
•MongoDB Compass analyzes documents and displays rich
structures within collections.
•Visualize, understand, and work with your geospatial data.

DATA MANAGEMENT AND VISUALIZATION 125


MongoDB Compass

•Connect to local or remote instances of MongoDB.


DATA MANAGEMENT AND VISUALIZATION 126
MongoDB Compass

•Get an overview of the data in list or table format.


DATA MANAGEMENT AND VISUALIZATION 127
MongoDB Compass

•Analyze the documents and their fields.


•Native support for geospatial coordinates.
DATA MANAGEMENT AND VISUALIZATION 128
MongoDB Compass

•Visually build the query conditioning on analyzed fields.

DATA MANAGEMENT AND VISUALIZATION 129


MongoDB Compass

•Autcomplete enabled by default

•Construct the query step by step.

DATA MANAGEMENT AND VISUALIZATION 130


MongoDB Compass

•Analyze query performance and get hints to speed it up.

DATA MANAGEMENT AND VISUALIZATION 131


MongoDB Compass

•Specify contraints to validate data


•Find unconsistent documents.
DATA MANAGEMENT AND VISUALIZATION 132
MongoDB Compass: Aggregation

•Build a pipeline consisting of


multiple aggregation stages

•Define the filter and aggregation


attributes for each operator.

DATA MANAGEMENT AND VISUALIZATION 133


MongoDB Compass: Aggregation stages

DATA MANAGEMENT AND VISUALIZATION 134


MongoDB Compass: Aggregation stages

The _id corresponds to the


GROUP BY parameter in SQL

Other fields contain the


attributes required for each
group.

One group for each “vendor”.

DATA MANAGEMENT AND VISUALIZATION 135


MongoDB Compass: Pipelines

1st stage: grouping by vendor

2nd stage: condition over fields created in the previous


stage (avg_fuel, total).

DATA MANAGEMENT AND VISUALIZATION 136

You might also like