MongoDB-for-Data-Science-seminar
MongoDB-for-Data-Science-seminar
Introduction to MongoDB
DANIELE APILETTI
POLITECNICO DI TORINO
Introduction
•The leader in the NoSQL Document-based databases
•Full of features, beyond NoSQL:
o High performance
o High availability
o Native scalability
o High flexibility
o Open source
Table Collection
Record Document
Column Field
•E.g.
ouse deliverydb
•Delete/Drop a database
o Select the database by using “use <database name>”
o Execute the command
db.dropDatabase()
E.g.,
use deliverydb;
db.dropDatabase();
o The collection is associated with the current database. Always select the database
before creating a collection.
o Options related to the collection size and indexing, e.g., to create a capped
collection, or to create a new collection that uses document validation
•E.g.,
o db.createCollection(“authors”, {capped: true});
•Drop collections
db.<collection_name>.drop()
•E.g.
o db.authors.drop()
•Create
•Read
•Update
•Delete
•E.g.,
db.people.insertOne( {
user_id: "abc123",
age: 55,
status: "A"
} );
•E.g.,
db.people.insertOne( {
user_id: "abc123",
Field age: 55,
name status: "A"
} );
•E.g.
db.people.insertOne( {
user_id: "abc123",
age: 55, Field value
status: "A"
} );
db.products.insertMany( [
{ user_id: "abc123", age: 30, status: "A"},
{ user_id: "abc456", age: 40, status: "A"},
{ user_id: "abc789", age: 50, status: "B"}
] );
•E.g.,
db.people.insertMany([
{user_id: "abc123", age: 55, status: “A”},
{user_id: "abc124", age: 45, favorite_colors: ["blue", "green"]}
] );
SELECT * db.people.find()
FROM people
Where Condition
SELECT id, db.people.find(
user_id, { },
status { user_id: 1,
FROM people status: 1
}
)
Select fields
SELECT * db.people.find(
FROM people { status: "A" }
WHERE status = "A" )
Where Condition
Where Condition
SELECT user_id, status db.people.find(
FROM people { status: "A" },
WHERE status = "A" { user_id: 1,
status: 1,
_id: 0
}
)
Selection fields
By default, the _id field is always returned.
To remove it, you must explicitly indicate _id: 0
db.people.find(
{"address.city":“Rome" }
)
{ _id: "A",
address: {
street: “Via Torino”,
nested document
number: “123/B”,
city: “Rome”,
code: “00184”
}
}
https://docs.mongodb.com/manual/reference/operator/aggregation/lookup
https://docs.mongodb.com/manual/reference/database-references/
$ne Matches all values that are not equal to a specified value,
including documents that do not contain the field.
$nin Matches none of the values specified in an array
SELECT * db.people.find(
FROM people { age: { $gt: 25 } }
WHERE age > 25 )
SELECT * db.people.find(
FROM people { age: { $gte: 25 } }
WHERE age >= 25 )
SELECT * db.people.find(
FROM people { age: { $lt: 25 } }
WHERE age < 25 )
SELECT * db.people.find(
FROM people { age: { $lte: 25 } }
WHERE age <= 25 )
SELECT * db.people.find(
FROM people { age: { $eq: 25 } }
WHERE age = 25 )
SELECT * db.people.find(
FROM people { age: { $ne: 25 } }
WHERE age != 25 )
SELECT * db.people.find(
FROM people { status: "A",
WHERE status = "A" age: 50 }
AND age = 50 )
SELECT * db.people.find(
FROM people { $or:
WHERE status = "A" [ { status: "A" } ,
OR age = 50 { age: 50 }
]
}
)
• Comparison
db.people. find({ age: {$gt: 32 }) // or equivalently with $gte, $lt, $lte,
db.people.find({ age: {$in: [32, 40] }) // returns all documents having age either 32 or 40
db.people.find({ age: { $gt: 25, $lte: 50 } }) //returns all documents having age > 25 and age <= 50
•Logical
db.people.find({ name: {$not: {$eq: ‘‘Max’’ } } })
This query returns documents (items) that satisfy both these conditions:
1. Quantity sold either less than 15 or greater than 50
2. Either the item is on sale (field “sale”: true) or its price is less than 5
SELECT * db.people.find(
FROM people { status: "A" }
WHERE status = "A" ).sort( { age: 1 } )
ORDER BY age ASC
•Returns all documents having status=“A”. The result is sorted in ascending age order
SELECT * db.people.find(
FROM people { status: "A" }
WHERE status = "A" ).sort( { age: 1 } )
ORDER BY age ASC
SELECT * db.people.find(
FROM people { status: "A" }
WHERE status = "A" ).sort( { age: -1 } )
ORDER BY age DESC
•Returns all documents having status=“A”. The result is sorted in ascending age order
•Returns all documents having status = “A”. The result is sorted in descending age order
db.people.find({status: "A“}).forEach(
function(myDoc){
print( "user:”+myDoc.name );
})
o <update> = specifies which fields must be updated and their new values
64AND VISUALIZATION
DATA MANAGEMENT
Updating data
MySQL clause MongoDB operator
UPDATE <table> db.<table>.updateMany(
SET <statement> { <condition> },
WHERE <condition> { $set: {<statement>}}
)
•Pipeline expressions can only operate on the current document in the pipeline and
cannot refer to data from other documents: expression operations provide in-memory
transformation of documents (max 100 Mb of RAM per stage).
•Generally, expressions are stateless and are only evaluated when seen by the
aggregation process with one exception: accumulator expressions used in the $group
stage (e.g. totals, maximums, minimums, and related data).
•The aggregation pipeline provides an alternative to map-reduce and may be the
preferred solution for aggregation tasks since MongoDB introduced the $accumulator
and $function aggregation operators starting in version 4.4
•The returned value is associated with a field called “mytotal” and a field “mycount”
DATA MANAGEMENT AND VISUALIZATION 70
Aggregation example (2)
db.people.aggregate( [
{ $group: { _id: null,
myaverage: { $avg: "$age" },
mytotal: { $sum: "$age" }
}
}
] )
o Considers all documents of people and computes
▪ sum of age
▪ average of age
SELECT status,
AVG(age) AS total
FROM people
GROUP BY status
db.orders.aggregate( [
{
$group: {
_id: "$status",
total: { $avg: "$age" }
}
}
] )
SELECT status,
SUM(age) AS total
FROM people
GROUP BY status
db.orders.aggregate( [
{
$group: {
_id: "$status", Group field
total: { $sum: "$age" }
}
}
] )
SELECT status,
SUM(age) AS total
FROM people
GROUP BY status
db.orders.aggregate( [
{
$group: {
_id: "$status", Group field
total: { $sum: "$age" }
}
} Aggregation function
] )
SELECT status,
SUM(age) AS total
FROM people
GROUP BY status
HAVING total > 1000
db.orders.aggregate( [
{
$group: {
_id: "$status",
total: { $sum: "$age" }
}
},
{ $match: { total: { $gt: 1000 } } }
] )
SELECT status,
SUM(age) AS total
FROM people
GROUP BY status
HAVING total > 1000
db.orders.aggregate( [
{ Group stage: Specify
$group: {
the aggregation field
_id: "$status",
total: { $sum: "$age" } and the aggregation
} function
},
{ $match: { total: { $gt: 1000 } } }
] )
SELECT status,
SUM(age) AS total
FROM people
GROUP BY status
HAVING total > 1000
db.orders.aggregate( [
{ Group stage: Specify
$group: {
the aggregation field
_id: "$status",
total: { $sum: "$age" } and the aggregation
} function
},
{ $match: { total: { $gt: 1000 } } } Match Stage: specify
] ) the condition as in
HAVING
Stage Description
$facet Processes multiple aggregation pipelines within a single stage on the same set of
input documents. Enables the creation of multi-faceted aggregations capable of
characterizing data across multiple dimensions. Input documents are passed to the
$facet stage only once, without needing multiple retrieval.
$geoNear Returns an ordered stream of documents based on the proximity to a geospatial
point. The output documents include an additional distance field. It must in the first
stage only.
$graphLookup Performs a recursive search on a collection. To each output document, adds a new
array field that contains the traversal results of the recursive search for that
document.
Stage Description
$sortByCount Groups incoming documents based on the value of a specified expression, then computes the
count of documents in each distinct group.
$unset Removes/excludes fields from documents.
$unwind Deconstructs an array field from the input documents to output a document for each element.
Each output document replaces the array with an element value. For each input document,
outputs n documents where n is the number of array elements and can be zero for an empty
array.
{ "_id" : ObjectId("5fb29b175b99900c3fa24293"), "title" : " Developing with Python ", "tag" : [ "python", "guide",
"programming" ], "n" : 352, "review_score" : 4.6, "price" : { "v" : 24.99, "c" : " € ", "country" : "IT" }, "author" : {
"_id" : 2, "name" : "John", "surname" : "Black" } }
{ "_id" : ObjectId("5fb29b175b99900c3fa24293"), "title" : " Developing with Python ", "tag" : [ "python", "guide",
"programming" ], "n" : 352, "review_score" : 4.6, "price" : { "v" : 19.49, "c" : "£", "country" : "UK" }, "author" : {
"_id" : 2, "name" : "John", "surname" : "Black" } }
…
db.book.aggregate( [
{ $unwind: ”$price” } ,
dot notation to access the
{ $group: { _id: ”$price.country”}, value of the embedded
avg_price: { $avg: ” $price.v” , document fields
bookcount: {$sum:1},
count the number
review: {$avg: ” $review_score”}
of books (number
} of documents)
}
])
db.book.aggregate( [
{ $unwind: '$price' } ,
{ $group: { _id: '$price.country',
avg_price: { $avg: '$price.v' },
bookcount: {$sum:1},
review: {$avg: '$review_score'}
}
},
{$match: { bookcount: { $gte: 50 } } },
round down the
{$project: {avg_price: 1, review: { $floor: '$review' }}},
review score
])
db.book.aggregate( [
{$match: { tag : "guide"} }, sort the documents in ascending order
{$sort : { n: 1} } according to the value of the n field, which
]) stores the number of pages of each book
db.book.aggregate( [
{$match: { tag : "guide"} },
{$sort : { n: 1} }, group all the records
{$group: {_id:null, value: {$push: "$n"}}} together inside a single
]) document (_id:null),
which contains an array
with all the values of n
of all the records
db.book.aggregate( [
{$match: { tag : "guide"} },
{$sort : { n: 1} },
{$group: {_id:null, value: {$push: "$n"}}},
{$project:
get the value of the array at a given index
{"n95p": {$arrayElemAt: with { $arrayElemAt: [ <array>, <idx> ] }
["$value",
{$floor: {$multiply: [0.95, {$size: "$value"}]}}
]
}} compute the index at 95% of the array length
}
])
Indexing
Indexes
•Without indexes, MongoDB must perform a collection scan, i.e. scan
every document in a collection, to select those documents that match the
query statement.
•Indexes are data structures that store a small portion of the collection’s
data set in a form easy to traverse.
•They store ordered values of a specific field, or set of fields, in order to
efficiently support
o equality matches,
o range-based queries and
o sorting operations.
•Options include:
o name - a mnemonic name given by the user, you cannot rename an index once
created, instead, you must drop and re-create the index with a new name
o unique - whether to accept or not insertion of documents with duplicate keys,
o background, dropDups, …
•E.g.,
o db.orders.createIndex( {orderDate: 1} )
•E.g.,
o db.orders.createIndex( {orderDate: 1, zipcode: -1} )
•Fields with 2dsphere indexes must hold geometry data in the form of
coordinate pairs or GeoJSON data.
o If you attempt to insert a document with non-geometry data in a 2dsphere indexed field, or build a 2dsphere
index on a collection where the indexed field has non-geometry data, the operation will fail.
{$near:
{$geometry: {
type: "Point",
coordinates: [ -73.96, 40.78 ] },
$maxDistance: 5000}
}})
o Find all the places within 5000 meters from the specified GeoJSON point, sorted in order from nearest
to furthest
•E.g.,
db.reviews.createIndex( {comment: “text”} )
o Wildcard ($**) allows MongoDB to index every field that contains string data
o E.g.,
db.reviews.createIndex( {“$**”: “text”} )
db.runCommand( {
create: <view>, viewOn: <source>, pipeline: <pipeline>, collation: <collation> } )
•Restrictions
o immutable Name
o you can modify a view either by dropping and recreating the view or using the collMod comman