CSC421 - Database Management II
CSC421 - Database Management II
Course Objective:
By the end of this course, students should be able to:
1. Design and implement advanced SQL queries and procedures.
2. Optimize database performance and manage transactions efficiently.
3. Understand and implement distributed database systems.
4. Explore and utilize NoSQL databases.
5. Analyze current trends and future directions in database technology.
• Foreign Key:
- A column that creates a relationship between two tables by referencing the
primary key of another table.
- Example: `CourseID` in the `Enrollments` table referencing `CourseID` in the
`Courses` table.
1.3 Normalization:
• Purpose:
- To eliminate redundancy and ensure data integrity.
- Involves decomposing tables to minimize duplication.
• Normal Forms:
o 1NF (First Normal Form): Eliminate repeating groups; ensure each column
contains atomic values.
o 2NF (Second Normal Form): Ensure all non-key attributes are fully
functionally dependent on the primary key.
Overview:
Complex queries involve multiple layers of operations and often require the use of nested
subqueries to break down the problem into manageable parts. These queries can perform
calculations, filter data, and join tables in sophisticated ways.
Complex Queries:
• Definition:
Queries that use multiple operations (joins, unions, subqueries) to retrieve the
desired data.
• Key Components:
- Joins (INNER, LEFT, RIGHT, FULL)
- Subqueries (nested queries)
- Aggregate functions (SUM, AVG, COUNT, etc.)
Nested Subqueries:
• Definition:
A query embedded within another SQL query.
• Types of Subqueries:
o Single-row subqueries: Return one row. Example:
SELECT name FROM employees WHERE salary = (SELECT MAX(salary)
FROM employees);
• Overview:
Common Table Expressions (CTEs) provide a way to define temporary result sets
that can be referenced within a SELECT, INSERT, UPDATE, or DELETE statement.
• CTEs:
• Definition:
A CTE is a named temporary result set that you can reference within a SELECT,
INSERT, UPDATE, or DELETE statement.
• Syntax:
WITH cte_name AS (
SELECT column1, column2
FROM table
WHERE condition
)
SELECT column1, column2
FROM cte_name
WHERE condition;
Example:
Calculate the average salary by department and list the employees who earn
more than their department's average:
WITH AvgSalaries AS (
SELECT department_id, AVG(salary) AS avg_salary
FROM employees
GROUP BY department_id
)
SELECT e.name, e.salary, a.avg_salary
FROM employees e
JOIN AvgSalaries a ON e.department_id = a.department_id
WHERE e.salary > a.avg_salary;
```
• Recursive CTEs:
• Definition:
CTEs that reference themselves.
• Syntax
WITH RECURSIVE cte_name AS (
SELECT initial_query
UNION ALL
SELECT recursive_query
FROM cte_name
WHERE condition
)
SELECT FROM cte_name;
Example
Calculate a hierarchical structure such as an employee-manager
relationship:
• Overview:
Window functions perform calculations across a set of table rows related to the
current row, providing capabilities such as running totals, moving averages, and
ranking.
• Window Functions:
• Definition:
Functions that operate over a window (set of rows) defined by the OVER() clause.
• Types:
o ROW_NUMBER(): Assigns a unique sequential integer to rows within a
partition.
o RANK(): Assigns a rank to rows within a partition, with gaps for ties.
o DENSE_RANK(): Similar to RANK() but without gaps.
o NTILE(n): Divides rows into a specified number of roughly equal parts.
o LEAD() and LAG(): Access data from subsequent and preceding rows.
• Syntax:
SELECT column,
ROW_NUMBER() OVER (PARTITION BY partition_column ORDER BY
order_column) AS row_num,
RANK() OVER (PARTITION BY partition_column ORDER BY order_column) AS
rank,
SUM(column) OVER (PARTITION BY partition_column ORDER BY order_column
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_total
FROM table;
Example:
Rank employees within each department by their salary:
```sql
SELECT name, department_id, salary,
RANK() OVER (PARTITION BY department_id ORDER BY salary DESC) AS
dept_rank
FROM employees;
• Types:
o Scalar functions: Return a single value.
o Table-valued functions Return a table.
• Stored Procedures:
• Definition:
A stored procedure is a set of SQL statements that can be executed as a program.
• Benefits:
- Encapsulation of business logic
- Improved performance through precompiled execution
- Enhanced security
Conclusion
These advanced SQL techniques allow for the creation of more efficient, readable, and
maintainable queries. By mastering complex queries, CTEs, Window Functions, UDFs, and
Stored Procedures, students can handle sophisticated database tasks and optimize
performance.
Lecture 3: Database Design and Normalization
1. Introduction to Normalization
Normalization is the process of organizing data in a database to reduce redundancy and
improve data integrity. The primary goal is to divide a database into two or more tables and
define relationships between the tables. The normal forms (1NF, 2NF, 3NF) ensure that the
database is free of certain types of redundancy and anomalies.
Example:
Consider a table `StudentCoursesBooks`:
Example:
Consider a table `ProjectAssignments`:
Denormalization Strategies
1. Introduction to Denormalization
Denormalization is the process of combining tables to improve read performance at the
expense of write performance and data integrity. It is often used in OLAP systems where
read performance is critical.
2. Benefits of Denormalization
- Improved query performance by reducing the number of joins.
- Simplified query writing for complex reports.
- Faster read times for data retrieval.
3. Risks of Denormalization
- Increased redundancy leading to data anomalies.
- Higher storage requirements.
- More complex update operations, leading to potential data inconsistency.
Example:
Original normalized schema:
Denormalized schema:
- `Orders(OrderID, CustomerID, OrderDate, TotalAmount)`
Schema:
o `Customers(CustomerID, Name, Email)`
o `Orders(OrderID, CustomerID, OrderDate)`
o `Products(ProductID, Name, Price)`
o `OrderDetails(OrderDetailID, OrderID, ProductID, Quantity, Price)`
• Problem Statement:
Design a database schema for an online retail store that handles customers,
orders, products, and inventory.
• Requirements:
o Customers can place multiple orders.
o Orders contain multiple products.
o Inventory should track product stock levels.
• Solution:
1. Entities:
- Customers
- Orders
- Products
- OrderDetails
- Inventory
2. Relationships:
- One-to-Many between Customers and Orders.
- Many-to-Many between Orders and Products through OrderDetails.
- One-to-One between Products and Inventory.
3. Schema Design:
- `Customers(CustomerID, Name, Email, Address)`
- `Orders(OrderID, CustomerID, OrderDate, TotalAmount)`
- `Products(ProductID, Name, Description, Price)`
- `OrderDetails(OrderDetailID, OrderID, ProductID, Quantity, Price)`
- `Inventory(ProductID, StockLevel)`
• Problem Statement:
Design a database schema for a university that manages students, courses, and
enrollments.
• Requirements:
o Students can enroll in multiple courses.
o Courses can have multiple students.
o Each enrollment should record the grade received.
• Solution:
1. Entities:
- Students
- Courses
- Enrollments
2. Relationships:
- Many-to-Many between Students and Courses through Enrollments.
3. Schema Design:
o `Students(StudentID, Name, Major, Email)`
o `Courses(CourseID, Title, Department, Credits)`
o `Enrollments(EnrollmentID, StudentID, CourseID, Grade)`
Conclusion
By applying advanced normalization techniques, appropriate denormalization strategies,
and following best practices in database schema design, one can create efficient,
scalable, and maintainable database systems. Real-world case studies illustrate the
practical application of these principles, ensuring that the database meets the functional
and performance requirements of the application.
1. Introduction to Indexing
• Definition:
An index is a database object that improves the speed of data retrieval operations
on a table.
• Analogy:
Think of an index in a book which helps you quickly locate the information.
2. Types of Indexes
o Primary Index: Automatically created with the primary key.
o Secondary Index: Created on non-primary key columns.
o Clustered Index: Alters the physical storage order of the table; one per table.
o Non-clustered Index: Does not alter the physical storage; multiple allowed per
table.
o Composite Index: An index on multiple columns.
3. Creating and Managing Indexes
• Syntax:
CREATE INDEX index_name ON table_name (column1, column2, ...);
• Guidelines:
o Index columns frequently used in WHERE clauses.
o Avoid indexing columns with a lot of NULL values.
o Limit the number of indexes on write-heavy tables.
5. Practical Example
Creating a composite index to optimize a query:
CREATE INDEX idx_customer_name ON Customers (LastName, FirstName);
• Joins:
- Use INNER JOIN for filtering and better performance.
- Ensure appropriate indexes on join columns.
• WHERE Clauses:
Filter early and reduce the result set as soon as possible.
• Example
SELECT /+ INDEX (employees idx_emp_department) /
FROM employees WHERE department_id = 10;
Bad:
WHERE UPPER(first_name) = 'JOHN'
Good:
WHERE first_name = 'John'
5. Practical Example
• Optimizing a query with appropriate indexes and filtering:
• MySQL:
EXPLAIN SELECT FROM employees WHERE department_id = 10;
Example:
EXPLAIN SELECT first_name, last_name FROM employees
WHERE department_id = 10;
5. Practical Example
Analyzing an execution plan for a complex query:
1. Atomicity
• Definition:
Ensures that all operations within a transaction are completed; if any part of the
transaction fails, the entire transaction fails and the database remains unchanged.
• Example:
In a banking system, transferring money from Account A to Account B involves
debiting Account A and crediting Account B. If either operation fails, neither should
complete.
• Implementation:
Achieved using transaction logs to track operations, enabling rollback if necessary.
2. Consistency
• Definition:
Ensures that a transaction brings the database from one valid state to another,
maintaining database invariants.
• Example:
In an e-commerce system, an order should only be processed if there is sufficient
stock. If the stock is insufficient, the transaction fails, ensuring no negative stock
levels.
• Implementation:
Enforced through integrity constraints, triggers, and predefined rules.
3. Isolation
• Definition:
Ensures that concurrently executing transactions do not affect each other’s
execution. Each transaction is isolated from others until it completes.
• Example:
Two transactions simultaneously booking a flight seat should not interfere with
each other and should reflect consistent state after completion.
• Implementation:
Controlled by concurrency control mechanisms like locking and isolation levels.
4. Durability
• Definition:
Ensures that once a transaction has been committed, it remains so, even in the
event of a system failure.
• Example:
Once a bank transfer transaction is complete, the updated balances should be
permanently stored even if the system crashes immediately afterward.
• Implementation:
Achieved through transaction logs and stable storage to ensure data can be
recovered after a crash.
Isolation Levels and Their Implications
Isolation levels control the visibility of changes made by one transaction to other
concurrently executing transactions. They balance between consistency and concurrency.
1. Read Uncommitted
• Definition:
Transactions can see uncommitted changes made by other transactions.
• Implications:
o Advantages: High concurrency, low latency.
o Disadvantages: Can result in dirty reads (reading uncommitted data), leading
to inconsistencies.
• Use Case:
Scenarios where high performance is crucial and minor inconsistencies are
acceptable.
2. Read Committed
• Definition:
Transactions can only see committed changes made by other transactions.
• Implications:
o Advantages: Prevents dirty reads.
o Disadvantages: Can still encounter non-repeatable reads (data can change
if read again in the same transaction).
3. Repeatable Read
• Definition: Ensures that if a transaction reads a value, subsequent reads within the
same transaction will return the same value.
• Implications:
o Advantages: Prevents dirty reads and non-repeatable reads.
o Disadvantages: Cannot prevent phantom reads (new data added by others
can be seen).
• Use Case: Financial applications where consistency is critical but phantom reads
are acceptable.
4. Serializable
• Definition:
The highest isolation level, ensuring transactions are executed in a completely
isolated manner as if they were serially executed.
• Implications:
o Advantages: Prevents dirty reads, non-repeatable reads, and phantom
reads.
o Disadvantages: Low concurrency, high overhead.
• Use Case: Scenarios requiring strict consistency, such as bank transfers and
critical data integrity operations.
Locking mechanisms are essential for maintaining data integrity and consistency in
concurrent transaction environments. However, improper use can lead to deadlocks,
where transactions wait indefinitely for resources.
1. Lock Types
• Shared Locks (S-lock): Allows multiple transactions to read a resource but not
modify it.
• Exclusive Locks (X-lock): Allows only one transaction to modify a resource,
preventing others from reading or writing.
2. Lock Granularity
• Row-level Locks: Fine-grained, allowing high concurrency but with higher overhead
due to managing many locks.
• Table-level Locks: Coarse-grained, easier to manage but lower concurrency.
• Page-level Locks: Intermediate granularity.
3. Deadlock Resolution
• Detection: Regularly check for cycles in the wait-for graph to detect deadlocks.
• Wait-for Graph: Nodes represent transactions, and edges represent waiting
relationships.
• Prevention: Use strategies to prevent deadlocks before they occur.
• Timeout: Transactions wait for a specific period before being aborted.
• Wait-Die and Wound-Wait Schemes: Based on transaction timestamps to decide
which transaction should wait and which should be aborted.
• Avoidance: Allocate resources in a way that avoids deadlocks, such as acquiring all
locks at once.
MVCC is a concurrency control method that allows multiple versions of a data item to exist
simultaneously, improving performance and consistency in read-heavy environments.
1. How MVCC Works
• Versioning: Each data item has multiple versions, each with a unique timestamp.
• Read Operations: Always access the version of the data item that was valid at the
transaction’s start time, ensuring consistency without locking.
• Write Operations: Create a new version of the data item with a new timestamp,
leaving old versions intact for concurrent readers.
2. Advantages of MVCC
• Non-blocking Reads: Readers do not block writers and vice versa, enhancing
concurrency.
• Consistency: Ensures that transactions see a consistent snapshot of the database.
• Performance: Reduces contention among transactions, improving throughput.
3. Disadvantages of MVCC
• Storage Overhead: Requires additional storage for multiple versions of data items.
• Complexity: Increased complexity in managing and cleaning up old versions.
4. Use Cases
• High-concurrency Systems: Web applications, content management systems
where read operations are frequent.
• Database Systems: PostgreSQL, MySQL InnoDB, and Oracle implement MVCC for
enhanced performance and consistency.
Conclusion
Understanding and implementing transaction management and concurrency control
mechanisms like ACID properties, isolation levels, locking mechanisms, deadlock
resolution, and MVCC is critical for maintaining data integrity, consistency, and
performance in database systems. These principles help ensure that database
transactions are processed reliably in a multi-user environment.
ACID properties are critical in ensuring the reliability of transactions in a database. They
guarantee that database transactions are processed reliably and ensure data integrity,
even in the event of errors, power failures, or other problems.
ACID Properties
1. Atomicity
• Ensures that all parts of a transaction are completed successfully. If any part of the
transaction fails, the entire transaction is rolled back.
• Example: In a banking transaction, transferring money from Account A to Account B
involves two operations: debiting Account A and crediting Account B. Atomicity
ensures that either both operations are completed or neither is.
2. Consistency
• Ensures that a transaction brings the database from one valid state to another,
maintaining database invariants.
• Example: If a transaction involves debiting money from an account, consistency
ensures that the total balance remains accurate according to business rules.
3. Isolation
• Ensures that the operations of one transaction are isolated from the operations of
other transactions. This prevents transactions from interfering with each other.
• Example: Two transactions attempting to update the same account balance will be
isolated, ensuring that the balance calculations are accurate and not corrupted by
concurrent operations.
4. Durability
• Ensures that once a transaction has been committed, it will remain so, even in the
event of a system failure.
• Example: After a bank transfer is completed, the changes to the account balances
must be permanent and survive system crashes.
Implementation in Databases
• Logging: Ensures durability and atomicity by keeping a log of all transactions.
• Locking Mechanisms: Ensures isolation by locking data that is being accessed by a
transaction.
• Concurrency Control: Techniques like two-phase locking help manage the order
and isolation of transactions.
Backups are essential for data protection and disaster recovery. They enable the
restoration of data to a previous state after data loss or corruption.
Types of Backups
1. Full Backup
o A complete copy of the entire database.
o Pros: Simplifies restore process.
o Cons: Time-consuming and requires significant storage.
2. Incremental Backup
o Only the data that has changed since the last backup (of any type) is backed up.
o Pros: Requires less storage and is faster than full backups.
o Cons: Restoring can be complex, as it requires the last full backup and all
subsequent incremental backups.
3. Differential Backup
o Backs up data that has changed since the last full backup.
o Pros: Faster than full backups and easier to restore than incremental backups.
o Cons: Can become large and slow over time until the next full backup.
Backup Methods
1. Onsite Backup
o Backups are stored on local storage devices.
o Pros: Fast and accessible.
o Cons: Vulnerable to local disasters.
2. Offsite Backup
o Backups are stored at a different physical location.
o Pros: Protects against local disasters.
o Cons: May be slower to access.
3. Cloud Backup
o Data is backed up to cloud storage services.
o Pros: Scalable and accessible from anywhere.
o Cons: Dependent on internet connectivity and can be costly.
Best Practices
o Regularly schedule backups.
o Verify backup integrity.
o Automate backup processes.
o Use a combination of full, incremental, and differential backups.
Recovery Models and Strategies
Recovery models and strategies are essential to restore database functionality and data
after failures or corruption.
Recovery Models
1. Simple Recovery Model
o Minimal logging, no log backups.
o Suitable for smaller databases where data loss can be tolerated.
o Pros: Simple to manage.
o Cons: Increased risk of data loss.
Recovery Strategies
1. Point-in-Time Recovery
o Restore the database to a specific point in time.
o Suitable for recovering from data corruption or user errors.
o Requires full and log backups.
1. Risk Assessment
Identify potential threats (natural disasters, hardware failures, cyber-attacks).
1. Regular Testing
o Schedule tests at least annually.
o Include various disaster scenarios.
2. Simulated Failures
Conduct simulations of different types of failures (hardware, software, natural disasters).
Examining real-world cases of data recovery provides valuable insights into the practical
application of recovery principles and strategies.
Lessons Learned
• Regular Backups: Ensure backups are frequent and stored securely.
• Offsite Storage: Protect backups from local disasters by using offsite or cloud
storage.
• Testing: Regularly test recovery plans to ensure they are effective and up-to-date.
• Security: Implement robust security measures to protect against attacks.
Conclusion
Understanding ACID properties, backup types, recovery models, and disaster recovery
planning is crucial for maintaining data integrity and availability in database management.
Real-world case studies highlight the importance of preparation, regular testing, and
continuous improvement in disaster recovery strategies.
Distributed databases are systems where data is stored across multiple physical
locations. These locations can be within the same physical site or dispersed over a
network of interconnected computers. Distributed databases aim to improve reliability,
availability, and performance by leveraging multiple nodes.
3. Scalability:
o Distributed databases should be able to scale horizontally by adding more nodes to
the network, thus handling increased loads and larger datasets.
4. Data Distribution:
o Data can be distributed using fragmentation (dividing data into smaller parts)
and replication (creating multiple copies of data).
5. Consistency and Concurrency:
o The system should ensure data consistency across all nodes. Concurrency
control mechanisms are needed to handle simultaneous data operations without
conflicts.
Data Fragmentation
1. Horizontal Fragmentation:
o Divides a table into subsets of rows. Each fragment contains a distinct subset of
rows.
o Example: A customer table split by region, with each region's customers stored in
different fragments.
2. Vertical Fragmentation:
o Divides a table into subsets of columns. Each fragment contains a distinct subset of
columns.
o Example: Separating personal information and order history of customers into
different fragments.
Data Replication
1. Full Replication:
o Entire database is replicated across multiple nodes.
o Increases availability and fault tolerance but may lead to higher consistency
management overhead.
2. Partial Replication:
o Only certain parts (fragments) of the database are replicated across nodes.
o Balances between availability and resource usage.
Data Allocation
1. Centralized Allocation:
- All data is stored at a single central site. Simpler but can become a bottleneck and
single point of failure.
2. Partitioned Allocation:
- Data is divided and stored across multiple sites without replication. Efficient use of
resources but lower fault tolerance.
3. Replicated Allocation:
- Data (or fragments) is replicated and distributed across multiple sites. Enhances
reliability and availability but requires sophisticated consistency management.
Distributed query processing involves executing queries that access data distributed
across multiple nodes. The goal is to optimize query execution by minimizing data transfer
and ensuring efficient resource utilization.
Key Concepts
1. Query Decomposition:
- Breaking down a high-level query into subqueries that can be executed at different
nodes.
2. Data Localization:
- Identifying the locations of data fragments needed for the query and determining the
best way to access them.
4. Distributed Joins:
o Strategies for joining data from different nodes, such as:
o Ship Whole: Transferring entire tables to a single site for joining.
o Semi-Join: Reducing the size of data transfer by sending only necessary columns
and rows.
5. Transaction Management:
- Ensuring ACID properties in a distributed context. This includes managing distributed
transactions, two-phase commit protocol, and dealing with failures.
Example Workflow
1. Query Parsing: Convert SQL query into an internal representation.
2. Query Decomposition: Divide the query into subqueries for each relevant node.
3. Query Optimization: Optimize each subquery and the overall query plan.
4. Query Execution: Execute subqueries, transfer data as needed, and assemble results.
5. Result Assembly: Combine results from subqueries to form the final result.
The CAP Theorem, proposed by Eric Brewer, states that in a distributed database system, it
is impossible to simultaneously achieve Consistency, Availability, and Partition Tolerance.
A system can only guarantee two out of the three at any given time.
2. Availability:
- Every request receives a response, either success or failure, even if some nodes are
down.
3. Partition Tolerance:
- The system continues to operate despite network partitions that disrupt communication
between nodes.
3. Consistency Models:
o Strong Consistency: Immediate consistency across all nodes.
o Eventual Consistency: Updates propagate asynchronously, and data becomes
consistent eventually.
o Causal Consistency: Operations that are causally related are seen in the same
order by all nodes.
Conclusion
Understanding the CAP theorem is crucial for designing and managing distributed
databases. It helps in making informed trade-offs between consistency, availability, and
partition tolerance based on application requirements and network conditions.
NoSQL databases can be broadly categorized into four types, each designed to solve
specific problems and optimize different aspects of data storage and retrieval.
1. Key-Value Stores
• Definition:
These databases store data as a collection of key-value pairs, where a unique key is
used to identify a particular value.
• Characteristics:
o Simple and fast retrieval.
o Ideal for scenarios requiring high-speed read and write operations.
o Lack of complex querying capabilities.
• Examples: Redis, DynamoDB, Riak.
2. Document Stores
• Definition:
These databases store data as documents, typically in formats like JSON, BSON, or
XML.
• Characteristics:
o Flexible schema design allows for easy modifications.
o Each document can contain nested structures, arrays, and other data types.
o Supports complex queries and indexing on document fields.
• Examples: MongoDB, CouchDB.
3. Column-Family Stores
• Definition:
These databases store data in columns rather than rows, allowing for efficient
retrieval of data in a column-oriented format.
• Characteristics:
o Suitable for read-heavy and write-heavy applications.
o Efficient storage and retrieval for large volumes of data.
o Often used in data warehousing and analytical applications.
• Examples: Cassandra, HBase.
4. Graph Databases
• Definition:
These databases store data in nodes, edges, and properties, representing entities
and their relationships.
• Characteristics:
o Optimized for traversing and querying complex relationships.
o Ideal for social networks, recommendation systems, and network analysis.
o Queries are performed using graph-specific query languages like Cypher.
• Examples: Neo4j, ArangoDB.
Use Cases
o Key-Value Stores: Ideal for session management, caching, and real-time analytics.
o Document Stores: Used for content management systems, e-commerce platforms,
and applications requiring flexible schemas.
o Column-Family Stores: Suitable for time-series data, IoT applications, and large-
scale data warehousing.
o Graph Databases: Best for applications involving complex relationships, such as
social networks, fraud detection, and recommendation engines.
Trade-Offs
• Flexibility vs. Structure:
o NoSQL: Offers schema flexibility, making it easier to accommodate changing
data requirements.
o Relational: Enforces a rigid schema, which can ensure data integrity but limits
flexibility.
• Scalability vs. Consistency:
o NoSQL: Designed for horizontal scalability, distributing data across multiple
nodes. However, this can sometimes lead to eventual consistency rather than
immediate consistency.
o Relational: Typically vertically scalable, with strong ACID compliance ensuring
immediate consistency.
• Performance vs. Complexity:
o NoSQL: High performance for specific operations (e.g., key-value lookups,
document retrieval) but may lack the complex querying capabilities of SQL.
o Relational: Supports complex joins and transactions, but can become slow with
very large datasets without proper optimization.
• Development Speed vs. Maturity:
o NoSQL: Faster development cycles due to flexible schemas, but the technology
is generally newer and may lack some of the maturity and tool support of
relational databases.
o Relational: More mature with extensive tooling and community support, but
development can be slower due to schema constraints and the need for
migration scripts.
Setup:
1. Install MongoDB (refer to the official MongoDB installation guide for your operating
system).
2. Start the MongoDB server using the `mongod` command.
Basic Operations:
1. Connecting to MongoDB:
mongo
2. Creating a Database:
```javascript
use myDatabase
```
3. Inserting Documents:
```javascript
db.myCollection.insertOne({ name: "John Doe", age: 30, email:
"[email protected]" })
```
4. Querying Documents:
```javascript
db.myCollection.find({ age: { $gt: 25 } })
```
5. Updating Documents:
```javascript
db.myCollection.updateOne({ name: "John Doe" }, { $set: { age: 31 } })
```
6. Deleting Documents:
```javascript
db.myCollection.deleteOne({ name: "John Doe" })
```
Setup:
1. Install Cassandra (refer to the official Apache Cassandra installation guide for your
operating system).
2. Start the Cassandra server using the `cassandra` command.
- Basic Operations:
1. Connecting to Cassandra:
```bash
cqlsh
```
2. Creating a Keyspace:
```sql
CREATE KEYSPACE myKeyspace WITH replication = {'class': 'SimpleStrategy',
'replication_factor': 1};
```
3. Creating a Table:
```sql
CREATE TABLE myKeyspace.users (
user_id UUID PRIMARY KEY,
name TEXT,
age INT,
email TEXT
);
```
4. Inserting Data:
```sql
INSERT INTO myKeyspace.users (user_id, name, age, email) VALUES (uuid(),
'Jane Doe', 28, '[email protected]');
```
5. Querying Data:
```sql
SELECT FROM myKeyspace.users WHERE age > 25;
```
6. Updating Data:
```sql
UPDATE myKeyspace.users SET age = 29 WHERE user_id = [specific-uuid];
```
7. Deleting Data:
```sql
DELETE FROM myKeyspace.users WHERE user_id = [specific-uuid];
```
Conclusion
NoSQL databases offer significant advantages in terms of scalability, flexibility, and
performance for specific use cases. Understanding the different types of NoSQL
databases, their trade-offs compared to relational databases, and gaining hands-on
experience with tools like MongoDB and Cassandra will prepare you to make informed
decisions about the best database technology for various applications.
• Spark: An open-source unified analytics engine for big data processing, with built-in
modules for streaming, SQL, machine learning, and graph processing. Known for:
- In-memory processing capabilities.
- High-level APIs in Java, Scala, Python, and R.
Real-Time Analytics:
o Business Intelligence (BI): Tools and techniques to analyze real-time data for
decision-making.
o Examples: Tableau, Power BI.
o Anomaly Detection: Identifying unusual patterns that do not conform to expected
behavior.
o Examples: Using machine learning models with real-time data streams for fraud
detection.
Conclusion
This lecture covered foundational concepts and technologies in big data, including
integration with traditional databases, data warehousing and ETL processes, and real-time
data processing and analytics. Understanding these concepts is crucial for managing and
analyzing the vast amounts of data generated in various domains, enabling effective
decision-making and innovative solutions.
Authorization
- Determines what an authenticated user is allowed to do.
- Example: SQL GRANT and REVOKE statements.
Capabilities
- Tokens or keys that grant specific access rights.
- Example: An API key that allows access to specific endpoints.
Database Encryption
- Encrypts data within the database.
- Example: Transparent Data Encryption (TDE) in SQL Server and Oracle.
Asymmetric Encryption
- Uses a pair of keys: a public key for encryption and a private key for decryption.
- Enables secure key exchange.
- Examples: RSA, ECC (Elliptic Curve Cryptography).
Stored Procedures
- Encapsulate SQL logic in the database.
- Example: Using stored procedures in SQL Server.
Input Validation
- Validate and sanitize user inputs.
- Example: Using regex to ensure input meets expected patterns.
---
Summary
o Database security is crucial for protecting sensitive data and maintaining data
integrity.
o Implementing strong security models and access control mechanisms can help
manage and restrict access.
o Encryption ensures that data remains secure both at rest and in transit.
o Awareness of common vulnerabilities like SQL injection and how to prevent them is
essential.
o Following best practices ensures a comprehensive approach to securing database
systems.
Cloud databases are databases that run on cloud computing platforms. They can be
accessed, managed, and utilized remotely through the internet. Database as a Service
(DBaaS) is a cloud service model that provides users with access to a database without the
need for physical hardware, software installation, or maintenance.
DBaaS Examples
1. Amazon RDS (Relational Database Service):
- Supports multiple database engines such as MySQL, PostgreSQL, Oracle, and
SQL Server.
- Automates backups, patch management, and replication.
2. Google Cloud SQL:
- Fully-managed relational database service for MySQL, PostgreSQL, and SQL
Server.
- Offers automated backups, replication, and failover.
3. Microsoft Azure SQL Database:
- Managed relational database service with built-in intelligence.
- Supports automatic tuning, performance monitoring, and scaling.
Key Considerations
1. Security: Ensure data encryption at rest and in transit, and implement access
controls.
2. Compliance: Verify compliance with industry regulations such as GDPR, HIPAA,
etc.
3. Vendor Lock-in: Be aware of potential difficulties in migrating databases between
providers.
In-Memory Databases
In-memory databases (IMDBs) store data entirely in the main memory (RAM) instead of on
disk storage. This allows for significantly faster data retrieval and processing times.
Use Cases
1. High-frequency Trading: Requires rapid data processing and decision making.
2. Real-time Analytics: Applications like fraud detection and recommendation
engines.
3. Caching: Frequently accessed data can be cached in-memory to enhance
performance.
Key Considerations
1. Cost: RAM is more expensive than disk storage, potentially increasing costs.
2. Persistence: Ensure data persistence strategies (e.g., snapshots, logs) to prevent
data loss.
3. Scalability: Managing large datasets can be challenging due to RAM limitations.
NewSQL Databases
NewSQL databases aim to provide the scalability and performance of NoSQL systems
while maintaining the ACID properties and familiar SQL interface of traditional relational
databases.
The field of database technology is continuously evolving, with ongoing research and
development aimed at addressing emerging challenges and leveraging new opportunities.
Key Trends
2. Blockchain Databases:
- Use of blockchain technology for decentralized, immutable ledgers.
- Applications in finance, supply chain, and healthcare for secure and transparent
record-keeping.
3. Hybrid Databases:
- Combining the strengths of SQL and NoSQL databases to handle diverse data
types and workloads.
- Examples: Multi-model databases like ArangoDB and OrientDB.
5. Quantum Databases:
- Research into quantum computing for database processing to solve complex
problems more efficiently.
- Potential for exponential speed-ups in certain types of database queries and
transactions.
6. Serverless Databases:
- DBaaS models that automatically handle provisioning, scaling, and management
without server management by users.
- Examples: AWS Aurora Serverless, Google Cloud Firestore.
Research Challenges
3. Interoperability:
- Ensuring seamless interoperability between different database systems and
platforms.
- Development of standardized APIs and data exchange formats.
Conclusion
Staying updated with emerging trends and advancements in database technologies is
crucial for database professionals. Continuous learning and adaptation will enable
leveraging these innovations to solve complex data challenges and drive technological
progress.
›