SQL + PL / SQL + Database (Very Important topic) DCL (Data Control Language) is used to manage permissions and access control. DML (Data Manipulation Language) is used for data manipulation like INSERT, UPDATE, DELETE. DDL (Data Definition Language) is used to define 1 What are DCL, DML, and DDL in SQL? and manage database structures like CREATE, ALTER, DROP. Example: DCL - GRANT SELECT ON table TO user; DML - INSERT INTO table (column1, column2) VALUES (value1, value2); DDL - CREATE TABLE table (column1 datatype, column2 datatype); GROUP BY is used to group rows based on a column's values, typically used with aggregate What is the difference between group by and functions. HAVING is used to filter grouped results. 2 having? Example: SELECT department, AVG(salary) FROM employees GROUP BY department HAVING AVG(salary) > 50000; ORDER BY is used to sort query results. You can order by one or more columns by specifying What is the order by? Can we order more than one 3 multiple column names in the ORDER BY clause. column? Example: SELECT name, age FROM students ORDER BY age, name; UNION combines the result sets of two or more SELECT queries into a single result set, removing duplicates. JOIN combines rows from two or more tables based on a related column. Example: 4 What is the difference between union and join? UNION - SELECT name FROM table1 UNION SELECT name FROM table2; JOIN - SELECT customers.name, orders.order_date FROM customers JOIN orders ON customers.customer_id = orders.customer_id; There are various types of joins: INNER JOIN (returns matching rows), LEFT JOIN (returns all rows from the left table and matching rows from the right), RIGHT JOIN (returns all rows from the right table and matching rows from the left), FULL 5 What is the different type of join? OUTER JOIN (returns all rows when there is a match in either table). Example: INNER JOIN - SELECT customers.name, orders.order_date FROM customers INNER JOIN orders ON customers.customer_id = orders.customer_id; Aggregate functions perform calculations on a set of values and return a single result. Common 6 What is the aggregate functions? aggregates include COUNT, SUM, AVG, MAX, and MIN. Example: SELECT COUNT(*) FROM orders; The typical sequence of SQL statements in a 7 What are the SQL statements Sequence? query is: SELECT (columns) FROM (table) WHERE (conditions) GROUP BY (columns) HAVING (conditions) ORDER BY (columns); A view is a virtual table based on the result of a SELECT query. It simplifies complex queries, provides security, and hides underlying table 8 What is the view? + Why we use it? structures. Example: CREATE VIEW employee_view AS SELECT name, salary FROM employees WHERE department = 'HR'; A SQL transaction is a sequence of one or more SQL statements treated as a single unit of work. It follows ACID properties (Atomicity, Consistency, 9 What is the SQL transaction? Isolation, Durability) to ensure data integrity. Example: BEGIN TRANSACTION; UPDATE account SET balance = balance - 100 WHERE account_number = '123'; COMMIT; DELETE removes specific rows from a table based on a condition and can be rolled back. TRUNCATE What is the difference between delete and 10 removes all rows from a table and is not reversible. truncate? Example: DELETE FROM employees WHERE department = 'IT'; TRUNCATE TABLE employees; You can use the ALTER TABLE statement to add a column to an existing table. Example: ALTER 11 How can insert a column to the table? TABLE employees ADD COLUMN address VARCHAR(255); Use the INSERT INTO statement with multiple How can insert multi rows in only one insert value sets in parentheses. Example: INSERT 12 statement? INTO students (name, age) VALUES ('Alice', 25), ('Bob', 22), ('Charlie', 28); A database is a structured collection of data. DBMS (Database Management System) is software that manages databases. RDBMS 13 What is the database, DBMS, and RDBMS? (Relational DBMS) stores data in tables with relationships. Example: Database: CompanyDB; DBMS: MySQL; RDBMS: PostgreSQL; Attributes in a database represent properties of entities. They can be classified as simple (atomic) or composite (composed of sub-attributes) and 14 What are the kinds of attributes? derived (calculated from other attributes). Example: Simple - Age, Composite - Address (Street, City), Derived - TotalPrice (Quantity * Price); ERD (Entity-Relationship Diagram) is a visual representation of database entities, their attributes, 15 What is the ERD? and relationships between entities. It helps in database design. Example: ; Constraints enforce data integrity rules. Common types include PRIMARY KEY, FOREIGN KEY, UNIQUE, CHECK, and NOT NULL. Example: 16 What is the type of constraints? PRIMARY KEY (employee_id), FOREIGN KEY (department_id) REFERENCES departments(department_id); A primary key uniquely identifies rows in a table. A foreign key establishes a link between tables, What is the difference between primary key and ensuring referential integrity. Example: PRIMARY 17 foreign key? KEY - employee_id in employees table; FOREIGN KEY - department_id in employees table referencing departments table; (Repeated question) DELETE removes specific What the difference is between delete and 18 rows; TRUNCATE removes all rows and is not truncate? reversible. DELETE SET NULL sets foreign key values to NULL when referenced rows are deleted. DELETE 19 What is delete set null and delete cascade? CASCADE deletes rows in related tables when the referenced row is deleted. Example: DELETE SET NULL - Set employee_id to NULL in orders when an employee is deleted; DELETE CASCADE - Delete all orders when an employee is deleted. Normalization is the process of organizing data in a database to reduce redundancy and improve data integrity. It prevents update anomalies and What is the normalization and why are we making 20 ensures efficient data storage. Example: 1NF - it? Ensure each column has atomic values; 2NF - Remove partial dependencies; 3NF - Remove transitive dependencies. Common normalization forms include 1NF, 2NF, 3NF, BCNF, and 4NF. Each eliminates specific 21 What are the types of normalization? types of data redundancy. Example: 1NF - Each column has atomic values; 2NF - No partial dependencies; 3NF - No transitive dependencies. Update anomalies occur when inconsistencies arise due to data redundancy, such as when updating data in one place but not another. 22 What are the update anomalies? Example: In a denormalized table, updating an employee's salary in one row but not in another for the same employee. SQL is a query language for managing and querying data in databases. PL/SQL is a procedural extension of SQL used for writing 23 What is the difference between SQL and PL/SQL? stored procedures and functions. Example SQL: SELECT * FROM employees; Example PL/SQL: CREATE PROCEDURE getEmployee (emp_id NUMBER) AS BEGIN ... END; PL/SQL provides loops like FOR LOOP, WHILE 24 What are the types of loops in PL/SQL? LOOP, and LOOP-END LOOP for repetitive tasks. Example: FOR i IN 1..10 LOOP ... END LOOP; Cursors are database objects used to retrieve and manipulate data. Types include Implicit (used for single-row queries) and Explicit (used for multi-row queries), which can be further categorized as What are the cursors and what are the cursors 25 Static, Dynamic, and Scrollable. Example: Implicit types? Cursor - SELECT name INTO employee_name FROM employees WHERE id = 123; Explicit Cursor - DECLARE emp_cursor CURSOR FOR SELECT name FROM employees; A procedure is a named collection of PL/SQL statements that can be stored in a database and executed as a single unit. It can take parameters 26 What is the procedure? and return values. Example: CREATE PROCEDURE calculate_salary (employee_id NUMBER) AS BEGIN ... END; A procedure doesn't return a value, while a function does. Functions can be used in SQL queries, whereas procedures cannot. Example What is the difference between procedure and Procedure: CREATE PROCEDURE 27 function? update_employee (emp_id NUMBER) AS BEGIN ... END; Example Function: CREATE FUNCTION get_employee_name (emp_id NUMBER) RETURN VARCHAR2 AS BEGIN ... END; Triggers are PL/SQL blocks executed automatically in response to specific database events. Types include BEFORE and AFTER What are the triggers and what are the triggers 28 triggers for INSERT, UPDATE, DELETE events. types? Example: BEFORE INSERT Trigger - Prevent inserting records with invalid data; AFTER UPDATE Trigger - Log changes to a table. SQL statements depend on specific requirements 29 Write SQL Statements and tables. For example, to insert data: INSERT INTO employees (emp_id, emp_name) VALUES (1, 'John Doe'); To update data: UPDATE products SET price = price * 0.9 WHERE category = 'Electronics'; To delete data: DELETE FROM customers WHERE last_purchase_date < '2022- 01-01'; Business Intelligence Business Intelligence (BI) refers to the technologies, processes, and tools used to analyze and present business data to support decision-making. It helps organizations gain 30 What is Business Intelligence? insights, make informed decisions, and improve business performance. Example: Using BI to analyze sales data to identify trends and optimize product offerings. The typical steps in BI include data collection, data integration (ETL - Extract, Transform, Load), data storage, data analysis, and data visualization. Example: 1. Collecting sales data from multiple 31 What are the steps in BI? sources. 2. Integrating and transforming the data into a unified format. 3. Storing it in a data warehouse. 4. Analyzing it to discover sales trends. 5. Creating dashboards to visualize the trends. ETL (Extract, Transform, Load) tools include Apache NiFi, Talend, and Informatica. Analysis tools include Tableau, Power BI, and QlikView. What are the tools we use in BI (for ETL, Analysis, 32 Visualization tools include D3.js, Google Data and Visualization)? Studio, and Looker. Example: Using Tableau for data analysis and visualization to create interactive sales reports. Data Warehouse A data warehouse is a centralized repository that stores, integrates, and manages data from various sources to support business reporting and analysis. It is designed for query and analysis 33 What is the data warehouse? rather than transaction processing. Example: Storing historical sales data, customer information, and product data for business intelligence purposes. Characteristics include subject-oriented (focus on specific business areas), integrated (combines data from diverse sources), time-variant (stores 34 What are the characteristics of a data warehouse? historical data), non-volatile (data is not updated frequently), and supports complex queries. Example: Analyzing sales trends over the last five years. A database is designed for transactional processing, while a data warehouse is designed for analytical processing. Databases support real- What is the difference between a database and a time data updates, while data warehouses store 35 data warehouse? historical data and support complex queries for reporting and analysis. Example: A database for online order processing vs. a data warehouse for sales analysis. A data warehouse stores structured data in a highly organized manner, while big data encompasses vast volumes of structured and unstructured data. Data warehouses are well- What is the difference between a data warehouse 36 suited for structured data analysis, whereas big and big data? data technologies like Hadoop handle unstructured and semi-structured data. Example: A data warehouse for analyzing sales data vs. using big data tools to analyze social media posts. OLTP (Online Transaction Processing) systems are used for day-to-day transactional operations, supporting real-time data entry and retrieval. OLAP 37 What is the difference between OLTP and OLAP? (Online Analytical Processing) systems are for complex data analysis and reporting. Example: OLTP for processing bank transactions, OLAP for analyzing customer spending patterns. Data warehousing is the process of designing, building, and maintaining data warehouses. It involves data extraction, transformation, loading 38 What is Data Warehousing? (ETL), and providing a platform for business intelligence and reporting. Example: Setting up a data warehousing system for a retail company. Processes include data extraction, data transformation, data loading (ETL), data storage, What are the processes that can be done in the data retrieval, data modeling, and data analysis. 39 data warehouse? Example: Extracting customer data, transforming it into a standardized format, and loading it into the data warehouse for analysis. Data modeling is the process of defining the structure and relationships of data in a database or data warehouse. Types include conceptual modeling (high-level representation), logical 40 What is Data Modeling? + Types of Data Modeling? modeling (entity-relationship diagrams), and physical modeling (designing database tables). Example: Creating an entity-relationship diagram for a customer database. Data warehouses are typically designed for read- intensive operations, and updates are infrequent. Updates can be performed, but they often involve 41 Can we update a record in a data warehouse? complex ETL processes to maintain historical data. Example: Correcting a customer's address in the data warehouse. A data mart is a subset of a data warehouse that focuses on specific business areas or departments. It contains a smaller, more 42 What is a data mart? specialized set of data for targeted analysis. Example: Creating a sales data mart for the Sales department to analyze sales performance. A data cube is a multi-dimensional representation of data that allows for efficient querying and analysis. It contains dimensions (attributes) and 43 What is a Data Cube? measures (facts) and is often used in OLAP systems. Example: Analyzing sales data with dimensions like time, product, and region. ETL (Extract, Transform, Load) is a process used to extract data from source systems, transform it into a desired format, and load it into a data 44 What is ETL? warehouse or data mart. Example: Extracting sales data from a CRM system, transforming it to match the data warehouse schema, and loading it into the data warehouse. In a star schema, dimension tables are directly linked to a central fact table. In a snowflake schema, dimension tables are normalized into multiple related tables. Star schemas are simpler What is the difference between snowflake and star 45 but can be less space-efficient, while snowflake schema? schemas save space but can be more complex. Example: Star schema for sales analysis vs. snowflake schema for complex product hierarchies. Fact tables contain numerical measures and What is the difference between fact and dimension 46 foreign keys to dimension tables. Dimension tables tables? contain descriptive attributes about dimensions such as time, product, or location. Example: Fact table with sales revenue vs. dimension table with product details. Big Data Big data is important because it enables organizations to gain valuable insights from vast and diverse datasets that were previously too large and complex to manage and analyze effectively. It 47 Why is big data important? can uncover patterns, trends, and opportunities for better decision-making. Example: Analyzing customer behavior across social media, online purchases, and offline interactions to enhance marketing strategies. Big data is characterized by the three V's: Volume (large amounts of data), Velocity (high-speed data generation and processing), and Variety (diverse data types, structured and unstructured). Some 48 What is big data? (V's of Big Data) also add Veracity (data accuracy) and Value (extracting insights). Example: Social media platforms processing massive volumes of tweets (Volume) in real-time (Velocity) with text, images, and videos (Variety). Data types in the context of big data can include structured data (e.g., numbers, dates), semi- structured data (e.g., JSON, XML), and unstructured data (e.g., text, images, videos). 49 What are the data types? Example: Structured data - Sales revenue as numbers; Semi-structured data - Customer data in JSON format; Unstructured data - Text reviews from customers. A Data Lake is a central repository that stores vast amounts of raw and unprocessed data from diverse sources. It allows for flexible and scalable 50 What is Data Lake? data storage and analysis. Example: Storing log files, sensor data, and social media posts in a Data Lake for future analytics. ETL (Extract, Transform, Load) involves extracting data from source systems, transforming it before loading it into a data warehouse. ELT (Extract, Load, Transform) loads data into the data warehouse first and then performs transformations. ELT is often used in big data 51 What is the difference between ETL & ELT? scenarios where data may not fit the traditional ETL model. Example (ETL): Extracting sales data, aggregating it, and loading it into a data warehouse. Example (ELT): Loading raw log data into a Data Lake, then transforming it into a structured format for analysis. Databases are designed for structured data storage and transaction processing, while big data encompasses both structured and unstructured data. Big data technologies like Hadoop and What is the difference between a Database and Big 52 NoSQL databases are built to handle massive data? volumes and varieties of data. Example: A relational database for storing customer information vs. Hadoop for processing social media data. Big data tools include Hadoop (for distributed storage and processing), Spark (for fast data processing), MapReduce (for data processing in 53 What are the tools in big data? Hadoop), Hive (for querying and data warehousing), Impala (for SQL queries on Hadoop), Kafka (for real-time data streaming), and more. Example: Using Spark for analyzing large datasets in real-time. - Hadoop is a distributed storage and processing framework for big data. - Spark is a fast and versatile data processing engine. - MapReduce is a programming model used in Hadoop for parallel processing. - Hive is a data warehousing and SQL What are querying tool for Hadoop. - Impala is an open- 54 (Hadoop/Spark/MapReduce/Hive/Impala/Kafka/...)? source SQL query engine for Hadoop. - Kafka is a distributed streaming platform for real-time data. Example: Using Hadoop to store and process large log files, Spark for real-time analytics, Hive for querying structured data in Hadoop, and Kafka for ingesting streaming data. Data Science + Machine Learning + Data Mining (Data Science Track) Data science is an interdisciplinary field that uses scientific methods, algorithms, processes, and systems to extract knowledge and insights from structured and unstructured data. It combines 55 What is data science? aspects of statistics, computer science, and domain knowledge to solve complex problems. Example: Using data science to analyze customer behavior and recommend personalized products. Data scientists focus on designing and implementing complex algorithms to solve business problems, often requiring programming and machine learning expertise. Data analysts What is the difference between data scientists and 56 primarily work on data exploration, visualization, data analysts? and basic statistical analysis to answer specific questions. Example: A data scientist develops a predictive model, while a data analyst creates reports and dashboards. Data cleaning involves identifying and correcting errors, inconsistencies, and inaccuracies in datasets. It includes tasks like handling missing values, removing duplicates, and correcting 57 What is data cleaning? How do we clean the data? outliers using statistical methods and domain knowledge. Example: Replacing missing age values in a dataset with the median age of known values. Data mining is the process of discovering patterns, relationships, and valuable insights from large datasets. It involves techniques like clustering, 58 What is Data Mining? classification, regression, and association rule mining. Example: Analyzing retail sales data to identify product associations for marketing strategies. Applications include fraud detection, recommendation systems (e.g., Netflix), medical diagnosis, sentiment analysis in social media, What are the real-life applications of data mining 59 predictive maintenance in manufacturing, and and machine learning? autonomous vehicles. Example: Using machine learning to predict disease outbreaks based on historical health data. The process involves data selection, data preprocessing, data transformation, data mining, pattern evaluation, and knowledge presentation. What is the Process of Data Mining/Knowledge 60 Example: In e-commerce, selecting sales data, Discovery Process? preprocessing it (cleaning and transforming), mining customer purchase patterns, and presenting these patterns for business decisions. Challenges include handling large datasets, data 61 What are the Challenges of Data Mining? quality issues, selecting appropriate algorithms, overfitting, interpretability of complex models, and ensuring privacy and security of sensitive data. Example: Dealing with skewed data distribution in fraud detection, where fraudulent transactions are rare. Machine learning is a subset of artificial intelligence that involves the development of algorithms that enable computers to learn patterns 62 What is Machine Learning? and make predictions or decisions from data. Example: Training a machine learning model to recognize handwritten digits in images. Deep learning is a subset of machine learning that uses artificial neural networks with multiple layers (deep architectures) to automatically learn and 63 What is deep learning? represent data. It excels in tasks like image and speech recognition. Example: Training a deep neural network to recognize objects in images. Tasks include clustering (K-Means), classification (Decision Trees), regression (Linear Regression), association rule mining (Apriori), and anomaly 64 What are the data mining tasks/algorithms? detection (Isolation Forest). Example: Using K- Means to group customers based on purchasing behavior. Supervised learning uses labeled data to train models (e.g., classification or regression), while What is the difference between Supervised and unsupervised learning uses unlabeled data to find 65 Unsupervised learning? patterns or groupings (e.g., clustering). Examples: Supervised - Spam email detection; Unsupervised - Customer segmentation. Classification assigns labels to data based on predefined classes, while clustering groups data What is the difference between Classification and into clusters based on similarity. Examples: 66 Clustering? Classification - Identifying email as spam or not; Clustering - Grouping customers into market segments. K-Means, Hierarchical Clustering, and DBSCAN 67 Examples of clustering algorithms are examples of clustering algorithms. Decision Trees, Logistic Regression, and Support 68 Examples for classification algorithms Vector Machines (SVM) are examples of classification algorithms. Association rules identify relationships between items in a dataset, often used in market basket 69 What is an association rule? analysis to find item associations in transactions. Example: "If a customer buys bread, they are likely to buy butter." Provide brief explanations of how each algorithm How does this algorithm work (K-Mean, works. Example: K-Means clusters data points into 70 Regression, SVM, association rule, decision tree, K clusters based on proximity; Decision Trees KNN...)? make decisions by following a tree-like structure of if-else conditions. Recall measures the ability of a model to identify all relevant instances. Precision measures the ability of a model to return only relevant instances. 71 What is recall and precision, F1? F1-score is the harmonic mean of precision and recall, balancing them. Example: In a medical test, recall is the percentage of actual sick patients correctly identified by the test. The bias-variance trade-off refers to the balance between model complexity and model performance. A model with high bias (underfitting) 72 What is the bias-variance trade-off? has low complexity and may not capture underlying patterns. A model with high variance (overfitting) fits the training data too closely and may not generalize well to new data. Example: In polynomial regression, increasing the polynomial degree leads to lower bias but higher variance. A confusion matrix is a table that visualizes the performance of a classification algorithm. It shows true positives, true negatives, false positives, and 73 What is the confusion matrix? false negatives. Example: In a binary classification problem, the confusion matrix may look like this: TP: 120, TN: 80, FP: 10, FN: 5. The ROC (Receiver Operating Characteristic) curve is a graphical representation of a classifier's performance, showing the trade-off between true 74 What is the ROC Curve? positive rate and false positive rate at various thresholds. Example: In medical diagnosis, plotting the ROC curve helps assess the accuracy of a diagnostic test. Cross-validation is a technique used to evaluate the performance of a machine learning model by dividing the dataset into multiple subsets (folds). It trains and tests the model on different 75 Explain cross-validation? combinations of folds to assess its generalization ability. Example: Using k-fold cross-validation to train and test a model on five subsets of the data, rotating which subset is used for testing in each iteration. A validation set is used during the model training phase to tune hyperparameters and assess performance. A test set is a separate dataset used What is the difference between a validation set and to evaluate the final model's generalization 76 a test set? performance. Example: Using a validation set to adjust the learning rate in gradient boosting and a test set to estimate the model's accuracy on unseen data. Missing values can be imputed using methods like mean, median, or interpolation. Outliers can be identified and removed or transformed using 77 How do you treat missing/outlier values? statistical techniques. Example: Replacing missing age values with the median age of known values; Detecting outliers using the Z-score and removing extreme values. Data preparation involves data cleaning, feature selection/engineering, handling missing values, scaling/normalizing features, and splitting data into 78 How do you prepare the data for the ML Model? training, validation, and test sets. Example: Scaling numerical features to have a mean of 0 and a standard deviation of 1 for better model convergence. Statistics (Data Science Track) Variance measures how individual data points deviate from the mean. Standard deviation is the square root of the variance and measures the What is the difference between standard deviation average deviation of data points from the mean. 79 and variance? Example: Variance calculates the average squared difference from the mean, while standard deviation provides a more interpretable measure in the original units of the data. Mean is the average of a set of numbers. Median is the middle number when the numbers are 80 What are Mean, Median, and Mode? ordered. Mode is the value that appears most frequently. Example: For the set of numbers {2, 3, 3, 5, 7}, Mean = 4, Median = 3, Mode = 3. Variance measures the spread or dispersion of What is the difference between variance and data by calculating the average of squared 81 standard deviation? differences from the mean. Standard deviation is the square root of the variance and provides a more interpretable measure in the original units of the data. Example: Variance = 9, Standard Deviation = 3 for the set {1, 2, 3, 4, 5}. A box plot (box-and-whisker plot) is a graphical representation of the distribution of data. It shows the median, quartiles, and potential outliers. The box represents the interquartile range (IQR), and 82 What is the Box plot? the whiskers extend to the minimum and maximum values within a defined range. Example: A box plot showing the distribution of test scores, with the median, quartiles, and any outliers. Skewed data can be positively skewed (right- skewed) where the tail extends to the right, or negatively skewed (left-skewed) where the tail 83 What are the types of skewed data? extends to the left. Example: Positive skew in income distribution data due to a few high earners; Negative skew in test scores with many high scores. The Z-score (standard score) measures how many standard deviations a data point is from the mean. It standardizes data, making it possible to compare 84 What is the Z-score? values from different datasets. Example: A Z-score of -1.5 indicates a data point is 1.5 standard deviations below the mean. The P-value measures the evidence against a null hypothesis in hypothesis testing. It indicates the probability of observing a test statistic as extreme as, or more extreme than, what is observed in the 85 What is the P-value? sample, assuming the null hypothesis is true. Example: In a medical trial, a P-value of 0.03 suggests a 3% chance of observing the results if the treatment has no effect (null hypothesis). The Pearson correlation coefficient (Pearson's r) measures the linear relationship between two continuous variables. It ranges from -1 (perfect negative correlation) to 1 (perfect positive 86 What is the Pearson correlation coefficient? correlation), with 0 indicating no linear correlation. Example: Pearson's r of 0.75 between hours studied and exam scores suggests a strong positive correlation. A/B testing (split testing) is a controlled experiment where two versions (A and B) of a webpage, app, or product are compared to determine which 87 What is A/B Testing? performs better in terms of user engagement or conversions. Example: Testing two different website layouts to see which one results in higher click-through rates. Hypothesis testing is a statistical method used to make inferences about population parameters based on sample data. It involves formulating a null hypothesis (no effect) and an alternative 88 What is hypothesis testing? hypothesis (an effect exists) and testing the null hypothesis using data and statistical tests. Example: Testing whether a new drug is more effective than an existing one in a clinical trial.