0% found this document useful (0 votes)

13 views

Mastercard Data Engineer Interview Questions

The document outlines interview questions and answers for a Data Engineer position at Mastercard, focusing on SQL optimization, data normalization, schema design, and big data tools like Hadoop and Spark. It also covers practical coding examples in Python for handling large datasets and missing data. Key topics include differences between SQL clauses, data warehouse design, and the advantages of using Parquet over CSV for data storage.

Uploaded by

Rajveer Choudhary

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views

Mastercard Data Engineer Interview Questions

Uploaded by

Rajveer Choudhary

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Mastercard Data Engineer Interview Questions

(4–5 YOE)
20-25 LPA

SQL

1. HOW WOULD YOU OPTIMIZE A SLOW-RUNNING SQL QUERY?

Query optimization is a critical skill for a Data Engineer. Here are practical strategies:

Steps to Optimize a SQL Query:

1. Check Execution Plan:

o Use EXPLAIN (MySQL/PostgreSQL) or SET SHOWPLAN_ALL ON (SQL Server)

to analyze how SQL is executed.

o Identify costly operations like full table scans, nested loops, or missing
indexes.

2. Use Proper Indexing:

o Create indexes on frequently filtered/joined columns.

o Use composite indexes when filtering on multiple columns together.

o Avoid indexes on columns with high cardinality or frequent updates.

3. Avoid SELECT *:

o Only select required columns to reduce I/O load.

4. Use Joins Efficiently:

o Prefer INNER JOIN over OUTER JOIN if NULLs are not needed.
o Ensure joined fields are indexed.

5. Filter Early:

Harshavardhana I
Data Engineer
o Apply WHERE clauses early to limit the data set before joins and
aggregations.

6. Avoid Subqueries When Possible:

o Use JOINs or CTEs (Common Table Expressions) for better performance and
readability.

7. Limit Use of Functions in WHERE Clauses:

-- Avoid this:

WHERE YEAR(order_date) = 2024

-- Prefer this:

WHERE order_date >= '2024-01-01' AND order_date < '2025-01-01'

8. Partitioning and Sharding (for big data):

o Use table partitioning to divide large tables logically for faster access.

o Consider sharding for distributed systems.

2. Write a SQL query to find the second highest salary in a table.

Let’s say we have a table called employees(salary).

Query Using Subquery:

SELECT MAX(salary) AS second_highest_salary FROM

employees WHERE salary < (SELECT MAX(salary) FROM
employees);

Alternative Using DENSE_RANK() (SQL Server, PostgreSQL, etc.):

SELECT salary
FROM (

SELECT salary, DENSE_RANK() OVER (ORDER BY salary DESC) AS rnk

FROM employees

Harshavardhana I
Data Engineer
) ranked

WHERE rnk = 2;

Note: Use RANK() if you want to skip duplicates, DENSE_RANK() if not.

3. What’s the difference between the WHERE and HAVING

clauses in SQL?
Feature WHERE HAVING Filters groups

after
Purpose Filters rows before aggregation
aggregation
Typically used with GROUP
Can be used with SELECT, UPDATE,
Usage BY
DELETE

Aggregate
Cannot use (SUM, AVG, etc.) Can use
Functions
Example
WHERE salary > 50000 HAVING COUNT(*) > 3

Example:

-- Using WHERE

SELECT * FROM employees

WHERE department = 'IT';

-- Using HAVING

SELECT department, COUNT(*) AS total_employees

FROM employees

GROUP BY department

HAVING COUNT(*) > 5;

4. How do you typically handle NULL values in your SQL queries?

Harshavardhana I
Data Engineer
Key Strategies:

1. Use IS NULL or IS NOT NULL:

SELECT * FROM employees WHERE manager_id IS NULL;

2. Use COALESCE() or IFNULL():

o Replace NULLs with default values.

SELECT name, COALESCE(department, 'Not Assigned') AS dept FROM employees;

3. Use CASE statements:

SELECT

name,

CASE

WHEN salary IS NULL THEN 'Not

Disclosed' ELSE salary

END AS salary_status

FROM employees;

4. Avoid NULLs in joins:

o Use INNER JOIN when NULLs are not needed.

o
Use LEFT JOIN + COALESCE if necessary.

5. NULL-safe comparison in MySQL:

SELECT * FROM table WHERE column <=> NULL; -- Only TRUE if column is NULL

Data Normalization

1. What is normalization, and why is it important in data

modeling?
Normalization is the process of structuring a relational database to:

Harshavardhana I
Data Engineer
• Eliminate data redundancy (duplicate data) Ensure data

• integrity Make the database more efficient and easier to

• maintain

Key Normal Forms:

Normal Form 1NF (First Normal No repeating Example Avoid arrays or multiple

groups; atomic values in a

Form) columns only single column
2NF (Second 1NF + No partial dependency on Every non-key column depends on
Normal Form) a primary key the whole key
3NF (Third Normal 2NF + No transitive No non-key column depends on
Form) dependencies another non-key column

Why Normalization is Important:

• Reduces data redundancy (e.g., no repeated customer info in each order row)

• Improves data consistency (update in one place only)

• Makes updates, deletions, and insertions safer

• Minimizes storage costs (by avoiding repetition)

However, in OLAP/data warehouses, denormalization (opposite of normalization) is

preferred to optimize for query speed.

2. Explain the difference between a star schema and a

snowflake schema.
Star Schema vs Snowflake Schema:

Harshavardhana I
Data Engineer
Feature Star Schema Central fact Snowflake Schema Central

table linked to fact table linked to

Structure
dimension tables normalized dimension tables
Denormalized Normalized
Normalization

Query
Faster (fewer joins) Slightly slower (more joins)
Performance
Storage
Uses more space Uses less space
Simplicity
Easier to understand and query More complex

Star Schema Example: Fact Table:

Fact_Transactions (transaction_id, customer_id, product_id, amount,
date_id)
Dimension Tables:

• Dim_Customer (customer_id, name, gender, age)

• Dim_Product (product_id, name, category)
•
Dim_Date (date_id, full_date, month, year)

Snowflake Schema Example:

• Same as Star Schema but dimension tables are normalized:

o Dim_Product is split into Product, Category o

Dim_Date might be split into Day, Month, Year tables

3. Designing a data warehouse for a banking system — How

would you approach it?
This is an open-ended system design question. The interviewer is looking for your ability to
think at the architectural level.

Harshavardhana I
Data Engineer
Approach:

Step 1: Requirement Gathering

• Understand business KPIs: e.g., number of transactions, loan approvals, daily

balances

•
Identify stakeholders: finance, fraud detection, compliance, marketing

Step 2: Identify Key Subject Areas (Data Marts)

• Accounts (savings, current, loans) Transactions

•
(deposits, withdrawals, transfers)

• Customers

• Cards (credit, debit)

•
Loans

Step 3: Design the Schema

• Choose Star Schema for better reporting performance

• Example:

Fact Table:

• Fact_Transactions(transaction_id, customer_id, account_id, amount, date_id,

branch_id)

Dimension Tables:

• Dim_Customer(customer_id, name, address, dob, kyc_status)

• Dim_Account(account_id, account_type, open_date)
•
Dim_Date(date_id, date, month, quarter, year)
•
Dim_Branch(branch_id, branch_name, region)

Step 4: ETL/ELT Design

Harshavardhana I
Data Engineer
• Source systems: Core banking systems, customer CRM, external KYC

• APIs Use tools like Apache NiFi, Airflow, or Informatica Implement:

o Data cleaning (handle NULLs, outliers)

o
Deduplication

o Historical tracking using SCD (Slowly Changing Dimensions)

Step 5: Data Warehouse Layer

• Use cloud DWs like Snowflake, Amazon Redshift, Google BigQuery, or on-
premise like Teradata

•
Partition large fact tables

• Use Materialized Views for reporting

Step 6: Reporting Layer

• Build dashboards using Power BI, Tableau, or Looker

•
Serve to teams: operations, fraud analytics, compliance

Step 7: Security & Compliance

• Encrypt PII data Mask sensitive info

•
(like PAN, Aadhar) Role-based access
•
(RLS) Retain logs for audit
•

Big Data Tools

Harshavardhana I
Data Engineer
1. Compare Hadoop and Spark in terms of architecture and use
cases.
1.1 Hadoop Architecture:

• Core Components:

o HDFS (Hadoop Distributed File System): Stores massive data across

clusters.

o YARN (Yet Another Resource Negotiator): Manages cluster resources.

o MapReduce: Batch processing framework using map → shuffle → reduce.

• Workflow:

o Data is stored in HDFS → processed using MapReduce → output written back

to HDFS.

o Disk I/O intensive (writes intermediate data to disk).

1.2 Spark Architecture:

• Core Components:

o Spark Core: Handles distributed task scheduling.

o RDD (Resilient Distributed Dataset): Immutable, distributed data.

o Catalyst Engine: For SQL optimization.

o DAG Scheduler: Executes jobs in memory using Directed Acyclic Graphs.

• Spark Ecosystem:

o Spark SQL – Structured data

o Spark Streaming – Real-time data

o MLlib – Machine learning

o GraphX – Graph processing

• In-Memory Processing: Stores intermediate data in memory (RAM), making it much

faster than MapReduce.

Harshavardhana I
Data Engineer
1.3 Comparison Table:

Feature Processing
Hadoop
Type (MapReduce) Spark Batch + Real-time 10–

Batch only 100x faster (in-memory)

Speed Ease Slower (due to disk I/O) Supports Scala, Python, SQL

of Use Java-based, verbose

Fault Tolerance Yes (via HDFS replication) Yes (via lineage of RDDs)

Use Cases Legacy batch ETL Real-time processing, ML, ETL

When to Use:

• Hadoop: Archival, cold data storage, traditional batch jobs Spark: Real-time
•
analytics, machine learning pipelines, interactive querying

2. Explain how partitioning works in Apache Spark.

Partitioning in Spark:

• Partitioning is how Spark logically divides data across multiple executors or

nodes for parallel processing.

•
Spark processes each partition in parallel, leading to high performance in
distributed environments.

Types of Partitioning:

1. Default Partitioning:

o Automatically based on cluster configuration and number of cores. o

Controlled using spark.default.parallelism.

2. Hash Partitioning (via transformations):

Harshavardhana I
Data Engineer
rdd.partitionBy(4)

3. Range Partitioning:

o Used in sorted or range-based data.

Repartition vs Coalesce:

Operation Description Use Case

repartition(n) Increases/decreases partitions (full shuffle) When increasing partitions

coalesce(n) Reduces partitions (no full shuffle) When reducing partitions

Why Partitioning Matters:

• Optimizes parallelism

• Reduces data shuffling in joins

• Improves cache efficiency

• Controls skewed data issues

Example: Partitioning in Spark SQL df.write.partitionBy("country",

"year").parquet("output_path") This creates folders by country/year,

making queries faster on those filters.

3. Why might you choose Parquet over CSV for storing large
datasets?
Parquet vs CSV Comparison:

Feature CSV Text-based, row- Parquet Columnar

Format Type oriented binary format

Harshavardhana I
Data Engineer
Feature CompressionCSV
Read
Poor (large file Parquet Highly compressed

Performance Reads entire

size) file (Snappy, GZIP) Reads only

required columns Yes (self-

Schema Support DataNone describing metadata)

Types Splittable for Strings (needs manual parsing) Strongly typed (ints, floats, etc.)

HDFS Yes Yes

Why Parquet is Preferred:

1. Columnar Storage:

o Efficient for analytical queries (OLAP).

o Only loads relevant columns into memory.

2. Compression:

o Up to 75% smaller than CSV.

o Reduces I/O and storage costs.

3. Schema Enforcement:

o Helps validate and track schema evolution.

4. Integration:

o Well supported in Spark, Hive, AWS Athena, BigQuery.

Use Case Example:

For a banking analytics pipeline, where analysts want to aggregate transactions by

account or region:

• CSV would scan the full dataset, including unused columns.

• Parquet would only load account_id, region, amount columns → faster and cheaper.

Harshavardhana I
Data Engineer
Coding

1. Python script to read a large CSV file and apply

transformations
Reading Large CSV Files: When working with large datasets (e.g.,

millions of rows), it's efficient to:

• Read data in chunks using pandas.read_csv() with chunksize

•
Apply transformations chunk by chunk to avoid memory overflow

Example Code:

import pandas as pd

# Define chunk size

chunk_size = 100000

result = []

# Read CSV in chunks

for chunk in pd.read_csv("large_file.csv", chunksize=chunk_size):

# Transformation: Drop nulls and add a new column

chunk = chunk.dropna()

chunk['Total'] = chunk['Price'] * chunk['Quantity']

result.append(chunk)

# Combine all processed chunks

final_df = pd.concat(result)

Harshavardhana I
Data Engineer
# SAVE TO NEW FILE

FINAL_DF.TO_CSV("TRANSFORMED_FILE.CSV", INDEX=FALSE)

BEST PRACTICES:

• USE DTYPES ARGUMENT TO OPTIMIZE MEMORY USAGE

AVOID LOADING FULL DATA IN RAM IF NOT NECESSARY

•

2. Handling missing data in Python (Pandas)

COMMON MISSING DATA TECHNIQUES:

TECHNIQUE CODE EXAMPLE USE CASE REMOVE

ROWS/COLUMNS WITH
DROP MISSING VALUES DF.DROPNA()
NULLS
DEFAULT VALUE LIKE 0
FILL WITH
CONSTANT DF.FILLNA(0)
OR
"UNKNOWN"
FORWARD FILL (FFILL) DF.FILLNA(METHOD='FFILL')
TIME SERIES DATA
DF.FILLNA(METHOD='BFILL')
BACKWARD FILL (BFILL)
ALTERNATIVE TO FFILL
FILL WITH
MEAN/MEDIAN/MODE DF['COL'].FILLNA(DF['COL'].MEAN())
COLUMNS NUMERICAL
CHECK % OF MISSING

DATA DF.ISNULL().MEAN() * 100 DATA QUALITY CHECK

EXAMPLE:

# FILL MISSING AGE WITH MEAN

DF['AGE'] = DF['AGE'].FILLNA(DF['AGE'].MEAN())

# DROP ROWS WHERE 'SALARY' IS MISSING

DF = DF.DROPNA(SUBSET=['SALARY'])

Harshavardhana I
Data Engineer
# Fill missing city names with "Unknown"

df['City'] = df['City'].fillna("Unknown")

3. PYTHON DECORATORS – EXPLANATION & USE CASE

What is a Decorator?

• A decorator is a function that modifies another function’s behavior without

changing its code.

• It is widely used in logging, timing, authentication, and caching.

Simple Decorator Example: def

my_decorator(func):

def wrapper():

print("Before function

runs") func() print("After

function runs")

return wrapper

@my_decorator

def say_hello():

print("Hello!")

say_hello() Output:

Before function runs

Hello!

Harshavardhana I
Data Engineer
After function runs

Real Use Case – Logging Execution Time:

import time

def timer_decorator(func):

def wrapper(*args, **kwargs):

start = time.time() result = func(*args, **kwargs) end =

time.time() print(f"{func.name} took {end -

start:.2f} seconds") return result

return wrapper

@timer_decorator

def process_data():

time.sleep(2)

print("Data processed")

process_data() Output: Data

processed process_data took

2.00 seconds

Harshavardhana I
Data Engineer

Big Data Engineering Interview Questions
67% (3)
Big Data Engineering Interview Questions
189 pages
ABAP On HANA Interview Questions
100% (2)
ABAP On HANA Interview Questions
21 pages
Advanced Database Management System
67% (3)
Advanced Database Management System
2 pages
DW Slides
No ratings yet
DW Slides
248 pages
Book Power BI From Rookie To Rock Star Book04 Power BI Modeling and DAX Reza Rad RADACAD
No ratings yet
Book Power BI From Rookie To Rock Star Book04 Power BI Modeling and DAX Reza Rad RADACAD
257 pages
Barclays Data Engineer Interview Questions
No ratings yet
Barclays Data Engineer Interview Questions
17 pages
Interview Q & a (SQL Spark HIVE Airflow AWS Kafka)-1
No ratings yet
Interview Q & a (SQL Spark HIVE Airflow AWS Kafka)-1
25 pages
Advance SQL
No ratings yet
Advance SQL
12 pages
SQL Interview Questions
No ratings yet
SQL Interview Questions
3 pages
1) Union and Union All
No ratings yet
1) Union and Union All
11 pages
Pyspark and SQL
No ratings yet
Pyspark and SQL
57 pages
From Data To Insights Course Summary
No ratings yet
From Data To Insights Course Summary
67 pages
6 tips for better sql query optimization (with example code)
No ratings yet
6 tips for better sql query optimization (with example code)
4 pages
Data Warehousing (Advanced Query Processing) : Carsten Binnig Donald Kossmann
No ratings yet
Data Warehousing (Advanced Query Processing) : Carsten Binnig Donald Kossmann
55 pages
Want to Ace ETL Testing Interview_ Watch This Now
No ratings yet
Want to Ace ETL Testing Interview_ Watch This Now
7 pages
Big Data Interview Questions 1690738892
No ratings yet
Big Data Interview Questions 1690738892
189 pages
Question
No ratings yet
Question
24 pages
SQL Interview Questions 1725044566
No ratings yet
SQL Interview Questions 1725044566
4 pages
BDST 122 RDBMS
No ratings yet
BDST 122 RDBMS
12 pages
SQL-Query
No ratings yet
SQL-Query
14 pages
SQL - Visualisation - Docxedited)
No ratings yet
SQL - Visualisation - Docxedited)
22 pages
Data Science: Part 2 - SQL
100% (1)
Data Science: Part 2 - SQL
13 pages
Advanced SQL Concepts
No ratings yet
Advanced SQL Concepts
38 pages
1734277634220
No ratings yet
1734277634220
4 pages
The Most Commonly Used SQL Queries
No ratings yet
The Most Commonly Used SQL Queries
29 pages
Data Engineering Interviews
No ratings yet
Data Engineering Interviews
33 pages
Week 1 Getting Started and Selecting & Retrieving Data With SQL
No ratings yet
Week 1 Getting Started and Selecting & Retrieving Data With SQL
29 pages
S
No ratings yet
S
22 pages
CSC421 - Database Management II
No ratings yet
CSC421 - Database Management II
48 pages
SQL DM1
No ratings yet
SQL DM1
5 pages
TECH MAHINDRA DATA ANALYST INTERVIEW QUESTIONS
No ratings yet
TECH MAHINDRA DATA ANALYST INTERVIEW QUESTIONS
11 pages
Wa0000
No ratings yet
Wa0000
38 pages
SQL Theory With Query
No ratings yet
SQL Theory With Query
11 pages
000_Company Interview Qns
No ratings yet
000_Company Interview Qns
13 pages
SQL - Visualisation
No ratings yet
SQL - Visualisation
57 pages
Query Execution
No ratings yet
Query Execution
25 pages
Question
No ratings yet
Question
6 pages
Curriculum
No ratings yet
Curriculum
10 pages
DB2 V8 Application Programming and SQL Guide
No ratings yet
DB2 V8 Application Programming and SQL Guide
1,141 pages
LT mindtree
No ratings yet
LT mindtree
3 pages
Adbms 1 To 3
No ratings yet
Adbms 1 To 3
36 pages
Wipro Data Analyst Interview Questions
No ratings yet
Wipro Data Analyst Interview Questions
29 pages
SQL Interview Topics (1)
No ratings yet
SQL Interview Topics (1)
3 pages
Advanced Databased Integration RepotMaamJho
No ratings yet
Advanced Databased Integration RepotMaamJho
45 pages
SQL_short_Notes_Top_10_Questions_1748266007
No ratings yet
SQL_short_Notes_Top_10_Questions_1748266007
8 pages
Get Database Systems Design Implementation and Management 11th Edition Coronel Solutions Manual Free All Chapters Available
100% (20)
Get Database Systems Design Implementation and Management 11th Edition Coronel Solutions Manual Free All Chapters Available
66 pages
Final Cheat Sheet
No ratings yet
Final Cheat Sheet
3 pages
Interview Question
No ratings yet
Interview Question
48 pages
Data_Engineer_Preparation
No ratings yet
Data_Engineer_Preparation
5 pages
12 sql query optimization best practices for cloud databases
No ratings yet
12 sql query optimization best practices for cloud databases
9 pages
SQL questions
No ratings yet
SQL questions
25 pages
Database Systems Design Implementation and Management 11th Edition Coronel Solutions Manualpdf download
100% (6)
Database Systems Design Implementation and Management 11th Edition Coronel Solutions Manualpdf download
57 pages
Database Notes Final
No ratings yet
Database Notes Final
5 pages
Tech Mahindra SQL Interview Questions for Data Engineer
No ratings yet
Tech Mahindra SQL Interview Questions for Data Engineer
6 pages
Database Management Systems all weeks
No ratings yet
Database Management Systems all weeks
77 pages
Business Intelligence Interview Questions and Answer
No ratings yet
Business Intelligence Interview Questions and Answer
12 pages
SQL for Data Science
No ratings yet
SQL for Data Science
8 pages
Tiger Analytics 1735834470
No ratings yet
Tiger Analytics 1735834470
27 pages
Advtop
No ratings yet
Advtop
268 pages
SQL Notes
No ratings yet
SQL Notes
9 pages
Querry Optimization
No ratings yet
Querry Optimization
13 pages
Data Engineer Interview 1738557398
No ratings yet
Data Engineer Interview 1738557398
15 pages
SQL Interview Success From Beginner To Pro
From Everand
SQL Interview Success From Beginner To Pro
Shana
No ratings yet
L7. Multidimensional Modeling
No ratings yet
L7. Multidimensional Modeling
29 pages
DWH Notes
No ratings yet
DWH Notes
30 pages
Data Warehouse 1
No ratings yet
Data Warehouse 1
6 pages
Data Warehousing MCQ
No ratings yet
Data Warehousing MCQ
71 pages
Unit 345 DW Autosaved
No ratings yet
Unit 345 DW Autosaved
68 pages
Data Mining Lab Notes
0% (1)
Data Mining Lab Notes
93 pages
DW Concepts
No ratings yet
DW Concepts
7 pages
Schema Is A Logical Description of The Entire Database
No ratings yet
Schema Is A Logical Description of The Entire Database
4 pages
DWMM Notes
No ratings yet
DWMM Notes
23 pages
B.A. 2nd Notes PDF
No ratings yet
B.A. 2nd Notes PDF
51 pages
DW Basics
No ratings yet
DW Basics
17 pages
Experiment2 E059 DWM PDF
No ratings yet
Experiment2 E059 DWM PDF
10 pages
4 Data Warehousing & OLAP
No ratings yet
4 Data Warehousing & OLAP
62 pages
HR Data Analysis - Power BI, Excel
No ratings yet
HR Data Analysis - Power BI, Excel
20 pages
Data Warehousing Concepts
No ratings yet
Data Warehousing Concepts
9 pages
Data Mining Python Lab
No ratings yet
Data Mining Python Lab
208 pages
dp-900
No ratings yet
dp-900
200 pages
SAS DI Interview Questions
50% (2)
SAS DI Interview Questions
3 pages
Power Bi Interview Questions
No ratings yet
Power Bi Interview Questions
15 pages
DMDW 7
No ratings yet
DMDW 7
30 pages
04 - Logical Design in Data Warehouse
No ratings yet
04 - Logical Design in Data Warehouse
39 pages
Planning and Project Management
No ratings yet
Planning and Project Management
54 pages
Snowflake & Starflake
100% (2)
Snowflake & Starflake
9 pages
Data Mining and Data Warehouse - Mukesh Prasad Chaudhary
No ratings yet
Data Mining and Data Warehouse - Mukesh Prasad Chaudhary
651 pages
710/16/S07 Business Intelligence: A CRM Solution
No ratings yet
710/16/S07 Business Intelligence: A CRM Solution
24 pages
CODENA
No ratings yet
CODENA
9 pages
Answer
No ratings yet
Answer
36 pages