Mastercard Data Engineer Interview Questions
Mastercard Data Engineer Interview Questions
(4–5 YOE)
20-25 LPA
SQL
Query optimization is a critical skill for a Data Engineer. Here are practical strategies:
o Identify costly operations like full table scans, nested loops, or missing
indexes.
3. Avoid SELECT *:
o Prefer INNER JOIN over OUTER JOIN if NULLs are not needed.
o Ensure joined fields are indexed.
5. Filter Early:
Harshavardhana I
Data Engineer
o Apply WHERE clauses early to limit the data set before joins and
aggregations.
o Use JOINs or CTEs (Common Table Expressions) for better performance and
readability.
-- Avoid this:
-- Prefer this:
o Use table partitioning to divide large tables logically for faster access.
SELECT salary
FROM (
FROM employees
Harshavardhana I
Data Engineer
) ranked
WHERE rnk = 2;
after
Purpose Filters rows before aggregation
aggregation
Typically used with GROUP
Can be used with SELECT, UPDATE,
Usage BY
DELETE
Aggregate
Cannot use (SUM, AVG, etc.) Can use
Functions
Example
WHERE salary > 50000 HAVING COUNT(*) > 3
Example:
-- Using WHERE
-- Using HAVING
FROM employees
GROUP BY department
SELECT
name,
CASE
END AS salary_status
FROM employees;
SELECT * FROM table WHERE column <=> NULL; -- Only TRUE if column is NULL
Data Normalization
Harshavardhana I
Data Engineer
• Eliminate data redundancy (duplicate data) Ensure data
• maintain
Normal Form 1NF (First Normal No repeating Example Avoid arrays or multiple
• Reduces data redundancy (e.g., no repeated customer info in each order row)
Harshavardhana I
Data Engineer
Feature Star Schema Central fact Snowflake Schema Central
Query
Faster (fewer joins) Slightly slower (more joins)
Performance
Storage
Uses more space Uses less space
Simplicity
Easier to understand and query More complex
Harshavardhana I
Data Engineer
Approach:
•
Identify stakeholders: finance, fraud detection, compliance, marketing
• Customers
• Example:
Fact Table:
Dimension Tables:
Harshavardhana I
Data Engineer
• Source systems: Core banking systems, customer CRM, external KYC
• Use cloud DWs like Snowflake, Amazon Redshift, Google BigQuery, or on-
premise like Teradata
•
Partition large fact tables
Harshavardhana I
Data Engineer
1. Compare Hadoop and Spark in terms of architecture and use
cases.
1.1 Hadoop Architecture:
• Core Components:
• Workflow:
• Core Components:
• Spark Ecosystem:
Harshavardhana I
Data Engineer
1.3 Comparison Table:
Feature Processing
Hadoop
Type (MapReduce) Spark Batch + Real-time 10–
Speed Ease Slower (due to disk I/O) Supports Scala, Python, SQL
Fault Tolerance Yes (via HDFS replication) Yes (via lineage of RDDs)
When to Use:
• Hadoop: Archival, cold data storage, traditional batch jobs Spark: Real-time
•
analytics, machine learning pipelines, interactive querying
•
Spark processes each partition in parallel, leading to high performance in
distributed environments.
Types of Partitioning:
1. Default Partitioning:
Harshavardhana I
Data Engineer
rdd.partitionBy(4)
3. Range Partitioning:
Repartition vs Coalesce:
• Optimizes parallelism
3. Why might you choose Parquet over CSV for storing large
datasets?
Parquet vs CSV Comparison:
Harshavardhana I
Data Engineer
Feature CompressionCSV
Read
Poor (large file Parquet Highly compressed
Types Splittable for Strings (needs manual parsing) Strongly typed (ints, floats, etc.)
1. Columnar Storage:
2. Compression:
3. Schema Enforcement:
4. Integration:
Harshavardhana I
Data Engineer
Coding
Example Code:
import pandas as pd
chunk_size = 100000
result = []
chunk = chunk.dropna()
result.append(chunk)
Harshavardhana I
Data Engineer
# SAVE TO NEW FILE
FINAL_DF.TO_CSV("TRANSFORMED_FILE.CSV", INDEX=FALSE)
BEST PRACTICES:
ROWS/COLUMNS WITH
DROP MISSING VALUES DF.DROPNA()
NULLS
DEFAULT VALUE LIKE 0
FILL WITH
CONSTANT DF.FILLNA(0)
OR
"UNKNOWN"
FORWARD FILL (FFILL) DF.FILLNA(METHOD='FFILL')
TIME SERIES DATA
DF.FILLNA(METHOD='BFILL')
BACKWARD FILL (BFILL)
ALTERNATIVE TO FFILL
FILL WITH
MEAN/MEDIAN/MODE DF['COL'].FILLNA(DF['COL'].MEAN())
COLUMNS NUMERICAL
CHECK % OF MISSING
EXAMPLE:
DF['AGE'] = DF['AGE'].FILLNA(DF['AGE'].MEAN())
Harshavardhana I
Data Engineer
# Fill missing city names with "Unknown"
df['City'] = df['City'].fillna("Unknown")
What is a Decorator?
my_decorator(func):
def wrapper():
print("Before function
function runs")
return wrapper
@my_decorator
def say_hello():
print("Hello!")
say_hello() Output:
Hello!
Harshavardhana I
Data Engineer
After function runs
import time
def timer_decorator(func):
return wrapper
@timer_decorator
def process_data():
time.sleep(2)
print("Data processed")
2.00 seconds
Harshavardhana I
Data Engineer