0% found this document useful (0 votes)
12 views25 pages

Interview Q & a (SQL Spark HIVE Airflow AWS Kafka)-1

The document provides a comprehensive overview of interview questions and answers related to data engineering, covering topics such as responsibilities, data ingestion, ETL vs ELT, data security, performance optimization, and various technologies like SQL, Spark, and Hive. It includes explanations of key concepts, best practices, and examples of code snippets for practical applications. Additionally, it discusses challenges faced in data engineering and the differences between various data processing methods and structures.

Uploaded by

muddassirshaik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views25 pages

Interview Q & a (SQL Spark HIVE Airflow AWS Kafka)-1

The document provides a comprehensive overview of interview questions and answers related to data engineering, covering topics such as responsibilities, data ingestion, ETL vs ELT, data security, performance optimization, and various technologies like SQL, Spark, and Hive. It includes explanations of key concepts, best practices, and examples of code snippets for practical applications. Additionally, it discusses challenges faced in data engineering and the differences between various data processing methods and structures.

Uploaded by

muddassirshaik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Interview Questions & Answers.

(SQL/Spark/HIVE/Airflow/AWS/Kafka)

1. What are the key responsibilities of a Data Engineer?


Ans:-
Interviewer: Can you briefly describe your experience with
Databricks
Over the past 4-5 years, I've primarily worked as a Data Engineer
using Databricks. My main responsibilities included data ingestion,
ETL processes, data processing with Spark, performance
optimization, collaboration with data teams, and ensuring data
security and compliance.
Interviewer: Can you explain how you typically handle data
ingestion and ETL in Databricks?
I use Azure Data Factory for scheduling and orchestrating data
ingestion tasks from various sources, such as SQL Server, Oracle
databases, and cloud storage. Within Databricks, I use PySpark to
clean, transform, and aggregate the data before loading it into Delta
Lake tables for further analysis.
Interviewer: How do you optimize the performance of Spark jobs?
Performance optimization involves several strategies. I ensure
efficient joins by partitioning data correctly and using broadcast
joins when needed. Caching intermediate data helps reduce
redundant computations. I also implement data partitioning based
on key columns and tune Spark configurations like executor
memory and cores to match the workload.
Interviewer: Can you give an example of a specific project where
you implemented optimizations?
In a retail analytics project, we had to process large volumes of sales
data. We partitioned the data by date and store ID, used broadcast
joins for lookup tables, and cached frequently accessed datasets.
These optimizations significantly reduced the query runtime and
improved overall performance.
Interviewer: How do you ensure data security and compliance in
your Databricks projects?
We implement role-based access control (RBAC) using Azure Active
Directory, ensuring only authorized users can access sensitive data.
Data is encrypted both at rest and in transit. We also use data
masking techniques to protect sensitive information during
processing and enable audit logging to track data access and
changes, which is crucial for compliance with regulations like GDPR
and HIPAA.
Interviewer: Can you describe a project where data security was a
top priority?
In a financial analytics project, we handled sensitive customer data.
We implemented strict RBAC policies, encrypted all data at rest and
in transit, and used audit logging to track data access. We also
masked sensitive information to ensure that analysts only accessed
anonymized data, ensuring compliance with industry regulations.

Interviewer: How do you collaborate with data scientists and


analysts on projects?
I work closely with data scientists to understand their requirements
and provide clean datasets for their models. We follow an iterative
process where I adjust data pipelines based on their feedback.
Regular meetings and documentation ensure everyone is aligned.
For example, in a customer segmentation project, I prepared
datasets and made adjustments based on the data scientists' needs
for feature engineering.

2. Explain the difference between OLTP and OLAP systems.


Feature OLTP OLAP
Purpose Transactional systems Analytical systems
Query
Type Simple, short queries Complex, multi-table queries
INSERT, UPDATE,
Operations DELETE SELECT (aggregations)
Denormalized
Schema Normalized (Star/Snowflake)
Speed Milliseconds Seconds to minutes
Users Operational staff Analysts, Managers
Data Type Current, up-to-date Historical, aggregated

3. What are ETL and ELT? When would you use one over the
other?
Feature ETL ELT
Transform In an external tool or Inside the destination
Location engine (warehouse/lake)
Slower (post- Faster (raw data loaded
Load Time transformation) quickly)
Less (fixed schema More (raw + transformed
Flexibility before loading) data coexists)
Informatica, Talend, dbt, Spark, SQL in
Tooling SSIS warehouse
Legacy systems, strict Modern, scalable data
Best For compliance stacks

4. What are the common challenges in data engineering?


Data
quality Real-time computing Data discovery
Scalability
issues Data storage Data enrichment
Data
integration Poor performance Data transformation
Data
pipelines Can't get to data New data sources
Data Data format and schema
security inconsistencies Unclear strategy
Data Challenge 3: infrastructure and
governance Data inconsistency cost management
Data Challenges that data engineers
ingestion Processing face
Data
compliance Tool selection Collaboration with other teams

5. Explain the concept of data partitioning in a data warehouse.


🚀 Why Partition?
✅ Faster Queries (especially for time-based or range-based filters)
✅ Reduced I/O (only scan relevant partitions)
✅ Better Parallelism (each partition can be processed in parallel)
✅ Easier Data Management (delete/load/archive per partition)
Type Description Example
Partitions data by a range
Range of values (like dates, IDs) sales partitioned by year/month
Partitions by specific
List values of a column country IN ('US', 'IN', 'UK')
Distributes rows using a
hash function (for load
Hash balancing) user_id % 4
Combines two or more
strategies (e.g., range +
Composite hash) date range + hash on user_id

6. How do you optimize a slow SQL query?

Best practices for optimizing SQL queries on large


datasets
Indexing
Subqueries
Avoid select
Avoid SELECT DISTINCT
Optimize joins
Select
Use limit and offset
Use proper joins
Choose exists over in
Limit the use of wildcards
Monitor query performance
Optimize where clauses
Utilize stored procedures
Apply where instead of having
Column selection
Incorporate exist instead of count
Indexing strategy
Optimize data retrieval
Partitioning
Use int64 data types in joins
Wildcards
Optimize join operations
Prefer union all to union
Query execution plan

7. What is indexing in databases? How does it improve


performance?
Indexing is used to reduce the table scan. To create index the
cardinality of the column should be high. The column which is used
in WHERE clause frequently indexes will be created on that
column.They organize and store data in a way that accelerates the
data access process.
Two types:
Clustered Index - We can have clustered index on only one column.
For PRIMARY Key column by default clustered index will be created.
When we create clustered index data will be sorted in disk.
Non-Clustered Index - We can have Non-Clustered Index on
multiple colums. The key lookup operation will be performed as we
are having key-value pairs only. It does not have actual table
information.
key = Non-Clustered Index column data
value = Physical address.
https://youtu.be/09E-tVAUqQw?si=S4QjkWfq9V3uoa2y

8. Explain different types of joins in SQL with examples.


spark.sql("select a.id,a.name,b.product from cust a join prod b on
a.id=b.id").show()
spark.sql("select a.id,a.name,b.product from cust a left join prod b
on a.id=b.id").show()
spark.sql("select a.id,a.name,b.product from cust a right join prod b
on a.id=b.id").show()
spark.sql("select a.id,a.name,b.product from cust a full join prod b
on a.id=b.id").show()
spark.sql("select a.id,a.name from cust a LEFT ANTI JOIN prod b on
a.id=b.id").show()
spark.sql("select a.id,a.name from cust a LEFT SEMI JOIN prod b on
a.id=b.id").show()
spark.sql("select a.id,a.name from cust a CROSS JOIN prod
b").show()

9. What is denormalization, and when should you use it?


Denormalization is a database trick to speed up fetching data by
storing some stuff redundantly. Use it when you want your reads to
be lightning fast, but watch out for a bit of data redundancy mess.
10. How do you handle duplicate records in SQL?
In SQL, duplicate records can be handled in several ways, including
using DISTINCT,UNION, GROUP BY, ROW_NUMBER(), and CTEs
(Common Table Expressions) to identify and remove duplicates,
11. What is Apache Spark, and how does it compare with
Hadoop MapReduce?
Feature Apache Spark Hadoop MapReduce
In-memory, very
Speed fast Disk-based, slower
Ease of Use High-level APIs Low-level APIs
Scala, Python,
Languages Java, R Mainly Java
Batch, Streaming,
Processing Types ML, Graph Batch only
Fast for iterative & Better for very large
Performance real-time batch jobs
Lineage-based
Fault Tolerance recovery Checkpointing via HDFS
Strict Map ? Shuffle ?
Execution Engine DAG scheduler Reduce

12. Explain the difference between Spark RDD, DataFrame, and


Dataset.
RDDs:-
A distributed collection of data elements without a schema. RDDs
are slower than DataFrames and Datasets for simple operations.
DataFrames:-
A distributed collection organized into named columns. DataFrames
are similar to relational database tables or Python Pandas
DataFrames.
DataFrames are faster than RDDs for exploratory analysis and
creating aggregated statistics.
Datasets:-
An extension of DataFrames with additional features like type-safety
and object-oriented interface. Datasets are faster than RDDs but
slower than DataFrames. Datasets combine the performance
optimization of DataFrames and the convenience of RDDs.

13. What is the difference between cache() and persist() in


Spark?
cache() :
It is used to cache the RDD in memory. When an RDD is cached, the
data is stored in memory so that it can be quickly accessed the next
time it is needed. This can greatly improve the performance of Spark
jobs by reducing the amount of time spent reading data from disk.
it is lazy only after we call action for first time then only it will catch
it , For first time it will star catching from start to end so its take
time but secondtime it will be fast. Beacuse it only run last stage.
persist(): It is similar to cache(), but it allows you to specify where
the RDD should be stored (in memory, on disk, or both). By default,
persist() caches the RDD in memory, but you can use the
StorageLevel parameter to specify a different storage level.
Different Levels of Caching and Persistence in PySpark:
1)MEMORY_ONLY: This level stores the RDD in memory as
deserialized Java objects. This is the default level used by cache()
and persist(). It provides fast access to the data, but if the RDD does
not fit entirely in memory, it may need to be recomputed from the
original data source.
2)MEMORY_ONLY_SER: This level stores the RDD or DataFrame in
memory as serialized Java objects. This can reduce memory usage
compared to MEMORY_ONLY, but accessing the data requires
deserialization, which can be slower than using deserialized objects.
3)MEMORY_AND_DISK: This level stores the RDD or DataFrame in
memory as deserialized Java objects, but if the RDD or DataFrame
does not fit entirely in memory, it spills the excess data to disk. This
provides better performance than recomputing the data, but
accessing data from disk can be slower than accessing it from
memory.
4)MEMORY_AND_DISK_SER: This level stores the RDD or
DataFrame in memory as serialized Java objects, and spills excess
data to disk if needed. This can be useful when memory usage is a
concern, but accessing the data requires deserialization, which can
be slower than using deserialized objects.
5)DISK_ONLY: This level stores the RDD or DataFrame on disk only,
and not in memory. This can be useful when memory usage is a
concern and the data does not fit entirely in memory, but accessing
the data from disk can be slower than accessing it from memory.
What is uncache() and unpersist() in PySpark ?
In PySpark, uncache() and unpersist() are methods used to remove
RDDs from memory or disk.
uncache(): This method is used to remove an RDD from memory
that was previously cached using the cache() method. Once an RDD
has been uncached, its data is no longer stored in memory, and it
must be recomputed from its original source if it is needed again.
unpersist(): This method is used to remove an RDD from the disk
that was previously persisted using the persist() method. Once an
RDD has been unpersisted, its data is no longer stored on disk, and it
must be recomputed from its original source if it is needed again.

14. Explain Spark's shuffle operation and how to optimize it.


A shuffle in Apache Spark is the process of redistributing data across
partitions, often between different nodes in the cluster. It happens
when data needs to be repartitioned or grouped in a way that the
current partitioning can't satisfy.
✳ When does shuffle happen?
groupByKey()
reduceByKey()
join()
distinct()
repartition()
coalesce() (when increasing partitions)
15. What are the different types of transformations in Spark?
Narrow Transformations
Wide Transformations
Narrow
select()
filter()
withColumn()

Wide
groupBy().agg()
join()
distinct()
orderBy()

16. Write a PySpark code to remove duplicates from a


DataFrame.
Df1 = df.dropDuplicates()
17. Write a PySpark code to calculate the moving average of a
column.
dfavg = df.withColumn("moving_avg",
avg("value").over(windowSpec))
18. Write a Scala program to count the number of words in a
text file using Spark.

val data = sc.textFile("file:///f:/data/wc.txt")


val flatdata = data.flatMap( x => x.split(" "))
val df = flatdata.toDF("value")
val addcol = df.withColumn("id",lit(1))
addcol.show()
val agg = addcol.groupBy("value").agg(sum("id"))
agg.show()

19. Given a PySpark DataFrame, write a function to fill missing


values with the column mean.
na.fill(0) in dataframe
==========================
val data=Seq( (1, "A", 10),
(1, "B", 20),
(2, "A", 30),
(2, "B", 40),
(3, "C", 50),
(3, "D", 60)
).toDF("ID", "col1", "col2")
data.show(false)
val
df=data.groupBy("ID").pivot("col1").agg(sum("col2")).na.fill(0)
df.show()
val seldf=df.select("ID", "A" ,"B","C","D")
seldf.show()

20. Write a PySpark program to group data by a column and


calculate the sum of another column.
val data=Seq( (1, "A", 10),
(1, "B", 20),
(2, "A", 30),
(2, "B", 40),
(3, "C", 50),
(3, "D", 60)
).toDF("ID", "col1", "col2")
data.show(false)
val
df=data.groupBy("ID").pivot("col1").agg(sum("col2")).na.fill(0)
df.show()

21. What is the difference between partitioning and bucketing


in Hive?
Partitioning: Partitioning divides a table into smaller, more
manageable parts based on the values of one or more columns. Each
partition is stored as a subdirectory within the table's directory in
HDFS, making it easier to query specific subsets of the data. For
example, you might partition a table of sales data by year, month,
and day to facilitate queries that filter on these date components.

Bucketing: Bucketing, also known as clustering, divides data into a


fixed number of buckets based on a hash function applied to one or
more columns. Bucketing is useful for evenly distributing data
across buckets, which can improve query performance for certain
types of operations, such as joins and aggregations.

22. How does Hive store data internally?


Component Description
Data
HDFS (or external systems like S3)
Location
Format Text, ORC, Parquet, etc.
Schema Stored in Hive Metastore
Partitioning Data stored in directory hierarchy
Bucketing Data split into fixed number of files

23. What are different file formats used in Hive?


TextFile (default, row-based, inefficient)
ORC (Optimized Row Columnar, best for Hive)
Parquet (columnar, good for interoperability)
Avro, SequenceFile, RCFile, etc.

24. Explain dynamic partitioning in Hive with an example.


Partitioning and Types
Hive supports partitions by diving the data into multiple folders or
sub folders for efficient querying. Low cardinal columns (Low
distinct columns) are chosen to perform partitions.
Partitions are folder
Types –
Static and Dynamic
Static: -Static partitioning involves manually specifying the
partition values when inserting data into a Hive table. It’s like
organizing data into predefined folders.
CREATE TABLE sales_data (
transaction_id INT,
product_id INT,
amount DECIMAL(10, 2)
)
PARTITIONED BY (date STRING, country STRING);
— Insert data into the table with partition values specified
INSERT INTO sales_data PARTITION (date=’2024–03–23',
country=’USA’) VALUES (1, 101, 100.50);

Dynamic :-Dynamic partitioning allows Hive to automatically


determine partition values based on the data being inserted. It’s like
an automatic sorting system where Hive creates partitions on-the-
fly.

— Create a table with partition columns (dynamic partitioning)


CREATE TABLE sales_data_dynamic (
transaction_id INT,
product_id INT,
amount DECIMAL(10, 2)
)
PARTITIONED BY (date STRING, country STRING);

— Enable dynamic partition mode


SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;

— Insert data into the table without specifying partition values


INSERT INTO sales_data_dynamic PARTITION (date, country)
VALUES (1, 101, 100.50, ‘2024–03–23’, ‘USA’);

25. How does indexing work in Hive?


Indexing :-One of the Hive query optimization methods is Hive
index. Hive index is used to speed up the access of a column or set of
columns in a Hive database because with the use of index the
database system does not need to read all rows in the table to find
the data that one has selected.
CREATE INDEX table01_index ON TABLE table01 (column2) AS
'COMPACT';

26. What is Apache Airflow, and how is it used in data


pipelines?
Concept Description
A Directed Acyclic Graph—defines the flow of tasks and their
DAG
dependencies.
A single unit of work (can be a Bash, Python, SQL, or custom
Task
operator).
Defines what kind of task it is (e.g., PythonOperator,
Operator
BashOperator, SparkSubmitOperator).
Triggers tasks based on DAG definitions and schedule
Scheduler
intervals.
Executor Runs tasks (e.g., Sequential, Local, Celery, Kubernetes).
Web UI Provides a dashboard to monitor and manage DAGs and tasks.

27. Explain how to schedule a pipeline in Airflow.


DAG & schedule interval
28. What are DAGs in Airflow?
A DAG (directed acyclic graph) is a mathematical structure
consisting of nodes and edges. In Airflow, a DAG represents a data
pipeline or workflow with a start and an end. The mathematical
properties of DAGs make them useful for building data pipelines:
Directed: There is a clear direction of flow between tasks.
29. How would you handle a failed task in an Airflow DAG?
1. Retries:
Airflow allows tasks to be configured with a specific number of
retries. If a task fails within a retry window, it will automatically be
retried after a configurable delay.
2. Clearing Task Status:
If a task fails and the retry mechanism is exhausted, you can
manually clear the task status in the Airflow UI. This essentially
resets the task instance, allowing it to be rerun from scratch. You
can select options to clear the task in the current DAG run or across
multiple runs.
3. Trigger Rules:
Trigger rules define the conditions under which a task should be
executed. By default, a task runs if all upstream tasks succeed
(all\_success). However, you can use trigger rules like all_failed to
execute a task only when all upstream tasks have failed, or
one_failed to execute it if at least one upstream task has failed.
4. Error Handling:
Airflow provides various mechanisms for error handling, including:
Callbacks: You can define callback functions that are triggered when
a task fails or succeeds. These callbacks can be used to send
notifications, log errors, or trigger other actions.
Alerts: You can configure email alerts that are triggered when tasks
fail or are retried.
5. DAG Level Failure:
If a DAG as a whole needs to be alerted or handled upon failure, you
can define a failure_callback function at the DAG level. This callback
will be triggered if any task in the DAG fails.
6. Monitoring and Logging:
Airflow provides built-in logging and monitoring capabilities. You
can access task logs to diagnose the cause of failures and monitor
the overall status of your DAG runs.
30. What is backfilling in Apache Airflow?
Backfill is a very useful feature in Airflow. It allows you to re-run a
dag or specific tasks in a dag for the past days.

31. What is the difference between AWS S3 and HDFS?


S3 HDFS S3 vs HDFS
S3 is more
Elasticity Yes No elastic
Cost/TB/month $23 $206 10X
99.9%
Availability 99.99% (estimated) 10X
99.9999%
Durability 100.00% (estimated) 10X+
Transactional Yes with
writes DBIO Yes Comparable

32. How does AWS Glue work, and when would you use it?
1. Data Discovery:
AWS Glue uses crawlers to automatically discover and catalog data
from various sources, such as Amazon S3, DynamoDB, and relational
databases. These crawlers infer the schema and structure of your
data, automatically creating metadata descriptions that are stored in
the AWS Glue Data Catalog.
2. Data Preparation (ETL):
Once data is discovered and cataloged, AWS Glue provides tools to
prepare it for analysis. You can create and run ETL jobs to extract
data from various sources, transform it to the desired format, and
load it into a target destination.
3. Visual and Code-Based Workflows:
AWS Glue offers both visual and code-based interfaces for building
ETL workflows. You can use AWS Glue Studio to visually design ETL
pipelines and manage your data, while also using code-based
approaches with languages like Python or Scala for more complex
transformations.
4. Serverless Architecture:
AWS Glue operates on a serverless architecture, meaning you don't
have to manage servers or infrastructure. It automatically scales
resources as needed and handles job execution, monitoring, and
error handling.

33. Explain how AWS Lambda can be used in data processing.

1. Event-Driven Data Processing:


Lambda functions are triggered by specific events, such as a new file
being uploaded to S3, a record being added to a Kinesis stream, or a
change in a DynamoDB table.
This event-driven architecture allows you to process data as it
becomes available, enabling real-time data processing.

2. Serverless Architecture:
Lambda eliminates the need for you to provision, manage, or scale
servers.
You pay only for the compute time your code consumes, making it
cost-effective, especially for intermittent or burst processing
workloads.

3. Real-time Data Transformation:


You can use Lambda to transform data on the fly. For example,
convert images from color to grayscale as soon as they are uploaded
to S3.
Lambda functions can also process streaming data in real-time using
services like Kinesis, enabling tasks like clickstream analysis, log
filtering, or social media monitoring.

4. ETL (Extract, Transform, Load):


Lambda can be used to extract data from one source, transform it
according to your needs, and load it into another destination.
This is particularly useful for building data pipelines that move data
between different AWS services or external data stores.

5. Integration with Other AWS Services:


Lambda integrates seamlessly with other AWS services, allowing
you to build complex data processing pipelines.
For example, you can use AWS Step Functions to orchestrate a series
of Lambda functions to perform a complex ETL process.

6. Scaling and High Availability:


Lambda automatically scales to handle varying workloads, ensuring
that your data processing functions can handle peak loads without
you needing to manage infrastructure.
Lambda also provides high availability, ensuring that your data
processing functions remain operational even if one server fails.

34. What is the difference between Redshift and Athena?

35. How do you optimize an S3 data lake for faster querying?


Best Practice Area Key Takeaways
Structuring data for scalability, efficient
1. Data Organization partitioning strategies, and consistent naming
and Management conventions.
Implementing IAM policies, data encryption
2. Security and methods, and compliance with data
Compliance regulations.
Choosing the right storage classes, data
3. Performance caching techniques, and query performance
Optimization tuning.
Monitoring storage costs, implementing data
lifecycle policies, and using cost-effective data
4. Cost Management formats.
Integrating with other AWS services,
5. Data Integration and providing robust APIs, and supporting
Accessibility multiple data consumers.
Ensuring data accuracy, implementing
6. Data Quality and validation checks, and managing data
Reliability redundancy.
7. Scalability and Auto-scaling strategies, managing peak loads,
Elasticity and planning for data growth.
8. Backup and Disaster Regular data backups, disaster recovery
Recovery planning, and cross-region replication.
Setting up monitoring dashboards, log
9. Monitoring and management and analysis, and alerting
Logging mechanisms.
Embracing new AWS features, preparing for
10. Future-Proofing technological advances, and continuous
and Innovation learning.

36. What is Apache Kafka, and how does it work?


Apache Kafka (Kafka) is an open-source, distributed streaming
platform that enables (among other things) the development of real-
time, event-driven applications and user experiences on the web.
It enables applications to publish or subscribe to data or event
streams.
It stores records accurately (i.e., in the order in which they
occurred) in a fault-tolerant and durable way.
It processes records in real time (as they occur).
Apache Spark
Apache Spark is an analytics engine for large-scale data processing.
You can use Spark to perform analytics on streams delivered by
Apache Kafka and to produce real-time stream processing
applications, such as the aforementioned click-stream analysis.

Apache NiFi
Apache NiFi is a data flow management system with a visual, drag-
and-drop interface. Because NiFi can run as a Kafka producer and a
Kafka consumer, it’s an ideal tool for managing data flow challenges
that Kafka can’t address.

Apache Flink
Apache Flink is an engine for performing computations on event
streams at scale, with consistently high speed and low latency. Flink
can ingest streams as a Kafka consumer, perform operations based
on these streams in real time and publish the results to Kafka or to
another application.

Apache Hadoop
Apache Hadoop is a distributed software framework that lets you
store massive amounts of data in a cluster of computers for use in
big data analytics, machine learning, data mining and other data-
driven applications that process structured and unstructured data.
Kafka is often used to create a real-time streaming data pipeline to a
Hadoop cluster.

37. Explain the difference between Kafka Producer and


Consumer.
In Apache Kafka, Producers write (publish) data to topics, while
Consumers read (consume) data from those topics. Think of
producers as the publishers of messages, and consumers as the
subscribers or readers. The key difference lies in their purpose:
producers add data to the system, and consumers retrieve data.

38. How do you ensure message ordering in Kafka?


In order to guarantee ordered message delivery with the Kafka
messaging broker, messages can be produced with a key.
Messages with the same key are written to the same topic partition,
and as consumers read messages in order from each partition, their
ordering is preserved.

39. What are Kafka topics and partitions?


In Apache Kafka, topics are categories or streams of data, and
partitions are how topics are divided for scalability and parallel
processing. A topic is a logical channel for producers to write
messages and consumers to read them, while partitions allow a
single topic to be distributed across multiple Kafka brokers.

40. How does Kafka handle fault tolerance?


Kafka is a distributed system. The topic is divided into partitions
and kept in different brokers. If any broker fails, data should not be
lost. For fault-tolerance purposes, the partition is replicated and
stored in different brokers. If leader brokers fail, then the controller
will elects one of the replicas as the leader. Even controller brokers
can fail, in this case, Zookeeper will help in electing the broker as the
controller.
41. What are some common Spark performance optimization
techniques?
 Use DataFrame/Dataset over RDD
 Use coalesce() over repartition()
 Use mapPartitions() over map()
 Use Serialized data format’s
 Avoid UDF’s (User Defined Functions)
 Caching data in memory
 Reduce expensive Shuffle operations
 Disable DEBUG & INFO Logging

https://sparkbyexamples.com/spark/spark-performance-
tuning/

42. How do you optimize data storage in a data lake?


 Source hardware. Whether you're using on-premises machines
or Virtual Machines (VMs) in Azure, make sure to carefully
select the appropriate hardware. ...
 Network connectivity to the storage account. ...
 Configure data ingestion tools for maximum parallelization. ...
 File formats. ...
 File size. ...
 Directory structure.

43. Explain the role of file formats like Parquet and ORC in
performance tuning.
Parquet:
Parquet is often preferred for write-once, read-many scenarios and
when efficient data compression is a priority. It also offers broader
compatibility, supporting multiple programming languages and
query engines.
ORC:
ORC is particularly well-suited for Hive-centric environments and
when ACID transactions are needed. It also excels in certain query
types due to its advanced indexing and optimization features.

44. What is vectorization in Pandas and Spark, and how does it


help in performance?
1. Vectorization in Pandas:
Batch Processing:
Pandas provides functions that can operate on entire Series (1D
arrays) or DataFrames (2D arrays) at once.
Performance Gains:
Vectorized operations are often implemented with optimized C
code, which significantly speeds up calculations compared to Python
loops.

Types of Pandas UDFs:


Scalar Pandas UDFs: Used for vectorizing scalar operations (e.g.,
applying a function to a single column).
Grouped Map Pandas UDFs: Designed for split-apply-combine
operations (e.g., applying a function to different groups of data).

45. What are best practices for optimizing SQL queries on large
datasets?
Best practices for optimizing SQL queries on large datasets
Indexing
Subqueries
Avoid select
Avoid SELECT DISTINCT
Optimize joins
Select
Use limit and offset
Use proper joins
Choose exists over in
Limit the use of wildcards
Monitor query performance
Optimize where clauses
Utilize stored procedures
Apply where instead of having
Column selection
Incorporate exist instead of count
Indexing strategy
Optimize data retrieval
Partitioning
Use int64 data types in joins
Wildcards
Optimize join operations
Prefer union all to union
Query execution plan

You might also like