0% found this document useful (0 votes)

12 views25 pages

Interview Q & a (SQL Spark HIVE Airflow AWS Kafka)-1

The document provides a comprehensive overview of interview questions and answers related to data engineering, covering topics such as responsibilities, data ingestion, ETL vs ELT, data security, performance optimization, and various technologies like SQL, Spark, and Hive. It includes explanations of key concepts, best practices, and examples of code snippets for practical applications. Additionally, it discusses challenges faced in data engineering and the differences between various data processing methods and structures.

Uploaded by

muddassirshaik

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views25 pages

Interview Q & a (SQL Spark HIVE Airflow AWS Kafka)-1

Uploaded by

muddassirshaik

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

Interview Questions & Answers.

(SQL/Spark/HIVE/Airflow/AWS/Kafka)

1. What are the key responsibilities of a Data Engineer?

Ans:-
Interviewer: Can you briefly describe your experience with
Databricks
Over the past 4-5 years, I've primarily worked as a Data Engineer
using Databricks. My main responsibilities included data ingestion,
ETL processes, data processing with Spark, performance
optimization, collaboration with data teams, and ensuring data
security and compliance.
Interviewer: Can you explain how you typically handle data
ingestion and ETL in Databricks?
I use Azure Data Factory for scheduling and orchestrating data
ingestion tasks from various sources, such as SQL Server, Oracle
databases, and cloud storage. Within Databricks, I use PySpark to
clean, transform, and aggregate the data before loading it into Delta
Lake tables for further analysis.
Interviewer: How do you optimize the performance of Spark jobs?
Performance optimization involves several strategies. I ensure
efficient joins by partitioning data correctly and using broadcast
joins when needed. Caching intermediate data helps reduce
redundant computations. I also implement data partitioning based
on key columns and tune Spark configurations like executor
memory and cores to match the workload.
Interviewer: Can you give an example of a specific project where
you implemented optimizations?
In a retail analytics project, we had to process large volumes of sales
data. We partitioned the data by date and store ID, used broadcast
joins for lookup tables, and cached frequently accessed datasets.
These optimizations significantly reduced the query runtime and
improved overall performance.
Interviewer: How do you ensure data security and compliance in
your Databricks projects?
We implement role-based access control (RBAC) using Azure Active
Directory, ensuring only authorized users can access sensitive data.
Data is encrypted both at rest and in transit. We also use data
masking techniques to protect sensitive information during
processing and enable audit logging to track data access and
changes, which is crucial for compliance with regulations like GDPR
and HIPAA.
Interviewer: Can you describe a project where data security was a
top priority?
In a financial analytics project, we handled sensitive customer data.
We implemented strict RBAC policies, encrypted all data at rest and
in transit, and used audit logging to track data access. We also
masked sensitive information to ensure that analysts only accessed
anonymized data, ensuring compliance with industry regulations.

Interviewer: How do you collaborate with data scientists and

analysts on projects?
I work closely with data scientists to understand their requirements
and provide clean datasets for their models. We follow an iterative
process where I adjust data pipelines based on their feedback.
Regular meetings and documentation ensure everyone is aligned.
For example, in a customer segmentation project, I prepared
datasets and made adjustments based on the data scientists' needs
for feature engineering.

2. Explain the difference between OLTP and OLAP systems.

Feature OLTP OLAP
Purpose Transactional systems Analytical systems
Query
Type Simple, short queries Complex, multi-table queries
INSERT, UPDATE,
Operations DELETE SELECT (aggregations)
Denormalized
Schema Normalized (Star/Snowflake)
Speed Milliseconds Seconds to minutes
Users Operational staff Analysts, Managers
Data Type Current, up-to-date Historical, aggregated

3. What are ETL and ELT? When would you use one over the
other?
Feature ETL ELT
Transform In an external tool or Inside the destination
Location engine (warehouse/lake)
Slower (post- Faster (raw data loaded
Load Time transformation) quickly)
Less (fixed schema More (raw + transformed
Flexibility before loading) data coexists)
Informatica, Talend, dbt, Spark, SQL in
Tooling SSIS warehouse
Legacy systems, strict Modern, scalable data
Best For compliance stacks

4. What are the common challenges in data engineering?

Data
quality Real-time computing Data discovery
Scalability
issues Data storage Data enrichment
Data
integration Poor performance Data transformation
Data
pipelines Can't get to data New data sources
Data Data format and schema
security inconsistencies Unclear strategy
Data Challenge 3: infrastructure and
governance Data inconsistency cost management
Data Challenges that data engineers
ingestion Processing face
Data
compliance Tool selection Collaboration with other teams

5. Explain the concept of data partitioning in a data warehouse.

🚀 Why Partition?
✅ Faster Queries (especially for time-based or range-based filters)
✅ Reduced I/O (only scan relevant partitions)
✅ Better Parallelism (each partition can be processed in parallel)
✅ Easier Data Management (delete/load/archive per partition)
Type Description Example
Partitions data by a range
Range of values (like dates, IDs) sales partitioned by year/month
Partitions by specific
List values of a column country IN ('US', 'IN', 'UK')
Distributes rows using a
hash function (for load
Hash balancing) user_id % 4
Combines two or more
strategies (e.g., range +
Composite hash) date range + hash on user_id

6. How do you optimize a slow SQL query?

Best practices for optimizing SQL queries on large

datasets
Indexing
Subqueries
Avoid select
Avoid SELECT DISTINCT
Optimize joins
Select
Use limit and offset
Use proper joins
Choose exists over in
Limit the use of wildcards
Monitor query performance
Optimize where clauses
Utilize stored procedures
Apply where instead of having
Column selection
Incorporate exist instead of count
Indexing strategy
Optimize data retrieval
Partitioning
Use int64 data types in joins
Wildcards
Optimize join operations
Prefer union all to union
Query execution plan

7. What is indexing in databases? How does it improve

performance?
Indexing is used to reduce the table scan. To create index the
cardinality of the column should be high. The column which is used
in WHERE clause frequently indexes will be created on that
column.They organize and store data in a way that accelerates the
data access process.
Two types:
Clustered Index - We can have clustered index on only one column.
For PRIMARY Key column by default clustered index will be created.
When we create clustered index data will be sorted in disk.
Non-Clustered Index - We can have Non-Clustered Index on
multiple colums. The key lookup operation will be performed as we
are having key-value pairs only. It does not have actual table
information.
key = Non-Clustered Index column data
value = Physical address.
https://youtu.be/09E-tVAUqQw?si=S4QjkWfq9V3uoa2y

8. Explain different types of joins in SQL with examples.

spark.sql("select a.id,a.name,b.product from cust a join prod b on
a.id=b.id").show()
spark.sql("select a.id,a.name,b.product from cust a left join prod b
on a.id=b.id").show()
spark.sql("select a.id,a.name,b.product from cust a right join prod b
on a.id=b.id").show()
spark.sql("select a.id,a.name,b.product from cust a full join prod b
on a.id=b.id").show()
spark.sql("select a.id,a.name from cust a LEFT ANTI JOIN prod b on
a.id=b.id").show()
spark.sql("select a.id,a.name from cust a LEFT SEMI JOIN prod b on
a.id=b.id").show()
spark.sql("select a.id,a.name from cust a CROSS JOIN prod
b").show()

9. What is denormalization, and when should you use it?

Denormalization is a database trick to speed up fetching data by
storing some stuff redundantly. Use it when you want your reads to
be lightning fast, but watch out for a bit of data redundancy mess.
10. How do you handle duplicate records in SQL?
In SQL, duplicate records can be handled in several ways, including
using DISTINCT,UNION, GROUP BY, ROW_NUMBER(), and CTEs
(Common Table Expressions) to identify and remove duplicates,
11. What is Apache Spark, and how does it compare with
Hadoop MapReduce?
Feature Apache Spark Hadoop MapReduce
In-memory, very
Speed fast Disk-based, slower
Ease of Use High-level APIs Low-level APIs
Scala, Python,
Languages Java, R Mainly Java
Batch, Streaming,
Processing Types ML, Graph Batch only
Fast for iterative & Better for very large
Performance real-time batch jobs
Lineage-based
Fault Tolerance recovery Checkpointing via HDFS
Strict Map ? Shuffle ?
Execution Engine DAG scheduler Reduce

12. Explain the difference between Spark RDD, DataFrame, and

Dataset.
RDDs:-
A distributed collection of data elements without a schema. RDDs
are slower than DataFrames and Datasets for simple operations.
DataFrames:-
A distributed collection organized into named columns. DataFrames
are similar to relational database tables or Python Pandas
DataFrames.
DataFrames are faster than RDDs for exploratory analysis and
creating aggregated statistics.
Datasets:-
An extension of DataFrames with additional features like type-safety
and object-oriented interface. Datasets are faster than RDDs but
slower than DataFrames. Datasets combine the performance
optimization of DataFrames and the convenience of RDDs.

13. What is the difference between cache() and persist() in

Spark?
cache() :
It is used to cache the RDD in memory. When an RDD is cached, the
data is stored in memory so that it can be quickly accessed the next
time it is needed. This can greatly improve the performance of Spark
jobs by reducing the amount of time spent reading data from disk.
it is lazy only after we call action for first time then only it will catch
it , For first time it will star catching from start to end so its take
time but secondtime it will be fast. Beacuse it only run last stage.
persist(): It is similar to cache(), but it allows you to specify where
the RDD should be stored (in memory, on disk, or both). By default,
persist() caches the RDD in memory, but you can use the
StorageLevel parameter to specify a different storage level.
Different Levels of Caching and Persistence in PySpark:
1)MEMORY_ONLY: This level stores the RDD in memory as
deserialized Java objects. This is the default level used by cache()
and persist(). It provides fast access to the data, but if the RDD does
not fit entirely in memory, it may need to be recomputed from the
original data source.
2)MEMORY_ONLY_SER: This level stores the RDD or DataFrame in
memory as serialized Java objects. This can reduce memory usage
compared to MEMORY_ONLY, but accessing the data requires
deserialization, which can be slower than using deserialized objects.
3)MEMORY_AND_DISK: This level stores the RDD or DataFrame in
memory as deserialized Java objects, but if the RDD or DataFrame
does not fit entirely in memory, it spills the excess data to disk. This
provides better performance than recomputing the data, but
accessing data from disk can be slower than accessing it from
memory.
4)MEMORY_AND_DISK_SER: This level stores the RDD or
DataFrame in memory as serialized Java objects, and spills excess
data to disk if needed. This can be useful when memory usage is a
concern, but accessing the data requires deserialization, which can
be slower than using deserialized objects.
5)DISK_ONLY: This level stores the RDD or DataFrame on disk only,
and not in memory. This can be useful when memory usage is a
concern and the data does not fit entirely in memory, but accessing
the data from disk can be slower than accessing it from memory.
What is uncache() and unpersist() in PySpark ?
In PySpark, uncache() and unpersist() are methods used to remove
RDDs from memory or disk.
uncache(): This method is used to remove an RDD from memory
that was previously cached using the cache() method. Once an RDD
has been uncached, its data is no longer stored in memory, and it
must be recomputed from its original source if it is needed again.
unpersist(): This method is used to remove an RDD from the disk
that was previously persisted using the persist() method. Once an
RDD has been unpersisted, its data is no longer stored on disk, and it
must be recomputed from its original source if it is needed again.

14. Explain Spark's shuffle operation and how to optimize it.

A shuffle in Apache Spark is the process of redistributing data across
partitions, often between different nodes in the cluster. It happens
when data needs to be repartitioned or grouped in a way that the
current partitioning can't satisfy.
✳ When does shuffle happen?
groupByKey()
reduceByKey()
join()
distinct()
repartition()
coalesce() (when increasing partitions)
15. What are the different types of transformations in Spark?
Narrow Transformations
Wide Transformations
Narrow
select()
filter()
withColumn()

Wide
groupBy().agg()
join()
distinct()
orderBy()

16. Write a PySpark code to remove duplicates from a

DataFrame.
Df1 = df.dropDuplicates()
17. Write a PySpark code to calculate the moving average of a
column.
dfavg = df.withColumn("moving_avg",
avg("value").over(windowSpec))
18. Write a Scala program to count the number of words in a
text file using Spark.

val data = sc.textFile("file:///f:/data/wc.txt")

val flatdata = data.flatMap( x => x.split(" "))
val df = flatdata.toDF("value")
val addcol = df.withColumn("id",lit(1))
addcol.show()
val agg = addcol.groupBy("value").agg(sum("id"))
agg.show()

19. Given a PySpark DataFrame, write a function to fill missing

values with the column mean.
na.fill(0) in dataframe
==========================
val data=Seq( (1, "A", 10),
(1, "B", 20),
(2, "A", 30),
(2, "B", 40),
(3, "C", 50),
(3, "D", 60)
).toDF("ID", "col1", "col2")
data.show(false)
val
df=data.groupBy("ID").pivot("col1").agg(sum("col2")).na.fill(0)
df.show()
val seldf=df.select("ID", "A" ,"B","C","D")
seldf.show()

20. Write a PySpark program to group data by a column and

calculate the sum of another column.
val data=Seq( (1, "A", 10),
(1, "B", 20),
(2, "A", 30),
(2, "B", 40),
(3, "C", 50),
(3, "D", 60)
).toDF("ID", "col1", "col2")
data.show(false)
val
df=data.groupBy("ID").pivot("col1").agg(sum("col2")).na.fill(0)
df.show()

21. What is the difference between partitioning and bucketing

in Hive?
Partitioning: Partitioning divides a table into smaller, more
manageable parts based on the values of one or more columns. Each
partition is stored as a subdirectory within the table's directory in
HDFS, making it easier to query specific subsets of the data. For
example, you might partition a table of sales data by year, month,
and day to facilitate queries that filter on these date components.

Bucketing: Bucketing, also known as clustering, divides data into a

fixed number of buckets based on a hash function applied to one or
more columns. Bucketing is useful for evenly distributing data
across buckets, which can improve query performance for certain
types of operations, such as joins and aggregations.

22. How does Hive store data internally?

Component Description
Data
HDFS (or external systems like S3)
Location
Format Text, ORC, Parquet, etc.
Schema Stored in Hive Metastore
Partitioning Data stored in directory hierarchy
Bucketing Data split into fixed number of files

23. What are different file formats used in Hive?

TextFile (default, row-based, inefficient)
ORC (Optimized Row Columnar, best for Hive)
Parquet (columnar, good for interoperability)
Avro, SequenceFile, RCFile, etc.

24. Explain dynamic partitioning in Hive with an example.

Partitioning and Types
Hive supports partitions by diving the data into multiple folders or
sub folders for efficient querying. Low cardinal columns (Low
distinct columns) are chosen to perform partitions.
Partitions are folder
Types –
Static and Dynamic
Static: -Static partitioning involves manually specifying the
partition values when inserting data into a Hive table. It’s like
organizing data into predefined folders.
CREATE TABLE sales_data (
transaction_id INT,
product_id INT,
amount DECIMAL(10, 2)
)
PARTITIONED BY (date STRING, country STRING);
— Insert data into the table with partition values specified
INSERT INTO sales_data PARTITION (date=’2024–03–23',
country=’USA’) VALUES (1, 101, 100.50);

Dynamic :-Dynamic partitioning allows Hive to automatically

determine partition values based on the data being inserted. It’s like
an automatic sorting system where Hive creates partitions on-the-
fly.

— Create a table with partition columns (dynamic partitioning)

CREATE TABLE sales_data_dynamic (
transaction_id INT,
product_id INT,
amount DECIMAL(10, 2)
)
PARTITIONED BY (date STRING, country STRING);

— Enable dynamic partition mode

SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;

— Insert data into the table without specifying partition values

INSERT INTO sales_data_dynamic PARTITION (date, country)
VALUES (1, 101, 100.50, ‘2024–03–23’, ‘USA’);

25. How does indexing work in Hive?

Indexing :-One of the Hive query optimization methods is Hive
index. Hive index is used to speed up the access of a column or set of
columns in a Hive database because with the use of index the
database system does not need to read all rows in the table to find
the data that one has selected.
CREATE INDEX table01_index ON TABLE table01 (column2) AS
'COMPACT';

26. What is Apache Airflow, and how is it used in data

pipelines?
Concept Description
A Directed Acyclic Graph—defines the flow of tasks and their
DAG
dependencies.
A single unit of work (can be a Bash, Python, SQL, or custom
Task
operator).
Defines what kind of task it is (e.g., PythonOperator,
Operator
BashOperator, SparkSubmitOperator).
Triggers tasks based on DAG definitions and schedule
Scheduler
intervals.
Executor Runs tasks (e.g., Sequential, Local, Celery, Kubernetes).
Web UI Provides a dashboard to monitor and manage DAGs and tasks.

27. Explain how to schedule a pipeline in Airflow.

DAG & schedule interval
28. What are DAGs in Airflow?
A DAG (directed acyclic graph) is a mathematical structure
consisting of nodes and edges. In Airflow, a DAG represents a data
pipeline or workflow with a start and an end. The mathematical
properties of DAGs make them useful for building data pipelines:
Directed: There is a clear direction of flow between tasks.
29. How would you handle a failed task in an Airflow DAG?
1. Retries:
Airflow allows tasks to be configured with a specific number of
retries. If a task fails within a retry window, it will automatically be
retried after a configurable delay.
2. Clearing Task Status:
If a task fails and the retry mechanism is exhausted, you can
manually clear the task status in the Airflow UI. This essentially
resets the task instance, allowing it to be rerun from scratch. You
can select options to clear the task in the current DAG run or across
multiple runs.
3. Trigger Rules:
Trigger rules define the conditions under which a task should be
executed. By default, a task runs if all upstream tasks succeed
(all\_success). However, you can use trigger rules like all_failed to
execute a task only when all upstream tasks have failed, or
one_failed to execute it if at least one upstream task has failed.
4. Error Handling:
Airflow provides various mechanisms for error handling, including:
Callbacks: You can define callback functions that are triggered when
a task fails or succeeds. These callbacks can be used to send
notifications, log errors, or trigger other actions.
Alerts: You can configure email alerts that are triggered when tasks
fail or are retried.
5. DAG Level Failure:
If a DAG as a whole needs to be alerted or handled upon failure, you
can define a failure_callback function at the DAG level. This callback
will be triggered if any task in the DAG fails.
6. Monitoring and Logging:
Airflow provides built-in logging and monitoring capabilities. You
can access task logs to diagnose the cause of failures and monitor
the overall status of your DAG runs.
30. What is backfilling in Apache Airflow?
Backfill is a very useful feature in Airflow. It allows you to re-run a
dag or specific tasks in a dag for the past days.

31. What is the difference between AWS S3 and HDFS?

S3 HDFS S3 vs HDFS
S3 is more
Elasticity Yes No elastic
Cost/TB/month $23 $206 10X
99.9%
Availability 99.99% (estimated) 10X
99.9999%
Durability 100.00% (estimated) 10X+
Transactional Yes with
writes DBIO Yes Comparable

32. How does AWS Glue work, and when would you use it?
1. Data Discovery:
AWS Glue uses crawlers to automatically discover and catalog data
from various sources, such as Amazon S3, DynamoDB, and relational
databases. These crawlers infer the schema and structure of your
data, automatically creating metadata descriptions that are stored in
the AWS Glue Data Catalog.
2. Data Preparation (ETL):
Once data is discovered and cataloged, AWS Glue provides tools to
prepare it for analysis. You can create and run ETL jobs to extract
data from various sources, transform it to the desired format, and
load it into a target destination.
3. Visual and Code-Based Workflows:
AWS Glue offers both visual and code-based interfaces for building
ETL workflows. You can use AWS Glue Studio to visually design ETL
pipelines and manage your data, while also using code-based
approaches with languages like Python or Scala for more complex
transformations.
4. Serverless Architecture:
AWS Glue operates on a serverless architecture, meaning you don't
have to manage servers or infrastructure. It automatically scales
resources as needed and handles job execution, monitoring, and
error handling.

33. Explain how AWS Lambda can be used in data processing.

1. Event-Driven Data Processing:

Lambda functions are triggered by specific events, such as a new file
being uploaded to S3, a record being added to a Kinesis stream, or a
change in a DynamoDB table.
This event-driven architecture allows you to process data as it
becomes available, enabling real-time data processing.

2. Serverless Architecture:
Lambda eliminates the need for you to provision, manage, or scale
servers.
You pay only for the compute time your code consumes, making it
cost-effective, especially for intermittent or burst processing
workloads.

3. Real-time Data Transformation:

You can use Lambda to transform data on the fly. For example,
convert images from color to grayscale as soon as they are uploaded
to S3.
Lambda functions can also process streaming data in real-time using
services like Kinesis, enabling tasks like clickstream analysis, log
filtering, or social media monitoring.

4. ETL (Extract, Transform, Load):

Lambda can be used to extract data from one source, transform it
according to your needs, and load it into another destination.
This is particularly useful for building data pipelines that move data
between different AWS services or external data stores.

5. Integration with Other AWS Services:

Lambda integrates seamlessly with other AWS services, allowing
you to build complex data processing pipelines.
For example, you can use AWS Step Functions to orchestrate a series
of Lambda functions to perform a complex ETL process.

6. Scaling and High Availability:

Lambda automatically scales to handle varying workloads, ensuring
that your data processing functions can handle peak loads without
you needing to manage infrastructure.
Lambda also provides high availability, ensuring that your data
processing functions remain operational even if one server fails.

34. What is the difference between Redshift and Athena?

35. How do you optimize an S3 data lake for faster querying?

Best Practice Area Key Takeaways
Structuring data for scalability, efficient
1. Data Organization partitioning strategies, and consistent naming
and Management conventions.
Implementing IAM policies, data encryption
2. Security and methods, and compliance with data
Compliance regulations.
Choosing the right storage classes, data
3. Performance caching techniques, and query performance
Optimization tuning.
Monitoring storage costs, implementing data
lifecycle policies, and using cost-effective data
4. Cost Management formats.
Integrating with other AWS services,
5. Data Integration and providing robust APIs, and supporting
Accessibility multiple data consumers.
Ensuring data accuracy, implementing
6. Data Quality and validation checks, and managing data
Reliability redundancy.
7. Scalability and Auto-scaling strategies, managing peak loads,
Elasticity and planning for data growth.
8. Backup and Disaster Regular data backups, disaster recovery
Recovery planning, and cross-region replication.
Setting up monitoring dashboards, log
9. Monitoring and management and analysis, and alerting
Logging mechanisms.
Embracing new AWS features, preparing for
10. Future-Proofing technological advances, and continuous
and Innovation learning.

36. What is Apache Kafka, and how does it work?

Apache Kafka (Kafka) is an open-source, distributed streaming
platform that enables (among other things) the development of real-
time, event-driven applications and user experiences on the web.
It enables applications to publish or subscribe to data or event
streams.
It stores records accurately (i.e., in the order in which they
occurred) in a fault-tolerant and durable way.
It processes records in real time (as they occur).
Apache Spark
Apache Spark is an analytics engine for large-scale data processing.
You can use Spark to perform analytics on streams delivered by
Apache Kafka and to produce real-time stream processing
applications, such as the aforementioned click-stream analysis.

Apache NiFi
Apache NiFi is a data flow management system with a visual, drag-
and-drop interface. Because NiFi can run as a Kafka producer and a
Kafka consumer, it’s an ideal tool for managing data flow challenges
that Kafka can’t address.

Apache Flink
Apache Flink is an engine for performing computations on event
streams at scale, with consistently high speed and low latency. Flink
can ingest streams as a Kafka consumer, perform operations based
on these streams in real time and publish the results to Kafka or to
another application.

Apache Hadoop
Apache Hadoop is a distributed software framework that lets you
store massive amounts of data in a cluster of computers for use in
big data analytics, machine learning, data mining and other data-
driven applications that process structured and unstructured data.
Kafka is often used to create a real-time streaming data pipeline to a
Hadoop cluster.

37. Explain the difference between Kafka Producer and

Consumer.
In Apache Kafka, Producers write (publish) data to topics, while
Consumers read (consume) data from those topics. Think of
producers as the publishers of messages, and consumers as the
subscribers or readers. The key difference lies in their purpose:
producers add data to the system, and consumers retrieve data.

38. How do you ensure message ordering in Kafka?

In order to guarantee ordered message delivery with the Kafka
messaging broker, messages can be produced with a key.
Messages with the same key are written to the same topic partition,
and as consumers read messages in order from each partition, their
ordering is preserved.

39. What are Kafka topics and partitions?

In Apache Kafka, topics are categories or streams of data, and
partitions are how topics are divided for scalability and parallel
processing. A topic is a logical channel for producers to write
messages and consumers to read them, while partitions allow a
single topic to be distributed across multiple Kafka brokers.

40. How does Kafka handle fault tolerance?

Kafka is a distributed system. The topic is divided into partitions
and kept in different brokers. If any broker fails, data should not be
lost. For fault-tolerance purposes, the partition is replicated and
stored in different brokers. If leader brokers fail, then the controller
will elects one of the replicas as the leader. Even controller brokers
can fail, in this case, Zookeeper will help in electing the broker as the
controller.
41. What are some common Spark performance optimization
techniques?
 Use DataFrame/Dataset over RDD
 Use coalesce() over repartition()
 Use mapPartitions() over map()
 Use Serialized data format’s
 Avoid UDF’s (User Defined Functions)
 Caching data in memory
 Reduce expensive Shuffle operations
 Disable DEBUG & INFO Logging

https://sparkbyexamples.com/spark/spark-performance-
tuning/

42. How do you optimize data storage in a data lake?

 Source hardware. Whether you're using on-premises machines
or Virtual Machines (VMs) in Azure, make sure to carefully
select the appropriate hardware. ...
 Network connectivity to the storage account. ...
 Configure data ingestion tools for maximum parallelization. ...
 File formats. ...
 File size. ...
 Directory structure.

43. Explain the role of file formats like Parquet and ORC in
performance tuning.
Parquet:
Parquet is often preferred for write-once, read-many scenarios and
when efficient data compression is a priority. It also offers broader
compatibility, supporting multiple programming languages and
query engines.
ORC:
ORC is particularly well-suited for Hive-centric environments and
when ACID transactions are needed. It also excels in certain query
types due to its advanced indexing and optimization features.

44. What is vectorization in Pandas and Spark, and how does it

help in performance?
1. Vectorization in Pandas:
Batch Processing:
Pandas provides functions that can operate on entire Series (1D
arrays) or DataFrames (2D arrays) at once.
Performance Gains:
Vectorized operations are often implemented with optimized C
code, which significantly speeds up calculations compared to Python
loops.

Types of Pandas UDFs:

Scalar Pandas UDFs: Used for vectorizing scalar operations (e.g.,
applying a function to a single column).
Grouped Map Pandas UDFs: Designed for split-apply-combine
operations (e.g., applying a function to different groups of data).

45. What are best practices for optimizing SQL queries on large
datasets?
Best practices for optimizing SQL queries on large datasets
Indexing
Subqueries
Avoid select
Avoid SELECT DISTINCT
Optimize joins
Select
Use limit and offset
Use proper joins
Choose exists over in
Limit the use of wildcards
Monitor query performance
Optimize where clauses
Utilize stored procedures
Apply where instead of having
Column selection
Incorporate exist instead of count
Indexing strategy
Optimize data retrieval
Partitioning
Use int64 data types in joins
Wildcards
Optimize join operations
Prefer union all to union
Query execution plan

Contoh Table of Content
100% (1)
Contoh Table of Content
3 pages
9580V User & Reference Guide
100% (2)
9580V User & Reference Guide
217 pages
Azure Data Engineering Interview q & a - Topicwise
No ratings yet
Azure Data Engineering Interview q & a - Topicwise
57 pages
Report Zazmic Inc. Senior Middle Data Engineer Hiring Test AWS Snowflake Databricks Python SQL Kalgaonkarsiddhesh Gmail.com
No ratings yet
Report Zazmic Inc. Senior Middle Data Engineer Hiring Test AWS Snowflake Databricks Python SQL Kalgaonkarsiddhesh Gmail.com
36 pages
Barclays Data Engineer Interview Questions
No ratings yet
Barclays Data Engineer Interview Questions
17 pages
Logcat Home Fota Update Log
No ratings yet
Logcat Home Fota Update Log
1,309 pages
Big Data in The Construction Industry A Review of Present Status, Opportunities, and Future Trends
100% (3)
Big Data in The Construction Industry A Review of Present Status, Opportunities, and Future Trends
42 pages
Zendcon Performance Oci8
No ratings yet
Zendcon Performance Oci8
78 pages
4063 Final复习资料
No ratings yet
4063 Final复习资料
6 pages
Power BI With HR Analytics
No ratings yet
Power BI With HR Analytics
4 pages
Genesis AI Jurisprudence - Trentongarmon - Ai - Natural Law - Civil Rights - Constitutional Law I Human-AI Declaration of Dependence I Deep Learn Library
No ratings yet
Genesis AI Jurisprudence - Trentongarmon - Ai - Natural Law - Civil Rights - Constitutional Law I Human-AI Declaration of Dependence I Deep Learn Library
89 pages
CSC271 Database Systems: Lecture # 4
No ratings yet
CSC271 Database Systems: Lecture # 4
31 pages
Big Data Interview Questions 1690738892
No ratings yet
Big Data Interview Questions 1690738892
189 pages
Mastercard Data Engineer Interview Questions
No ratings yet
Mastercard Data Engineer Interview Questions
16 pages
Readme Data Convertors PDF
No ratings yet
Readme Data Convertors PDF
5 pages
Apache Spark
No ratings yet
Apache Spark
62 pages
Database Transaction Management
No ratings yet
Database Transaction Management
20 pages
AS400 Interview Questions
No ratings yet
AS400 Interview Questions
93 pages
Python Viva Questions With Answers
No ratings yet
Python Viva Questions With Answers
45 pages
REVISI_ IMPROVING STUDENT CREATIVITY THROUGH PROJECT BASED LEARNING MODEL ASSI en (1)
No ratings yet
REVISI_ IMPROVING STUDENT CREATIVITY THROUGH PROJECT BASED LEARNING MODEL ASSI en (1)
27 pages
LIT FTK Specification Guide 6.3
No ratings yet
LIT FTK Specification Guide 6.3
8 pages
Chapter 2 - Data Acquisition (1H16)
No ratings yet
Chapter 2 - Data Acquisition (1H16)
30 pages
Cpi Odata: Integration Flow
No ratings yet
Cpi Odata: Integration Flow
12 pages
Sample Thesis Format New
No ratings yet
Sample Thesis Format New
33 pages
Database Assignment-Ll Dream Home Database System Faisal Abbas BSCM-F19-404
No ratings yet
Database Assignment-Ll Dream Home Database System Faisal Abbas BSCM-F19-404
5 pages
Php-Question Bank
No ratings yet
Php-Question Bank
5 pages
Ade Companywise Interview
No ratings yet
Ade Companywise Interview
133 pages
The Ultimate C - C - THR97 - 2011 - SAP Certified Application Associate - SAP SuccessFactors Onboarding 2H2020
No ratings yet
The Ultimate C - C - THR97 - 2011 - SAP Certified Application Associate - SAP SuccessFactors Onboarding 2H2020
2 pages
Ds Notes
No ratings yet
Ds Notes
88 pages
Pop Quiz - Arsitektur Sistem Informasi (Data) - Attempt Review
No ratings yet
Pop Quiz - Arsitektur Sistem Informasi (Data) - Attempt Review
6 pages
DBMS 2
No ratings yet
DBMS 2
5 pages
Top 50 Industry-Relevant Data Analyst Interview Q_A
No ratings yet
Top 50 Industry-Relevant Data Analyst Interview Q_A
5 pages
CS 61A Scheme Midterm 2 Cheat Sheet
No ratings yet
CS 61A Scheme Midterm 2 Cheat Sheet
2 pages
60+ Data Engineer Interview Questions and Answers
No ratings yet
60+ Data Engineer Interview Questions and Answers
16 pages
DBMS 02
No ratings yet
DBMS 02
13 pages
top 51 data architect interview questions and how to answer them _ datacamp
No ratings yet
top 51 data architect interview questions and how to answer them _ datacamp
19 pages
3DRi-brochure 2018
No ratings yet
3DRi-brochure 2018
2 pages
My_Walmart_interviewExperience_Answers
No ratings yet
My_Walmart_interviewExperience_Answers
13 pages
1744827782701
No ratings yet
1744827782701
20 pages
Summary of DataSys
No ratings yet
Summary of DataSys
5 pages
CDE Sample Interview Questions (1)
No ratings yet
CDE Sample Interview Questions (1)
10 pages
Interview QnAs - CloudyML
No ratings yet
Interview QnAs - CloudyML
13 pages
Content Part 1 PDF
No ratings yet
Content Part 1 PDF
6 pages
Blue Print: Kendriya Vidyalaya Sangathan, Chandigarh Region
No ratings yet
Blue Print: Kendriya Vidyalaya Sangathan, Chandigarh Region
2 pages
Q1. Difference between cache and pe
No ratings yet
Q1. Difference between cache and pe
13 pages
What Are Databases
No ratings yet
What Are Databases
6 pages
DBMS
No ratings yet
DBMS
29 pages
Power Apps Interview Questions
No ratings yet
Power Apps Interview Questions
3 pages
Tiger Analytics 1735834470
No ratings yet
Tiger Analytics 1735834470
27 pages
prac
No ratings yet
prac
4 pages
INTERVIEW QUESTIONS
No ratings yet
INTERVIEW QUESTIONS
7 pages
Data Science - Hierarchy of Needs
No ratings yet
Data Science - Hierarchy of Needs
20 pages
FSCK Ufs Example
No ratings yet
FSCK Ufs Example
3 pages
DBMS (SHORT 12pm)
No ratings yet
DBMS (SHORT 12pm)
3 pages
AZURE_ETL__1741608374
No ratings yet
AZURE_ETL__1741608374
14 pages
interviewsss
No ratings yet
interviewsss
4 pages
LT mindtree
No ratings yet
LT mindtree
3 pages
Big Data Engineering Interview Questions
67% (3)
Big Data Engineering Interview Questions
189 pages
Week 1 and 2 - Scope of Science & The Scientific Method
No ratings yet
Week 1 and 2 - Scope of Science & The Scientific Method
9 pages
DATA MANAGEMENT OFFICER II TRA Qs&AS
No ratings yet
DATA MANAGEMENT OFFICER II TRA Qs&AS
10 pages
SQL Interview Topics (1)
No ratings yet
SQL Interview Topics (1)
3 pages
SQL, Python, Azure Interview Questions
No ratings yet
SQL, Python, Azure Interview Questions
8 pages
Microsoft: Exam Questions DP-900
No ratings yet
Microsoft: Exam Questions DP-900
27 pages
Basic DBMS Interview Questions
No ratings yet
Basic DBMS Interview Questions
6 pages
SQL Interview Questions
No ratings yet
SQL Interview Questions
3 pages
Question
No ratings yet
Question
6 pages
Tech Mahindra
No ratings yet
Tech Mahindra
2 pages
Ericsson GSM BSC Printout PDF
100% (1)
Ericsson GSM BSC Printout PDF
4 pages
Top 10 SQL Performance Optimization Techniques
No ratings yet
Top 10 SQL Performance Optimization Techniques
1 page
Databricks Question 1668314325
No ratings yet
Databricks Question 1668314325
104 pages
Data Engineer
No ratings yet
Data Engineer
19 pages
INTERVIEW QUESTIONS - ALL Companies
No ratings yet
INTERVIEW QUESTIONS - ALL Companies
15 pages
Comprehensive Guide to Hive Architecture and Query Language: Definitive Reference for Developers and Engineers
From Everand
Comprehensive Guide to Hive Architecture and Query Language: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Pandas Essentials for Data Analysis: Definitive Reference for Developers and Engineers
From Everand
Pandas Essentials for Data Analysis: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
ArangoDB Technical Reference Guide: Definitive Reference for Developers and Engineers
From Everand
ArangoDB Technical Reference Guide: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Efficient Parallel Computing with Dask: Definitive Reference for Developers and Engineers
From Everand
Efficient Parallel Computing with Dask: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
PrestoDB in Practice: Definitive Reference for Developers and Engineers
From Everand
PrestoDB in Practice: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Advanced Apache Tez Techniques: Definitive Reference for Developers and Engineers
From Everand
Advanced Apache Tez Techniques: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Couchbase Essentials: Definitive Reference for Developers and Engineers
From Everand
Couchbase Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Practical TimescaleDB Solutions: Definitive Reference for Developers and Engineers
From Everand
Practical TimescaleDB Solutions: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Efficient Data Querying with Drill: Definitive Reference for Developers and Engineers
From Everand
Efficient Data Querying with Drill: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Redshift Essentials: Definitive Reference for Developers and Engineers
From Everand
Redshift Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
InfluxDB Essentials: Definitive Reference for Developers and Engineers
From Everand
InfluxDB Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Snowflake Data Platform Engineering: Definitive Reference for Developers and Engineers
From Everand
Snowflake Data Platform Engineering: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
From Everand
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
Eric Tome
No ratings yet
THE SQL LANGUAGE: Master Database Management and Unlock the Power of Data (2024 Beginner's Guide)
From Everand
THE SQL LANGUAGE: Master Database Management and Unlock the Power of Data (2024 Beginner's Guide)
JAMIE POWERS
No ratings yet
Mastering DuckDB: High-Performance Analytics Made Easy
From Everand
Mastering DuckDB: High-Performance Analytics Made Easy
Robert Johnson
No ratings yet
THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE: "THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE"
From Everand
THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE: "THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE"
AJIT DASH
2/5 (2)
Practical Data Strategies and Recipes
From Everand
Practical Data Strategies and Recipes
Tom Henricksen
No ratings yet
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet

Uploaded by

Uploaded by

Interview Questions & Answers.

1. What are the key responsibilities of a Data Engineer?

Interviewer: How do you collaborate with data scientists and

2. Explain the difference between OLTP and OLAP systems.

4. What are the common challenges in data engineering?

5. Explain the concept of data partitioning in a data warehouse.

6. How do you optimize a slow SQL query?

Best practices for optimizing SQL queries on large

7. What is indexing in databases? How does it improve

8. Explain different types of joins in SQL with examples.

9. What is denormalization, and when should you use it?

12. Explain the difference between Spark RDD, DataFrame, and

13. What is the difference between cache() and persist() in

14. Explain Spark's shuffle operation and how to optimize it.

16. Write a PySpark code to remove duplicates from a

val data = sc.textFile("file:///f:/data/wc.txt")

19. Given a PySpark DataFrame, write a function to fill missing

20. Write a PySpark program to group data by a column and

21. What is the difference between partitioning and bucketing

Bucketing: Bucketing, also known as clustering, divides data into a

22. How does Hive store data internally?

23. What are different file formats used in Hive?

24. Explain dynamic partitioning in Hive with an example.

Dynamic :-Dynamic partitioning allows Hive to automatically

— Create a table with partition columns (dynamic partitioning)

— Enable dynamic partition mode

— Insert data into the table without specifying partition values

25. How does indexing work in Hive?

26. What is Apache Airflow, and how is it used in data

27. Explain how to schedule a pipeline in Airflow.

31. What is the difference between AWS S3 and HDFS?

33. Explain how AWS Lambda can be used in data processing.

1. Event-Driven Data Processing:

3. Real-time Data Transformation:

4. ETL (Extract, Transform, Load):

5. Integration with Other AWS Services:

6. Scaling and High Availability:

34. What is the difference between Redshift and Athena?

35. How do you optimize an S3 data lake for faster querying?

36. What is Apache Kafka, and how does it work?

37. Explain the difference between Kafka Producer and

38. How do you ensure message ordering in Kafka?

39. What are Kafka topics and partitions?

40. How does Kafka handle fault tolerance?

42. How do you optimize data storage in a data lake?

44. What is vectorization in Pandas and Spark, and how does it

Types of Pandas UDFs:

You might also like