Interview Q & a (SQL Spark HIVE Airflow AWS Kafka)-1
Interview Q & a (SQL Spark HIVE Airflow AWS Kafka)-1
(SQL/Spark/HIVE/Airflow/AWS/Kafka)
3. What are ETL and ELT? When would you use one over the
other?
Feature ETL ELT
Transform In an external tool or Inside the destination
Location engine (warehouse/lake)
Slower (post- Faster (raw data loaded
Load Time transformation) quickly)
Less (fixed schema More (raw + transformed
Flexibility before loading) data coexists)
Informatica, Talend, dbt, Spark, SQL in
Tooling SSIS warehouse
Legacy systems, strict Modern, scalable data
Best For compliance stacks
Wide
groupBy().agg()
join()
distinct()
orderBy()
32. How does AWS Glue work, and when would you use it?
1. Data Discovery:
AWS Glue uses crawlers to automatically discover and catalog data
from various sources, such as Amazon S3, DynamoDB, and relational
databases. These crawlers infer the schema and structure of your
data, automatically creating metadata descriptions that are stored in
the AWS Glue Data Catalog.
2. Data Preparation (ETL):
Once data is discovered and cataloged, AWS Glue provides tools to
prepare it for analysis. You can create and run ETL jobs to extract
data from various sources, transform it to the desired format, and
load it into a target destination.
3. Visual and Code-Based Workflows:
AWS Glue offers both visual and code-based interfaces for building
ETL workflows. You can use AWS Glue Studio to visually design ETL
pipelines and manage your data, while also using code-based
approaches with languages like Python or Scala for more complex
transformations.
4. Serverless Architecture:
AWS Glue operates on a serverless architecture, meaning you don't
have to manage servers or infrastructure. It automatically scales
resources as needed and handles job execution, monitoring, and
error handling.
2. Serverless Architecture:
Lambda eliminates the need for you to provision, manage, or scale
servers.
You pay only for the compute time your code consumes, making it
cost-effective, especially for intermittent or burst processing
workloads.
Apache NiFi
Apache NiFi is a data flow management system with a visual, drag-
and-drop interface. Because NiFi can run as a Kafka producer and a
Kafka consumer, it’s an ideal tool for managing data flow challenges
that Kafka can’t address.
Apache Flink
Apache Flink is an engine for performing computations on event
streams at scale, with consistently high speed and low latency. Flink
can ingest streams as a Kafka consumer, perform operations based
on these streams in real time and publish the results to Kafka or to
another application.
Apache Hadoop
Apache Hadoop is a distributed software framework that lets you
store massive amounts of data in a cluster of computers for use in
big data analytics, machine learning, data mining and other data-
driven applications that process structured and unstructured data.
Kafka is often used to create a real-time streaming data pipeline to a
Hadoop cluster.
https://sparkbyexamples.com/spark/spark-performance-
tuning/
43. Explain the role of file formats like Parquet and ORC in
performance tuning.
Parquet:
Parquet is often preferred for write-once, read-many scenarios and
when efficient data compression is a priority. It also offers broader
compatibility, supporting multiple programming languages and
query engines.
ORC:
ORC is particularly well-suited for Hive-centric environments and
when ACID transactions are needed. It also excels in certain query
types due to its advanced indexing and optimization features.
45. What are best practices for optimizing SQL queries on large
datasets?
Best practices for optimizing SQL queries on large datasets
Indexing
Subqueries
Avoid select
Avoid SELECT DISTINCT
Optimize joins
Select
Use limit and offset
Use proper joins
Choose exists over in
Limit the use of wildcards
Monitor query performance
Optimize where clauses
Utilize stored procedures
Apply where instead of having
Column selection
Incorporate exist instead of count
Indexing strategy
Optimize data retrieval
Partitioning
Use int64 data types in joins
Wildcards
Optimize join operations
Prefer union all to union
Query execution plan