0% found this document useful (0 votes)
34 views24 pages

Certified Data Engineer Professional Topic 3

The document consists of multiple questions and answers related to data engineering concepts, particularly focusing on Databricks and Structured Streaming. Topics include query updates for marketing dashboards, job execution results, language interoperability, security measures, and data pipeline configurations. Each question is followed by a selected answer, indicating the preferred choice based on the scenarios presented.

Uploaded by

sagar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views24 pages

Certified Data Engineer Professional Topic 3

The document consists of multiple questions and answers related to data engineering concepts, particularly focusing on Databricks and Structured Streaming. Topics include query updates for marketing dashboards, job execution results, language interoperability, security measures, and data pipeline configurations. Each question is followed by a selected answer, indicating the preferred choice based on the scenarios presented.

Uploaded by

sagar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 24

Certified Data Engineer Professional Topic 3

Question #: 114
Topic #: 1
A data team’s Structured Streaming job is configured to calculate running aggregates
for item sales to update a downstream marketing dashboard. The marketing team has
introduced a new promotion, and they would like to add a new field to track the number
of times this promotion code is used for each item. A junior data engineer suggests
updating the existing query as follows. Note that proposed changes are in bold.

Original query:

Proposed query:

Which step must also be completed to put the proposed query into production?

A. Specify a new checkpointLocation


B. Remove .option(‘mergeSchema’, ‘true’) from the streaming write
C. Increase the shuffle partitions to account for additional aggregates
D. Run REFRESH TABLE delta.‛/item_agg‛

Selected Answer: A

Question #: 12
Topic #: 1
A junior data engineer has configured a workload that posts the following JSON to the
Databricks REST API endpoint 2.0/jobs/create.

Assuming that all configurations and referenced resources are available, which
statement describes the result of executing this workload three times?

A. Three new jobs named “Ingest new data” will be defined in the workspace, and they
will each run once daily.
B. The logic defined in the referenced notebook will be executed three times on new
clusters with the configurations of the provided cluster ID.
C. Three new jobs named “Ingest new data” will be defined in the workspace, but no
jobs will be executed.
D. One new job named “Ingest new data” will be defined in the workspace, but it will not
be executed.
E. The logic defined in the referenced notebook will be executed three times on the
referenced existing all purpose cluster.

Selected Answer: C

Question #: 9
Topic #: 1
A junior member of the data engineering team is exploring the language interoperability
of Databricks notebooks. The intended outcome of the below code is to register a view
of all sales that occurred in countries on the continent of Africa that appear in the
geo_lookup table.
Before executing the code, running SHOW TABLES on the current database indicates
the database contains only two tables: geo_lookup and sales.

Which statement correctly describes the outcome of executing these command cells in
order in an interactive notebook?

A. Both commands will succeed. Executing show tables will show that countries_af and
sales_af have been registered as views.
B. Cmd 1 will succeed. Cmd 2 will search all accessible databases for a table or view
named countries_af: if this entity exists, Cmd 2 will succeed.
C. Cmd 1 will succeed and Cmd 2 will fail. countries_af will be a Python variable
representing a PySpark DataFrame.
D. Both commands will fail. No new variables, tables, or views will be created.
E. Cmd 1 will succeed and Cmd 2 will fail. countries_af will be a Python variable
containing a list of strings.

Selected Answer: E

Question #: 6
Topic #: 1
The security team is exploring whether or not the Databricks secrets module can be
leveraged for connecting to an external database.
After testing the code with all Python variables being defined with strings, they upload
the password to the secrets module and configure the correct permissions for the
currently active user. They then modify their code to the following (leaving all other
variables unchanged).

Which statement describes what will happen when the above code is executed?

A. The connection to the external table will fail; the string “REDACTED” will be printed.
B. An interactive input box will appear in the notebook; if the right password is provided,
the connection will succeed and the encoded password will be saved to DBFS.
C. An interactive input box will appear in the notebook; if the right password is provided,
the connection will succeed and the password will be printed in plain text.
D. The connection to the external table will succeed; the string value of password will be
printed in plain text.
E. The connection to the external table will succeed; the string “REDACTED” will be
printed.

Selected Answer: E

Question #: 5
Topic #: 1
A junior developer complains that the code in their notebook isn’t producing the correct
results in the development environment. A shared screenshot reveals that while they’re
using a notebook versioned with Databricks Repos, they’re using a personal branch that
contains old logic. The desired branch named dev-2.3.9 is not available from the branch
selection dropdown.
Which approach will allow this developer to review the current logic for this notebook?

A. Use Repos to make a pull request use the Databricks REST API to update the
current branch to dev-2.3.9
B. Use Repos to pull changes from the remote Git repository and select the dev-2.3.9
branch.
C. Use Repos to checkout the dev-2.3.9 branch and auto-resolve conflicts with the
current branch
D. Merge all changes back to the main branch in the remote Git repository and clone
the repo again
E. Use Repos to merge the current branch and the dev-2.3.9 branch, then make a pull
request to sync with the remote repository

Selected Answer: B

Question #: 3
Topic #: 1
When scheduling Structured Streaming jobs for production, which configuration
automatically recovers from query failures and keeps costs low?

A. Cluster: New Job Cluster;


Retries: Unlimited;
Maximum Concurrent Runs: Unlimited
B. Cluster: New Job Cluster;
Retries: None;
Maximum Concurrent Runs: 1
C. Cluster: Existing All-Purpose Cluster;
Retries: Unlimited;
Maximum Concurrent Runs: 1
D. Cluster: New Job Cluster;
Retries: Unlimited;
Maximum Concurrent Runs: 1
E. Cluster: Existing All-Purpose Cluster;
Retries: None;
Maximum Concurrent Runs: 1

Selected Answer: D

Question #: 40
Topic #: 1
The view updates represents an incremental batch of all newly ingested data to be
inserted or updated in the customers table.
The following logic is used to process these records.

Which statement describes this implementation?

A. The customers table is implemented as a Type 3 table; old values are maintained as
a new column alongside the current value.
B. The customers table is implemented as a Type 2 table; old values are maintained but
marked as no longer current and new values are inserted.
C. The customers table is implemented as a Type 0 table; all writes are append only
with no changes to existing values.
D. The customers table is implemented as a Type 1 table; old values are overwritten by
new values and no history is maintained.
E. The customers table is implemented as a Type 2 table; old values are overwritten
and new customers are appended.

Selected Answer: B

Question #: 37
Topic #: 1
A small company based in the United States has recently contracted a consulting firm in
India to implement several new data engineering pipelines to power artificial intelligence
applications. All the company’s data is stored in regional cloud storage in the United
States.
The workspace administrator at the company is uncertain about where the Databricks
workspace used by the contractors should be deployed.
Assuming that all data governance considerations are accounted for, which statement
accurately informs this decision?

A. Databricks runs HDFS on cloud volume storage; as such, cloud virtual machines
must be deployed in the region where the data is stored.
B. Databricks workspaces do not rely on any regional infrastructure; as such, the
decision should be made based upon what is most convenient for the workspace
administrator.
C. Cross-region reads and writes can incur significant costs and latency; whenever
possible, compute should be deployed in the same region the data is stored.
D. Databricks leverages user workstations as the driver during interactive development;
as such, users should always use a workspace deployed in a region they are physically
near.
E. Databricks notebooks send all executable code from the user’s browser to virtual
machines over the open internet; whenever possible, choosing a workspace region near
the end users is the most secure.

Selected Answer: C

Question #: 32
Topic #: 1
The data engineering team maintains the following code:

Assuming that this code produces logically correct results and the data in the source
tables has been de-duplicated and validated, which statement describes what will occur
when this code is executed?

A. A batch job will update the enriched_itemized_orders_by_account table, replacing


only those rows that have different values than the current version of the table, using
accountID as the primary key.
B. The enriched_itemized_orders_by_account table will be overwritten using the current
valid version of data in each of the three tables referenced in the join logic.
C. An incremental job will leverage information in the state store to identify unjoined
rows in the source tables and write these rows to the
enriched_iteinized_orders_by_account table.
D. An incremental job will detect if new rows have been written to any of the source
tables; if new rows are detected, all results will be recalculated and used to overwrite
the enriched_itemized_orders_by_account table.
E. No computation will occur until enriched_itemized_orders_by_account is queried;
upon query materialization, results will be calculated using the current valid version of
data in each of the three tables referenced in the join logic.
Selected Answer: B

Question #: 29
Topic #: 1
A new data engineer notices that a critical field was omitted from an application that
writes its Kafka source to Delta Lake. This happened even though the critical field was
in the Kafka source. That field was further missing from data written to dependent, long-
term storage. The retention threshold on the Kafka service is seven days. The pipeline
has been in production for three months.
Which describes how Delta Lake can help to avoid data loss of this nature in the future?

A. The Delta log and Structured Streaming checkpoints record the full history of the
Kafka producer.
B. Delta Lake schema evolution can retroactively calculate the correct value for newly
added fields, as long as the data was in the original source.
C. Delta Lake automatically checks that all fields present in the source data are included
in the ingestion layer.
D. Data can never be permanently dropped or deleted from Delta Lake, so data loss is
not possible under any circumstance.
E. Ingesting all raw data and metadata from Kafka to a bronze Delta table creates a
permanent, replayable history of the data state.

Selected Answer: E

Question #: 132
Topic #: 1
An hourly batch job is configured to ingest data files from a cloud object storage
container where each batch represent all records produced by the source system in a
given hour. The batch job to process these records into the Lakehouse is sufficiently
delayed to ensure no late-arriving data is missed. The user_id field represents a unique
key for the data, which has the following schema:

user_id BIGINT, username STRING, user_utc STRING, user_region STRING,


last_login BIGINT, auto_pay BOOLEAN, last_updated BIGINT
New records are all ingested into a table named account_history which maintains a full
record of all data in the same schema as the source. The next table in the system is
named account_current and is implemented as a Type 1 table representing the most
recent value for each unique user_id.

Which implementation can be used to efficiently update the described account_current


table as part of each hourly batch job assuming there are millions of user accounts and
tens of thousands of records processed hourly?

A. Filter records in account_history using the last_updated field and the most recent
hour processed, making sure to deduplicate on username; write a merge statement to
update or insert the most recent value for each username.
B. Use Auto Loader to subscribe to new files in the account_history directory; configure
a Structured Streaming trigger available job to batch update newly detected files into the
account_current table.
C. Overwrite the account_current table with each batch using the results of a query
against the account_history table grouping by user_id and filtering for the max value of
last_updated.
D. Filter records in account_history using the last_updated field and the most recent
hour processed, as well as the max last_login by user_id write a merge statement to
update or insert the most recent value for each user_id.

Selected Answer: D

Question #: 130
Topic #: 1
The following table consists of items found in user carts within an e-commerce website.

The following MERGE statement is used to update this table using an updates view,
with schema evolution enabled on this table.

How would the following update be handled?

A. The update throws an error because changes to existing columns in the target
schema are not supported.
B. The new nested Field is added to the target schema, and dynamically read as NULL
for existing unmatched records.
C. The update is moved to a separate “rescued” column because it is missing a column
expected in the target schema.
D. The new nested field is added to the target schema, and files underlying existing
records are updated to include NULL values for the new field.

Selected Answer: B

Question #: 27
Topic #: 1
A junior data engineer on your team has implemented the following code block.

The view new_events contains a batch of records with the same schema as the events
Delta table. The event_id field serves as a unique key for this table.
When this query is executed, what will happen with new records that have the same
event_id as an existing record?

A. They are merged.


B. They are ignored.
C. They are updated.
D. They are inserted.
E. They are deleted.

Selected Answer: B

Question #: 25
Topic #: 1
A Spark job is taking longer than expected. Using the Spark UI, a data engineer notes
that the Min, Median, and Max Durations for tasks in a particular stage show the
minimum and median time to complete a task as roughly the same, but the max
duration for a task to be roughly 100 times as long as the minimum.
Which situation is causing increased duration of the overall job?
A. Task queueing resulting from improper thread pool assignment.
B. Spill resulting from attached volume storage being too small.
C. Network latency due to some cluster nodes being in different regions from the source
data
D. Skew caused by more data being assigned to a subset of spark-partitions.
E. Credential validation errors while pulling data from an external system.

Selected Answer: D

Question #: 23
Topic #: 1
Which statement characterizes the general programming model used by Spark
Structured Streaming?

A. Structured Streaming leverages the parallel processing of GPUs to achieve highly


parallel data throughput.
B. Structured Streaming is implemented as a messaging bus and is derived from
Apache Kafka.
C. Structured Streaming uses specialized hardware and I/O streams to achieve sub-
second latency for data transfer.
D. Structured Streaming models new data arriving in a data stream as new rows
appended to an unbounded table.
E. Structured Streaming relies on a distributed network of nodes that hold incremental
state values for cached stages.

Selected Answer: D

Question #: 19
Topic #: 1
A junior data engineer has been asked to develop a streaming data pipeline with a
grouped aggregation using DataFrame df. The pipeline needs to calculate the average
humidity and average temperature for each non-overlapping five-minute interval. Events
are recorded once per minute per device.
Streaming DataFrame df has the following schema:
“device_id INT, event_time TIMESTAMP, temp FLOAT, humidity FLOAT”
Code block:

Choose the response that correctly fills in the blank within the code block to complete
this task.

A. to_interval(“event_time”, “5 minutes”).alias(“time”)
B. window(“event_time”, “5 minutes”).alias(“time”)
C. “event_time”
D. window(“event_time”, “10 minutes”).alias(“time”)
E. lag(“event_time”, “10 minutes”).alias(“time”)

Selected Answer: B

Question #: 17
Topic #: 1
A production workload incrementally applies updates from an external Change Data
Capture feed to a Delta Lake table as an always-on Structured Stream job. When data
was initially migrated for this table, OPTIMIZE was executed and most data files were
resized to 1 GB. Auto Optimize and Auto Compaction were both turned on for the
streaming production job. Recent review of data files shows that most data files are
under 64 MB, although each partition in the table contains at least 1 GB of data and the
total table size is over 10 TB.
Which of the following likely explains these smaller file sizes?

A. Databricks has autotuned to a smaller target file size to reduce duration of MERGE
operations
B. Z-order indices calculated on the table are preventing file compaction
C. Bloom filter indices calculated on the table are preventing file compaction
D. Databricks has autotuned to a smaller target file size based on the overall size of
data in the table
E. Databricks has autotuned to a smaller target file size based on the amount of data in
each partition

Selected Answer: A
Question #: 13
Topic #: 1
An upstream system is emitting change data capture (CDC) logs that are being written
to a cloud object storage directory. Each record in the log indicates the change type
(insert, update, or delete) and the values for each field after the change. The source
table has a primary key identified by the field pk_id.
For auditing purposes, the data governance team wishes to maintain a full record of all
values that have ever been valid in the source system. For analytical purposes, only the
most recent value for each record needs to be recorded. The Databricks job to ingest
these records occurs once per hour, but each individual record may have changed
multiple times over the course of an hour.
Which solution meets these requirements?

A. Create a separate history table for each pk_id resolve the current state of the table
by running a union all filtering the history tables for the most recent state.
B. Use MERGE INTO to insert, update, or delete the most recent entry for each pk_id
into a bronze table, then propagate all changes throughout the system.
C. Iterate through an ordered set of changes to the table, applying each in turn; rely on
Delta Lake’s versioning ability to create an audit log.
D. Use Delta Lake’s change data feed to automatically process CDC data from an
external system, propagating all changes to all dependent tables in the Lakehouse.
E. Ingest all log information into a bronze table; use MERGE INTO to insert, update, or
delete the most recent entry for each pk_id into a silver table to recreate the current
table state.

Selected Answer: E

Question #: 93
Topic #: 1
You are performing a join operation to combine values from a static userLookup table
with a streaming DataFrame streamingDF.

Which code block attempts to perform an invalid stream-static join?


A. userLookup.join(streamingDF, [“userid”], how=”inner”)
B. streamingDF.join(userLookup, [“user_id”], how=”outer”)
C. streamingDF.join(userLookup, [“user_id”], how=”left”)
D. streamingDF.join(userLookup, [“userid”], how=”inner”)
E. userLookup.join(streamingDF, [“user_id”], how=”right”)

Selected Answer: B

Question #: 165
Topic #: 1
The data governance team is reviewing code used for deleting records for compliance
with GDPR. The following logic has been implemented to propagate delete requests
from the user_lookup table to the user_aggregates table.

Assuming that user_id is a unique identifying key and that all users that have requested
deletion have been removed from the user_lookup table, which statement describes
whether successfully executing the above logic guarantees that the records to be
deleted from the user_aggregates table are no longer accessible and why?

A. No; the Delta Lake DELETE command only provides ACID guarantees when
combined with the MERGE INTO command.
B. No; files containing deleted records may still be accessible with time travel until a
VACUUM command is used to remove invalidated data files.
C. No; the change data feed only tracks inserts and updates, not deleted records.
D. Yes; Delta Lake ACID guarantees provide assurance that the DELETE command
succeeded fully and permanently purged these records.

Selected Answer: B

Question #: 169
Topic #: 1
A data engineer, User A, has promoted a new pipeline to production by using the REST
API to programmatically create several jobs. A DevOps engineer, User B, has
configured an external orchestration tool to trigger job runs through the REST API. Both
users authorized the REST API calls using their personal access tokens.
Which statement describes the contents of the workspace audit logs concerning these
events?

A. Because the REST API was used for job creation and triggering runs, a Service
Principal will be automatically used to identify these events.
B. Because User A created the jobs, their identity will be associated with both the job
creation events and the job run events.
C. Because these events are managed separately, User A will have their identity
associated with the job creation events and User B will have their identity associated
with the job run events.
D. Because the REST API was used for job creation and triggering runs, user identity
will not be captured in the audit logs.

Selected Answer: B

Question #: 170
Topic #: 1
A distributed team of data analysts share computing resources on an interactive cluster
with autoscaling configured. In order to better manage costs and query throughput, the
workspace administrator is hoping to evaluate whether cluster upscaling is caused by
many concurrent users or resource-intensive queries.

In which location can one review the timeline for cluster resizing events?

A. Workspace audit logs


B. Driver’s log file
C. Ganglia
D. Cluster Event Log

Selected Answer: B

Question #: 171
Topic #: 1
When evaluating the Ganglia Metrics for a given cluster with 3 executor nodes, which
indicator would signal proper utilization of the VM’s resources?
A. The five Minute Load Average remains consistent/flat
B. CPU Utilization is around 75%
C. Network I/O never spikes
D. Total Disk Space remains constant

Selected Answer: B

Question #: 174
Topic #: 1
What is a method of installing a Python package scoped at the notebook level to all
nodes in the currently active cluster?

A. Run source env/bin/activate in a notebook setup script


B. Install libraries from PyPI using the cluster UI
C. Use %pip install in a notebook cell
D. Use %sh pip install in a notebook cell

Selected Answer: D

Question #: 175
Topic #: 1
What is the first line of a Databricks Python notebook when viewed in a text editor?

A. %python
B. // Databricks notebook source
C. # Databricks notebook source
D. — Databricks notebook source

Selected Answer: C

Question #: 176
Topic #: 1
Incorporating unit tests into a PySpark application requires upfront attention to the
design of your jobs, or a potentially significant refactoring of existing code.

Which benefit offsets this additional effort?


A. Improves the quality of your data
B. Validates a complete use case of your application
C. Troubleshooting is easier since all steps are isolated and tested individually
D. Ensures that all steps interact correctly to achieve the desired end result

Selected Answer: A

Question #: 177
Topic #: 1
What describes integration testing?

A. It validates an application use case.


B. It validates behavior of individual elements of an application,
C. It requires an automated testing framework.
D. It validates interactions between subsystems of your application.

Selected Answer: C

Question #: 179
Topic #: 1
A Databricks job has been configured with three tasks, each of which is a Databricks
notebook. Task A does not depend on other tasks. Tasks B and C run in parallel, with
each having a serial dependency on task A.

What will be the resulting state if tasks A and B complete successfully but task C fails
during a scheduled run?

A. All logic expressed in the notebook associated with tasks A and B will have been
successfully completed; some operations in task C may have completed successfully.
B. Unless all tasks complete successfully, no changes will be committed to the
Lakehouse; because task C failed, all commits will be rolled back automatically.
C. Because all tasks are managed as a dependency graph, no changes will be
committed to the Lakehouse until all tasks have successfully been completed.
D. All logic expressed in the notebook associated with tasks A and B will have been
successfully completed; any changes made in task C will be rolled back due to task
failure.

Selected Answer: D

Question #: 137
Topic #: 1
Which statement regarding stream-static joins and static Delta tables is correct?

A. The checkpoint directory will be used to track updates to the static Delta table.
B. Each microbatch of a stream-static join will use the most recent version of the static
Delta table as of the job’s initialization.
C. The checkpoint directory will be used to track state information for the unique keys
present in the join.
D. Stream-static joins cannot use static Delta tables because of consistency issues.

Selected Answer: B

Question #: 133
Topic #: 1
The business intelligence team has a dashboard configured to track various summary
metrics for retail stores. This includes total sales for the previous day alongside totals
and averages for a variety of time periods. The fields required to populate this
dashboard have the following schema:

For demand forecasting, the Lakehouse contains a validated table of all itemized sales
updated incrementally in near real-time. This table, named products_per_order,
includes the following fields:

Because reporting on long-term sales trends is less volatile, analysts using the new
dashboard only require data to be refreshed once daily. Because the dashboard will be
queried interactively by many users throughout a normal business day, it should return
results quickly and reduce total compute associated with each materialization.
Which solution meets the expectations of the end users while controlling and limiting
possible costs?

A. Populate the dashboard by configuring a nightly batch job to save the required values
as a table overwritten with each update.
B. Use Structured Streaming to configure a live dashboard against the
products_per_order table within a Databricks notebook.
C. Define a view against the products_per_order table and define the dashboard
against this view.
D. Use the Delta Cache to persist the products_per_order table in memory to quickly
update the dashboard with each query.

Selected Answer: A

Question #: 116
Topic #: 1
The data engineering team is configuring environments for development, testing, and
production before beginning migration on a new data pipeline. The team requires
extensive testing on both the code and data resulting from code execution, and the
team wants to develop and test against data as similar to production data as possible.

A junior data engineer suggests that production data can be mounted to the
development and testing environments, allowing pre-production code to execute against
production data. Because all users have admin privileges in the development
environment, the junior data engineer has offered to configure permissions and mount
this data for the team.

Which statement captures best practices for this situation?

A. All development, testing, and production code and data should exist in a single,
unified workspace; creating separate environments for testing and development
complicates administrative overhead.
B. In environments where interactive code will be executed, production data should only
be accessible with read permissions; creating isolated databases for each environment
further reduces risks.
C. As long as code in the development environment declares USE dev_db at the top of
each notebook, there is no possibility of inadvertently committing changes back to
production data sources.
D. Because Delta Lake versions all data and supports time travel, it is not possible for
user error or malicious actors to permanently delete production data; as such, it is
generally safe to mount production data anywhere.
E. Because access to production data will always be verified using passthrough
credentials, it is safe to mount data to any Databricks development environment.

Selected Answer: B

Question #: 115
Topic #: 1
When using CLI or REST API to get results from jobs with multiple tasks, which
statement correctly describes the response structure?

A. Each run of a job will have a unique job_id; all tasks within this job will have a unique
job_id
B. Each run of a job will have a unique job_id; all tasks within this job will have a unique
task_id
C. Each run of a job will have a unique orchestration_id; all tasks within this job will
have a unique run_id
D. Each run of a job will have a unique run_id; all tasks within this job will have a unique
task_id
E. Each run of a job will have a unique run_id; all tasks within this job will also have a
unique run_id

Selected Answer: D

Question #: 77
Topic #: 1
In order to facilitate near real-time workloads, a data engineer is creating a helper
function to leverage the schema detection and evolution functionality of Databricks Auto
Loader. The desired function will automatically detect the schema of the source directly,
incrementally process JSON files as they arrive in a source directory, and automatically
evolve the schema of the table when new fields are detected.

Which response correctly fills in the blank to meet the specified requirements?
A.
B.
C.
D.
E.

Selected Answer: E

Question #: 74
Topic #: 1
Which statement describes the correct use of pyspark.sql.functions.broadcast?

A. It marks a column as having low enough cardinality to properly map distinct values to
available partitions, allowing a broadcast join.
B. It marks a column as small enough to store in memory on all executors, allowing a
broadcast join.
C. It caches a copy of the indicated table on attached storage volumes for all active
clusters within a Databricks workspace.
D. It marks a DataFrame as small enough to store in memory on all executors, allowing
a broadcast join.
E. It caches a copy of the indicated table on all nodes in the cluster for use in all future
queries during the cluster lifetime.

Selected Answer: D

Question #: 66
Topic #: 1
The following code has been migrated to a Databricks notebook from a legacy
workload:
The code executes successfully and provides the logically correct results, however, it
takes over 20 minutes to extract and load around 1 GB of data.

Which statement is a possible explanation for this behavior?

A. %sh triggers a cluster restart to collect and install Git. Most of the latency is related to
cluster startup time.
B. Instead of cloning, the code should use %sh pip install so that the Python code can
get executed in parallel across all nodes in a cluster.
C. %sh does not distribute file moving operations; the final line of code should be
updated to use %fs instead.
D. Python will always execute slower than Scala on Databricks. The run.py script should
be refactored to Scala.
E. %sh executes shell code on the driver node. The code does not take advantage of
the worker nodes or Databricks optimized Spark.

Selected Answer: E

Question #: 46
Topic #: 1
Although the Databricks Utilities Secrets module provides tools to store sensitive
credentials and avoid accidentally displaying them in plain text users should still be
careful with which credentials are stored here and which users have access to using
these secrets.
Which statement describes a limitation of Databricks Secrets?

A. Because the SHA256 hash is used to obfuscate stored secrets, reversing this hash
will display the value in plain text.
B. Account administrators can see all secrets in plain text by logging on to the
Databricks Accounts console.
C. Secrets are stored in an administrators-only table within the Hive Metastore;
database administrators have permission to query this table by default.
D. Iterating through a stored secret and printing each character will display secret
contents in plain text.
E. The Databricks REST API can be used to list secrets in plain text if the personal
access token has proper credentials.

Selected Answer: E

Question #: 45
Topic #: 1
An external object storage container has been mounted to the location
/mnt/finance_eda_bucket.

After the database was successfully created and permissions configured, a member of
the finance team runs the following code:

If all users on the finance team are members of the finance group, which statement
describes how the tx_sales table will be created?

A. A logical table will persist the query plan to the Hive Metastore in the Databricks
control plane.
B. An external table will be created in the storage container mounted to
/mnt/finance_eda_bucket.
C. A logical table will persist the physical plan to the Hive Metastore in the Databricks
control plane.
D. An managed table will be created in the storage container mounted to
/mnt/finance_eda_bucket.
E. A managed table will be created in the DBFS root storage container.

Selected Answer: D

Question #: 7
Topic #: 1
The data science team has created and logged a production model using MLflow. The
following code correctly imports and applies the production model to output the
predictions as a new DataFrame named preds with the schema “customer_id LONG,
predictions DOUBLE, date DATE”
The data science team would like predictions saved to a Delta Lake table with the ability
to compare all predictions across time. Churn predictions will be made at most once per
day.
Which code block accomplishes this task while minimizing potential compute costs?
A. preds.write.mode(“append”).saveAsTable(“churn_preds”)
B. preds.write.format(“delta”).save(“/preds/churn_preds”)
C.
D.
E.

Selected Answer: A

Question #: 64
Topic #: 1
A Delta Lake table was created with the below query:

Consider the following query:

DROP TABLE prod.sales_by_store –

If this statement is executed by a workspace admin, which result will occur?

A. Nothing will occur until a COMMIT command is executed.


B. The table will be removed from the catalog but the data will remain in storage.
C. The table will be removed from the catalog and the data will be deleted.
D. An error will occur because Delta Lake prevents the deletion of production data.
E. Data will be marked as deleted but still recoverable with Time Travel.

Selected Answer: C

Question #: 73
Topic #: 1
A Structured Streaming job deployed to production has been resulting in higher than
expected cloud storage costs. At present, during normal execution, each microbatch of
data is processed in less than 3s; at least 12 times per minute, a microbatch is
processed that contains 0 records. The streaming write was configured using the default
trigger settings. The production job is currently scheduled alongside many other
Databricks jobs in a workspace with instance pools provisioned to reduce start-up time
for jobs with batch execution.

Holding all other variables constant and assuming records need to be processed in less
than 10 minutes, which adjustment will meet the requirement?

A. Set the trigger interval to 3 seconds; the default trigger interval is consuming too
many records per batch, resulting in spill to disk that can increase volume costs.
B. Increase the number of shuffle partitions to maximize parallelism, since the trigger
interval cannot be modified without modifying the checkpoint directory.
C. Set the trigger interval to 10 minutes; each batch calls APIs in the source storage
account, so decreasing trigger frequency to maximum allowable threshold should
minimize this cost.
D. Set the trigger interval to 500 milliseconds; setting a small but non-zero trigger
interval ensures that the source is not queried too frequently.
E. Use the trigger once option and configure a Databricks job to execute the query
every 10 minutes; this approach minimizes costs for both compute and storage.

Selected Answer: E

You might also like