100% found this document useful (2 votes)
1K views41 pages

Databricks Data Engg Pro Certification Dumps

Uploaded by

Heaven Guy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
100% found this document useful (2 votes)
1K views41 pages

Databricks Data Engg Pro Certification Dumps

Uploaded by

Heaven Guy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 41

Topic 1 - Exam A

Question #1

Topic 1

An upstream system has been configured to pass the date for a given batch of data to
the Databricks Jobs API as a parameter. The notebook to be scheduled will use this
parameter to load data with the following code: df =
spark.read.format("parquet").load(f"/mnt/source/(date)")
Which code block should be used to create the date Python variable used in the above
code block?

A. date = spark.conf.get("date")
B. input_dict = input()
date= input_dict["date"]
C. import sys
date = sys.argv[1]
D. date = dbutils.notebooks.getParam("date")
E. dbutils.widgets.text("date", "null")
date = dbutils.widgets.get("date")

Reveal Solution Discussion 18

Question #2

Topic 1

The Databricks workspace administrator has configured interactive clusters for each of
the data engineering groups. To control costs, clusters are set to terminate after 30
minutes of inactivity. Each user should be able to execute workloads against their
assigned clusters at any time of the day.
Assuming users have been added to a workspace but not granted any permissions,
which of the following describes the minimal permissions a user would need to start and
attach to an already configured cluster.

A. "Can Manage" privileges on the required cluster


B. Workspace Admin privileges, cluster creation allowed, "Can Attach To" privileges
on the required cluster
C. Cluster creation allowed, "Can Attach To" privileges on the required cluster
D. "Can Restart" privileges on the required cluster
E. Cluster creation allowed, "Can Restart" privileges on the required cluster

Reveal Solution Discussion 23

Question #3

Topic 1

When scheduling Structured Streaming jobs for production, which configuration


automatically recovers from query failures and keeps costs low?

A. Cluster: New Job Cluster;


Retries: Unlimited;
Maximum Concurrent Runs: Unlimited
B. Cluster: New Job Cluster;
Retries: None;
Maximum Concurrent Runs: 1
C. Cluster: Existing All-Purpose Cluster;
Retries: Unlimited;
Maximum Concurrent Runs: 1
D. Cluster: New Job Cluster;
Retries: Unlimited;
Maximum Concurrent Runs: 1
E. Cluster: Existing All-Purpose Cluster;
Retries: None;
Maximum Concurrent Runs: 1

Reveal Solution Discussion 9

Question #4

Topic 1

The data engineering team has configured a Databricks SQL query and alert to monitor
the values in a Delta Lake table. The recent_sensor_recordings table contains an
identifying sensor_id alongside the timestamp and temperature for the most recent 5
minutes of recordings.
The below query is used to create the alert:
The query is set to refresh each minute and always completes in less than 10 seconds.
The alert is set to trigger when mean (temperature) > 120. Notifications are triggered to
be sent at most every 1 minute.
If this alert raises notifications for 3 consecutive minutes and then stops, which
statement must be true?

A. The total average temperature across all sensors exceeded 120 on three
consecutive executions of the query
B. The recent_sensor_recordings table was unresponsive for three consecutive runs
of the query
C. The source query failed to update properly for three consecutive minutes and
then restarted
D. The maximum temperature recording for at least one sensor exceeded 120 on
three consecutive executions of the query
E. The average temperature recordings for at least one sensor exceeded 120 on
three consecutive executions of the query

Reveal Solution Discussion 6

Question #5

Topic 1

A junior developer complains that the code in their notebook isn't producing the correct
results in the development environment. A shared screenshot reveals that while they're
using a notebook versioned with Databricks Repos, they're using a personal branch that
contains old logic. The desired branch named dev-2.3.9 is not available from the branch
selection dropdown.
Which approach will allow this developer to review the current logic for this notebook?

A. Use Repos to make a pull request use the Databricks REST API to update the
current branch to dev-2.3.9
B. Use Repos to pull changes from the remote Git repository and select the dev-
2.3.9 branch.
C. Use Repos to checkout the dev-2.3.9 branch and auto-resolve conflicts with the
current branch
D. Merge all changes back to the main branch in the remote Git repository and clone
the repo again
E. Use Repos to merge the current branch and the dev-2.3.9 branch, then make a
pull request to sync with the remote repository

Reveal Solution

Question #6

Topic 1

The security team is exploring whether or not the Databricks secrets module can be
leveraged for connecting to an external database.
After testing the code with all Python variables being defined with strings, they upload
the password to the secrets module and configure the correct permissions for the
currently active user. They then modify their code to the following (leaving all other
variables unchanged).

Which statement describes what will happen when the above code is executed?

A. The connection to the external table will fail; the string "REDACTED" will be
printed.
B. An interactive input box will appear in the notebook; if the right password is
provided, the connection will succeed and the encoded password will be saved to
DBFS.
C. An interactive input box will appear in the notebook; if the right password is
provided, the connection will succeed and the password will be printed in plain
text.
D. The connection to the external table will succeed; the string value of password
will be printed in plain text.
E. The connection to the external table will succeed; the string "REDACTED" will be
printed.

Reveal Solution Discussion 11

Question #7

Topic 1

The data science team has created and logged a production model using MLflow. The
following code correctly imports and applies the production model to output the
predictions as a new DataFrame named preds with the schema "customer_id LONG,
predictions DOUBLE, date DATE".

The data science team would like predictions saved to a Delta Lake table with the ability
to compare all predictions across time. Churn predictions will be made at most once per
day.
Which code block accomplishes this task while minimizing potential compute costs?

A. preds.write.mode("append").saveAsTable("churn_preds")
B. preds.write.format("delta").save("/preds/churn_preds")

C.
D.

E.

Reveal Solution Discussion 13

Question #8

Topic 1

An upstream source writes Parquet data as hourly batches to directories named with
the current date. A nightly batch job runs the following code to ingest all data from the
previous day as indicated by the date variable:

Assume that the fields customer_id and order_id serve as a composite key to uniquely
identify each order.
If the upstream system is known to occasionally produce duplicate entries for a single
order hours apart, which statement is correct?

A. Each write to the orders table will only contain unique records, and only those
records without duplicates in the target table will be written.
B. Each write to the orders table will only contain unique records, but newly written
records may have duplicates already present in the target table.
C. Each write to the orders table will only contain unique records; if existing records
with the same key are present in the target table, these records will be
overwritten.
D. Each write to the orders table will only contain unique records; if existing records
with the same key are present in the target table, the operation will fail.
E. Each write to the orders table will run deduplication over the union of new and
existing records, ensuring no duplicate records are present.

Reveal Solution Discussion 14

Question #9

Topic 1

A junior member of the data engineering team is exploring the language interoperability
of Databricks notebooks. The intended outcome of the below code is to register a view
of all sales that occurred in countries on the continent of Africa that appear in the
geo_lookup table.
Before executing the code, running SHOW TABLES on the current database indicates
the database contains only two tables: geo_lookup and sales.

Which statement correctly describes the outcome of executing these command cells in
order in an interactive notebook?

A. Both commands will succeed. Executing show tables will show that countries_af
and sales_af have been registered as views.
B. Cmd 1 will succeed. Cmd 2 will search all accessible databases for a table or
view named countries_af: if this entity exists, Cmd 2 will succeed.
C. Cmd 1 will succeed and Cmd 2 will fail. countries_af will be a Python variable
representing a PySpark DataFrame.
D. Both commands will fail. No new variables, tables, or views will be created.
E. Cmd 1 will succeed and Cmd 2 will fail. countries_af will be a Python variable
containing a list of strings.
Reveal Solution Discussion 18

Question #10

Topic 1

A Delta table of weather records is partitioned by date and has the below schema: date
DATE, device_id INT, temp FLOAT, latitude FLOAT, longitude FLOAT
To find all the records from within the Arctic Circle, you execute a query with the below
filter: latitude > 66.3
Which statement describes how the Delta engine identifies which files to load?

A. All records are cached to an operational database and then the filter is applied
B. The Parquet file footers are scanned for min and max statistics for the latitude
column
C. All records are cached to attached storage and then the filter is applied
D. The Delta log is scanned for min and max statistics for the latitude column
E. The Hive metastore is scanned for min and max statistics for the latitude column

Reveal Solution

Question #11

Topic 1

The data engineering team has configured a job to process customer requests to be
forgotten (have their data deleted). All user data that needs to be deleted is stored in
Delta Lake tables using default table settings.
The team has decided to process all deletions from the previous week as a batch job at
1am each Sunday. The total duration of this job is less than one hour. Every Monday at
3am, a batch job executes a series of VACUUM commands on all Delta Lake tables
throughout the organization.
The compliance officer has recently learned about Delta Lake's time travel functionality.
They are concerned that this might allow continued access to deleted data.
Assuming all delete logic is correctly implemented, which statement correctly addresses
this concern?

A. Because the VACUUM command permanently deletes all files containing deleted
records, deleted records may be accessible with time travel for around 24 hours.
B. Because the default data retention threshold is 24 hours, data files containing
deleted records will be retained until the VACUUM job is run the following day.
C. Because Delta Lake time travel provides full access to the entire history of a
table, deleted records can always be recreated by users with full admin
privileges.
D. Because Delta Lake's delete statements have ACID guarantees, deleted records
will be permanently purged from all storage systems as soon as a delete job
completes.
E. Because the default data retention threshold is 7 days, data files containing
deleted records will be retained until the VACUUM job is run 8 days later.

Reveal Solution Discussion 27

Question #12

Topic 1

A junior data engineer has configured a workload that posts the following JSON to the
Databricks REST API endpoint 2.0/jobs/create.

Assuming that all configurations and referenced resources are available, which
statement describes the result of executing this workload three times?

A. Three new jobs named "Ingest new data" will be defined in the workspace, and
they will each run once daily.
B. The logic defined in the referenced notebook will be executed three times on new
clusters with the configurations of the provided cluster ID.
C. Three new jobs named "Ingest new data" will be defined in the workspace, but no
jobs will be executed.
D. One new job named "Ingest new data" will be defined in the workspace, but it will
not be executed.
E. The logic defined in the referenced notebook will be executed three times on the
referenced existing all purpose cluster.
Reveal Solution Discussion 16

Question #13

Topic 1

An upstream system is emitting change data capture (CDC) logs that are being written
to a cloud object storage directory. Each record in the log indicates the change type
(insert, update, or delete) and the values for each field after the change. The source
table has a primary key identified by the field pk_id.
For auditing purposes, the data governance team wishes to maintain a full record of all
values that have ever been valid in the source system. For analytical purposes, only the
most recent value for each record needs to be recorded. The Databricks job to ingest
these records occurs once per hour, but each individual record may have changed
multiple times over the course of an hour.
Which solution meets these requirements?

A. Create a separate history table for each pk_id resolve the current state of the
table by running a union all filtering the history tables for the most recent state.
B. Use MERGE INTO to insert, update, or delete the most recent entry for each
pk_id into a bronze table, then propagate all changes throughout the system.
C. Iterate through an ordered set of changes to the table, applying each in turn; rely
on Delta Lake's versioning ability to create an audit log.
D. Use Delta Lake's change data feed to automatically process CDC data from an
external system, propagating all changes to all dependent tables in the
Lakehouse.
E. Ingest all log information into a bronze table; use MERGE INTO to insert, update,
or delete the most recent entry for each pk_id into a silver table to recreate the
current table state.

Reveal Solution Discussion 16

Question #14

Topic 1

An hourly batch job is configured to ingest data files from a cloud object storage
container where each batch represent all records produced by the source system in a
given hour. The batch job to process these records into the Lakehouse is sufficiently
delayed to ensure no late-arriving data is missed. The user_id field represents a unique
key for the data, which has the following schema: user_id BIGINT, username STRING,
user_utc STRING, user_region STRING, last_login BIGINT, auto_pay BOOLEAN,
last_updated BIGINT
New records are all ingested into a table named account_history which maintains a full
record of all data in the same schema as the source. The next table in the system is
named account_current and is implemented as a Type 1 table representing the most
recent value for each unique user_id.
Assuming there are millions of user accounts and tens of thousands of records
processed hourly, which implementation can be used to efficiently update the described
account_current table as part of each hourly batch job?

A. Use Auto Loader to subscribe to new files in the account_history directory;


configure a Structured Streaming trigger once job to batch update newly detected
files into the account_current table.
B. Overwrite the account_current table with each batch using the results of a query
against the account_history table grouping by user_id and filtering for the max
value of last_updated.
C. Filter records in account_history using the last_updated field and the most recent
hour processed, as well as the max last_iogin by user_id write a merge
statement to update or insert the most recent value for each user_id.
D. Use Delta Lake version history to get the difference between the latest version of
account_history and one version prior, then write these records to
account_current.
E. Filter records in account_history using the last_updated field and the most recent
hour processed, making sure to deduplicate on username; write a merge
statement to update or insert the most recent value for each username.

Reveal Solution Discussion 19

Question #15

Topic 1

A table in the Lakehouse named customer_churn_params is used in churn prediction by


the machine learning team. The table contains information about customers derived
from a number of upstream sources. Currently, the data engineering team populates
this table nightly by overwriting the table with the current valid values derived from
upstream data sources.
The churn prediction model used by the ML team is fairly stable in production. The team
is only interested in making predictions on records that have changed in the past 24
hours.
Which approach would simplify the identification of these changed records?
A. Apply the churn model to all rows in the customer_churn_params table, but
implement logic to perform an upsert into the predictions table that ignores rows
where predictions have not changed.
B. Convert the batch job to a Structured Streaming job using the complete output
mode; configure a Structured Streaming job to read from the
customer_churn_params table and incrementally predict against the churn
model.
C. Calculate the difference between the previous model predictions and the current
customer_churn_params on a key identifying unique customers before making
new predictions; only make predictions on those customers not in the previous
predictions.
D. Modify the overwrite logic to include a field populated by calling
spark.sql.functions.current_timestamp() as data are being written; use this field to
identify records written on a particular date.
E. Replace the current overwrite logic with a merge statement to modify only those
records that have changed; write logic to make predictions on the changed
records identified by the change data feed.

Reveal Solution Discussion 9

Previous Questions

Question #16

Topic 1

A table is registered with the following code:


Both users and orders are Delta Lake tables. Which statement describes the results of
querying recent_orders?

A. All logic will execute at query time and return the result of joining the valid
versions of the source tables at the time the query finishes.
B. All logic will execute when the table is defined and store the result of joining
tables to the DBFS; this stored data will be returned when the table is queried.
C. Results will be computed and cached when the table is defined; these cached
results will incrementally update as new records are inserted into source tables.
D. All logic will execute at query time and return the result of joining the valid
versions of the source tables at the time the query began.
E. The versions of each source table will be stored in the table transaction log; query
results will be saved to DBFS with each query.

Reveal Solution Discussion 28

Question #17

Topic 1

A production workload incrementally applies updates from an external Change Data


Capture feed to a Delta Lake table as an always-on Structured Stream job. When data
was initially migrated for this table, OPTIMIZE was executed and most data files were
resized to 1 GB. Auto Optimize and Auto Compaction were both turned on for the
streaming production job. Recent review of data files shows that most data files are
under 64 MB, although each partition in the table contains at least 1 GB of data and the
total table size is over 10 TB.
Which of the following likely explains these smaller file sizes?

A. Databricks has autotuned to a smaller target file size to reduce duration of


MERGE operations
B. Z-order indices calculated on the table are preventing file compaction
C. Bloom filter indices calculated on the table are preventing file compaction
D. Databricks has autotuned to a smaller target file size based on the overall size of
data in the table
E. Databricks has autotuned to a smaller target file size based on the amount of data
in each partition
Reveal Solution Discussion 14

Question #18

Topic 1

Which statement regarding stream-static joins and static Delta tables is correct?

A. Each microbatch of a stream-static join will use the most recent version of the
static Delta table as of each microbatch.
B. Each microbatch of a stream-static join will use the most recent version of the
static Delta table as of the job's initialization.
C. The checkpoint directory will be used to track state information for the unique
keys present in the join.
D. Stream-static joins cannot use static Delta tables because of consistency issues.
E. The checkpoint directory will be used to track updates to the static Delta table.

Reveal Solution Discussion 10

Question #19

Topic 1

A junior data engineer has been asked to develop a streaming data pipeline with a
grouped aggregation using DataFrame df. The pipeline needs to calculate the average
humidity and average temperature for each non-overlapping five-minute interval. Events
are recorded once per minute per device.
Streaming DataFrame df has the following schema:
"device_id INT, event_time TIMESTAMP, temp FLOAT, humidity FLOAT"
Code block:
Choose the response that correctly fills in the blank within the code block to complete
this task.

A. to_interval("event_time", "5 minutes").alias("time")


B. window("event_time", "5 minutes").alias("time")
C. "event_time"
D. window("event_time", "10 minutes").alias("time")
E. lag("event_time", "10 minutes").alias("time")

Reveal Solution Discussion 7

Question #20

Topic 1

A data architect has designed a system in which two Structured Streaming jobs will
concurrently write to a single bronze Delta table. Each job is subscribing to a different
topic from an Apache Kafka source, but they will write data with the same schema. To
keep the directory structure simple, a data engineer has decided to nest a checkpoint
directory to be shared by both streams.
The proposed directory structure is displayed below:

Which statement describes whether this checkpoint directory structure is valid for the
given scenario and why?

A. No; Delta Lake manages streaming checkpoints in the transaction log.


B. Yes; both of the streams can share a single checkpoint directory.
C. No; only one stream can write to a Delta Lake table.
D. Yes; Delta Lake supports infinite concurrent writers.
E. No; each of the streams needs to have its own checkpoint directory.

Reveal Solution
Question #21

Topic 1

A Structured Streaming job deployed to production has been experiencing delays during
peak hours of the day. At present, during normal execution, each microbatch of data is
processed in less than 3 seconds. During peak hours of the day, execution time for
each microbatch becomes very inconsistent, sometimes exceeding 30 seconds. The
streaming write is currently configured with a trigger interval of 10 seconds.
Holding all other variables constant and assuming records need to be processed in less
than 10 seconds, which adjustment will meet the requirement?

A. Decrease the trigger interval to 5 seconds; triggering batches more frequently


allows idle executors to begin processing the next batch while longer running
tasks from previous batches finish.
B. Increase the trigger interval to 30 seconds; setting the trigger interval near the
maximum execution time observed for each batch is always best practice to
ensure no records are dropped.
C. The trigger interval cannot be modified without modifying the checkpoint directory;
to maintain the current stream state, increase the number of shuffle partitions to
maximize parallelism.
D. Use the trigger once option and configure a Databricks job to execute the query
every 10 seconds; this ensures all backlogged records are processed with each
batch.
E. Decrease the trigger interval to 5 seconds; triggering batches more frequently
may prevent records from backing up and large batches from causing spill.

Reveal Solution Discussion 19

Question #22

Topic 1

Which statement describes Delta Lake Auto Compaction?

A. An asynchronous job runs after the write completes to detect if files could be
further compacted; if yes, an OPTIMIZE job is executed toward a default of 1 GB.
B. Before a Jobs cluster terminates, OPTIMIZE is executed on all tables modified
during the most recent job.
C. Optimized writes use logical partitions instead of directory partitions; because
partition boundaries are only represented in metadata, fewer small files are
written.
D. Data is queued in a messaging bus instead of committing data directly to
memory; all data is committed from the messaging bus in one batch once the job
is complete.
E. An asynchronous job runs after the write completes to detect if files could be
further compacted; if yes, an OPTIMIZE job is executed toward a default of 128
MB.

Reveal Solution Discussion 25

Question #23

Topic 1

Which statement characterizes the general programming model used by Spark


Structured Streaming?

A. Structured Streaming leverages the parallel processing of GPUs to achieve highly


parallel data throughput.
B. Structured Streaming is implemented as a messaging bus and is derived from
Apache Kafka.
C. Structured Streaming uses specialized hardware and I/O streams to achieve sub-
second latency for data transfer.
D. Structured Streaming models new data arriving in a data stream as new rows
appended to an unbounded table.
E. Structured Streaming relies on a distributed network of nodes that hold
incremental state values for cached stages.

Reveal Solution Discussion 5

Question #24

Topic 1

Which configuration parameter directly affects the size of a spark-partition upon


ingestion of data into Spark?
A. spark.sql.files.maxPartitionBytes
B. spark.sql.autoBroadcastJoinThreshold
C. spark.sql.files.openCostInBytes
D. spark.sql.adaptive.coalescePartitions.minPartitionNum
E. spark.sql.adaptive.advisoryPartitionSizeInBytes

Reveal Solution Discussion 3

Question #25

Topic 1

A Spark job is taking longer than expected. Using the Spark UI, a data engineer notes
that the Min, Median, and Max Durations for tasks in a particular stage show the
minimum and median time to complete a task as roughly the same, but the max
duration for a task to be roughly 100 times as long as the minimum.
Which situation is causing increased duration of the overall job?

A. Task queueing resulting from improper thread pool assignment.


B. Spill resulting from attached volume storage being too small.
C. Network latency due to some cluster nodes being in different regions from the
source data
D. Skew caused by more data being assigned to a subset of spark-partitions.
E. Credential validation errors while pulling data from an external system.

Reveal Solution

Vendors
Exam List
Login
Register
Databricks Certified Data Engineer Professional Exam
Page: 6 / 46
Total 227 questions
Question 26 ( Exam A )
Each configuration below is identical to the extent that each cluster has 400 GB total of RAM, 160
total cores and only one Executor per VM.
Given a job with at least one wide transformation, which of the following cluster configurations will
result in maximum performance?

A. • Total VMs; 1
• 400 GB per Executor
• 160 Cores / Executor
B. • Total VMs: 8
• 50 GB per Executor
• 20 Cores / Executor
C. • Total VMs: 16
• 25 GB per Executor
• 10 Cores/Executor
D. • Total VMs: 4
• 100 GB per Executor
• 40 Cores/Executor
E. • Total VMs:2
• 200 GB per Executor
• 80 Cores / Executor

Expose Correct Answer Next Question


Question 27 ( Exam A )
A junior data engineer on your team has implemented the following code block.

The view new_events contains a batch of records with the same schema as the events Delta table.
The event_id field serves as a unique key for this table.
When this query is executed, what will happen with new records that have the same event_id as an
existing record?

A. They are merged.


B. They are ignored.
C. They are updated.
D. They are inserted.
E. They are deleted.
Expose Correct Answer Next Question
Question 28 ( Exam A )
A junior data engineer seeks to leverage Delta Lake's Change Data Feed functionality to create a
Type 1 table representing all of the values that have ever been valid for all rows in a bronze table
created with the property delta.enableChangeDataFeed = true. They plan to execute the following
code as a daily job:

Which statement describes the execution and results of running the above query multiple times?

A. Each time the job is executed, newly updated records will be merged into the target table,
overwriting previous values with the same primary keys.
B. Each time the job is executed, the entire available history of inserted or updated records will
be appended to the target table, resulting in many duplicate entries.
C. Each time the job is executed, the target table will be overwritten using the entire history of
inserted or updated records, giving the desired result.
D. Each time the job is executed, the differences between the original and current versions are
calculated; this may result in duplicate entries for some records.
E. Each time the job is executed, only those records that have been inserted or updated since the
last execution will be appended to the target table, giving the desired result.

Expose Correct Answer Next Question


Question 29 ( Exam A )
A new data engineer notices that a critical field was omitted from an application that writes its Kafka
source to Delta Lake. This happened even though the critical field was in the Kafka source. That field
was further missing from data written to dependent, long-term storage. The retention threshold on
the Kafka service is seven days. The pipeline has been in production for three months.
Which describes how Delta Lake can help to avoid data loss of this nature in the future?

A. The Delta log and Structured Streaming checkpoints record the full history of the Kafka
producer.
B. Delta Lake schema evolution can retroactively calculate the correct value for newly added
fields, as long as the data was in the original source.
C. Delta Lake automatically checks that all fields present in the source data are included in the
ingestion layer.
D. Data can never be permanently dropped or deleted from Delta Lake, so data loss is not
possible under any circumstance.
E. Ingesting all raw data and metadata from Kafka to a bronze Delta table creates a permanent,
replayable history of the data state.

Expose Correct Answer Next Question


Question 30 ( Exam A )
A nightly job ingests data into a Delta Lake table using the following code:

The next step in the pipeline requires a function that returns an object that can be used to
manipulate new records that have not yet been processed to the next table in the pipeline.
Which code snippet completes this function definition?
def new_records():

A. return spark.readStream.table("bronze")
B. return spark.readStream.load("bronze")

C.
D. return spark.read.option("readChangeFeed", "true").table ("bronze")

E.
Expose Correct Answer Next Question
Page: 6 / 46
Total 227 questions
Previous PageNext Page
Connect with us

Facebook
Twitter
Youtube
support@itexams.com
Blog | Terms & Conditions | Privacy Policy | DMCA & Legal

ITExams.com is owned by MBS Tech Limited: Room 1905 Nam Wo Hong Building, 148 Wing Lok Street, Sheung Wan, Hong
Kong. Company registration number: 2310926
ITExams doesn't offer Real Microsoft Exam Questions. ITExams doesn't offer Real Amazon Exam Questions.
ITExams Materials do not contain actual questions and answers from Cisco's Certification Exams.
CFA Institute does not endorse, promote or warrant the accuracy or quality of ITExams. CFA® and Chartered Financial
Analyst® are registered trademarks owned by CFA Institute.

Question #31

Topic 1

A junior data engineer is working to implement logic for a Lakehouse table named
silver_device_recordings. The source data contains 100 unique fields in a highly nested
JSON structure.
The silver_device_recordings table will be used downstream to power several
production monitoring dashboards and a production model. At present, 45 of the 100
fields are being used in at least one of these applications.
The data engineer is trying to determine the best approach for dealing with schema
declaration given the highly-nested structure of the data and the numerous fields.
Which of the following accurately presents information about Delta Lake and Databricks
that may impact their decision-making process?

A. The Tungsten encoding used by Databricks is optimized for storing string data;
newly-added native support for querying JSON strings means that string types
are always most efficient.
B. Because Delta Lake uses Parquet for data storage, data types can be easily
evolved by just modifying file footer information in place.
C. Human labor in writing code is the largest cost associated with data engineering
workloads; as such, automating table declaration logic should be a priority in all
migration workloads.
D. Because Databricks will infer schema using types that allow all observed data to
be processed, setting types manually provides greater assurance of data quality
enforcement.
E. Schema inference and evolution on Databricks ensure that inferred types will
always accurately match the data types used by downstream systems.

Reveal Solution Discussion 6

Question #32

Topic 1
The data engineering team maintains the following code:

Assuming that this code produces logically correct results and the data in the source
tables has been de-duplicated and validated, which statement describes what will occur
when this code is executed?

A. A batch job will update the enriched_itemized_orders_by_account table, replacing


only those rows that have different values than the current version of the table,
using accountID as the primary key.
B. The enriched_itemized_orders_by_account table will be overwritten using the
current valid version of data in each of the three tables referenced in the join
logic.
C. An incremental job will leverage information in the state store to identify unjoined
rows in the source tables and write these rows to the
enriched_iteinized_orders_by_account table.
D. An incremental job will detect if new rows have been written to any of the source
tables; if new rows are detected, all results will be recalculated and used to
overwrite the enriched_itemized_orders_by_account table.
E. No computation will occur until enriched_itemized_orders_by_account is queried;
upon query materialization, results will be calculated using the current valid
version of data in each of the three tables referenced in the join logic.

Reveal Solution Discussion 6

Question #33

Topic 1

The data engineering team is migrating an enterprise system with thousands of tables
and views into the Lakehouse. They plan to implement the target architecture using a
series of bronze, silver, and gold tables. Bronze tables will almost exclusively be used
by production data engineering workloads, while silver tables will be used to support
both data engineering and machine learning workloads. Gold tables will largely serve
business intelligence and reporting purposes. While personal identifying information
(PII) exists in all tiers of data, pseudonymization and anonymization rules are in place
for all data at the silver and gold levels.
The organization is interested in reducing security concerns while maximizing the ability
to collaborate across diverse teams.
Which statement exemplifies best practices for implementing this system?

A. Isolating tables in separate databases based on data quality tiers allows for easy
permissions management through database ACLs and allows physical
separation of default storage locations for managed tables.
B. Because databases on Databricks are merely a logical construct, choices around
database organization do not impact security or discoverability in the Lakehouse.
C. Storing all production tables in a single database provides a unified view of all
data assets available throughout the Lakehouse, simplifying discoverability by
granting all users view privileges on this database.
D. Working in the default Databricks database provides the greatest security when
working with managed tables, as these will be created in the DBFS root.
E. Because all tables must live in the same storage containers used for the database
they're created in, organizations should be prepared to create between dozens
and thousands of databases depending on their data isolation requirements.

Reveal Solution Discussion 5


Question #34

Topic 1

The data architect has mandated that all tables in the Lakehouse should be configured
as external Delta Lake tables.
Which approach will ensure that this requirement is met?

A. Whenever a database is being created, make sure that the LOCATION keyword
is used
B. When configuring an external data warehouse for all table storage, leverage
Databricks for all ELT.
C. Whenever a table is being created, make sure that the LOCATION keyword is
used.
D. When tables are created, make sure that the EXTERNAL keyword is used in the
CREATE TABLE statement.
E. When the workspace is being configured, make sure that external cloud object
storage has been mounted.

Reveal Solution Discussion 15

Question #35

Topic 1

To reduce storage and compute costs, the data engineering team has been tasked with
curating a series of aggregate tables leveraged by business intelligence dashboards,
customer-facing applications, production machine learning models, and ad hoc
analytical queries.
The data engineering team has been made aware of new requirements from a
customer-facing application, which is the only downstream workload they manage
entirely. As a result, an aggregate table used by numerous teams across the
organization will need to have a number of fields renamed, and additional fields will also
be added.
Which of the solutions addresses the situation while minimally interrupting other teams
in the organization without increasing the number of tables that need to be managed?
A. Send all users notice that the schema for the table will be changing; include in the
communication the logic necessary to revert the new table schema to match
historic queries.
B. Configure a new table with all the requisite fields and new names and use this as
the source for the customer-facing application; create a view that maintains the
original data schema and table name by aliasing select fields from the new table.
C. Create a new table with the required schema and new fields and use Delta Lake's
deep clone functionality to sync up changes committed to one table to the
corresponding table.
D. Replace the current table definition with a logical view defined with the query logic
currently writing the aggregate table; create a new table to power the customer-
facing application.
E. Add a table comment warning all users that the table schema and field names will
be changing on a given date; overwrite the table in place to the specifications of
the customer-facing application.

Reveal Solution

Question #36

Topic 1

A Delta Lake table representing metadata about content posts from users has the
following schema: user_id LONG, post_text STRING, post_id STRING, longitude
FLOAT, latitude FLOAT, post_time TIMESTAMP, date DATE
This table is partitioned by the date column. A query is run with the following filter:
longitude < 20 & longitude > -20
Which statement describes how data will be filtered?

A. Statistics in the Delta Log will be used to identify partitions that might Include files
in the filtered range.
B. No file skipping will occur because the optimizer does not know the relationship
between the partition column and the longitude.
C. The Delta Engine will use row-level statistics in the transaction log to identify the
flies that meet the filter criteria.
D. Statistics in the Delta Log will be used to identify data files that might include
records in the filtered range.
E. The Delta Engine will scan the parquet file footers to identify each row that meets
the filter criteria.

Reveal Solution Discussion 9

Question #37

Topic 1

A small company based in the United States has recently contracted a consulting firm in
India to implement several new data engineering pipelines to power artificial intelligence
applications. All the company's data is stored in regional cloud storage in the United
States.
The workspace administrator at the company is uncertain about where the Databricks
workspace used by the contractors should be deployed.
Assuming that all data governance considerations are accounted for, which statement
accurately informs this decision?

A. Databricks runs HDFS on cloud volume storage; as such, cloud virtual machines
must be deployed in the region where the data is stored.
B. Databricks workspaces do not rely on any regional infrastructure; as such, the
decision should be made based upon what is most convenient for the workspace
administrator.
C. Cross-region reads and writes can incur significant costs and latency; whenever
possible, compute should be deployed in the same region the data is stored.
D. Databricks leverages user workstations as the driver during interactive
development; as such, users should always use a workspace deployed in a
region they are physically near.
E. Databricks notebooks send all executable code from the user’s browser to virtual
machines over the open internet; whenever possible, choosing a workspace
region near the end users is the most secure.
Reveal Solution Discussion 6

Question #38

Topic 1

The downstream consumers of a Delta Lake table have been complaining about data
quality issues impacting performance in their applications. Specifically, they have
complained that invalid latitude and longitude values in the activity_details table have
been breaking their ability to use other geolocation processes.
A junior engineer has written the following code to add CHECK constraints to the Delta
Lake table:

A senior engineer has confirmed the above logic is correct and the valid ranges for
latitude and longitude are provided, but the code fails when executed.
Which statement explains the cause of this failure?

A. Because another team uses this table to support a frequently running application,
two-phase locking is preventing the operation from committing.
B. The activity_details table already exists; CHECK constraints can only be added
during initial table creation.
C. The activity_details table already contains records that violate the constraints; all
existing data must pass CHECK constraints in order to add them to an existing
table.
D. The activity_details table already contains records; CHECK constraints can only
be added prior to inserting values into a table.
E. The current table schema does not contain the field valid_coordinates; schema
evolution will need to be enabled before altering the table to add a constraint.

Reveal Solution Discussion 16


Question #39

Topic 1

Which of the following is true of Delta Lake and the Lakehouse?

A. Because Parquet compresses data row by row. strings will only be compressed
when a character is repeated multiple times.
B. Delta Lake automatically collects statistics on the first 32 columns of each table
which are leveraged in data skipping based on query filters.
C. Views in the Lakehouse maintain a valid cache of the most recent versions of
source tables at all times.
D. Primary and foreign key constraints can be leveraged to ensure duplicate values
are never entered into a dimension table.
E. Z-order can only be applied to numeric values stored in Delta Lake tables.

Reveal Solution Discussion 10

Question #40

Topic 1

The view updates represents an incremental batch of all newly ingested data to be
inserted or updated in the customers table.
The following logic is used to process these records.
Which statement describes this implementation?

A. The customers table is implemented as a Type 3 table; old values are maintained
as a new column alongside the current value.
B. The customers table is implemented as a Type 2 table; old values are maintained
but marked as no longer current and new values are inserted.
C. The customers table is implemented as a Type 0 table; all writes are append only
with no changes to existing values.
D. The customers table is implemented as a Type 1 table; old values are overwritten
by new values and no history is maintained.
E. The customers table is implemented as a Type 2 table; old values are overwritten
and new customers are appended

Question #41

Topic 1

The DevOps team has configured a production workload as a collection of notebooks


scheduled to run daily using the Jobs UI. A new data engineering hire is onboarding to
the team and has requested access to one of these notebooks to review the production
logic.
What are the maximum notebook permissions that can be granted to the user without
allowing accidental changes to production code or data?
A. Can Manage
B. Can Edit
C. No permissions
D. Can Read
E. Can Run

Reveal Solution Discussion 8

Question #42

Topic 1

A table named user_ltv is being used to create a view that will be used by data analysts
on various teams. Users in the workspace are configured into groups, which are used
for setting up data access using ACLs.
The user_ltv table has the following schema:
email STRING, age INT, ltv INT
The following view definition is executed:

An analyst who is not a member of the marketing group executes the following query:

SELECT * FROM email_ltv -


Which statement describes the results returned by this query?

A. Three columns will be returned, but one column will be named "REDACTED" and
contain only null values.
B. Only the email and ltv columns will be returned; the email column will contain all
null values.
C. The email and ltv columns will be returned with the values in user_ltv.
D. The email.age, and ltv columns will be returned with the values in user_ltv.
E. Only the email and ltv columns will be returned; the email column will contain the
string "REDACTED" in each row.

Reveal Solution Discussion 5

Question #43

Topic 1

The data governance team has instituted a requirement that all tables containing
Personal Identifiable Information (PH) must be clearly annotated. This includes adding
column comments, table comments, and setting the custom table property
"contains_pii" = true.
The following SQL DDL statement is executed to create a new table:

Which command allows manual confirmation that these three requirements have been
met?

A. DESCRIBE EXTENDED dev.pii_test


B. DESCRIBE DETAIL dev.pii_test
C. SHOW TBLPROPERTIES dev.pii_test
D. DESCRIBE HISTORY dev.pii_test
E. SHOW TABLES dev

Reveal Solution Discussion 3

Question #44
Topic 1

The data governance team is reviewing code used for deleting records for compliance
with GDPR. They note the following logic is used to delete records from the Delta Lake
table named users.

Assuming that user_id is a unique identifying key and that delete_requests contains all
users that have requested deletion, which statement describes whether successfully
executing the above logic guarantees that the records to be deleted are no longer
accessible and why?

A. Yes; Delta Lake ACID guarantees provide assurance that the DELETE command
succeeded fully and permanently purged these records.
B. No; the Delta cache may return records from previous versions of the table until
the cluster is restarted.
C. Yes; the Delta cache immediately updates to reflect the latest data files recorded
to disk.
D. No; the Delta Lake DELETE command only provides ACID guarantees when
combined with the MERGE INTO command.
E. No; files containing deleted records may still be accessible with time travel until a
VACUUM command is used to remove invalidated data files.

Reveal Solution Discussion 4

Question #45

Topic 1

An external object storage container has been mounted to the location


/mnt/finance_eda_bucket.
The following logic was executed to create a database for the finance team:
After the database was successfully created and permissions configured, a member of
the finance team runs the following code:

If all users on the finance team are members of the finance group, which statement
describes how the tx_sales table will be created?

A. A logical table will persist the query plan to the Hive Metastore in the Databricks
control plane.
B. An external table will be created in the storage container mounted to
/mnt/finance_eda_bucket.
C. A logical table will persist the physical plan to the Hive Metastore in the
Databricks control plane.
D. An managed table will be created in the storage container mounted to
/mnt/finance_eda_bucket.
E. A managed table will be created in the DBFS root storage container.

Reveal Solution Discussion 20

Previous Questions

Viewing

Question #46

Topic 1

Although the Databricks Utilities Secrets module provides tools to store sensitive
credentials and avoid accidentally displaying them in plain text users should still be
careful with which credentials are stored here and which users have access to using
these secrets.
Which statement describes a limitation of Databricks Secrets?
A. Because the SHA256 hash is used to obfuscate stored secrets, reversing this
hash will display the value in plain text.
B. Account administrators can see all secrets in plain text by logging on to the
Databricks Accounts console.
C. Secrets are stored in an administrators-only table within the Hive Metastore;
database administrators have permission to query this table by default.
D. Iterating through a stored secret and printing each character will display secret
contents in plain text.
E. The Databricks REST API can be used to list secrets in plain text if the personal
access token has proper credentials.

Reveal Solution Discussion 24

Question #47

Topic 1

What statement is true regarding the retention of job run history?

A. It is retained until you export or delete job run logs


B. It is retained for 30 days, during which time you can deliver job run logs to DBFS
or S3
C. It is retained for 60 days, during which you can export notebook run results to
HTML
D. It is retained for 60 days, after which logs are archived
E. It is retained for 90 days or until the run-id is re-used through custom run
configuration

Reveal Solution Discussion 13

Question #48

Topic 1

A data engineer, User A, has promoted a new pipeline to production by using the REST
API to programmatically create several jobs. A DevOps engineer, User B, has
configured an external orchestration tool to trigger job runs through the REST API. Both
users authorized the REST API calls using their personal access tokens.
Which statement describes the contents of the workspace audit logs concerning these
events?

A. Because the REST API was used for job creation and triggering runs, a Service
Principal will be automatically used to identify these events.
B. Because User B last configured the jobs, their identity will be associated with both
the job creation events and the job run events.
C. Because these events are managed separately, User A will have their identity
associated with the job creation events and User B will have their identity
associated with the job run events.
D. Because the REST API was used for job creation and triggering runs, user
identity will not be captured in the audit logs.
E. Because User A created the jobs, their identity will be associated with both the job
creation events and the job run events.

Reveal Solution Discussion 14

Question #49

Topic 1

A user new to Databricks is trying to troubleshoot long execution times for some
pipeline logic they are working on. Presently, the user is executing code cell-by-cell,
using display() calls to confirm code is producing the logically correct results as new
transformations are added to an operation. To get a measure of average time to
execute, the user is running each cell multiple times interactively.
Which of the following adjustments will get a more accurate measure of how code is
likely to perform in production?

A. Scala is the only language that can be accurately tested using interactive
notebooks; because the best performance is achieved by using Scala code
compiled to JARs, all PySpark and Spark SQL logic should be refactored.
B. The only way to meaningfully troubleshoot code execution times in development
notebooks Is to use production-sized data and production-sized clusters with Run
All execution.
C. Production code development should only be done using an IDE; executing code
against a local build of open source Spark and Delta Lake will provide the most
accurate benchmarks for how code will perform in production.
D. Calling display() forces a job to trigger, while many transformations will only add
to the logical query plan; because of caching, repeated execution of the same
logic does not provide meaningful results.
E. The Jobs UI should be leveraged to occasionally run the notebook as a job and
track execution time during incremental code development because Photon can
only be enabled on clusters launched for scheduled jobs.

Reveal Solution Discussion 25

Question #50

Topic 1

A production cluster has 3 executor nodes and uses the same virtual machine type for
the driver and executor.
When evaluating the Ganglia Metrics for this cluster, which indicator would signal a
bottleneck caused by code executing on the driver?

A. The five Minute Load Average remains consistent/flat


B. Bytes Received never exceeds 80 million bytes per second
C. Total Disk Space remains constant
D. Network I/O never spikes
E. Overall cluster CPU utilization is around 25%

Reveal Solution Discussion 19

Previous Questions

Viewing

Question #51

Topic 1

Where in the Spark UI can one diagnose a performance problem induced by not
leveraging predicate push-down?
A. In the Executor’s log file, by grepping for "predicate push-down"
B. In the Stage’s Detail screen, in the Completed Stages table, by noting the size of
data read from the Input column
C. In the Storage Detail screen, by noting which RDDs are not stored on disk
D. In the Delta Lake transaction log. by noting the column statistics
E. In the Query Detail screen, by interpreting the Physical Plan

Reveal Solution Discussion 3

Question #52

Topic 1

Review the following error traceback:

Which statement describes the error being raised?


A. The code executed was PySpark but was executed in a Scala notebook.
B. There is no column in the table named heartrateheartrateheartrate
C. There is a type error because a column object cannot be multiplied.
D. There is a type error because a DataFrame object cannot be multiplied.
E. There is a syntax error because the heartrate column is not correctly identified as
a column.

Reveal Solution Discussion 10

Question #53

Topic 1

Which distribution does Databricks support for installing custom Python code
packages?

A. sbt
B. CRANC. npm
D. Wheels
E. jars

Reveal Solution Discussion 4

Question #54

Topic 1

Which Python variable contains a list of directories to be searched when trying to locate
required modules?

A. importlib.resource_path
B. sys.path
C. os.path
D. pypi.path
E. pylib.source
Reveal Solution Discussion 3

Question #55

Topic 1

Incorporating unit tests into a PySpark application requires upfront attention to the
design of your jobs, or a potentially significant refactoring of existing code.
Which statement describes a main benefit that offset this additional effort?

A. Improves the quality of your data


B. Validates a complete use case of your application
C. Troubleshooting is easier since all steps are isolated and tested individually
D. Yields faster deployment and execution times
E. Ensures that all steps interact correctly to achieve the desired end result

Reveal Solution Discussion 4

Previous Questions

Viewing

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy