0% found this document useful (0 votes)
36 views9 pages

Kimball Data Modeling - Data Engineering Interviews

Uploaded by

Le Doan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views9 pages

Kimball Data Modeling - Data Engineering Interviews

Uploaded by

Le Doan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 9

How to pass the data modeling

round in big tech data


engineering interviews
ZACH
WILSON
SEP
22

READ IN APP

The data modeling interview separates data engineers who can solve business
problems efficiently from those who can’t. Being able to deconstruct business
requirements into efficient data sets that solve problems is the key skill that you want to
demonstrate throughout this entire interview.

Knowledge of the following concepts will serve you very well:

 Asking the right questions!


o Remember the interviewer will often give you more requirements
than is given in the problem statement. It’s your job to discover
these requirements! The data modeling interview is 50% about
technicals and 50% about communication!
 Dimensional data modeling
o Cumulative dimensions versus daily dimensions (Maxime
Beauchemin covers this in-depth on his Medium)
o Slowly-changing dimension types
 Fact data modeling
o Kimball data modeling method and Normalized versus
denormalized facts
o One big table methodology
 Aggregate data modeling
o Daily metrics vs. cumulative metrics
o Enable rapid slice-and-dice of metrics with cube aggregates
 OLAP cubes are fundamental to rapid analytics.
 Your data gets crunched down to the
number of combinations of dimensions.
 You can model the dimensions by time,
product, location, etc. This enabled you to
“slice” the aggregates you care about.

EcZachly Data Engineering Newsletter is a reader-supported publication. To


receive new posts and support my work, consider becoming a free or paid
subscriber.

Example - modeling user growth


Often times in interviews you’ll be given a vague problem like, “Give me an efficient
data model that models user connection growth on a social media site. Here are
the upstream schemas”

 connection_events
o event_time TIMESTAMP
o sending_user_id BIGINT
o receiving_user_id BIGINT
o event_type STRING (values [“sent”, “reject”, “accept”]
o event_date DATE PARTITION
 active_user_snapshot (contains one row for every active user
on snapshot_date)
o user_id BIGINT
o country STRING
o age INTEGER
o username STRING
o snapshot_date DATE PARTITION
So how do you take these two schemas and create a data model that efficiently tracks
how many connections are sent, accepted, and rejected?

One of the first indicators here that you should lean into cumulative table design is
that active_user_snapshot does not have all the users for each snapshot_date.

So to start off, you’d want to cumulate active_user_snapshot in a table


called: users_cumulated

The schema of this table might look something like:

 users_cumulated
o user_id BIGINT
o dim_is_active_today BOOLEAN
o l7 INTEGER (how many days were they active in the last 7 days)
o active_datelist_int INTEGER (a binary integer that tracks the
monthly activity history, see this article on how to leverage
powerful data structures like this)
o dim_country STRING
o dim_age INTEGER
o partition_date DATE PARTITION
This table is populated by taking active_user_snapshot WHERE snapshot_date =
‘today’ and FULL OUTER JOIN it with users_cumulated WHERE partition_date =
‘yesterday’

This table will have one row for every user each day regardless of it they are active or
not.
In the interview, you’ll probably need to come up with the schema above and a diagram
that looks something like this.

Great, now we have a good user dimension table to use further downstream.
The next thing we need to build is two tables. One daily dimension table
called daily_user_connections and one cumulative table
called user_connections_cumulated.

We create daily_user_connections for the purpose of simpler backfills in the future


and fast daily-level analytics.

You can imagine a schema like

 daily_user_connections
o sender_user_id BIGINT
o receiver_user_id BIGINT
o sent_event_time TIMESTAMP
o response_event_time TIMESTAMP (this is NULL if they have not
accepted or rejected)
o connection_status STRING [“accepted”, “rejected”,
“unanswered”]
o partition_date DATE PARTITION
 user_connections_cumulated
o (the same schema except contains all historical connections)
The schemas above and a diagram that looks something like this would be expected in
the interview.

Now that we have daily_user_connections, user_connections_cumulated,


and users_cumulated we are ready to JOIN these two dimensions together to create
aggregate tables and analytical cubes.

What type of aggregates do we care about? You can use the upstream schemas as
hints. (They probably care about age and country). If you’re unsure what aggregates
matter, make sure to ask questions in the interview.

If we join together user_connections_cumulated and users_cumulated, both on


sender_user_id and receiver_user_id, we can get a new schema. Let’s call
it user_connections_aggregated

 user_connections_aggregated
o dim_sender_country STRING
o dim_receiver_country STRING
o dim_sender_age_bucket STRING
o dim_receiver_age_bucket STRING
o m_num_users BIGINT
o m_num_requests BIGINT
o m_num_accepts BIGINT
o m_num_rejects BIGINT
o m_num_unanswered BIGINT
o aggregation_level STRING PARTITION KEY
o partition_date DATE PARTITION KEY
This this case we can generate this type of table by doing a GROUPING SETS query
on top of a JOIN between user_connections_cumulated and users_cumulated. We
also want to bucketize age into categorys like <18, 18-30, 30-50, etc. This is to lower
the cardinality and make the dashboards more performant.

The GROUPING SETS statement would probably look something like this:

GROUP BY GROUPING SETS (

(),

(dim_sender_country),

(dim_sender_country, dim_receiver_country),

(dim_sender_country, dim_sender_age),

(dim_receiver_country, dim_receiver_age),

(dim_sender_country, dim_receiver_country, dim_sender_age, dim_receiver_age)

The aggregation_level partition is determined by which columns we are grouping on. If


it’s just dim_sender_country, then the aggregation_level would be the string literal
‘dim_sender_country’, at Facebook they concatted the aggregation_levels with ‘__’ so
if you grouped on dim_sender_country and dim_receiver_country the
aggregation_level would be ‘dim_sender_country__dim_receiver_country’

A lot of the details here around GROUPING SETS aren’t needed in the interview
though. You just need to talk about the different grains and aggregates you need to
produce not the nitty-gritty details I’m going over here.

Aside from the schema for user_connections_aggregate, you’ll need to produce a


diagram that looks something like this:
You’ll see that user_connections_aggregate is an OLAP cube that. is ready to be
sliced and diced by analysts!

The last piece of the puzzle is coming up with metrics based on this aggregate table.
There are easy ones like:

 rejection_rate which is defined as m_num_rejects / m_num_requests


 acceptance_rate which is defined as m_num_accepts / m_num_requests
 connections_per_user which is defined as m_num_accepts /
m_num_users
You don’t have to come up with a ton of metrics during the interview but a few is good to
wrap up the modeling exercise!

After defining a few important business metrics, you’ll end up with a diagram that looks
something like this:

Conclusion
If you can produce diagrams and schemas like the ones above and talk intelligently
about the tradeoffs, you’ll pass this round of interview with ease.
I’ve found this round of interview to be fun and engaging since the correct answer is
often ambiguous and requires a lot of back and forth with the interviewer!
If you want to learn more about data modeling and other critical data engineering
concepts, join my six week intensive course that covers everything from data modeling
to Kafka to Flink to Spark to Airflow and more!

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy