Kimball Data Modeling - Data Engineering Interviews
Kimball Data Modeling - Data Engineering Interviews
READ IN APP
The data modeling interview separates data engineers who can solve business
problems efficiently from those who can’t. Being able to deconstruct business
requirements into efficient data sets that solve problems is the key skill that you want to
demonstrate throughout this entire interview.
connection_events
o event_time TIMESTAMP
o sending_user_id BIGINT
o receiving_user_id BIGINT
o event_type STRING (values [“sent”, “reject”, “accept”]
o event_date DATE PARTITION
active_user_snapshot (contains one row for every active user
on snapshot_date)
o user_id BIGINT
o country STRING
o age INTEGER
o username STRING
o snapshot_date DATE PARTITION
So how do you take these two schemas and create a data model that efficiently tracks
how many connections are sent, accepted, and rejected?
One of the first indicators here that you should lean into cumulative table design is
that active_user_snapshot does not have all the users for each snapshot_date.
users_cumulated
o user_id BIGINT
o dim_is_active_today BOOLEAN
o l7 INTEGER (how many days were they active in the last 7 days)
o active_datelist_int INTEGER (a binary integer that tracks the
monthly activity history, see this article on how to leverage
powerful data structures like this)
o dim_country STRING
o dim_age INTEGER
o partition_date DATE PARTITION
This table is populated by taking active_user_snapshot WHERE snapshot_date =
‘today’ and FULL OUTER JOIN it with users_cumulated WHERE partition_date =
‘yesterday’
This table will have one row for every user each day regardless of it they are active or
not.
In the interview, you’ll probably need to come up with the schema above and a diagram
that looks something like this.
Great, now we have a good user dimension table to use further downstream.
The next thing we need to build is two tables. One daily dimension table
called daily_user_connections and one cumulative table
called user_connections_cumulated.
daily_user_connections
o sender_user_id BIGINT
o receiver_user_id BIGINT
o sent_event_time TIMESTAMP
o response_event_time TIMESTAMP (this is NULL if they have not
accepted or rejected)
o connection_status STRING [“accepted”, “rejected”,
“unanswered”]
o partition_date DATE PARTITION
user_connections_cumulated
o (the same schema except contains all historical connections)
The schemas above and a diagram that looks something like this would be expected in
the interview.
What type of aggregates do we care about? You can use the upstream schemas as
hints. (They probably care about age and country). If you’re unsure what aggregates
matter, make sure to ask questions in the interview.
user_connections_aggregated
o dim_sender_country STRING
o dim_receiver_country STRING
o dim_sender_age_bucket STRING
o dim_receiver_age_bucket STRING
o m_num_users BIGINT
o m_num_requests BIGINT
o m_num_accepts BIGINT
o m_num_rejects BIGINT
o m_num_unanswered BIGINT
o aggregation_level STRING PARTITION KEY
o partition_date DATE PARTITION KEY
This this case we can generate this type of table by doing a GROUPING SETS query
on top of a JOIN between user_connections_cumulated and users_cumulated. We
also want to bucketize age into categorys like <18, 18-30, 30-50, etc. This is to lower
the cardinality and make the dashboards more performant.
The GROUPING SETS statement would probably look something like this:
(),
(dim_sender_country),
(dim_sender_country, dim_receiver_country),
(dim_sender_country, dim_sender_age),
(dim_receiver_country, dim_receiver_age),
A lot of the details here around GROUPING SETS aren’t needed in the interview
though. You just need to talk about the different grains and aggregates you need to
produce not the nitty-gritty details I’m going over here.
The last piece of the puzzle is coming up with metrics based on this aggregate table.
There are easy ones like:
After defining a few important business metrics, you’ll end up with a diagram that looks
something like this:
Conclusion
If you can produce diagrams and schemas like the ones above and talk intelligently
about the tradeoffs, you’ll pass this round of interview with ease.
I’ve found this round of interview to be fun and engaging since the correct answer is
often ambiguous and requires a lot of back and forth with the interviewer!
If you want to learn more about data modeling and other critical data engineering
concepts, join my six week intensive course that covers everything from data modeling
to Kafka to Flink to Spark to Airflow and more!