0% found this document useful (0 votes)

36 views9 pages

Kimball Data Modeling - Data Engineering Interviews

Uploaded by

Le Doan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views9 pages

Kimball Data Modeling - Data Engineering Interviews

Uploaded by

Le Doan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 9

How to pass the data modeling

round in big tech data

engineering interviews
ZACH
WILSON
SEP
22

READ IN APP

The data modeling interview separates data engineers who can solve business
problems efficiently from those who can’t. Being able to deconstruct business
requirements into efficient data sets that solve problems is the key skill that you want to
demonstrate throughout this entire interview.

Knowledge of the following concepts will serve you very well:

 Asking the right questions!

o Remember the interviewer will often give you more requirements
than is given in the problem statement. It’s your job to discover
these requirements! The data modeling interview is 50% about
technicals and 50% about communication!
 Dimensional data modeling
o Cumulative dimensions versus daily dimensions (Maxime
Beauchemin covers this in-depth on his Medium)
o Slowly-changing dimension types
 Fact data modeling
o Kimball data modeling method and Normalized versus
denormalized facts
o One big table methodology
 Aggregate data modeling
o Daily metrics vs. cumulative metrics
o Enable rapid slice-and-dice of metrics with cube aggregates
 OLAP cubes are fundamental to rapid analytics.
 Your data gets crunched down to the
number of combinations of dimensions.
 You can model the dimensions by time,
product, location, etc. This enabled you to
“slice” the aggregates you care about.

EcZachly Data Engineering Newsletter is a reader-supported publication. To

receive new posts and support my work, consider becoming a free or paid
subscriber.

Example - modeling user growth

Often times in interviews you’ll be given a vague problem like, “Give me an efficient
data model that models user connection growth on a social media site. Here are
the upstream schemas”

 connection_events
o event_time TIMESTAMP
o sending_user_id BIGINT
o receiving_user_id BIGINT
o event_type STRING (values [“sent”, “reject”, “accept”]
o event_date DATE PARTITION
 active_user_snapshot (contains one row for every active user
on snapshot_date)
o user_id BIGINT
o country STRING
o age INTEGER
o username STRING
o snapshot_date DATE PARTITION
So how do you take these two schemas and create a data model that efficiently tracks
how many connections are sent, accepted, and rejected?

One of the first indicators here that you should lean into cumulative table design is
that active_user_snapshot does not have all the users for each snapshot_date.

So to start off, you’d want to cumulate active_user_snapshot in a table

called: users_cumulated

The schema of this table might look something like:

 users_cumulated
o user_id BIGINT
o dim_is_active_today BOOLEAN
o l7 INTEGER (how many days were they active in the last 7 days)
o active_datelist_int INTEGER (a binary integer that tracks the
monthly activity history, see this article on how to leverage
powerful data structures like this)
o dim_country STRING
o dim_age INTEGER
o partition_date DATE PARTITION
This table is populated by taking active_user_snapshot WHERE snapshot_date =
‘today’ and FULL OUTER JOIN it with users_cumulated WHERE partition_date =
‘yesterday’

This table will have one row for every user each day regardless of it they are active or
not.
In the interview, you’ll probably need to come up with the schema above and a diagram
that looks something like this.

Great, now we have a good user dimension table to use further downstream.
The next thing we need to build is two tables. One daily dimension table
called daily_user_connections and one cumulative table
called user_connections_cumulated.

We create daily_user_connections for the purpose of simpler backfills in the future

and fast daily-level analytics.

You can imagine a schema like

 daily_user_connections
o sender_user_id BIGINT
o receiver_user_id BIGINT
o sent_event_time TIMESTAMP
o response_event_time TIMESTAMP (this is NULL if they have not
accepted or rejected)
o connection_status STRING [“accepted”, “rejected”,
“unanswered”]
o partition_date DATE PARTITION
 user_connections_cumulated
o (the same schema except contains all historical connections)
The schemas above and a diagram that looks something like this would be expected in
the interview.

Now that we have daily_user_connections, user_connections_cumulated,

and users_cumulated we are ready to JOIN these two dimensions together to create
aggregate tables and analytical cubes.

What type of aggregates do we care about? You can use the upstream schemas as
hints. (They probably care about age and country). If you’re unsure what aggregates
matter, make sure to ask questions in the interview.

If we join together user_connections_cumulated and users_cumulated, both on

sender_user_id and receiver_user_id, we can get a new schema. Let’s call
it user_connections_aggregated

 user_connections_aggregated
o dim_sender_country STRING
o dim_receiver_country STRING
o dim_sender_age_bucket STRING
o dim_receiver_age_bucket STRING
o m_num_users BIGINT
o m_num_requests BIGINT
o m_num_accepts BIGINT
o m_num_rejects BIGINT
o m_num_unanswered BIGINT
o aggregation_level STRING PARTITION KEY
o partition_date DATE PARTITION KEY
This this case we can generate this type of table by doing a GROUPING SETS query
on top of a JOIN between user_connections_cumulated and users_cumulated. We
also want to bucketize age into categorys like <18, 18-30, 30-50, etc. This is to lower
the cardinality and make the dashboards more performant.

The GROUPING SETS statement would probably look something like this:

GROUP BY GROUPING SETS (

(),

(dim_sender_country),

(dim_sender_country, dim_receiver_country),

(dim_sender_country, dim_sender_age),

(dim_receiver_country, dim_receiver_age),

(dim_sender_country, dim_receiver_country, dim_sender_age, dim_receiver_age)

The aggregation_level partition is determined by which columns we are grouping on. If

it’s just dim_sender_country, then the aggregation_level would be the string literal
‘dim_sender_country’, at Facebook they concatted the aggregation_levels with ‘__’ so
if you grouped on dim_sender_country and dim_receiver_country the
aggregation_level would be ‘dim_sender_country__dim_receiver_country’

A lot of the details here around GROUPING SETS aren’t needed in the interview
though. You just need to talk about the different grains and aggregates you need to
produce not the nitty-gritty details I’m going over here.

Aside from the schema for user_connections_aggregate, you’ll need to produce a

diagram that looks something like this:
You’ll see that user_connections_aggregate is an OLAP cube that. is ready to be
sliced and diced by analysts!

The last piece of the puzzle is coming up with metrics based on this aggregate table.
There are easy ones like:

 rejection_rate which is defined as m_num_rejects / m_num_requests

 acceptance_rate which is defined as m_num_accepts / m_num_requests
 connections_per_user which is defined as m_num_accepts /
m_num_users
You don’t have to come up with a ton of metrics during the interview but a few is good to
wrap up the modeling exercise!

After defining a few important business metrics, you’ll end up with a diagram that looks
something like this:

Conclusion
If you can produce diagrams and schemas like the ones above and talk intelligently
about the tradeoffs, you’ll pass this round of interview with ease.
I’ve found this round of interview to be fun and engaging since the correct answer is
often ambiguous and requires a lot of back and forth with the interviewer!
If you want to learn more about data modeling and other critical data engineering
concepts, join my six week intensive course that covers everything from data modeling
to Kafka to Flink to Spark to Airflow and more!

The Four Stages of NTFS File Growth - Part - 1
No ratings yet
The Four Stages of NTFS File Growth - Part - 1
7 pages
CS6010 - SOCIAL NETWORK ANALYSIS - Unit 1 Notes
67% (6)
CS6010 - SOCIAL NETWORK ANALYSIS - Unit 1 Notes
25 pages
DAMA-DMBOK: Data Management Body of Knowledge: 2nd Edition. ISBN 1634622340, 978-1634622349
88% (34)
DAMA-DMBOK: Data Management Body of Knowledge: 2nd Edition. ISBN 1634622340, 978-1634622349
23 pages
Analytics Engineer Roadmap
No ratings yet
Analytics Engineer Roadmap
6 pages
Telugu Boothu Kathala 7
73% (11)
Telugu Boothu Kathala 7
69 pages
Data Science Unit 1
No ratings yet
Data Science Unit 1
17 pages
Data Analytics Case Study Guide (Updated For 2024)
No ratings yet
Data Analytics Case Study Guide (Updated For 2024)
10 pages
Domains and Forests
No ratings yet
Domains and Forests
48 pages
CSC 414 - Data Management Lecture Notes
No ratings yet
CSC 414 - Data Management Lecture Notes
45 pages
Embedded SQL
100% (1)
Embedded SQL
149 pages
Machine Learning in Production
No ratings yet
Machine Learning in Production
31 pages
BDS Session 4
No ratings yet
BDS Session 4
65 pages
AI PM Pendo
No ratings yet
AI PM Pendo
8 pages
Database Management Systems: Ankit Rajpal
No ratings yet
Database Management Systems: Ankit Rajpal
39 pages
Day 76
No ratings yet
Day 76
10 pages
Actifio: Next-Generation Data Management
No ratings yet
Actifio: Next-Generation Data Management
36 pages
How To Configure RMAN To Work With NETbackup
No ratings yet
How To Configure RMAN To Work With NETbackup
7 pages
Monish 4 TH
No ratings yet
Monish 4 TH
12 pages
ToadForOracle 14.0 ReleaseNotes
No ratings yet
ToadForOracle 14.0 ReleaseNotes
47 pages
Big Data
No ratings yet
Big Data
7 pages
CSE4005
No ratings yet
CSE4005
27 pages
File Organization Midterm
No ratings yet
File Organization Midterm
43 pages
Datathon at UCI Resource Sheet
No ratings yet
Datathon at UCI Resource Sheet
15 pages
Comp Codes CADWORX
100% (1)
Comp Codes CADWORX
6 pages
Annotations and Base64 Encoding: Session: 14
No ratings yet
Annotations and Base64 Encoding: Session: 14
29 pages
EMP - ETL Mapping Documents
No ratings yet
EMP - ETL Mapping Documents
56 pages
New Date and Time API: Session: 13
No ratings yet
New Date and Time API: Session: 13
35 pages
01 - Introduction To Data Analytics
100% (2)
01 - Introduction To Data Analytics
58 pages
Cohesity Best Practices NAS Data Protection Archived
No ratings yet
Cohesity Best Practices NAS Data Protection Archived
17 pages
Session 17
No ratings yet
Session 17
29 pages
Genomic Analysis Package
No ratings yet
Genomic Analysis Package
8 pages
Email Clustering Algorithm
No ratings yet
Email Clustering Algorithm
57 pages
Gajanan
No ratings yet
Gajanan
23 pages
Lab Guides: Java SE 8 Programming Language
No ratings yet
Lab Guides: Java SE 8 Programming Language
15 pages
FINAL EXAM PERFECT UGRD-ITE6202B Data Structure and Algorithms
No ratings yet
FINAL EXAM PERFECT UGRD-ITE6202B Data Structure and Algorithms
31 pages
Ex - No. 2 Student Faculty Database
No ratings yet
Ex - No. 2 Student Faculty Database
12 pages
Midterm Solutionxxx
No ratings yet
Midterm Solutionxxx
7 pages
s10 InheritPolymorphism
No ratings yet
s10 InheritPolymorphism
16 pages
Unit 1 Lesson-1 Introduction To Database Management System
No ratings yet
Unit 1 Lesson-1 Introduction To Database Management System
8 pages
January All SQL Questions Compiled 1682631354
No ratings yet
January All SQL Questions Compiled 1682631354
122 pages
HBASE
No ratings yet
HBASE
18 pages
Jurnal Rani Susilawati 1101334
No ratings yet
Jurnal Rani Susilawati 1101334
9 pages
Data Analyst Interviews 2025
No ratings yet
Data Analyst Interviews 2025
22 pages
ISSP Lesson 4
No ratings yet
ISSP Lesson 4
33 pages
Lab Guides: Java SE 8 Programming Language
No ratings yet
Lab Guides: Java SE 8 Programming Language
7 pages
SQL Hard Interview Question
No ratings yet
SQL Hard Interview Question
4 pages
Lab Guides: Java SE 8 Programming Language
No ratings yet
Lab Guides: Java SE 8 Programming Language
6 pages
Campuslink - Admin - HCM21 - CPL - JAVA - 05 - Tests & Quizzes
No ratings yet
Campuslink - Admin - HCM21 - CPL - JAVA - 05 - Tests & Quizzes
8 pages
SOP 002 Perform An On-Page SEO Audit On A Page
No ratings yet
SOP 002 Perform An On-Page SEO Audit On A Page
10 pages
Project Status Report
No ratings yet
Project Status Report
19 pages
Lab Guides: Java SE 8 Programming Language
No ratings yet
Lab Guides: Java SE 8 Programming Language
4 pages
BDA - M1 - T2 - Understanding Data Lifecycle
No ratings yet
BDA - M1 - T2 - Understanding Data Lifecycle
21 pages
Project 3
No ratings yet
Project 3
16 pages
A System Architecture For Manufacturing Process Analysis Based On Big Data and Process Mining Techniques
No ratings yet
A System Architecture For Manufacturing Process Analysis Based On Big Data and Process Mining Techniques
6 pages
Advanced Databased Integration RepotMaamJho
No ratings yet
Advanced Databased Integration RepotMaamJho
45 pages
Adbms 12
No ratings yet
Adbms 12
28 pages
Operation Analytics and Investigating Metric Spike
No ratings yet
Operation Analytics and Investigating Metric Spike
13 pages
Cassandra Data Modeling Best Practices
No ratings yet
Cassandra Data Modeling Best Practices
57 pages
Shaikh Abdul
No ratings yet
Shaikh Abdul
5 pages
5 Jan 2025
No ratings yet
5 Jan 2025
7 pages
Most Common SQL Questions Asked at Netflix For Data Analysts
No ratings yet
Most Common SQL Questions Asked at Netflix For Data Analysts
2 pages
Oracle 11g Certified Database Administrator
No ratings yet
Oracle 11g Certified Database Administrator
0 pages
SLIDE TEMPLATE 29 March
No ratings yet
SLIDE TEMPLATE 29 March
24 pages
30 SQL and Database Design Questions From Data Science Interviews at Top Tech Companies
No ratings yet
30 SQL and Database Design Questions From Data Science Interviews at Top Tech Companies
27 pages
Comprehensive DBA (Database Administration) Notes
No ratings yet
Comprehensive DBA (Database Administration) Notes
6 pages
Data Analytics Case Study - Complete Guide in 2024
No ratings yet
Data Analytics Case Study - Complete Guide in 2024
10 pages
Chapter 02
No ratings yet
Chapter 02
45 pages
Case Study o Aims
No ratings yet
Case Study o Aims
26 pages
Event Management System
No ratings yet
Event Management System
23 pages
R - D 2020 - Thuy
No ratings yet
R - D 2020 - Thuy
7 pages
Case Study
No ratings yet
Case Study
10 pages
Ebay PM JD
No ratings yet
Ebay PM JD
5 pages
JD - FilumAI
No ratings yet
JD - FilumAI
3 pages
WES64 - Thận Nội Tiết
No ratings yet
WES64 - Thận Nội Tiết
5 pages
Business Requirements Document Template RFP360
No ratings yet
Business Requirements Document Template RFP360
2 pages
Guideline FamPlan
No ratings yet
Guideline FamPlan
4 pages
SQL Interview Prep Questions Repository-1
No ratings yet
SQL Interview Prep Questions Repository-1
23 pages
Evaluation of A Liquid Biopsy Protocol Using Ultra
No ratings yet
Evaluation of A Liquid Biopsy Protocol Using Ultra
2 pages
HHG - Associate Product Manager - JD
No ratings yet
HHG - Associate Product Manager - JD
3 pages
SQL Interview Question
No ratings yet
SQL Interview Question
18 pages
PCF Fan
No ratings yet
PCF Fan
1 page
Name - Nityananda Vyawhare Roll No. - 2223216 TY Core - 2: Unit-3
No ratings yet
Name - Nityananda Vyawhare Roll No. - 2223216 TY Core - 2: Unit-3
22 pages
Operations Analytics and Metrics Spike - 20241115 - 124811 - 0000
No ratings yet
Operations Analytics and Metrics Spike - 20241115 - 124811 - 0000
11 pages
New Microsoft Word Document
No ratings yet
New Microsoft Word Document
3 pages
CST2355 - A2
No ratings yet
CST2355 - A2
34 pages
Bit - Ly DNA
No ratings yet
Bit - Ly DNA
2 pages
Questions
No ratings yet
Questions
2 pages
Key Performance Indicators (KPIs) - Scalify Worksheet
No ratings yet
Key Performance Indicators (KPIs) - Scalify Worksheet
2 pages
Review 2 PHP
No ratings yet
Review 2 PHP
2 pages
Competitors Comparison
No ratings yet
Competitors Comparison
2 pages
Weekly Update Email Template
No ratings yet
Weekly Update Email Template
2 pages
JD-Vietnam Data Analyst - May 2023
No ratings yet
JD-Vietnam Data Analyst - May 2023
1 page
Pretty Yacoub Resume
No ratings yet
Pretty Yacoub Resume
1 page
Business Intelligence and Databases - Kopie
No ratings yet
Business Intelligence and Databases - Kopie
14 pages
File Management Report
No ratings yet
File Management Report
3 pages
Text 3
No ratings yet
Text 3
3 pages
QUIZ 2 Notes
No ratings yet
QUIZ 2 Notes
14 pages
DATA ANALYSIS AND DATA SCIENCE: Unlock Insights and Drive Innovation with Advanced Analytical Techniques (2024 Guide)
From Everand
DATA ANALYSIS AND DATA SCIENCE: Unlock Insights and Drive Innovation with Advanced Analytical Techniques (2024 Guide)
WINTON CLEM
No ratings yet
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
Enterprise Data Science: Smarter Decisions with Big Data
From Everand
Enterprise Data Science: Smarter Decisions with Big Data
Vidhur Gupta
No ratings yet
Book Series: Increasing Productivity of Software Development, Part 1: Productivity and Performance Measurement - Measurability and Methods
From Everand
Book Series: Increasing Productivity of Software Development, Part 1: Productivity and Performance Measurement - Measurability and Methods
Stefan Luckhaus
No ratings yet
Microprediction: Building an Open AI Network
From Everand
Microprediction: Building an Open AI Network
Peter Cotton
No ratings yet
The Secret Of Machine Learning
From Everand
The Secret Of Machine Learning
Mhd Arjunanta
No ratings yet
From Chaos to Concept: A Team Oriented Approach to Designing World Class Products and Experiences
From Everand
From Chaos to Concept: A Team Oriented Approach to Designing World Class Products and Experiences
Kevin Collamore Braun
No ratings yet
Data Science with R: Beginner to Expert
From Everand
Data Science with R: Beginner to Expert
Narayana Nemani
No ratings yet
Pragmatic Machine Learning with Python: Learn How to Deploy Machine Learning Models in Production
From Everand
Pragmatic Machine Learning with Python: Learn How to Deploy Machine Learning Models in Production
Avishek Nag
No ratings yet
What Is Data Analytics? A Complete Guide For Beginners
From Everand
What Is Data Analytics? A Complete Guide For Beginners
Piyush Kumar Jain
No ratings yet
PYTHON DATA SCIENCE: A Practical Guide to Mastering Python for Data Science and Artificial Intelligence (2023 Beginner Crash Course)
From Everand
PYTHON DATA SCIENCE: A Practical Guide to Mastering Python for Data Science and Artificial Intelligence (2023 Beginner Crash Course)
Calvert Long
No ratings yet
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
From Everand
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
Janet Laane Effron
No ratings yet
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
Data Analytics
From Everand
Data Analytics
Jeffery Short
1/5 (1)
Machine Learning Interview Questions
From Everand
Machine Learning Interview Questions
Tech Interviews
4.5/5 (2)
Big Data for Enterprise Architects
From Everand
Big Data for Enterprise Architects
Dr Mehmet Yildiz
4.5/5 (3)
Applied Predictive Modeling: An Overview of Applied Predictive Modeling
From Everand
Applied Predictive Modeling: An Overview of Applied Predictive Modeling
Steven Taylor
No ratings yet
Descriptive and Summary Statistics: Statistics for Lean Six Sigma Simplified with GEN AI, #1
From Everand
Descriptive and Summary Statistics: Statistics for Lean Six Sigma Simplified with GEN AI, #1
Sumeet Savant
No ratings yet
Microsoft Excel Statistical and Advanced Functions for Decision Making
From Everand
Microsoft Excel Statistical and Advanced Functions for Decision Making
Palani Murugappan
No ratings yet
Be Data Curious!: Be Data Curious!, #1
From Everand
Be Data Curious!: Be Data Curious!, #1
Nick Jewell
No ratings yet
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
From Everand
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
Artem Kovera
No ratings yet
Spreadsheets To Cubes (Advanced Data Analytics for Small Medium Business): Data Science
From Everand
Spreadsheets To Cubes (Advanced Data Analytics for Small Medium Business): Data Science
alasdair gilchrist
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Kimball Data Modeling - Data Engineering Interviews

Uploaded by

Kimball Data Modeling - Data Engineering Interviews

Uploaded by

How to pass the data modeling

round in big tech data

Knowledge of the following concepts will serve you very well:

 Asking the right questions!

EcZachly Data Engineering Newsletter is a reader-supported publication. To

Example - modeling user growth

So to start off, you’d want to cumulate active_user_snapshot in a table

The schema of this table might look something like:

We create daily_user_connections for the purpose of simpler backfills in the future

You can imagine a schema like

Now that we have daily_user_connections, user_connections_cumulated,

If we join together user_connections_cumulated and users_cumulated, both on

GROUP BY GROUPING SETS (

(dim_sender_country, dim_receiver_country, dim_sender_age, dim_receiver_age)

The aggregation_level partition is determined by which columns we are grouping on. If

Aside from the schema for user_connections_aggregate, you’ll need to produce a

 rejection_rate which is defined as m_num_rejects / m_num_requests

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.