0% found this document useful (0 votes)

10 views16 pages

Random Question 1

The document covers various data processing concepts including SCD types for historical data management, differences between OLTP and OLAP systems, and the CAP theorem regarding distributed systems. It also discusses data modeling techniques, specifically fact and dimension modeling, and compares star and snowflake schemas. Additionally, it addresses partitioning and bucketing strategies for optimizing data storage and query performance, along with specific partitioning methods like hash and range partitioning.

Uploaded by

thanish shekar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views16 pages

Random Question 1

Uploaded by

thanish shekar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 16

1.

Groupbykey vs reduceby key

2. Scd type
3. Oltp olap
4. Cap therom
5. fact data modeling and dimensional data modeling
6. Best practice l
a. Loging
b. Pysparj
c. Adf
d. Databricks
e. Sql
7. Snowflake schema vs start schema
8. Partitioning vs bucketing
9. Hash Partitioning vs Range Partitioning
Scd types
 play a crucial role in capturing historical data
 preserving the integrity of historical records.
 SCD’s are used to track changes in dimension attributes over time
 enabling analysts and decision-makers to analyze data trends accurately.

SCD Type 1:
 This is the simplest form of handling dimension changes.
 In this approach, when a change occurs in an attribute, the existing record is directly updated,
effectively overwriting the old data with the new value.
 This method does not maintain historical information and is suitable when historical data is not
critical or not required.
Eg :

Update the price of 101 to 850

SCD Type 2:

 SCD Type 2 maintains historical versions of dimension records by creating new rows for each
change
 effectively creating a versioned history. This approach ensures that historical data remains intact
and can be used for auditing or trend analysis.

Start date :

John switch the department from sales to marketing

SCD Type 3:

 SCD Type 3 keeps track of limited historical information by maintaining attributes as separate
columns.
 It allows us to store a limited number of changes and is commonly used when maintaining a
few previous versions is sufficient.

SCD type 4
 In Type 4, the dimension table has the latest value while its history is maintained in a separate
table.

OLTP vs OLAP
In data warehouse we can perform either of the process
 OLTP
o On-line transaction processing
o Insert update delete
o huge number of small on-line transactions (INSERT, UPDATE, and DELETE).
 OLAP
o On-line analytics process
o Data retrival for decision support
o small volume of transactions.
o Queries are repeatedly very complex and contain aggregations.
Cap therom
Cap therom
 it is not possible to guarantee all three of the desirable properties
 consistency, availability, and partition tolerance at the same time in a distributed system with
data replication.


Fact and dimension modeling
Fact Data Modeling:

 Think of fact tables as the place where you store measurable data about business events, like
sales or orders. Each row in a fact table records a specific event or transaction.
 Example: In a sales fact table, each row could represent a sale, with columns like Sale Amount,
Quantity Sold, and links to other tables for Date, Customer ID, and Product ID.

Dimension Data Modeling:

 Dimension tables provide context to the facts by storing descriptive information, which helps
you slice and dice your data. These tables describe the "who," "what," "when," and "where" of
each fact.
 Example: In a Product dimension table, you might store details about each product, like Product
Name, Category, and Brand. Similarly, a Customer dimension could contain Customer Name,
Location, and Age Group.

Snowflake schema vs start schema

 Star Schema and Snowflake Schema are data warehouse design approaches

Start schema :
 The fact tables and the dimension tables are contained.
 Star Schema uses a central fact table connected to dimension tables.
 fewer foreign-key join is used.
 This schema forms a star with a fact table and dimension tables.
Snowflake Schema :
 The fact tables, dimension tables as well as sub dimension tables are contained.
 This schema forms a snowflake with fact tables, dimension tables as well as sub dimension
tables are contained.
Partitioning vs bucketing
 Partitioning and Bucketing in data processing (like in Spark or Hive) are techniques used to
optimize data storage and improve query performance, especially for large datasets.
 Using partitioning and bucketing together can provide even more efficient data handling.
 Partitioning helps in filtering large sections of data,
 while bucketing helps in efficient processing within each partition, especially during join
operations.

Partitioning
Bucketing

Key Points
 Partitioning helps in reducing the amount of data scanned by the query.
 Bucketing helps optimize join operations by ensuring that the same keys are grouped within
each bucket.
 When cardinality is low then using partition and when cardinality is high then go for bucketing

How many bucket we should have

 128 mb file in each file because we have block size as 128 mb
 Size of data /128 mb

Summery
 Partition helps us query performance.
 Use bucketing to avoid small files
 Each bucket should have 128 mb files for more optimized version

Hash Partitioning vs Range Partitioning

Why partition
 Query optimizing and efficient storage
Hash partition

 Spark calculates the partition for each row based on a hash function applied to a
specified column.
 Hash Partitioning: Distributes rows uniformly by applying a hash function, which is good
for even load distribution but not ordered within partitions.
 Hash Partitioning: Often used for joins on a key column to ensure data locality.
Range partition
 Spark sorts data based on a specified column and assigns ranges to different partitions.
 Distributes rows based on sorted ranges, helpful for range queries where ordered data
is beneficial.
 Useful for range scans or queries, where operations are more efficient on sorted data
(e.g., filtering by a date range).
Spark code for read and write
1. Reading from DBMS (e.g., MySQL, PostgreSQL, SQL Server)
2. Reading from Azure Data Lake Storage (ADLS)

3. Reading from CSV Files

4. Reading from JSON Files

5. Reading from Parquet/avro/orc Files

6. R/W from Delta Tables

Simple question
1. Why Parquet files are commonly used.
 Columnar Storage:
 Compression:
 Efficient Reads and Writes:
 Compatibility: open source
2. In bucketing how do you decide how many bucket we should have
 128 mb file in each file because we have block size as 128 mb
 Size of data /128 mb
 Df.bucketBy(number,column(prefer column which is int type))

3. Map and flatmap diffferece

 map keeps the structure of the original RDD (a list of lists in this case), while
 flatMap flattens the structure, giving a single list of words.
4. Difference between external table and manage table

THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE: "THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE"
From Everand
THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE: "THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE"
AJIT DASH
2/5 (2)
ETL Testing - PPT
No ratings yet
ETL Testing - PPT
77 pages
SnowFlake Notes
100% (1)
SnowFlake Notes
40 pages
Hs2 Main Report
No ratings yet
Hs2 Main Report
252 pages
Data Modeling, Star Schema, Snowflake Schema
No ratings yet
Data Modeling, Star Schema, Snowflake Schema
7 pages
7 - Data warehousing & Data Modelling_DE_Feb25
No ratings yet
7 - Data warehousing & Data Modelling_DE_Feb25
18 pages
AWS Redshift
No ratings yet
AWS Redshift
145 pages
Assignment - 2 DWH
No ratings yet
Assignment - 2 DWH
13 pages
Barclays Data Engineer Interview Questions
No ratings yet
Barclays Data Engineer Interview Questions
17 pages
Assingment:-2 Submitted To: - Mandeep Ma'Am Submitted By: - Nishant Ruhil UID:-17BCA1513 GROUP:-4 Class: - Bca-4D
No ratings yet
Assingment:-2 Submitted To: - Mandeep Ma'Am Submitted By: - Nishant Ruhil UID:-17BCA1513 GROUP:-4 Class: - Bca-4D
6 pages
DWH
No ratings yet
DWH
5 pages
unit 2 dwm
No ratings yet
unit 2 dwm
16 pages
Data Warehouse
No ratings yet
Data Warehouse
71 pages
dw4 - Dimension1
No ratings yet
dw4 - Dimension1
75 pages
Operational Data Stores Data Warehouse: 8) What Is Ods Vs Datawarehouse?
No ratings yet
Operational Data Stores Data Warehouse: 8) What Is Ods Vs Datawarehouse?
15 pages
Assignment 4-1
100% (2)
Assignment 4-1
27 pages
DWDM notes
No ratings yet
DWDM notes
19 pages
dm theory (1)
No ratings yet
dm theory (1)
31 pages
DATA MINING THEORETICAL QUESTIONS
No ratings yet
DATA MINING THEORETICAL QUESTIONS
30 pages
Datadgeling
No ratings yet
Datadgeling
22 pages
Q1. Difference between cache and pe
No ratings yet
Q1. Difference between cache and pe
13 pages
2.data Warehouse and OLAP
No ratings yet
2.data Warehouse and OLAP
14 pages
Data Warehousing and Data Mining Dec 2023
No ratings yet
Data Warehousing and Data Mining Dec 2023
28 pages
Unit 2 Notes DWM
No ratings yet
Unit 2 Notes DWM
14 pages
Home Work 3
0% (1)
Home Work 3
10 pages
Snowflake
No ratings yet
Snowflake
16 pages
Unit - 3 Data Warehousing and OLAP Technology
No ratings yet
Unit - 3 Data Warehousing and OLAP Technology
20 pages
Dimensional modelling _
No ratings yet
Dimensional modelling _
5 pages
Tableau Daily Notes
No ratings yet
Tableau Daily Notes
8 pages
CTEVT Data mining_solution 2079
No ratings yet
CTEVT Data mining_solution 2079
19 pages
DW&DM Material
No ratings yet
DW&DM Material
107 pages
Data Warehousing Interview Q&A
No ratings yet
Data Warehousing Interview Q&A
14 pages
Dwm Chp2 Notes
No ratings yet
Dwm Chp2 Notes
21 pages
Data Warehouse Lec-3
No ratings yet
Data Warehouse Lec-3
38 pages
Chapter V
No ratings yet
Chapter V
38 pages
DatawareHousing Concepts
No ratings yet
DatawareHousing Concepts
20 pages
DWDM Mid 1
No ratings yet
DWDM Mid 1
10 pages
DWDM IT-32 DATAWAREHOUSING & DATAMINING
No ratings yet
DWDM IT-32 DATAWAREHOUSING & DATAMINING
9 pages
000_Company Interview Qns
No ratings yet
000_Company Interview Qns
13 pages
DWM Practical Notes Theory Answers All
No ratings yet
DWM Practical Notes Theory Answers All
15 pages
Azure Data Engineering Complete Guide
No ratings yet
Azure Data Engineering Complete Guide
130 pages
DWM Unit 2. Data Warehousing Modeling & OLAP I
100% (2)
DWM Unit 2. Data Warehousing Modeling & OLAP I
16 pages
DWM Unit 1 (2023)
No ratings yet
DWM Unit 1 (2023)
38 pages
Interview Question
No ratings yet
Interview Question
14 pages
Ds Assign
No ratings yet
Ds Assign
6 pages
mid1 DWDM
No ratings yet
mid1 DWDM
11 pages
Understanding the Dimensional Modeling 1727537549
No ratings yet
Understanding the Dimensional Modeling 1727537549
21 pages
unit1
No ratings yet
unit1
36 pages
Correct DW
No ratings yet
Correct DW
9 pages
DW_Notes
No ratings yet
DW_Notes
13 pages
Interview Q & a (SQL Spark HIVE Airflow AWS Kafka)-1
No ratings yet
Interview Q & a (SQL Spark HIVE Airflow AWS Kafka)-1
25 pages
M 1.4 Multidimensional Data Model
No ratings yet
M 1.4 Multidimensional Data Model
72 pages
C CC C CCC CCCCCCCCCCCC
No ratings yet
C CC C CCC CCCCCCCCCCCC
7 pages
dw question paper 3
No ratings yet
dw question paper 3
4 pages
Hierarchy For A Dimension or Introducing Additional Dimensions. (Reverse of Roll-Up)
No ratings yet
Hierarchy For A Dimension or Introducing Additional Dimensions. (Reverse of Roll-Up)
3 pages
BA
No ratings yet
BA
6 pages
Unit 5 DW
No ratings yet
Unit 5 DW
12 pages
Reference Short Notes For Mid Term Papers: CS614 - Date Warehousing
No ratings yet
Reference Short Notes For Mid Term Papers: CS614 - Date Warehousing
18 pages
ETL Testing Concepts V16
No ratings yet
ETL Testing Concepts V16
35 pages
OLAP Design
No ratings yet
OLAP Design
5 pages
Data Structures I Essentials
From Everand
Data Structures I Essentials
Dennis Smolarski
No ratings yet
Valence
100% (1)
Valence
241 pages
Cat 950F Transmission PDF
100% (4)
Cat 950F Transmission PDF
13 pages
ScratchJr Book
No ratings yet
ScratchJr Book
17 pages
Typical Food From Ecuador
No ratings yet
Typical Food From Ecuador
1 page
Bite Registration Materials
No ratings yet
Bite Registration Materials
60 pages
Reactions of Benzene and Alkylbenzene A Level A2 Chemistry CIE
No ratings yet
Reactions of Benzene and Alkylbenzene A Level A2 Chemistry CIE
7 pages
2017 - TP Use of English - Units 4 and 6
No ratings yet
2017 - TP Use of English - Units 4 and 6
3 pages
Profile Lejerka
No ratings yet
Profile Lejerka
2 pages
Ref Tek
No ratings yet
Ref Tek
12 pages
Akkadian Names in Aramaic Documents From PDF
No ratings yet
Akkadian Names in Aramaic Documents From PDF
12 pages
Detailed Lesson Plan in Music
No ratings yet
Detailed Lesson Plan in Music
10 pages
Bab Vii Perenc - Pondasi
No ratings yet
Bab Vii Perenc - Pondasi
12 pages
Ndls
No ratings yet
Ndls
2 pages
A1C Infographic
No ratings yet
A1C Infographic
1 page
LayingOutFrustumWithDividers 20jul2012
No ratings yet
LayingOutFrustumWithDividers 20jul2012
9 pages
7 KAMBI
No ratings yet
7 KAMBI
2 pages
Management Information Systems for the Information Age 9th Edition Haag Test Bank - Full Version Is Now Available For Download
100% (6)
Management Information Systems for the Information Age 9th Edition Haag Test Bank - Full Version Is Now Available For Download
59 pages
Swapnil Bhandari CV'25
No ratings yet
Swapnil Bhandari CV'25
5 pages
RESOLUTION 15, SK Hono April
No ratings yet
RESOLUTION 15, SK Hono April
2 pages
T4-04 Qs PAPER SAM
No ratings yet
T4-04 Qs PAPER SAM
3 pages
ROHIT JINDAL FINAL WORK - Rohit Jindal
No ratings yet
ROHIT JINDAL FINAL WORK - Rohit Jindal
42 pages
Lista de Verbos Por Grupos
No ratings yet
Lista de Verbos Por Grupos
3 pages
International Marketing Dissertation Titles
100% (2)
International Marketing Dissertation Titles
4 pages
QP_CBSE_IX_SS_Ch 2 Socialism in Europe_Practise Sheet 2
No ratings yet
QP_CBSE_IX_SS_Ch 2 Socialism in Europe_Practise Sheet 2
2 pages
Physical Education in The Tertiary Level
No ratings yet
Physical Education in The Tertiary Level
35 pages
PIM
No ratings yet
PIM
4 pages
Primary Education Degree Dissertation Ideas
100% (2)
Primary Education Degree Dissertation Ideas
8 pages
5
No ratings yet
5
15 pages
Basic EE Prelim
No ratings yet
Basic EE Prelim
3 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Random Question 1

Uploaded by

Random Question 1

Uploaded by

1.

Groupbykey vs reduceby key

Update the price of 101 to 850

John switch the department from sales to marketing

Dimension Data Modeling:

Snowflake schema vs start schema

How many bucket we should have

Hash Partitioning vs Range Partitioning

3. Reading from CSV Files

5. Reading from Parquet/avro/orc Files

3. Map and flatmap diffferece

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.