0% found this document useful (0 votes)
10 views16 pages

Random Question 1

The document covers various data processing concepts including SCD types for historical data management, differences between OLTP and OLAP systems, and the CAP theorem regarding distributed systems. It also discusses data modeling techniques, specifically fact and dimension modeling, and compares star and snowflake schemas. Additionally, it addresses partitioning and bucketing strategies for optimizing data storage and query performance, along with specific partitioning methods like hash and range partitioning.

Uploaded by

thanish shekar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views16 pages

Random Question 1

The document covers various data processing concepts including SCD types for historical data management, differences between OLTP and OLAP systems, and the CAP theorem regarding distributed systems. It also discusses data modeling techniques, specifically fact and dimension modeling, and compares star and snowflake schemas. Additionally, it addresses partitioning and bucketing strategies for optimizing data storage and query performance, along with specific partitioning methods like hash and range partitioning.

Uploaded by

thanish shekar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 16

1.

Groupbykey vs reduceby key


2. Scd type
3. Oltp olap
4. Cap therom
5. fact data modeling and dimensional data modeling
6. Best practice l
a. Loging
b. Pysparj
c. Adf
d. Databricks
e. Sql
7. Snowflake schema vs start schema
8. Partitioning vs bucketing
9. Hash Partitioning vs Range Partitioning
Scd types
 play a crucial role in capturing historical data
 preserving the integrity of historical records.
 SCD’s are used to track changes in dimension attributes over time
 enabling analysts and decision-makers to analyze data trends accurately.

SCD Type 1:
 This is the simplest form of handling dimension changes.
 In this approach, when a change occurs in an attribute, the existing record is directly updated,
effectively overwriting the old data with the new value.
 This method does not maintain historical information and is suitable when historical data is not
critical or not required.
Eg :

Update the price of 101 to 850

SCD Type 2:

 SCD Type 2 maintains historical versions of dimension records by creating new rows for each
change
 effectively creating a versioned history. This approach ensures that historical data remains intact
and can be used for auditing or trend analysis.

Start date :

John switch the department from sales to marketing


SCD Type 3:

 SCD Type 3 keeps track of limited historical information by maintaining attributes as separate
columns.
 It allows us to store a limited number of changes and is commonly used when maintaining a
few previous versions is sufficient.

SCD type 4
 In Type 4, the dimension table has the latest value while its history is maintained in a separate
table.

OLTP vs OLAP
In data warehouse we can perform either of the process
 OLTP
o On-line transaction processing
o Insert update delete
o huge number of small on-line transactions (INSERT, UPDATE, and DELETE).
 OLAP
o On-line analytics process
o Data retrival for decision support
o small volume of transactions.
o Queries are repeatedly very complex and contain aggregations.
Cap therom
Cap therom
 it is not possible to guarantee all three of the desirable properties
 consistency, availability, and partition tolerance at the same time in a distributed system with
data replication.


Fact and dimension modeling
Fact Data Modeling:

 Think of fact tables as the place where you store measurable data about business events, like
sales or orders. Each row in a fact table records a specific event or transaction.
 Example: In a sales fact table, each row could represent a sale, with columns like Sale Amount,
Quantity Sold, and links to other tables for Date, Customer ID, and Product ID.

Dimension Data Modeling:

 Dimension tables provide context to the facts by storing descriptive information, which helps
you slice and dice your data. These tables describe the "who," "what," "when," and "where" of
each fact.
 Example: In a Product dimension table, you might store details about each product, like Product
Name, Category, and Brand. Similarly, a Customer dimension could contain Customer Name,
Location, and Age Group.

Snowflake schema vs start schema


 Star Schema and Snowflake Schema are data warehouse design approaches

Start schema :
 The fact tables and the dimension tables are contained.
 Star Schema uses a central fact table connected to dimension tables.
 fewer foreign-key join is used.
 This schema forms a star with a fact table and dimension tables.
Snowflake Schema :
 The fact tables, dimension tables as well as sub dimension tables are contained.
 This schema forms a snowflake with fact tables, dimension tables as well as sub dimension
tables are contained.
Partitioning vs bucketing
 Partitioning and Bucketing in data processing (like in Spark or Hive) are techniques used to
optimize data storage and improve query performance, especially for large datasets.
 Using partitioning and bucketing together can provide even more efficient data handling.
 Partitioning helps in filtering large sections of data,
 while bucketing helps in efficient processing within each partition, especially during join
operations.

Partitioning
Bucketing

Key Points
 Partitioning helps in reducing the amount of data scanned by the query.
 Bucketing helps optimize join operations by ensuring that the same keys are grouped within
each bucket.
 When cardinality is low then using partition and when cardinality is high then go for bucketing

How many bucket we should have


 128 mb file in each file because we have block size as 128 mb
 Size of data /128 mb

Summery
 Partition helps us query performance.
 Use bucketing to avoid small files
 Each bucket should have 128 mb files for more optimized version

Hash Partitioning vs Range Partitioning


Why partition
 Query optimizing and efficient storage
Hash partition

 Spark calculates the partition for each row based on a hash function applied to a
specified column.
 Hash Partitioning: Distributes rows uniformly by applying a hash function, which is good
for even load distribution but not ordered within partitions.
 Hash Partitioning: Often used for joins on a key column to ensure data locality.
Range partition
 Spark sorts data based on a specified column and assigns ranges to different partitions.
 Distributes rows based on sorted ranges, helpful for range queries where ordered data
is beneficial.
 Useful for range scans or queries, where operations are more efficient on sorted data
(e.g., filtering by a date range).
Spark code for read and write
1. Reading from DBMS (e.g., MySQL, PostgreSQL, SQL Server)
2. Reading from Azure Data Lake Storage (ADLS)

3. Reading from CSV Files


4. Reading from JSON Files

5. Reading from Parquet/avro/orc Files


6. R/W from Delta Tables

Simple question
1. Why Parquet files are commonly used.
 Columnar Storage:
 Compression:
 Efficient Reads and Writes:
 Compatibility: open source
2. In bucketing how do you decide how many bucket we should have
 128 mb file in each file because we have block size as 128 mb
 Size of data /128 mb
 Df.bucketBy(number,column(prefer column which is int type))

3. Map and flatmap diffferece


 map keeps the structure of the original RDD (a list of lists in this case), while
 flatMap flattens the structure, giving a single list of words.
4. Difference between external table and manage table

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy