Random Question 1
Random Question 1
SCD Type 1:
This is the simplest form of handling dimension changes.
In this approach, when a change occurs in an attribute, the existing record is directly updated,
effectively overwriting the old data with the new value.
This method does not maintain historical information and is suitable when historical data is not
critical or not required.
Eg :
SCD Type 2:
SCD Type 2 maintains historical versions of dimension records by creating new rows for each
change
effectively creating a versioned history. This approach ensures that historical data remains intact
and can be used for auditing or trend analysis.
Start date :
SCD Type 3 keeps track of limited historical information by maintaining attributes as separate
columns.
It allows us to store a limited number of changes and is commonly used when maintaining a
few previous versions is sufficient.
SCD type 4
In Type 4, the dimension table has the latest value while its history is maintained in a separate
table.
OLTP vs OLAP
In data warehouse we can perform either of the process
OLTP
o On-line transaction processing
o Insert update delete
o huge number of small on-line transactions (INSERT, UPDATE, and DELETE).
OLAP
o On-line analytics process
o Data retrival for decision support
o small volume of transactions.
o Queries are repeatedly very complex and contain aggregations.
Cap therom
Cap therom
it is not possible to guarantee all three of the desirable properties
consistency, availability, and partition tolerance at the same time in a distributed system with
data replication.
Fact and dimension modeling
Fact Data Modeling:
Think of fact tables as the place where you store measurable data about business events, like
sales or orders. Each row in a fact table records a specific event or transaction.
Example: In a sales fact table, each row could represent a sale, with columns like Sale Amount,
Quantity Sold, and links to other tables for Date, Customer ID, and Product ID.
Dimension tables provide context to the facts by storing descriptive information, which helps
you slice and dice your data. These tables describe the "who," "what," "when," and "where" of
each fact.
Example: In a Product dimension table, you might store details about each product, like Product
Name, Category, and Brand. Similarly, a Customer dimension could contain Customer Name,
Location, and Age Group.
Start schema :
The fact tables and the dimension tables are contained.
Star Schema uses a central fact table connected to dimension tables.
fewer foreign-key join is used.
This schema forms a star with a fact table and dimension tables.
Snowflake Schema :
The fact tables, dimension tables as well as sub dimension tables are contained.
This schema forms a snowflake with fact tables, dimension tables as well as sub dimension
tables are contained.
Partitioning vs bucketing
Partitioning and Bucketing in data processing (like in Spark or Hive) are techniques used to
optimize data storage and improve query performance, especially for large datasets.
Using partitioning and bucketing together can provide even more efficient data handling.
Partitioning helps in filtering large sections of data,
while bucketing helps in efficient processing within each partition, especially during join
operations.
Partitioning
Bucketing
Key Points
Partitioning helps in reducing the amount of data scanned by the query.
Bucketing helps optimize join operations by ensuring that the same keys are grouped within
each bucket.
When cardinality is low then using partition and when cardinality is high then go for bucketing
Summery
Partition helps us query performance.
Use bucketing to avoid small files
Each bucket should have 128 mb files for more optimized version
Spark calculates the partition for each row based on a hash function applied to a
specified column.
Hash Partitioning: Distributes rows uniformly by applying a hash function, which is good
for even load distribution but not ordered within partitions.
Hash Partitioning: Often used for joins on a key column to ensure data locality.
Range partition
Spark sorts data based on a specified column and assigns ranges to different partitions.
Distributes rows based on sorted ranges, helpful for range queries where ordered data
is beneficial.
Useful for range scans or queries, where operations are more efficient on sorted data
(e.g., filtering by a date range).
Spark code for read and write
1. Reading from DBMS (e.g., MySQL, PostgreSQL, SQL Server)
2. Reading from Azure Data Lake Storage (ADLS)
Simple question
1. Why Parquet files are commonly used.
Columnar Storage:
Compression:
Efficient Reads and Writes:
Compatibility: open source
2. In bucketing how do you decide how many bucket we should have
128 mb file in each file because we have block size as 128 mb
Size of data /128 mb
Df.bucketBy(number,column(prefer column which is int type))