Data Warehouse and Data Mining
Data Warehouse and Data Mining
The star schema is one of the most popular data modeling techniques used in data
warehousing.
Its structure is relatively simple, making it easy to understand and conducive for query
● At the heart of the star schema is the fact table. This table contains the quantitative data
(often called "facts" or "measures") about specific events or transactions. Examples of facts
● The fact table usually has a composite primary key made up of foreign keys that link to
associated dimension tables. This composite key helps in relating facts to their descriptive
context.
2. Dimension Tables:
● Surrounding the central fact table are several dimension tables. Each dimension table
referred to as "attributes." These attributes give context to the quantitative data in the fact
table.
● Examples of dimension tables could be: "Time" (with attributes like day, week, month,
quarter, year), "Product" (with attributes like product name, category, manufacturer),
"Customer" (with attributes like customer name, address, and phone number), and so forth.
● Each dimension table is linked to the fact table by a primary-to-foreign key relationship.
3. Characteristics:
● Simplicity: One of the main advantages of the star schema is its simplicity. The clear
distinction between fact and dimension tables makes it easy for end-users and developers
● Performance: Due to its denormalized nature, the star schema is optimized for query
performance. Queries often require fewer joins in a star schema than in more normalized
● Scalability: New dimensions or facts can be added without changing the existing structure,
4. Drawback:
● Redundancy: Because it's denormalized, the star schema can introduce data redundancy.
This can lead to increased storage requirements and potential data integrity issues.
5. Usage:
● The star schema is primarily used in OLAP systems, which are designed for complex
queries and aggregations, rather than OLTP systems, which are transaction-oriented.
In graphical representations, the structure resembles a star, with the fact table in the center and
dimension tables radiating outward, hence the name "star schema” as depicted in figure 5.7
SALES is a fact table having attributes i.e. (Product ID, Order ID, Customer ID, Employer ID,
Total, Quantity, Discount) which references to the dimension tables. Employee dimension table
contains the attributes: Emp ID, Emp Name, Title, Department and Region. Product dimension
table contains the attributes: Product ID, Product Name, Product Category, Unit Price.
Customer dimension table contains the attributes: Customer ID, Customer Name, Address,
City, Zip. Time dimension table contains the attributes: Order ID, Order Date, Year, Quarter,
Month.
Snowflake Schema
The snowflake schema is another common data warehousing model, closely related to the star
schema. While both are used for OLAP (Online Analytical Processing), they have structural
differences and a sample is shown in Figure 5.8. Here's an overview of the snowflake schema:
● In the snowflake schema, dimension tables are normalized. That means the data is
organized within the database to reduce redundancy and improve data integrity. This is
done by dividing the data into additional tables, creating a structure that looks like a
● For instance, if you have a "Customer" dimension in a retail scenario, that dimension could
be normalized into separate "Customer," "City," and "Country" tables instead of a single
● Because of this normalization, the snowflake schema tends to have a more complex
structure than the star schema. Queries can become more complex and involve more table
● The main advantage of the snowflake schema is the reduction in data redundancy. This can
● However, the space saved may be minimal compared to the overall size of the data
warehouse, and this saving might not justify the additional complexity.
● The increase in normalization can improve data integrity, as the chances of inconsistent
data are reduced. Any changes to a data point need to be made in just one place, reducing
5. Query Performance:
● Query performance can be slower compared to the star schema due to the increased number
6. Scalability Issues:
● While the snowflake schema can handle changing requirements by adding new dimensions
easily, the complexity of the schema might increase significantly as the database scales,
The Employee dimension table now contains the attributes: EmployeeID, EmployeeName,
DepartmentID, Region, and Territory. The DepartmentID attribute links with the Employee table
with the Department dimension table. The Department dimension is used to provide detail
about
each department, such as the Name and Location of the department. The Customer dimension
table now contains the attributes: CustomerID, CustomerName, Address, and CityID. The
CityID
attributes link the Customer dimension table with the City dimension table. The City dimension
table has details about each city such as city name, Zipcode, State, and Country.