0% found this document useful (0 votes)
97 views

Data Management Assignment

CT051-3-M- DATA MANAGEMENT INDIVIDUAL ASSIGNMENT PART 1

Uploaded by

Amirrul Rasyid
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
97 views

Data Management Assignment

CT051-3-M- DATA MANAGEMENT INDIVIDUAL ASSIGNMENT PART 1

Uploaded by

Amirrul Rasyid
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 36

Amirrul Rasyid Bin Norazman

TP079469
CT051-3-M- DATA MANAGEMENT
INDIVIDUAL ASSIGNMENT PART 1
Dr Murugananthan Velayutham

1
Table of Contents
INTRODUCTION.................................................................................................................................................. 4

BACKGROUND.............................................................................................................................................................4

INITIAL DATA EXPLORATION............................................................................................................................... 6

DESCRIPTIVE STATISTICS................................................................................................................................................9
Summary statistics for Nominal Variables........................................................................................................10
Summary Statistics for Ordinal Variables.........................................................................................................12
Summary Statistics for Ratio Variables.............................................................................................................14

METADATA....................................................................................................................................................... 17

DATA WAREHOUSE AS LARGE DATASETS STORAGE AND PROCESSING CENTRE..................................................18

WHAT IS A DATA WAREHOUSE?...................................................................................................................................18


5 CRITERIA TO EVALUATE DATA WAREHOUSE AS LARGE DATASETS STORAGE AND PROCESSING CENTRE.....................................19
Criteria 1 : Complex query requirements..........................................................................................................19
Criteria 2: Historical Data Analysis...................................................................................................................20
Criteria 3: Data Integration from Multiple Sources..........................................................................................21
Criteria 4: Read-Intensive Workloads...............................................................................................................21
Criteria 5: Variant types of Analysis (Descriptive, Diagnostic, Predictive, Prescriptive)....................................22

CONCLUSION.................................................................................................................................................... 24

CITATIONS........................................................................................................................................................ 25

Figure 1Excerpt from the dataset ' Speed Date Experiment in Columbia University'....................6
Figure 2 Attributes generated using CONTENT Procedure on the dataset.....................................6
Figure 3 Variables retrieved using CONTENTS procedure............................................................7
Figure 4 Summary Statistics generated using the MEANS Procedure............................................9
Figure 5 MEANS procedure for variable Gender, Goal and Dec..................................................10
Figure 6 Pie chart as graphical representation of the frequency distribution for nominal variables
e.g. Gender.....................................................................................................................................11
Figure 7 MEAN procedure for variables Attr, Sinc, Intel, Fun, Amb, Shar, Like , Prob..............12
Figure 8 Frequency Distribution and Box Plot for the variable Attr.............................................13

2
Figure 9 UNIVARIATE Procedure for variable Age....................................................................14
Figure 10 Histogram of Z-Scores for variable Age.......................................................................15
Figure 11 Metadata of the chosen dataset using CONTENT procedure.......................................17
Figure 12 Criteria for Data Warehouse concept to store and process large datasets.....................19

Table 1 Attributes, their types and explanation...............................................................................7


Table 2 Summary Statistics for variables Gender, Goal , Dec & Met...........................................11
Table 3 Summary Statistics for ordinal variables Attr, Sinc, Intel, Fun, Amb, Shar, Like & Prob
.......................................................................................................................................................13
Table 4 Summary statistics of ratio variables (age, income) in the dataset...................................15

3
Introduction

The ability to efficiently store, handle, and analyse big datasets has never been more
important than now, in a data-driven world. This assignment shall focus on exploring and
evaluating various data types, storage methods and indexing & retrieval techniques, while
understanding the critical role of data warehouses in managing large datasets.
W.H. Inmon stated that a data warehouse serves as a repository that store integrated data
from multiple sources (W.H.Inmon, 2002). A data warehouse is not to be confused with a
database; while both acts as a data storage and management tools, a data warehouse separate
analysis workload from transaction workload and can consolidate data from several sources.
These characteristics help data warehouse supports tasks such as analytical reporting, structured
or ad hoc queries as well as decision making.
However, this assignment will not just primarily focuses on evaluating the effectiveness
of data warehouses in storing and processing a dataset; this assignment shall present findings on
the use of data mining using a chosen dataset. On top of that, a metadata structure for the chosen
database shall be featured in this assignment. This ensures a clear and organised framework for
future data handling and analysis.

Background

The dataset used is obtained from an experiment conducted in Columbia University. In


2002- 2004, the university organised a speed-dating experiment where they tracked data over 21
speed dating sessions for young adults meeting people of the opposite sex. The dataset are
suitable for this assignment based on the following observations:

1. Diverse Data Types


The dataset includes both categorical and numerical data i.e. mixed data. This makes it
useful for exploring algorithms that handle mixed data types. For example, Nguyen proposed
an extension to the iterative hard thresholding (IHT) algorithm to better quantify the structure
of categorical features when handling a combination of categorical and numerical features in
datasets(Nguyen & Obafemi-Ajayi, 2019). Furthermore, Suarez Alvarez suggested using
4
Euclidean metrics to normalise feature vectors for mixed datasets (Suarez Alvarez, 2010).
This normalization ensures that average contributions of all attributes to the measures are
statistically equal.
2. Sufficient Size
The data contains 8,378 observations, which is substantial for a meaningful analysis.
Moreover, with 15 columns, the dataset offers a variety of attributes to analyse and apply
statistical methods. Kaplan emphasized that larger sample sizes lead to more accurate average
values, help identify outliers, and provide smaller margins of error (Kaplan, Chambers, &
Glasgow, 2014). What’s more, another discussion highlights how large sample sizes provide
greater statistical power and more precise estimates, which improve the reliability of research
findings (Subramanian, Jayadevan, & Aji, 2019).
3. Completeness and Quality (or lack thereof)
Some columns have missing values. The dataset also likely contains outliers, which
can warrant for further exploration. While missing values and outliers are unavoidable, Kwak
and Kim discuss the importance of adequately managing them for accurate analysis (Kwak &
Kim, 2017). Likewise, Huang, Lin & Tsai explore the impact of outlier removal on the
imputation of missing values in medical datasets and its significance for reliable data analysis
(Huang, Lin, & Tsai, 2018).
4. Relevant and Realistic
The dataset is based on real-world scenarios, thus ensuring further analysis more
applicable. The result of analysis from this dataset is also domain-relevant, that is , relevant for
data mining and machine learning as it can be related to studies in social sciences, behaviour
analysis or psychology. A research article supports these statements; the article emphasizes the
need for domain-relevant datasets and realistic procedures to ensure fair benchmarking in time
series data (Ragab, et al., 2023).
5. Supports Data Warehousing Concepts
The dataset also contains historical data, and come from different sources of data
collection , therefore aligning with data warehousing concepts, just as a contributor from SAS
Institute provides the importance of data warehousing, highlighting how it integrates data from
multiple sources to provide comprehensive insights (SAS Institute, Inc, 2024).

5
Initial Data Exploration

For this assignment, SAS Studio is used to conduct data exploration and analysis on the
aforementioned dataset.

Figure 1Excerpt from the dataset ' Speed Date Experiment in Columbia University'

6
Figure 2 Attributes generated using CONTENT Procedure on the dataset

Figure 1 & 2 shows that there are 15 attributes in the dataset. There are four types of variables;
two types of qualitative data (nominal and ordinal) and two types of quantitative data ( interval
and ratio) (Mishra, Pandey, Singh, & Gupta, 2018). Nominal data have values in which only
function as labels and do not represent any mathematical meaning even though the values are
numbers. On the other hand, ordinal data contain ranking
order, whether higher or lower, albeit not having any equal
difference between each interval thus not supporting any
arithmetic operations. On the matter of quantitative data, the
significant difference between interval data and ratio data is
a meaningful zero-point (only ratio data has a meaningful
zero-point). This means while both interval data and ratio
data have constant interval (therefore, supporting addition
and subtraction operations), only ratio data can support
multiplication and division operation (due to having a true
zero-point) (Freund, Freund, & Wilson, 2003).
The CONTENTS procedure in SAS (as shown in Figure 3)
summarises the content of a dataset, including the number of
variables. However, as of the time of writing, SAS Studio do
Figure 3 Variables retrieved using not have the procedure to infer data types beyond basic
CONTENTS procedure numeric and character classifications, hence manual
inspection is required here. The types of each attribute are
shown in Table 1 below:
Table 1 Attributes, their types and explanation

Attributes Types Explanation


Gender Nominal Gender ( Female = 0 , Male = 1)

7
Age Ratio Age (years)
Income Ratio Median annual household income (in USD) based on
zipcode using the Census Bureau website (link here). When
there is no income, it means that they are either from abroad
or did not enter their zip code.

Goal Nominal Primary goal in participating in the speed dating event


Seemed like a fun night out = 1
To meet new people = 2
To get a date = 3
Looking for a serious relationship = 4
To say I did it = 5
Other = 6
Dec Nominal Rater’s decision on whether the date was a match ( Yes = 1,
No = 0)
Attr Ordinal What the participant rate the respective date. Score from 1 to
10 (Strongly disagree = 0, Strongly agree = 10) on the
characteristic : Attractive
Sinc Ordinal What the participant rate the respective date. Score from 1 to
10 (Strongly disagree = 1, Strongly agree = 10) on the
characteristic : Sincerity
Intel Ordinal What the participant rate the respective date. Score from 1 to
10 (Strongly disagree = 1, Strongly agree = 10) on the
characteristic : Intelligence
Fun Ordinal What the participant rate the respective date. Score from 1 to
10 (Strongly disagree = 1, Strongly agree = 10) on the
characteristic : Fun
Amb Ordinal What the participant rate the respective date. Score from 1 to
10 (Strongly disagree = 1, Strongly agree = 10) on the
characteristic : Ambitiousness
Shar Ordinal What the participant rate the respective date. Score from 1 to
8
10 (Strongly disagree = 1, Strongly agree = 10) on the
characteristic : Shared Interest
Like Ordinal What the participant rate the respective date in overall. Score
from 1 to 10 (Strongly dislike = 1, Strongly like = 10)
Prob Ordinal A rating on whether the participant believed the probability
that the interest would be reciprocated . Score from 0 to 10
( Strongly disagree = 0 , Strongly agree = 10)
Met Nominal A rating if the participant has met the date prior the
experiment ( Yes = 1, No =0)

Descriptive Statistics

The next step during an initial data exploration is finding the descriptive statistics in the dataset.
Finding descriptive statistics (also known as summary statistics) has several important purposes,
including understanding data distribution in the dataset. Morcillo explores data summary
techniques such as mean, median, mode, standard deviation, range, and interquartile range,
emphasizing their role in organising and presenting data (Morcillo, 2023). Along with it,
descriptive statistics can help identify outliers that might affect this analysis despite outliers
being genuine observations (Larson, 2006). Another point in finding the descriptive statistics is
understanding the data distribution can assist in the decision making on implementing necessary
data transformation to prepare the data for further analysis (Manikandan, 2010).

9
The MEANS procedure is used in SAS Studio to generate the descriptive statistics of the

Figure 4 Summary Statistics generated using the MEANS


Procedure
dataset, as per Figure 4. There are several observations that can be made here, namely :
A) Column ‘N’ shows the number of observations in each variable. Since it is established
earlier that there are 3000 observations in the dataset, we can conclude that there are
missing data in almost all variables, with the exception of variables “gender”, and “dec”.
B) Reviewing the columns ‘Mean’, ‘Minimum’ , and ‘Maximum’ shines possibilities of the
existence of outliers in variables ‘age’ and ‘income’. This inference bears sound logic as
both variables contain ratio data.
C) Cross-referencing both the columns ‘Minimum’ and ‘Maximum’ helps us understand the
ranges of each variable.
D) Some of measures do not hold strength against contextual questioning. For example,
mean of the variable ‘gender’ is 0.5005968 which doesn’t have any meaning as it is
already established that the variable ‘gender’ only has two definite values; 1 or 0.
Manikan explains that the mean cannot be calculated for nominal or non-nominal ordinal
data because it does not provide meaningful values for such types of data (Manikandan S.
, 2011).

Based on the findings on D, the most ideal way to identify the descriptive statistics for each
attribute is to account of its type.

10
Summary statistics for Nominal Variables

Nominal variables
in the dataset like
Gender, Goal,
Dec and Met are
best described
Figure 5 MEANS procedure for variable using statistics
Gender, Goal and Dec
that provide
counts, modes and proportions. Taylor supported
the statement by explaining that nominal data is Figure 6 Pie chart as graphical representation of the
used to label variables without providing any frequency distribution for nominal variables e.g. Gender

quantitative value (Taylor, 2024). Figure 5 shows the mode and number of observation for
variable Gender, Goal, Dec and Met. Mode is used almost exclusively with nominal variables, as
it is the only measure of central tendency applicable due to the inability of nominal variables to
be measured quantitatively (Trakulkasemsuk, 2014). Due to this reason, there are no dispersion
and position measures for nominal variables. Next, one-way frequency tables are produced to
present frequency distribution for the nominal variables while pie-chart are produced to represent
those frequency distribution in a graphical representation, which are available in Appendix 1.
Spriestersbach explains that nominal variables can be effectively represented using frequency
table and pie charts, albeit other formats such as bar graphs work as well (Spriestersbach, Röhrig,
Du Prel, Gerhold-Ay, & Blettner, 2009). The produced frequency distribution also shows
missing values and inconsistencies (refer to Appendix A). For example, in variable ‘met’ ,
datapoints such NA or any other names besides 1 or 0 should not present. However, these issues
will be addressed during data pre-processing in the next assignment. The summary statistics for
the nominal variables for the dataset are shown below:

Table 2 Summary Statistics for variables Gender, Goal , Dec & Met

Gender Goal Dec Met


Number of 3000 2968 3000 2859

11
observations
Number of 0 32 0 141
missing
observations
Inconsistencies none none none Outliers are
presented e.g.
2,5,7
Measure of Central Tendency
Mode 1 1 0 0
Measure of Variability (Dispersion)
NA
Measure of position
NA

Summary Statistics for Ordinal Variables

The next variables, ordinal variables (Attr, Sinc, Intel, Fun, Amb, Shar, Like , Prob) shall be
summarised by central tendency, variability measures and
position measures. While mode can be used to represent
central tendency in ordinal variables, median also can be
considered since ordinal data can be ranked therefore
median identifies the central value within the dataset
(Manikandan S. , 2011). Unlike nominal variables, ordinal
Figure 7 MEAN procedure for variables Attr,
variables allow for measure of dispersion and position. Sinc, Intel, Fun, Amb, Shar, Like , Prob
Both range and interquartile range are useful variability measure for ordinal variable. However,
with outliers in mind, interquartile range (IQR) is the most appropriate variability measure for
ordinal variables – IQR provide information about the dispersion of data while being less
affected by outliers and without assuming interval data (Larson, 2006). Finding positioning
measures is also possible with ordinal variables using quartiles and percentiles. Quartiles and
percentiles help divide the dataset into equal-sized groups, thus is vital to understand a
datapoint’s location relative to any determined quartiles and percentiles (Harding, Tremblay, &
12
Cousineau, 2014). On top of the previously mentioned pie chart and frequency tables, a graph
like a box-and-whisker plot (or boxplot) can be produced to display statistics for ordinal
variables like quartiles, and, outliers (Bensken, Pieracci, & Ho, 2021). This visualisation helps
communicate positioning measure data (Quartile 1 and Quartile 3 are represented by the edge of
the box, Quartile 2 or median are represented by the line within the box) and outliers (the dot
outside the lines or whiskers ). Below are the summary statistics for ordinal variables in the
dataset (refer to Appendix B for visualizations for ordinal variables):

Figure 8 Frequency Distribution and Box Plot for the variable Attr

Table 3 Summary Statistics for ordinal variables Attr, Sinc, Intel, Fun, Amb, Shar, Like & Prob

Attr Sinc Intel Fun Amb Shar Like Prob


Number of 2923 2896 2890 2868 2730 2599 2906 2889
observations
Number of 77 104 110 141 270 401 94 111
missing
observations
Inconsistencies yes yes yes yes yes yes yes yes
Measure of Central Tendency
Mode 6 8 8 7 7 5 7 5
Median 6 7 7 7 7 5 6 5
Measure of Variability (Dispersion)
Ranges 10 10 10 10 10 10 10 10
Interquartile 3 2 2 3 2 3 2 3

13
Range
Measure of Position
0% Min 0 0 0 0 0 0 0 0
1% 1 2 3 1 2 0 1 1
5% 3 4 5 3 4 2 3 1
10% 4 5 5 4 5 2 4 2
25% / 5 6 6 5 6 4 5 4
Quartile 1
50% / 6 7 7 7 7 5 6 5
Median /
Quartile 2
75% / 8 8 8 8 8 7 7 7
Quartile 3
90% 9 9 9 9 9 8 8 8
95% 9 10 10 9 10 9 9 9
99% 10 10 10 10 10 10 10 10
100% Max 10 10 10 10 10 10 10 10

Summary Statistics for Ratio Variables

The final type of variable in the dataset, ratio variables, offers a higher level of measurement
precision (as compared to both ordinal variables and nominal variables) as mentioned earlier.
Thus, ratio variables ( age and income in the dataset) can be summarised using much wider
ranges of statistical measures. Besides mode and median, mean can also be used as a measure of
central tendency for ratio variables. According to Gravetter and Wallnau, the mean is particularly
useful for ratio variables as it takes into account all data points, providing a comprehensive
central value (Gravetter & Wallnau, 2013). For
dispersion, two important measures are variance and
standard deviation. Variance measures the average
squared deviation of each data point from the mean,
giving an indication of the overall dispersion within
the dataset (Lee & Wong, 2001). Conversely, the
standard deviation, which is the square root of the
variance, provides a measure of dispersion in the same Figure 9 UNIVARIATE Procedure for variable
Age
14
units as the original data, making it easier to interpret compared to variance (Heumann &
Shalabh, 2016). The availability of mean in ratio variables also opens up options for other
statistical dispersion measures,
such as coefficient of variation
(CV)(shown in Figure 9), which
provide further insights into data
distribution relative to the
mean ,that is a high CV indicates
high variability relative to the mean
and vice versa (Bendel, Higgins,
Teberg, & Pyke, 1989). While
percentiles and quartiles applies to
ratio variables as much as ordinal
variables in measuring location, the
availability of mean also adds
another statistical position
measures, that is z-scores. Z-scores
Figure 10 Histogram of Z-Scores for variable Age offer a way to determine the
position of individual data points ,
relative to the mean (Wang & Chen, 2012). Positive z-score indicates the values above the
mean, and vice versa. Figure 10 shows a histogram of Z-scores of variable Age relative to the
mean, which shows datapoints outside the normal line , which are the outliers. In conclusion, the
availability of mean allows for more depth of statistical measures and visualisation (refer to
Appendix C for visualisation of summary statistics in ratio variable). Table 4 below shows the
summary statistics of ratio variables in the dataset.

Table 4 Summary statistics of ratio variables (age, income) in the dataset

Age Income
Number of 2965 1545
observations
Number of 35 1455
missing
observations
Inconsistencies yes yes
Measure of Central Tendency
Mean 26.41282 44922.53
Median 26 43367.00

15
Mode 27 55080.00
Measure of Variability
Range 37 100424
Interquartile 5 22424
Range
Variance 12.38081 286778515
Standard 3.51864 16935
Deviation
Coefficient of 13.3217038 37.6972019
Variation
Measure of Position
0% Min 18 8607
1% 20 15863
5% 22 21597
10% 22 25589
25% / 24 31516
Quartile 1
50% / 26 43367
Median /
Quartile 2
75% / 29 53940
Quartile 3
90% 30 69487
95% 33 78704
99% 36 90225
100% Max 55 109031

16
Metadata

While Figure 1 earlier is referred to show the number of variables in the chosen dataset, the
figure also shows essential information about the dataset. This ‘data about data’ or metadata,
enables researchers and analysts alike to manage and understand datasets further by including
essential context about the structure, content, and quality of the data. Usually, a metadata
contains details such as the creation date, author, file format, and version of a dataset, as well as
description of each variable and their data types. The metadata of the chosen dataset is shown in
Figure 11.

Figure 11 Metadata of the chosen dataset using CONTENT procedure

17
This additional layer of information enhances the usability and interpretability of the dataset, and
facilitating data discovery, organization, and management of the dataset, further necessitate the
importance of generating a metadata in a dataset (Xiao, Wei Jeng, & He, 2018).

18
Data Warehouse As Large Datasets Storage and Processing Centre

What is a Data Warehouse?

When one thinks of a large storage of data, the term ‘data warehouse’ comes into mind. In fact,
Gardner wrote that the term ‘data warehouse’ is easily confused even among executives in the
industry (Gardner, 1998). The most appropriate term to describe storage and management of
large datasets would be a database (Ceruti, 2021). A database is designed to support day-to-day
operations, transaction processing and real-time querying. Typically, a database are organized
into tables consisting of rows and columns using a relational model. Since a database is mainly
designed for small, atomic transactions, a database can handle online transaction process (OLTP)
tasks better than a data warehouse (Kour, 2015).

The definition of the term ‘data warehouse’, that scholars agreed upon is a repository that has the
following characteristics ; ‘subject-oriented’ , ‘non-volatile’ , ‘integrated’ and ‘time-variant’
(Maślankowski, 2013). The data in a data warehouse is focused on specific subjects e.g.
customer, product & sales, and cannot be changed once the data has been loaded into the
warehouse. Furthermore, the data is sourced from multiple, disparate channels into the data
warehouse and is accurate with consideration to time. With these characteristics, a data
warehouse is designed for analytical processing such as business intelligence (BI) tasks, online
analytical processing (OLAP) and complex querying for decision support (Ghosh, Haider, &
Sen, 2015). Unlike a database which requires optimisation on create, update, and delete
operations, a data warehouse requires optimisation for read-heavy operations. Thus, a data
warehouse tend to be designed using a schema design such as star schema and snowflake
schema, with real life examples like Amazon Redshift, Google BigQuery and Snowflake
(Başaran, 2005).

19
5 criteria to evaluate Data Warehouse as Large Datasets Storage and
Processing Centre

Are the query


requirements
complex?
Do the operation requires Is Historical Data
variant types of analysis
(Descriptive, Diagnostic, Analysis necessary?
Predictive, Prescriptive)?

Criteria for Data Warehouse


concept to store and process
large datasets

Are the workloads Do the data come


read-intensive? from multiple
sources?

Figure 12 Criteria for Data Warehouse concept to store and process large datasets

Both data warehouse and database serve different objectives and under normal circumstances, a
database is the best option for a large datasets storage and processing centre. However, as
mentioned previously, there are few scenarios where implementing a data warehouse concept is
more efficient than a database.

Criteria 1 : Complex query requirements

Data warehouses are designed explicitly to optimise complex queries and analytical processes.
Kimball and Ross emphasize that data warehouses are specifically optimized for complex
queries through the use of specialised indexing & partitioning. This allows for sophisticated
multi-dimensional analysis in which operational databases are not designed to handle (Kimball &

20
Ross, 2013). Furthermore, Inmon supports the view that separating analytical processing from
transactional database to a data warehouse ensures that complex queries do not impact the
performance of day-to-day operations (Inmon, 2005). Similarly, Watson and Matyska point out
that data warehouses facilitate complex queries and analyses, such as ad hoc reporting and data
mining, without degrading the performance of database in transactional systems despite the
initial requirements for data warehouses to draw data from said systems (Watson, Ariyachandra,
& Matyska, 2001). Not to mention, Chaudhuri and Dayal highlight that executing complex
OLAP queries against the operational databases will only render poor performance (Chaudhuri &
Dayal, 1997).
However, it could be argued that implementing a data warehouse requires significant investment
in both hardware and skilled personnel. The initial setup cost and ongoing maintenance cost can
pose challenges to smaller organizations. Golfarelli and Rizzi suggest that organizations should
compare the benefits of a data warehouse in terms of query performance and analytical
capabilities to the initial cost demands before investing in one (Golfarelli & Rizzi, 2009).

For organisations with elaborate analytical requirements, a data warehouse can offer a robust
solution that operational databases cannot match. Despite the high initial costs, the long-term
benefits in performance and analytical capabilities may justify the investment.

Criteria 2: Historical Data Analysis

Operational databases typically only store current transactional data and often purge historical
data to maintain performance (Prasad, 2023). On the other side, Ponniah states that data
warehouses are designed to store and analyse historical data, enabling organizations to perform
long-term trend analysis and forecasting which is critical for strategic planning (Ponniah, 2011).
While data redundancy may sound undesirable, Montgomery, Jennings and Kulahci argue that
maintaining extensive historical data allows for comprehensive time-series analysis, a
fundamental advantage of data warehouses (Montgomery, Jennings, & Kulahci, 2015). Data
warehouse’s ability to retain and analyse historical data allows organisation to identify patterns
over time, make data-driven predictions and assess the impact of past decisions, which is
essential for business intelligence (Niu, Jie, & Zhang, 2009).
21
However, both complexity and storage costs can increase as a result of maintaining a large
historical dataset. On top of that, it is challenging to ensure data integrity over long periods as it
requires consistent data governance practices. Ionescu and Diaconita note that while managing
large volumes of historical data can be resource-intensive, the ability to perform extensive
historical analysis provides significant value for decision-making (Ionescu & Diaconita, 2023).

If the organisation requires the data storage centre to analyse historical data, a data warehouse
could be indispensable and the correct choice. While storage and maintenance costs are of
concern, the strategic benefits of historical data analysis may outweigh these challenges.

Criteria 3: Data Integration from Multiple Sources

An organisation may need to consolidate datasets from various systems like CRM, ERP, and
external databases. A data warehouse excels at integrating data from multiple heterogeneous
sources to provide a unified view of enterprise data (Hellerstein, Stonebraker, & Caccia., 1999).
According to Vassiliadis & Simitsis, the ETL processes used in data warehouses ensure that data
from different sources is cleaned, transformed and loaded in a consistent format (Vassiliadis &
Simitsis, 2009). Operational databases do not typically support this level of integration.
Databases are often siloed, making it difficult to get a comprehensive view of the data across the
organisation (Patel, 2019).

In spite of that, the process of integrating data from multiple sources is complex and time-
consuming (Santos & Bernardino, 2008). It requires careful planning and robust ETL tools to
ensure data quality and consistency.

For organizations needing a unified view of data from multiple sources, data warehouses provide
an effective solution. The ability to consolidate and analyse diverse data sources is a significant
advantage despite the complexity and time required for integration.

Criteria 4: Read-Intensive Workloads


22
In a work environment where the primary need is data retrieval and analysis, a data warehouse
would triumphs over operational databases. This is because data warehouses are optimized for
read-intensive operations, which involve frequent querying and reporting while operational
databases are primarily designed for write-intensive transactions, handling large volumes of
insert, update, and delete operations efficiently (Sippu & Soisalon-Soininen, 2015). Supporting
the statement, Raj et al. argue that the architecture of data warehouses supports high-
performance read operations, making them ideal for analytical workloads (Raj, et al., 2015).

One cannot deny that the optimization of read-intensive tasks can come at the expense of slower
write operations. point out that while data warehouses are optimized for read performance, this
comes at the expense of write efficiency, which may not be suitable for applications requiring
frequent data updates in their storage (Athanassoulis, Chen, Ailamaki, Gibbons, & Stoica, 2011).

Data warehouse can offer superior performance to organisations with heavy read requirements
such as extensive reporting and analysis. However, the organisation must note the trade-off in
write performance when using data warehouse as a data storage.

Criteria 5: Variant types of Analysis (Descriptive, Diagnostic, Predictive,


Prescriptive)

If the requirement for a reporting feature by the organisation is merely static and one-time lists,
an operational database will suffice. However, performing analytical queries is very complex
with operational database due to the number of table joins (Shivakumar, 2020). This is not the
case with data warehouse as the historical and integrated data stored in data warehouses is
crucial for performing different types of analysis, which helps organizations make informed
decisions based on comprehensive data insights (March & Hevner, 2007). Along with it, Cardon
note that the possibilities for reporting and analysis are endless are much diverse with data
warehouse, from simple descriptive reports to complex predictive modelling and prescriptive
analytics (Cardon, 2014).

23
The concern of setup complexity and maintenance is still very true for data warehouse. Even so,
data warehouses are highly effective solutions should the organisation requires a range of
analytical capabilities.

24
Conclusion

The detailed evaluation of the five criteria demonstrates that data warehouses offer substantial
benefits for organisations, especially for organisations that require advanced analytical
capabilities. The robust support for complex queries, historical data analysis, data integration,
read-intensive workloads, and various types of analysis makes data warehouses an invaluable
asset for decision-making and strategic planning. Despite the challenges related to cost,
complexity, and resource requirements, the advantages provided by data warehouses in these
areas outweigh the drawbacks.

25
Citations

W.H.Inmon. (2002). Building the Data Warehouse ; 3rd Edition. New York: John Wiley &
Sons, Inc.
Nguyen, T., & Obafemi-Ajayi, T. (2019). Structured Iterative Hard Thresholding For Categorical
And Mixed Data Types. 2019 IEEE Symposium Series on Computational Intelligence
(pp. 2541 - 2547). Institute of Electrical and Electronics Engineers.
Suarez Alvarez, M. D. (2010). Design and analysis of clustering algorithms for numerical,
categorical and mixed data. Cardiff University.
Kaplan, R. M., Chambers, D. A., & Glasgow, R. E. (2014). Big Data and Large Sample Size: A
Cautionary Note on the Potential for Bias. CTS Journal.
Subramanian, C., Jayadevan, S., & Aji, G. (2019). Statistical Issues in Small and Large Sample:
Need of Optimum Upper Bound for the Sample Size. University of Bahrain Scientific
Journals.
Kwak, S. K., & Kim, J. H. (2017). Statistical data preparation: management of missing values
and outliers. Korean Journal of Anesthesiology.
Huang, M. W., Lin, W. C., & Tsai, C. F. (2018). Outlier Removal in Model-Based Missing
Value Imputation for Medical Datasets. Journal of healthcare engineering.
Ragab, M., Eldele, E., Tan, W. L., Foo, C.-S., Chen, Z., Wu, M., . . . Li, X. (2023). ADATIME:
A Benchmarking Suite for Domain Adaptation on Time Series Data. ACM Trans. Knowl.
Discov. Data.
SAS Institute, Inc. (2024). Data Warehouse: What it is and why it matters. Retrieved from SAS
Insights : https://www.sas.com/en_us/insights/data-management/data-warehouse.html
Mishra, P., Pandey, C. M., Singh, U., & Gupta, A. (2018). Scales of measurement and
presentation of statistical data. Annals of cardiac anaesthesia, 419–422.
Freund, R., Freund, R., & Wilson, W. (2003). Statistical Methods. Netherlands: Elsevier
Science.
Morcillo, A. (2023). Descriptive statistics: organizing, summarizing, describing, and presenting
data.
Larson, M. G. (2006). Descriptive Statistics and Graphical Displays. Circulation.

26
Manikandan, S. (2010). Data transformation. Journal of pharmacology & pharmacotherapeutics.
Manikandan, S. (2011). Measures of central tendency: The mean. Journal of pharmacology &
pharmacotherapeutics, 140-142.
Taylor, S. (2024). Nominal Data. Retrieved from Corporate Finance Institute:
https://corporatefinanceinstitute.com/resources/data-science/nominal-data/
Trakulkasemsuk, W. (2014). Understanding Central Tendency. Proceedings of the International
Conference on Doing Research in Applied Linguistics (pp. 75-93). King Mongkut’s
University of Technology Thonburi.
Spriestersbach, A., Röhrig, B., Du Prel, J. B., Gerhold-Ay, A., & Blettner, M. (2009).
Descriptive statistics: the specification of statistical measures and their presentation in
tables and graphs. Deutsches Ärzteblatt International.
Harding, B., Tremblay, C., & Cousineau, D. (2014). Standard errors: A review and evaluation of
standard error estimators using Monte Carlo simulations. The Quantitative Methods for
Psychology.
Bensken, W. P., Pieracci, F. M., & Ho, V. P. (2021). Basic introduction to statistics in medicine,
part 1: Describing data. Surgical Infections, 590-596.
Gravetter, F., & Wallnau, L. (2013). Statistics for the behavioral sciences. Belmont: Wadsworth
Cengage Learning.
Lee, J., & Wong, D. W. (2001). Statistical analysis with ArcView GIS. John Wiley & Sons.
Heumann, C., & Shalabh, M. S. (2016). Introduction to statistics and data analysis. Switzerland:
Springer International Publishing.
Bendel, R. B., Higgins, S. S., Teberg, J. E., & Pyke, D. A. (1989). Comparison of skewness
coefficient, coefficient of variation, and Gini coefficient as inequality measures within
population. Oecologia, 394-400.
Wang, Y., & Chen, H.-J. (2012). Use of percentiles and z-scores in anthropometry. In Y. Wang,
& H.-J. Chen, Handbook of anthropometry: Physical measures of human form in health
and disease (pp. 29-48). New York: Springer.
Xiao, F., Wei Jeng, & He, D. (2018). Investigating metadata adoptions for open government data
portals in US cities. Proceedings of the Association for Information Science and
Technology, 573-582.
Gardner, S. R. (1998). Building the data warehouse. Communications of the ACM , 52-60.
27
Ceruti, M. G. (2021). A Review of Database System Terminology. In M. G. Ceruti, Handbook of
Data Management (pp. 13-31). Auerbach Publications.
Kour, A. (2015). Data Warehousing, Data Mining, OLAP and OLTP Technologies Are
Indispensable Elements to Support Decision-Making Process in Industrial World.
International Journal of Scientific and Research Publications, 1-7.
Maślankowski, J. (2013). The evolution of the data warehouse systems in recent years.
Zarządzanie i Finanse.
Ghosh, R., Haider, S., & Sen, S. (2015). An integrated approach to deploy data warehouse in
business intelligence environment. Proceedings of the 2015 Third International
Conference on Computer, Communication, Control and Information Technology (C3IT)
(pp. 1-4). IEEE.
Başaran, B. P. (2005). A comparison of data warehouse design models. Atılım Üniversites.
Kimball, R., & Ross, M. (2013). The Data Warehouse Toolkit: The Definitive Guide to
Dimensional Modeling. Wiley.
Inmon, W. H. (2005). Building the Data Warehouse, 4th Edition. Wiley.
Watson, H., Ariyachandra, T., & Matyska, R. J. (2001). Data Warehousing Stages of Growth.
Information Systems Management, 42-50.
Chaudhuri, S., & Dayal, U. (1997). An Overview of Data Warehousing and OLAP Technology.
ACM Sigmod Record.
Golfarelli, M., & Rizzi, S. (2009). Data warehouse design: Modern principles and
methodologies. McGraw-Hill, Inc.
Prasad, N. (27 November, 2023). From Databases to Data Lakes: A Comprehensive Guide to
Streamlining Data Analytics for Modern Businesses. LinkedIn Pulse.
Ponniah, P. (2011). Data warehousing fundamentals for IT professionals. John Wiley & Sons.
Montgomery, D. C., Jennings, C. L., & Kulahci, M. (2015). Introduction to time series analysis
and forecasting. John Wiley & Sons.
Niu, L., Jie, L., & Zhang, G. (2009). Cognition-driven decision support for business intelligence.
Models, Techniques, Systems and Applications. Studies in Computational Intelligence,
Springer, Berlin, 4-5.
Ionescu, S.-A., & Diaconita, V. (2023). Transforming Financial Decision-Making: The Interplay
of AI, Cloud Computing and Advanced Data Management Technologies. International
28
Journal of Computers Communications & Control.
Hellerstein, J. M., Stonebraker, M., & Caccia., R. (1999). Independent, open enterprise data
integration. IEEE Data Engineering Bulletin, 43-49.
Vassiliadis, P., & Simitsis, A. (2009). Extraction, Transformation, and Loading. Encyclopedia of
Database Systems.
Patel, J. (2019). Bridging data silos using big data integration. International Journal of Database
Management Systems, 01-06.
Santos, R. J., & Bernardino, J. (2008). Real-time data warehouse loading methodology.
Proceedings of the 2008 international symposium on Database engineering &
applications, (pp. 49-58).
Sippu, S., & Soisalon-Soininen, E. (2015). Transaction processing: Management of the logical
database and its underlying physical structure. Springer.
Raj, P., Raman, A., Nagaraj, D., Duggirala, S., Raj, P., Raman, A., . . . Duggirala, S. (2015).
High-performance integrated systems, databases, and warehouses for big and fast data
analytics. High-Performance Big-Data Analytics: Computing Systems and Approaches,
233-274.
Athanassoulis, M., Chen, S., Ailamaki, A., Gibbons, P. B., & Stoica, R. (2011). MaSM: efficient
online updates in data warehouses. Proceedings of the 2011 ACM SIGMOD International
Conference on Management of data. ACM SIGMOD.
Shivakumar, S. K. (2020). Modern Web Data Patterns. In S. K. Shivakumar, odern Web
Performance Optimization: Methods, Tools, and Patterns to Speed Up Digital Platforms
(pp. 301-325). Springer.
March, S. T., & Hevner, A. R. (2007). Integrated decision support systems: A data warehousing
perspective. Decision support systems, 1031-1043.
Cardon, D. (May, 2014). Database vs Data Warehouse: A Comparative Review. Health Catalyst.

29
Appendix A
Frequency Tables and Pie Charts for nominal variables in the dataset

30
Appendix B
Visualisation for ordinal variables in the dataset

31
32
33
Appendix C
Visualisation for ratio variables in the dataset

34
35
36

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy