ccs341-data-warehousing-lab-manual2021 (1) (1)
ccs341-data-warehousing-lab-manual2021 (1) (1)
Name :
Register Number :
Degree / Branch :
Semester :
Subject Code :
Subject Name :
Department of
Certified that this is the Bonafide record of the work carried out
by Mr/Ms (name)
(Reg.No.) (semester)
in Laboratory
FACULTY INCHARGE
Introduction
The goal of this lab is to install and familiarize with Weka.
Steps:
1. Download and install Weka. You can find it here:
http://www.cs.waikato.ac.nz/ml/weka/downloading.html
2. Open Weka and have a look at the interface. It is an open-source project written
in Java from the University of Waikato.
7. In this lab, we will work with the dataset Iris. To open Iris dataset, click on ‘Open
file’ in the ‘Preprocess tab’. From your ‘data’ folder, select iris.arff and hit open.
8. To know more about the iris dataset, open iris.arff in notepad++ or in a similar tool
and read the comments.
Iris Versicolour 50
Iris Virginica 50
Result:
Thus the data exploration and integration using weka is explored successfully.
Aim:
To convert a text file to ARFF(Attribute-Relation File Format) using Weka3.8.2 tool.
Objectives:
Most of the data that we have collected from public forum is in the text
format that cannot be read byWeka tool. Since Weka (Data Mining tool)
recognizes the data in ARFF format only we have to convert the text file into
ARFF file.
Algorithm:
Output:
Result:
Thus, conversion of a text file to ARFF(Attribute-Relation File Format) using
Weka3.8.2 tool is implemented.
Aim:
To convert ARFF (Attribute-Relation File Format) into text file.
Objectives:
Since the data in the Weka tool is in ARFF file format we have to convert the
ARFF file to text format forfurther processing.
Algorithm:
Aim:
To apply the concept of Linear Regression for training the given dataset.
Algorithm:
LINEAR REGRESSION:
PROBLEM:
Consider the dataset below where x is the number of working expeince of a
college graduate and y is the corresponding salary of the graduate. Build a
regression equation and predict the salary of college graduate whoseexperience
is 10 years.
INPUT:
Result:
Thus the concept of Linear Regression for training the given dataset is applied and
implemented.
Aim:
To apply the Navie Bayes Classification for testing the given dataset.
Algorithm:
Example: predict whether a costumer will buy a computer or not " Costumers are
described by two attributes: age and income " X is a 35 years-old costumer with
an income of 40k " H is the hypothesis that the costumer will buy acomputer "
P(H|X) reflects the probability that costumer X will buy a computer given that we
know the costumers’ age and income.
Input Data:
10
11
Thus the Navie Bayes Classification for testing the given dataset is
implemented.
12
Aim:
Process: Replacing Missing Attribute Values by the Attribute Mean. This method is
used fordata sets with numerical attributes. An example of such a data set is
presented in fig no: 4.1
13
14
Result:
Thus the preprocessing of handling missing values is filled successfully.
15
Objectives:
The data collected from public fourms have plenty of noise or missing
data. Weka provides filter to replace themissing values and to remove the
noisy data. So that the result will be more accurate.
Algorithm:
OUTPUT:
16
Result:
17
OLAP Operations:
Since OLAP servers are based on multidimensional view of data, we
will discuss OLAPoperations in multidimensional data.
Here is the list of OLAP operations
Roll-up (Drill-up)
Drill-down
Slice and dice
Pivot (rotate)
Roll-up (Drill-up):
Roll-up performs aggregation on a data cube in any of the following ways
By climbing up a concept hierarchy for a dimension
By dimension reduction
Roll-up is performed by climbing up a concept hierarchy for the
dimension location.
Initially the concept hierarchy was "street < city < province < country".
On rolling up, the data is aggregated by ascending the location
hierarchy from the level ofcity to the level of country.
The data is grouped into cities rather than countries.
When roll-up is performed, one or more dimensions from the data cube
are removed.
Drill-down:
Drill-down is the reverse operation of roll-up. It is performed by either of the
following ways
By stepping down a concept hierarchy for a dimension
By introducing a new dimension.
Drill-down is performed by stepping down a concept hierarchy for the
dimension time.
Initially the concept hierarchy was "day < month < quarter < year."
On drilling down, the time dimension is descended from the level of
quarter to the levelof month.
When drill-down is performed, one or more dimensions from the data
cube are added.
It navigates the data from less detailed data to highly detailed data.
Slice:
The slice operation selects one particular dimension from a given cube and provides
a new sub-cube.
Dice:
18
Pivot (rotate):
The pivot operation is also known as rotation. It rotates the data axes in view
in order toprovide an alternative presentation of data.
Now, we are practically implementing all these OLAP Operations using
Microsoft
Excel.
19
4. We
got all the music.cub data for analyzing different OLAP Operations.Firstly,
we performed drill-down operation as shown below.
20
21
22
Result:
Thus the olap operations such as roll up, drill down, slice , dice and pivot are
implemented successfully.
23
ETL (Extract-Transform-Load):
ETL comes from Data Warehousing and stands for Extract-Transform-Load. ETL
coversa process of how the data are loaded from the source system to the data
warehouse.
Currently, the ETL encompasses a cleaning step as a separate step. The
sequence is thenExtract-CleanTransform-Load. Let us briefly describe each
step of the ETL process.
PROCESS:
EXTRACT
The Extract step covers the data extraction from the source system and makes it
accessible for further processing. The main objective of the extract step is to
retrieve allthe required data from the source system with as little resources as
possible. The extractstep should be designed in a way that it does not negatively
affect the source system in terms or performance, response time or any kind of
locking.
There are several ways to perform the extract:
The cleaning step is one of the most important as it ensures the quality of the
24
The transform step applies a set of rules to transform the data from the source to
the target. This includes converting any measured data to the same dimension (i.e.
conformeddimension) using the same units so that they can later be joined. The
transformation step also requires joining data from several sources, generating
aggregates, generating surrogate keys, sorting, deriving new calculated values, and
applying advanced validationrules.
Load:
During the load step, it is necessary to ensure that the load is performed correctly
and with as little resources as possible. The target of the Load process is often a
database. Inorder to make the load process efficient, it is helpful to disable any
constraints and indexes before the load and enable them back only after the load
completes. The referential integrity needs to be maintained by ETL tool to ensure
consistency.
Managing ETL Process:
The ETL process seems quite straight forward. As with every application, there is
a possibility that the ETL process fails. This can be caused by missing extracts
from one ofthe systems, missing values in one of the reference tables, or simply a
connection or power outage. Therefore, it is necessary to design the ETL process
keeping fail-recoveryin mind.
Staging:
It should be possible to restart, at least, some of the phases independently from the
others.For example, if the transformation step fails, it should not be necessary to
restart the Extract step. We can ensure this by implementing proper staging.
Staging
25
26
Result:
Thus the ELT script is implemented successfully.
27
Now, we are going to design these multi-dimensional models for the Marketing
enterprise.
First, we need to built the tables in a database through SQLyog as shown below.
In the above window, left side navigation bar consists of a database named as
―sales_dw‖ in which there are six different tables (dimcustdetails, dimcustomer,
dimproduct, dimsalesperson, dimstores, factproductsales) has been created.
After creating tables in database, here we are going to use a tool called as
“Microsoft Visual Studio 2012 for Business Intelligence” for building
multi- dimensional models.
28
By data source views & cubes, we can see our retrieved tables in multi-
dimensional models. We need to add dimensions also through dimensions option. In
general, Multidimensional models consists of dimension tables & fact tables.
A Star schema model is a join between a fact table and a no. of dimension tables.
Each dimensional table are joined to the fact table using primary key to foreign key
join but dimensional tables are not joined to each other. It is the simplest style of
dataware houseschema.
30
31
When the Explorer is first started only the first tab is active; the others are greyed
out. This is because it is necessary to open (and potentially pre-process) a data set
before starting to explore the data.
The tabs are as follows:
1. Preprocess. Choose and modify the data being acted on.
2. Classify. Train and test learning schemes that classify or perform regression.
3. Cluster. Learn clusters for the data.
4. Associate. Learn association rules for the data.
5. Select attributes. Select the most relevant attributes in the data.
6. Visualize. View an interactive 2D plot of the data.
Once the tabs are active, clicking on them flicks between different screens, on
which the respective actions can be performed. The bottom area of the window
(including the statusbox, the log button, and the Weka bird) stays visible regardless
of which section you are in.
1. Preprocessing
Loading Data:
The first four buttons at the top of the preprocess section enable you to load data into
33
WEKA‘s ARFF format, CSV format, C4.5 format, or serialized Instances format.
ARFF
files typically have a .arff extension, CSV files a .csv extension,C4.5 files a .data
and.names extension, and serialized Instances objects a .bsiextension.
2. Classification:
Selecting a Classifier
At the top of the classify section is the Classifier box. This box has a text fieldthat
gives the name of the currently selected classifier, and its options. Clickingon the text
34
3. Clustering:
35
Setting Up
This panel contains schemes for learning association rules, and the learners are
chosenand configured in the same way as the clusterers, filters, and classifiers in
the other panels.
36
6. Visualizing:
37
Result :
Thus the tools are explored and analyzed successfully.
38
Fact Table :
Fact table FACT_SALES that has a grain which gives us a number of units sold
by date, by store and by product.
39
40
Description:
Procedure:
1) Open Start Programs Accessories Notepad
2) Type the following training data set with the
help of Notepad for Employee Table.
@relation employee
@attribute eid numeric
@attribute ename
{raj,ramu,anil,sunil,rajiv,sunitha,kavitha,suresh,ravi,ramana,ram,kavy
a,navya}@attribute salary numeric
@attribute exp numeric
@attribute
address
{pdtr,kdp,nlr,gtr}
@data
101,raj,10000,4,pdtr
102,ramu,15000,5,pdtr
103,anil,12000,3,kdp
104,sunil,13000,3,kdp
105,rajiv,16000,6,kdp
106,sunitha,15000,5,nlr
41
42
window will be opened and set the path,enter .arff in look in dialog
box to save normalize data.
12) Right click on Arff Loader and click on Start Loading option then
everything will be executed one by one.
13) Check whether output is created or not by selecting the preferred path.
14) Rename the data name as a.arff
43
44
45