Data Mining Practical 7
Data Mining Practical 7
(VI SEMESTER)
EXPERIMENT NO: 7
THEORY:
Today, a large number of standard data miningmethods are available,from a historical perspective,these
methods have different roots. One early groupof methods was adopted from classical statistics: thefocus
was changed from the proof of known hypothesesto the generation of new hypotheses. Examplesinclude
methods from Bayesian decision theory, regressiontheory, and principal component analysis.Another
group of methods stemmed from artificial intelligence- like decision trees, rule-based systems,
andothers. The term ‘machine learning’ includes methodssuch as support vector machines and artificial
neuralnetworks. There are several different and sometimesoverlapping categorizations; for example,
fuzzylogic, artificial neural networks, and evolutionary algorithms,which are summarized as
computational intelligence.[2]
The typical life cycle of new data mining methods begins with theoretical papers based on
inhousesoftware prototypes, followed by public oron-demand software distribution of successful
algorithms as research prototypes. Then, either special commercial or open source packages containing a
family of similar algorithms are developed or the algorithms are integrated into existing open source or
commercial packages. Many companies have tried to promote their own stand alone packages, but only
few have reached notable market shares. The life cycle of some data mining tools is remarkably short.
Typical reasons include internal marketing decisions and acquisitions of specialized companies by larger
ones, leading to a renaming and integration of product lines.
The largest commercial success stories resulted from the step-wise integration of data mining methods
into established commercial statistical tools. Companies such as SPSS, founded in 1975 with precursors
from 1968, or SAS, founded in 1976, have been offering statistical tools for mainframe computers since
the 1970s.
Many companies offering business intelligence products have integrated data mining solutions into their
database products; one example is Oracle Data Mining (established in 2002). Many of these products are
also a product of the acquisition and integration of specialized datamining companies.
In 2008, the worldwide market for business intelligence (i.e., software and maintenance fees) was 7.8
billion USD, including 1.5 billion USD in socalled‘advanced analytics’, containing data mining and
statistics.7 This sector has grown 12.1% between 2007 and 2008, with large players including companies
such as SAS (33.2%, tool: SAS Enterprise Miner), SPSS (14.3%, since 2009, an IBM company; tool:
IBMSPSS Modeler), Microsoft (1.7%, tool: SQL Server Analysis Services), Teradata (1.5%, tool:
Teradata Database, former name TeraMiner), and TIBCO (1.4%, tool: TIBCO Spotfire). [3]
Open-source libraries have also become very popular since the 1990s. The most prominent example is
Waikato Environment for Knowledge Analysis (WEKA), see Ref 8. WEKA started in 1994 as a C++
library, with its first public release in 1996. In 1999, it was completely rebuilt as a JAVA package; since
that time, it has been regularly updated.
A. User Groups
There are many different data mining tools available,which fit the needs of quite different user
groups:
1. Business applications: This group uses datamining as a tool for solving commerciallyrelevant
business applications such as customerrelationship management, fraud detection,and so on.
2. Applied research: A user group that appliesdata mining to research problems, for
example,technology and life sciences. Here,users are mainly interested in tools with
wellprovenmethods, a graphical user interface(GUI), and interfaces to domain-related
dataformats or databases.
3. Algorithm development: Develops new datamining algorithms, and requires tools to both
integrate its own methods and compare thesewith existing methods. The necessary toolsshould
contain many concurrent algorithms.
4. Education: For education at universities, datamining tools should be very intuitive, witha
comfortable interactive user interface, andinexpensive. In addition, they should allowthe
integration of in-house methods duringprogramming seminars.
B. Data Structures
dimensional featuretables. In this classical format, a dataset consistsof a set of N examples (e.g.,
clients of an insurancecompany) with s features containing real values or
usually integer-coded classes or symbols (e.g., income,age, number of contracts, and alike). This
format issupported by nearly all existing tools. In some cases,the dataset can be sparse, with only a
few nonzerofeatures such as a list of s shopping items for N differentcustomers. The
computational and memory effortcan be reduced if a tool exploits this sparse structure.
There are different types of data format like feature data (e.g. age and income), texts, time-series
data, sequences, images, graphs, 3D images, Videos,3D videos etc.
D. Platforms
Data mining tools can be subdivided into standalone and client/server solutions. Client/server
solutionsdominate, especially in products designed forbusiness users. They are available for
different platforms,including Windows, MAC OS, Linux, or specialmainframe supercomputers.
There is a growingnumber of JAVA-based systems that are platformindependentfor users in
research and appliedresearch.
E. Licenses
There exists a wide variety of data mining tools withcommercial and open-source licenses. This is
particularlytrue in the business application user group,where commercial software is very
attractive dueto high software stability, good coupling with othercommercial tools for data
warehouses, included softwaremaintenance, and the possibility of user trainingfor sophisticated
topics. For all other user groups,there is a strong trend toward open-source software,but different
types of licenses exist for this.
Following the criteria from the previous section, different types of similar data mining tools can be found.
In addition, for commercial data mining tools, related tools and their group membership are summarized
in different tables for commercial (Tables 1 and 2), free, and open-source data mining tools (Table 3). In
these tables, very popular tools are marked in bold.
1. Data mining suites (DMS) focus largely on data mining and include numerous methods. They
support feature tables and time series, while additional tools for text mining are sometime
available. Typical examples include IBM SPSS Modeler, SAS Enterprise Miner, Alice d’Isoft,
DataEngine, DataDetective, GhostMiner, Knowledge Studio, KXEN, NAG Data Mining
Components, Partek Discovery Suite, STATISTICA, and TIBCO Spotfire.
2. Business intelligence packages (BIs) have no special focus to data mining, but include basic data
mining functionality, especially for statistical methods in business applications. Most BI softwares
are commercial (IBM Cognos 8 BI,OracleDataMining, SAPNetweaver Business Warehouse,
Teradata Database, DB2 Data Warehouse from IBM, and PolyVista), but a few open-source
solutions exist (Pentaho).
3. Mathematical packages (MATs) have no special focus on data mining, but provide a large and
extendable set of algorithms and visualization routines MATs are attractive to users in algorithm
development and applied research because data mining algorithms can be rapidly implemented,
mostly in the form of extensions (EXT) and research prototypes (RES). MAT packages exist as
commercial (MATLAB and R-PLUS) or open-source tools (R, Kepler).
4. Integration packages (INTs) are extendable bundles of many different open-source algorithms,
either as stand-alone software (mostly based on Java; as KNIME, the GUI-version of WEKA,
KEEL, and TANAGRA) or as a kind of larger extension package for tools from the MAT type
(such as Gait-CAD, PRTools for MATLAB, and RWEKA for R).
5. EXT are smaller add-ons for other tools such as Excel,Matlab, R, and so forth, with limited but
quite useful functionality Data mining libraries (LIBs) implement data mining methods as a
bundle of functions. These functions can be embedded in other software tools using an
Application Programming Interface (API) for the interaction between the software tool and the
data mining functions. A graphical user interface is missing, but some functions can support the
integration of specific visualization tools. They are often written in JAVA or C++ and the
solutions are platform independent. Open source examples are WEKA (Java-based), MLC++ (C+
+ based), JAVA Data Mining Package, and LibSVM (C++ and JAVAbased) for support vector
machines
6. Specialties (SPECs) are similar to DMS tools, but implement only one special family of methods
such as artificial neural networks. Examples are CART for decision trees, Bayesia Lab for
Bayesian networks, C5.0, WizRule, Rule Discovery System for rule-based systems, MagnumOpus
for association analysis, and JavaNNS, Neuroshell.
7. RES are usually the first—and not always stable—implementations of new and innovative
algorithms. They contain only one or a few algorithms with restricted graphical support and
without automation support. RES tools are mostly opensource. Examples are GIFT for content-
based image retrieval, Himalaya for mining maximal frequent item sets, sequential pattern mining
and scalable linear regression trees, Rseslibs for rough sets, and Pegasus for graph mining.
8. Solutions (SOLs) describe a group of tools that are customized to narrow application fields such
as text mining, image processing etc.
References:
[1] Fayyad U, Piatetsky-Shapiro G, Smyth P. From data mining to knowledge discovery in databases. AI Mag
1996, 17:37–54.
[2] Engelbrecht AP. Computational Intelligence - An Introduction. Chichester: John Wiley; 2007.
[3] Ralf Mikut∗ and Markus Reischl,Data mining tools.
EXERCISE:
1) Write down the functionality and advantage of the top 5 analytical tool.
EVALUATION:
Observation &
Timely completion Viva Total
Implementation
4 2 4 10
Signature: ____________
Date: ________________