0% found this document useful (0 votes)
2 views7 pages

Data Mining Practical 7

The document outlines an experiment focused on surveying various data mining tools, detailing their historical development, criteria for comparison, and categorization into different types. It discusses the evolution of data mining methods, user groups, data structures, tasks, platforms, and licensing models. Additionally, it provides examples of commercial and open-source data mining tools, highlighting their functionalities and applications in business and research contexts.

Uploaded by

akhilpapa303
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views7 pages

Data Mining Practical 7

The document outlines an experiment focused on surveying various data mining tools, detailing their historical development, criteria for comparison, and categorization into different types. It discusses the evolution of data mining methods, user groups, data structures, tasks, platforms, and licensing models. Additionally, it provides examples of commercial and open-source data mining tools, highlighting their functionalities and applications in business and research contexts.

Uploaded by

akhilpapa303
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 7

Shree Swaminarayan Institute of Technology CE DEPT.

(VI SEMESTER)

EXPERIMENT NO: 7

TITLE: Survey of Different Data Mining Tools.


OBJECTIVE:On completion of this exercise student will able to know about…

This practical attemptto support the decision-making process by discussing


the historical developmentand presenting a range of existing state-of-the-art
data mining and related tools.Furthermore, the tool categorization based on
differentuser groups, data structures, data mining tasks and methods,
visualization andinteraction styles, import and export options for data and
models, platforms, andlicense policies.

THEORY:

There are three stages for introduction to data mining tools.


1. The first section Historical Development and State-of-the-Art highlights the historical
development of data mining software until present;
2. The criteria to compare data mining software are explained in the second section Criteria for
Comparing Data Mining Software.
3. The last section Categorization of Data Mining Software into Different Types proposes a
categorization of data mining software and introduces typical software tools for the different
types.

Historical Development and State-Of-The-Art

Following the original definition given in Ref [1]


“Data mining is a step in the knowledge discovery from databases (KDD) process that consists of
applying data analysis and discovery algorithms to produce a particular enumeration of patterns
(or models) across the data. In that same article, KDD is defined as the nontrivial process of
identifying valid, novel, potentially useful, and ultimately understandable patterns in data.”

Today, a large number of standard data miningmethods are available,from a historical perspective,these
methods have different roots. One early groupof methods was adopted from classical statistics: thefocus
was changed from the proof of known hypothesesto the generation of new hypotheses. Examplesinclude
methods from Bayesian decision theory, regressiontheory, and principal component analysis.Another
group of methods stemmed from artificial intelligence- like decision trees, rule-based systems,
andothers. The term ‘machine learning’ includes methodssuch as support vector machines and artificial
neuralnetworks. There are several different and sometimesoverlapping categorizations; for example,
fuzzylogic, artificial neural networks, and evolutionary algorithms,which are summarized as
computational intelligence.[2]

The typical life cycle of new data mining methods begins with theoretical papers based on
inhousesoftware prototypes, followed by public oron-demand software distribution of successful
algorithms as research prototypes. Then, either special commercial or open source packages containing a
family of similar algorithms are developed or the algorithms are integrated into existing open source or
commercial packages. Many companies have tried to promote their own stand alone packages, but only
few have reached notable market shares. The life cycle of some data mining tools is remarkably short.

Student Name (Enrollment No) Page no


Shree Swaminarayan Institute of Technology CE DEPT. (VI SEMESTER)

Typical reasons include internal marketing decisions and acquisitions of specialized companies by larger
ones, leading to a renaming and integration of product lines.
The largest commercial success stories resulted from the step-wise integration of data mining methods
into established commercial statistical tools. Companies such as SPSS, founded in 1975 with precursors
from 1968, or SAS, founded in 1976, have been offering statistical tools for mainframe computers since
the 1970s.

Many companies offering business intelligence products have integrated data mining solutions into their
database products; one example is Oracle Data Mining (established in 2002). Many of these products are
also a product of the acquisition and integration of specialized datamining companies.

In 2008, the worldwide market for business intelligence (i.e., software and maintenance fees) was 7.8
billion USD, including 1.5 billion USD in socalled‘advanced analytics’, containing data mining and
statistics.7 This sector has grown 12.1% between 2007 and 2008, with large players including companies
such as SAS (33.2%, tool: SAS Enterprise Miner), SPSS (14.3%, since 2009, an IBM company; tool:
IBMSPSS Modeler), Microsoft (1.7%, tool: SQL Server Analysis Services), Teradata (1.5%, tool:
Teradata Database, former name TeraMiner), and TIBCO (1.4%, tool: TIBCO Spotfire). [3]

Open-source libraries have also become very popular since the 1990s. The most prominent example is
Waikato Environment for Knowledge Analysis (WEKA), see Ref 8. WEKA started in 1994 as a C++
library, with its first public release in 1996. In 1999, it was completely rebuilt as a JAVA package; since
that time, it has been regularly updated.

Criteria for Comparing Data Mining Software


In the following, different criteria for comparison of data mining software are introduced. These criteria
are based on user groups, data structures, data mining tasks and methods, import and export options, and
license models.

A. User Groups

There are many different data mining tools available,which fit the needs of quite different user
groups:
1. Business applications: This group uses datamining as a tool for solving commerciallyrelevant
business applications such as customerrelationship management, fraud detection,and so on.
2. Applied research: A user group that appliesdata mining to research problems, for
example,technology and life sciences. Here,users are mainly interested in tools with
wellprovenmethods, a graphical user interface(GUI), and interfaces to domain-related
dataformats or databases.
3. Algorithm development: Develops new datamining algorithms, and requires tools to both
integrate its own methods and compare thesewith existing methods. The necessary toolsshould
contain many concurrent algorithms.
4. Education: For education at universities, datamining tools should be very intuitive, witha
comfortable interactive user interface, andinexpensive. In addition, they should allowthe
integration of in-house methods duringprogramming seminars.

B. Data Structures

An important criterion is the dimensionality of the underlyingraw data in the processed


dataset.The first data mining applications were focused onhandling datasets represented as two-

Student Name (Enrollment No) Page no


Shree Swaminarayan Institute of Technology CE DEPT. (VI SEMESTER)

dimensional featuretables. In this classical format, a dataset consistsof a set of N examples (e.g.,
clients of an insurancecompany) with s features containing real values or
usually integer-coded classes or symbols (e.g., income,age, number of contracts, and alike). This
format issupported by nearly all existing tools. In some cases,the dataset can be sparse, with only a
few nonzerofeatures such as a list of s shopping items for N differentcustomers. The
computational and memory effortcan be reduced if a tool exploits this sparse structure.

There are different types of data format like feature data (e.g. age and income), texts, time-series
data, sequences, images, graphs, 3D images, Videos,3D videos etc.

C. Task and Methods

The most important tasks in data mining are


1. supervised learning, with a known outputvariable in the dataset, including
- classification: class prediction, withthe variable typically coded as an integeroutput;
- fuzzy classification: with gradualmemberships with values in-between0 and 1
applied to the differentclasses;
- regression: prediction of a real-valuedoutput variable, including specialcases of
predicting future values ina time series out of recent or pastvalues;
2. unsupervised learning, without a known outputvariable in the dataset, including
- clustering: finds and describes groupsof similar examples in the data usingcrisp of
fuzzy clustering algorithms;
- association learning: finds typical groups of items that occur frequently together in
examples;
3. Semi supervised learning, whereby the outputvariable is known only for some examples.

D. Platforms
Data mining tools can be subdivided into standalone and client/server solutions. Client/server
solutionsdominate, especially in products designed forbusiness users. They are available for
different platforms,including Windows, MAC OS, Linux, or specialmainframe supercomputers.
There is a growingnumber of JAVA-based systems that are platformindependentfor users in
research and appliedresearch.

E. Licenses
There exists a wide variety of data mining tools withcommercial and open-source licenses. This is
particularlytrue in the business application user group,where commercial software is very
attractive dueto high software stability, good coupling with othercommercial tools for data
warehouses, included softwaremaintenance, and the possibility of user trainingfor sophisticated
topics. For all other user groups,there is a strong trend toward open-source software,but different
types of licenses exist for this.

Categorization of Data Mining Software into Different Types

Following the criteria from the previous section, different types of similar data mining tools can be found.

In addition, for commercial data mining tools, related tools and their group membership are summarized
in different tables for commercial (Tables 1 and 2), free, and open-source data mining tools (Table 3). In
these tables, very popular tools are marked in bold.

Thefollowing types are proposed:[3]


Student Name (Enrollment No) Page no
Shree Swaminarayan Institute of Technology CE DEPT. (VI SEMESTER)

1. Data mining suites (DMS) focus largely on data mining and include numerous methods. They
support feature tables and time series, while additional tools for text mining are sometime
available. Typical examples include IBM SPSS Modeler, SAS Enterprise Miner, Alice d’Isoft,
DataEngine, DataDetective, GhostMiner, Knowledge Studio, KXEN, NAG Data Mining
Components, Partek Discovery Suite, STATISTICA, and TIBCO Spotfire.

2. Business intelligence packages (BIs) have no special focus to data mining, but include basic data
mining functionality, especially for statistical methods in business applications. Most BI softwares
are commercial (IBM Cognos 8 BI,OracleDataMining, SAPNetweaver Business Warehouse,
Teradata Database, DB2 Data Warehouse from IBM, and PolyVista), but a few open-source
solutions exist (Pentaho).

3. Mathematical packages (MATs) have no special focus on data mining, but provide a large and
extendable set of algorithms and visualization routines MATs are attractive to users in algorithm
development and applied research because data mining algorithms can be rapidly implemented,
mostly in the form of extensions (EXT) and research prototypes (RES). MAT packages exist as
commercial (MATLAB and R-PLUS) or open-source tools (R, Kepler).

4. Integration packages (INTs) are extendable bundles of many different open-source algorithms,
either as stand-alone software (mostly based on Java; as KNIME, the GUI-version of WEKA,
KEEL, and TANAGRA) or as a kind of larger extension package for tools from the MAT type
(such as Gait-CAD, PRTools for MATLAB, and RWEKA for R).

5. EXT are smaller add-ons for other tools such as Excel,Matlab, R, and so forth, with limited but
quite useful functionality Data mining libraries (LIBs) implement data mining methods as a
bundle of functions. These functions can be embedded in other software tools using an
Application Programming Interface (API) for the interaction between the software tool and the
data mining functions. A graphical user interface is missing, but some functions can support the
integration of specific visualization tools. They are often written in JAVA or C++ and the
solutions are platform independent. Open source examples are WEKA (Java-based), MLC++ (C+
+ based), JAVA Data Mining Package, and LibSVM (C++ and JAVAbased) for support vector
machines

6. Specialties (SPECs) are similar to DMS tools, but implement only one special family of methods
such as artificial neural networks. Examples are CART for decision trees, Bayesia Lab for
Bayesian networks, C5.0, WizRule, Rule Discovery System for rule-based systems, MagnumOpus
for association analysis, and JavaNNS, Neuroshell.

7. RES are usually the first—and not always stable—implementations of new and innovative
algorithms. They contain only one or a few algorithms with restricted graphical support and
without automation support. RES tools are mostly opensource. Examples are GIFT for content-
based image retrieval, Himalaya for mining maximal frequent item sets, sequential pattern mining
and scalable linear regression trees, Rseslibs for rough sets, and Pegasus for graph mining.

8. Solutions (SOLs) describe a group of tools that are customized to narrow application fields such
as text mining, image processing etc.

Student Name (Enrollment No) Page no


Shree Swaminarayan Institute of Technology CE DEPT. (VI SEMESTER)

Table 1 List of Commercial Tools (Part 1) [3]

TOOL TYPE LINK


ADAPA (Zementis) DMS www.zementis.com
Alice (d’Isoft) DMS www.alice-soft.com
Bayesia Lab SPEC www.bayesia.com
C5.0 SPEC www.rulequest.com
CART SPEC www.salford-systems.com
Data Applied DMS data-applied.com
DataDetective DMS www.sentient.nl/?dden
DataEngine DMS www.dataengine.de
Datascope DMS www.cygron.hu
DB2 Data Warehouse BI www.ibm.com/software/data/infosphere/warehouse
DeltaMaster BI www.bissantz.com/deltamaster
Forecaster XL EXT www.alyuda.com
GhostMiner DMS www.fqs.pl/businessintelligence/products/ghostminer
IBM Cognos 8 BI BI www.ibm.com/software/data/cognos/data-mining-tools.html
IBM SPSS Modeler DMS www.spss.com/software/modeling/modeler
IBM SPSS Statistics MAT www.spss.com/software/statistics
iModel DMS www.biocompsystems.com/products/imodel
InfoSphere Warehouse BI www.ibm.com/software/data/infosphere/warehouse
JMP DMS www.jmpdiscovery.com
KnowledgeMiner SPEC www.knowledgeminer.net
KnowledgeStudio DMS www.angoss.com
KXEN DMS www.kxen.com
Magnum Opus SPEC www.giwebb.com
MATLAB MAT www.mathworks.com
MATLAB Neural Network
EXT www.mathworks.com
Toolbox
Model Builder DMS www.fico.com
ModelMAX SOL www.asacorp.com/products/mmxover.jsp

Table 2 List of Commercial Tools (Part 2) [3]

TOOL TYPE LINK


Molegro Data Modeler SOL www.molegro.com
NAG Data Mining Components LIB www.nag.co.uk/numeric/DR/DRdescription.asp
NeuralWorks Predict SPEC www.neuralware.com/products.jsp
Neurofusion LIB www.alyuda.com
Neuroshell SPEC www.neuroshell.com
www.oracle.com/technology/products/bi/odm/
Oracle Data Mining (ODM) DMS
index.html
Partek Discovery Suite DMS www.partek.com/software
Partek Genomics Suite SOL www.partek.com/software
PolyAnalyst DMS www.megaputer.com/polyanalyst.php
PolyVista BI www.polyvista.com
Random Forests SPEC www.salford-systems.com
RapAnalyst SPEC www.raptorinternational.com/rapanalyst.html
R-PLUS MAT www.experience-rplus.com

Student Name (Enrollment No) Page no


Shree Swaminarayan Institute of Technology CE DEPT. (VI SEMESTER)

SAP Netweaver Business www.sap.com/platform/netweaver/components/


BI
Warehouse (BW) businesswarehouse
SAS Enterprise Miner DMS www.sas.com/products/miner
See5 SPEC www.rulequest.com
SPAD Data Mining DMS eng.spadsoft.com
SQL Server Analysis Services DMS www.microsoft.com/sql
www.statsoft.com/products/data-mining-solutions/
STATISTICA DMS
G259
SuperQuery DMS www.azmy.com
Teradata Database BI www.teradata.com
Think Enterprise Data Miner
DMS www.thinkanalytics.com
(EDM)
TIBCO Spotfire DMS spotfire.tibco.com
UnicaPredictiveInsight DMS www.unica.com
WizRule and WizWhy SPEC www.wizsoft.com
XAffinity SPEC www.exclusiveore.com

Table 3 List of Free and Open-Source Tools [3]

TOOL TYPE LINK


ADaM LIB datamining.itsc.uah.edu/adam
CellProfilerAnalyst SOL www.cellprofiler.org/index.htm
D2K DMS alg.ncsa.uiuc.edu
Gait-CAD INT sourceforge.net/projects/gait-cad
GATE SOL gate.ac.uk/download
GIFT RES www.gnu.org/software/gift
Gnome Data Mine Tools DMS www.togaware.com/datamining/gdatamine
Himalaya RES himalaya-tools.sourceforge.net
ImageJ SOL rsbweb.nih.gov/ij
ITK SOL www.itk.org
JAVA Data Mining Package LIB sourceforge.net/projects/jdmp
www.ra.cs.uni-tuebingen.de/software/JavaNNS/welcome
JavaNNS SPEC
e.html
KEEL INT www.keel.es
Kepler MAT kepler-project.org
KNIME INT www.knime.org
LibSVM LIB www.csie.ntu.edu.tw/ cjlin/libsvm
MEGA SOL www.megasoftware.net/m distance.html
MLC++ LIB www.sgi.com/tech/mlc
Orange LIB www.ailab.si/orange
Pegasus RES www.cs.cmu.edu/ pegasus
Pentaho BI sourceforge.net/projects/pentaho
Proximity SPEC kdl.cs.umass.edu/proximity/index.html
PRTools EXT www.prtools.org
R MAT www.r-project.org
RapidMiner DMS www.rapidminer.com
Rattle INT rattle.togaware.com
ROOT LIB root.cern.ch/root
ROSETTA SPEC www.lcb.uu.se/tools/rosetta/index.php
Rseslibs RES logic.mimuw.edu.pl/ rses

Student Name (Enrollment No) Page no


Shree Swaminarayan Institute of Technology CE DEPT. (VI SEMESTER)

Rule Discovery System∗ SPEC www.compumine.com


RWEKA INT cran.r-project.org/web/packages/RWeka/index.html
TANAGRA INT eric.univ-lyon2.fr/ ricco/tanagra/en/tanagra.html
Waffles LIB waffles.sourceforge.net
WEKA DMS, LIB sourceforge.net/projects/weka
XELOPES Library∗ LIB www.prudsys.de/en/technology/xelopes
XLMiner∗ EXT www.resample.com/xlminer

References:
[1] Fayyad U, Piatetsky-Shapiro G, Smyth P. From data mining to knowledge discovery in databases. AI Mag
1996, 17:37–54.
[2] Engelbrecht AP. Computational Intelligence - An Introduction. Chichester: John Wiley; 2007.
[3] Ralf Mikut∗ and Markus Reischl,Data mining tools.

EXERCISE:

1) Write down the functionality and advantage of the top 5 analytical tool.

EVALUATION:

Observation &
Timely completion Viva Total
Implementation
4 2 4 10

Signature: ____________

Date: ________________

Student Name (Enrollment No) Page no

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy