A Study On Data Mining and Its Tools: Submitted by
A Study On Data Mining and Its Tools: Submitted by
Submitted by:
Kakkera Rajasa
Pgdim-20 Section-B
Roll No: 150
Data Mining
Data mining is the process of intelligently extracting hidden trends and information from corporate databases.
The information is usually buried too deep to extract using a conventional analysis tool such as OLAP. The
power of finding new information helps corporate decision makers to learn more about their customers by
perform tasks such as market segmentation, customer profiling, trend forecasting, cross-selling and fraud
detection. OLAP (On-Line Analytical Processing) and data warehousing are two data mining related tools.
Traditional query and report tools describe what is in a database. OLAP goes further by answering why certain
things are true. The user forms a hypothesis about a relationship and verifies it with a series of queries against
the data. Data mining is different from OLAP because rather than verify a hypothesis, it is used to generate a
hypothesis.
Benefits:
Data mining applications help users discover correlations and connections within large data sets. These might
have gone unnoticed without these algorithms.
A classic example of how data mining software can be used is with customer purchasing patterns at grocery
stores. If shoppers tend to buy items such as toilet paper, diapers and alcohol before the weekend, retailers
can place these items closer together to maximize revenue. Store owners can further capitalize on this
opportunity by running specials on these items to encourage additional purchases.
Top 5 Data Mining Tools
Angoss
Angoss predictive analytics software and solutions help businesses discover valuable insight and intelligence
from their data, uncovering opportunities to increase sales and profitability, and reduce risk. Its focus resolves
to meet the evolving needs of all analytics users regardless of their analytical environment or level of expertise
through software products and solutions that uniquely offer ease of use, flexibility and business value.
The software suite capabilities include comprehensive modelling, best-in-class Decision Trees, strategy design
and automatic code generation. Angoss products support the analysis of massive amounts of data—both
structured and unstructured—with 64-bit addressing and in-database analytics for extremely large datasets
and numbers of predictive variables. Support for open standards includes Hadoop integration and import and
export of R tables, among others.
Customers More than 300 leading companies worldwide rely on it, such as JPMorgan Chase & Co., Microsoft,
Bank of America, Groupon, The Great-West Life, T-Mobile, Sainsbury’s Finance, Ivy Funds, Blue Cross Shield,
Citi, Yorkshire Bank, Wells Fargo, Vanguard, Fidelity Investment, Aon, United Health Group, e-Bay, PayPal,
General Electric etc.
Products
• Customer Segmentation
• Customer Acquisition
• Cross-Sell / Upsell
• Next-Best Channel
• Churn / Loyalty
• Sales Productivity
2. Risk: - Software and solutions allow organization to quickly and effectively design and deploy strategies for
• Application/Origination
• Behaviour/Account
• Fraud
• Claims
The major products are:
intelligence and data mining and predictive analytics — to complement your business intelligence workbench.
• KnowledgeREADER: - KnowledgeREADER combines text analysis and visual text discovery with sentiment
analysis and predictive analytics to deliver integrated customer intelligence.
• KnowledgeSTUDIO: - KnowledgeSTUDIO builds upon the market-leading data analysis and predictive
analytics capabilities included in KnowledgeSEEKER with many advanced modelling and predictive analytics
features for high-performance business users and quantitative analysts.
• StrategyBUILDER: - StrategyBUILDER is the first tool of its kind for building predictive strategies. It is a
standard module in KnowledgeSEEKER, KnowledgeREADER and KnowledgeSTUDIO that uniquely allows
business analysts and data analysts to work together to build and deploy predictive strategies.
• Big Data Analytics: - The use of Big Data is becoming a leading means of competition and growth for
companies. Angoss software products support the growing types and volume of data that can be analysed.
• Deployment Options: - Angoss software deployment options are flexible — making predictive analytics
accessible and easy to use for all business analytics users and use cases
Teradata
Teradata Corporation is an American computer company that sells analytic data platforms, applications and
related services. Its products are meant to consolidate data from different sources and make the data
available for analysis. Formerly a division of NCR Corporation, Teradata was incorporated in 1979 and
separated from NCR in October 2007.
The Teradata product is referred to as a "data warehouse system" and stores and manages data. The data
warehouses use a "shared nothing architecture which means that each server node has its own memory and
processing power. Adding more servers and nodes increases the amount of data that can be stored. The
database software sits on top of the servers and spreads the workload among them.Teradata sells applications
and software to process different types of data. In 2010, Teradata added text analytics to track unstructured
data, such as word processor documents, and semi-structured data, such as spreadsheets.
Teradata's product can be used for business analysis. Data warehouses can track company data, such as sales,
customer preferences, product placement, etc.
Products Teradata is a massively parallel processing system running a shared nothing architecture. Its
technology consists of hardware, software, database, and consulting. The system moves data to a data
warehouse where it can be recalled and analyzed.
The systems can be used as back-up for one another during downtime, and in normal operation balance the
work load across themselves. Marketing research company Gartner Group placed Teradata in the "leaders
quadrant" in its 2009, 2010, and 2012 reports, "Magic Quadrant for Data Warehouse Database Management
Systems".
Teradata is the most popular data warehouse DBMS in the DB-Engines database ranking.
Teradata Active Enterprise Data Warehouse is the platform that runs the Teradata Database, with added data
management tools and data mining software.
The data warehouse differentiates between “hot and cold” data – meaning that the warehouse puts data that
is not often used in a slower storage section. As of October 2010, Teradata uses Xeon 5600 processors for the
server nodes.
Cognos
The two lines of data mining products offered by Cognos are: Scenario and 4Thought.
Scenario and 4Though both belong to the desktop market segment (low-end). Scenario employs a decision
tree algorithm to perform classification. It is particularly good at identifying and ranking high impact factors.
Inputs of both categorical and continuous values can be used in Scenario. An estimated 40,000 records can be
analyzed by Scenario in roughly 3 minutes. Scenario is designed to run on desktops. It requires a 486 PC or
higher, minimum of 8MB of RAM, and 20MB free disk space. It only runs on Microsoft Windows 95 or
Windows NT. The user interface for Scenario has a distinctive windows feel. 2-D graphs, tables and statistical
information are used to illustrate the analysis of the data and the results. The user can choose which inputs
(factors) to emphasize on by clicking on that input in the table. Tutorials and wizards are also available within
the package. The presentation of the results is intuitive and therefore makes Scenario user-friendly. Scenario
is part of the COGNOSuite which also includes OLAP and data warehousing products although it is not
necessary to purchase the whole COGNOSuite in order to run Scenario. Full integration is available with the
rest of the COGNOSuite so that a click of a button will start the data mining operations. Scenario supports data
from text files, Excel, Lotus 1-2-3 worksheets, and dBase tables. Priced at $695, Scenario’s ease of use and
ease of integration makes it a good buy even though the method it uses is not as powerful as some other
products. Scenario received the PlugIn Datamation 1998 Product of the Year Award, as well as the PC Week’s
Analyst’s Choice Award. The other data mining product offered by Cognos is 4Thought. It has more capability
than Scenario. 4Thought was originally developed at Right Information Systems. Cognos acquired Right
Information Systems in April, 1997. 4Thought uses a combination of neural nets and statistical tools and is
therefore, very useful for "number crunching". It is especially good for the financial industry where large
quantity of data is dealt with. Because of the nature of neural nets, 4Thought can be used to perform
prediction. Statistical tools can support optimization analysis. 4Thought offers a familiar spreadsheet interface
in which to collect and prepare data for analysis. User can type values directly onto the spreadsheet, cut and
paste, or import data directly from other sources, such as Excel, Lotus, popular relational and non-relational
databases, and text files. Both categorical and continuous data can be inputted into 74 Thought. A variety of
line, bar, and area charts let users graphically view data and interpret the results. Scattergrams and overlaid
charts show the strength of relationships between factors, or even the strength of the model’s predictability.
4Thought comes with a higher price tag - $20,000. Even though it occupies the low-end segment of the
market, it provides more powerful features than most low-end products. Cognos offers standard or
customized training classes through regularly-scheduled public classes either in one of Cognos’ classrooms or
on-site. And since Cognos has 32 offices in 12 countries, the training programs are relatively accessible.
Support is available in the form of on-line, telephone, and in person from Cognos’ six support centers around
the world. Founded in 1969, Cognos is an international corporation with corporate headquarters in Ottawa,
Canada and U.S. sales headquarters in Burlington, Massachusetts. The company employs more than 1,400
people worldwide. Revenue for the most recent fiscal quarter (1998) is US$70.7 million, a 21% increase from
the same period last year (1997). Some of the more well-known customers include ADP, Mead Johnson,
Consolidated Edison, Vanguard, and Deutsche Bank.
DataMind
DataCruncher occupies the mid-end segment of the data mining market. It offers the Agent Network
Technology that uses a belief network. A belief network is essentially a hybrid of neural nets, decision trees,
and market basket analysis. The hybrid software compensates the weaknesses of each technique with the
strength of others. Neural nets is hard to understand, but very powerful. Decision trees is very easy to
understand but lacks some of the capabilities of neural nets. Combining them makes the hybrid powerful and
easy to understand even though some of the capability of neural nets is comprised, such as the capability to
perform estimation. DataCruncher can perform classification, association, and clustering. DataCruncher
typically can support a system with 50 million users and hundreds of fields per user. DataCruncher supports
client/server computing. The server component runs on Unix or Windows NT systems and performs
operations including mining the data, reading data sources and building models. The client component that
runs on Windows 95 or Windows NT, is responsible for initiating server-side data mining operations, viewing
results, , and building reports. 8
The server requires the use of Hewlett-Packard HP-UX, IBM AIX, Silicon Graphics IRIX, or Sun Microsystems
Solaris. The systems requirements are 64 MB of RAM, 15 MB of free disk space, and additional working
storage space dependent on volume and complexity of data. The client can run on desktop PCs with 16 MB of
RAM, 15 MB of free disk space, and additional working storage space dependent on volume and complexity of
data. A special version of Data Cruncher, called PowerPak, can be used on a standalone PC and performs all
data mining operations on a single platform. In addition to the usual reporting formats (graphs, tables, etc.),
DataCruncher is capable of generating HTML-standard reports which allows information sharing on corporate
intranets through web-viewing.
DataMind specializes in data mining products and therefore does not have any data warehousing or OLAP
software. However, DataCruncher has direct connection to Oracle and Informix, delimited ASCII files and
ODBC compatible databases. A five user system (1 server, 5 clients) costs $80,000. The standalone product,
PowerPak costs $25,000.
DataMind was founded in 1994 and has headquarters in San Mateo, California. Regional sales centers are
located in Atlanta, Boston, Chicago, San Mateo, and Paris, France. It employs approximately 42 people and is a
privately-held company. Some well-known DataMind customers are ADP, 360 Communications, Engage
Technologies, and Chase Manhattan Bank. DataCruncher received the fifth annual Crossroads A-List Awards.
Training programs are presented either on-site or at DataMind locations in California. An introduction to data
mining class and a business data mining class are offered. The introduction course lasts one half day and is
designed for anyone interested in the basic concepts of data mining and how to use DataMind products. The
business data mining is a two-day course with hands-on work tailored towards individuals who use
DataCruncher client side tools. On-site consulting services are available from DataMind.
SPSS:
SPSS specializes in developing statistical tools for business intelligence. The two data mining products it offers
are: AnswerTree and Neural Connection. AnswerTree and Neural Connection each offers one algorithm and
are considered to occupy the low-end of the market.
AnswerTree uses decision trees to perform classification. It uses 4 different decision tree algorithms: CHAID,
CART, Exhaustive CHAID and QUEST. Quest is a statistical algorithm that selects variables without bias and
build a binary tree. Unlike other algorithms, QUEST performs variable selection and split point selection in
separate stages.
AnswerTree runs on Windows 95 or Windows NT. It requires a 486DX processor or higher (includes math co-
processor), 40 MB hard drive space, 12 MB RAM (although 16 MB is strongly recommended), and VGA
monitor (SVGA recommended).
AnswerTree is compatible with other SPSS statistical packages. The user interface for AnswerTree uses 2D
graphs, diagrams and tables to present the results. Neural Connections employs neural nets to perform
classification, prediction, and
clustering. Approximately 10 records per input variable are needed to train the neural nets. Neural
Connections can handle up to 32,000 records and 750 inputs. Inputs are equivalent to variables with the
exception that every level of a categorical variable counts 13 as one input.
Systems requirements for Neural Connections are Windows 3.1, Windows 95, or Windows NT on a 386 or
better PC (math co-processor strongly recommended), as well as 4 MB memory (8 MB recommended), 4 MB
free hard drive space.
Neural Connections and AnswerTrees can be both launched from an SPSS menu. However, they are also
compatible with other types of files such as ASCII and Excel. Neural Connections has one of the better output
formats. It uses 3-D contour plots that could be rotated three dimensionally as well as tables and texts. The
pricing for both Neural Connection and AnswerTree is based on number of licenses.
There are two types of licenses offered by SPSS, an annual license and a perpetual license. An annual License is
a lease transaction and allows for use of the product for one year. An initial fee is paid and a yearly renewal
fee is required for the continuing use of the product. A perpetual license allows for indefinite use of the
software. A higher initial fee is required and a service fee is optional on a yearly basis. The user can choose not
to pay the service fee and can still keep on using the software. For either Neural Connection or AnswerTree,
an annual license for 1 user requires an initial fee of $375 and renewal fee of $203 yearly. A perpetual license
for 1 user requires an initial fee of $665 and a Maintenance fee of $35 yearly, which is optional.
SPSS was founded in 1975. It now employees 535 people. Revenue generated in the most recent fiscal quarter
was $28.5 million, a 4% increase from the same fiscal quarter last year. SPSS has sold approximately 250,000
licenses worldwide. It was ranked No. 11 among the 200 Best Small Companies in America by Forbes for 1997,
and was ranked No. 73 in Business Week’s Top 100 Growth Companies for 1997.