Chapter 1
Chapter 1
Introduction
1.1 Background
Data mining extracts implicit, potentially useful knowledge from large amounts of data. It
is also called knowledge mining, knowledge extraction, data/sequence/pattern analysis, data
archaeology and data dredging from databases. In other words, data mining is the act of drilling
through huge volumes of data to discover relationships or answer queries, generalized for
traditional query tools.
In general, data mining tasks can be classified into two categories:
Predictive mining: This is the process of inferring sequences form data to make
predictions. Classification, Regression and Deviation detection are predictive mining techniques.
Data mining technique is useful in various areas, such as market basket analysis, decision
support, fraud detection, business management, telecommunications etc. The data mining were
drawn from Database Technology, Machine Learning, Artificial Intelligence, Neural Networks,
Statistics, Pattern Recognition, Knowledge-based Systems, Knowledge Acquisition, Information
Retrieval, High-performance computation and Data Visualization.
Many methods came up to extract the information. The Sequential Sequence Mining is
one of the most important techniques that facilitate us to make the decisions in various
applications. The mining problem was first proposed by Agrawal and Srikant [10]. It discovers
sequential sequences which occur frequently in a sequence database.
In the Medicine, finding of time interval sequence of diseases from medical records like
diseases, treatments, and durations of hospital stay etc. are recorded in the database of Hospitals.
However, all the events such as suffering and curing diseases or occurring symptoms are
interval-based. The conventional sequential sequence mining is not appropriate for the discovery
of the sequences in these events. On other hand, time interval sequences are more useful to
identify if a patient suffers from a certain disease or not. It also predicts the symptoms of a
patient who has a certain disease.
In investment, a certain stock rises or falls is one of the important tasks that the stock
investors wanted to know. Further, the owners are worried about the stock trend of their own
businesses. Stockholders or Industry analysts also like to know the rise/fall of certain stocks,
which is actually one of the useful information extractions from the time interval sequences of
stock prices. The stock prices are recorded in every transaction which acts as a historical data.
We may find the time interval stock sequences from the stock interval event database.
In the E-marketing, some Internet vendors provide new selling methods like group
buying offer. These occur when vendors wanted to sell products at lower prices when someone
collects a crowd of people to buy this product. The duration when an individual joins a group
buying section for a certain product till the closing of the session is considered as an interval-
based event. Since many group buying customers may join buying sessions for a number of
products concurrently or later, these interval-based events form a set of sequences, which may
include some interesting time oriented sequences. Discovering time oriented sequences from
group buying records will help the purchasing behaviors of customers and make effective
marketing strategies.
The goal of my research work is to develop and evaluate new algorithms of MySSM
which efficiently produce sequential sequences in large database having significant improvement
in execution Time and Memory.
We have discussed introductory part of our thesis in Chapter 1. We have also focused on
the organization of our thesis and the aim of our research work in this chapter.
Chapter 2 focuses on the related work to our research. The first part of this chapter is
based on literature survey. In second section, we have discussed various sequential sequence
mining techniques. Third section of this chapter focuses on state-of-the-art techniques for finding
sequential sequence mining. Gradually these techniques are compared with in close proximity
techniques. The results of empirical analysis of state-of-the-art methods are discussed in fourth
section of this chapter. This chapter helped us to strengthen to our technique by considering
various parameters of matrix of evaluation in the area of sequential sequence mining.
Chapter 3 provides the motivation of our research work. It focuses on our inspiration to
do the research work in the sequential sequence mining. The deficiency in state-of-the-art
methods motivated us to develop new sequential sequence mining technique.
Chapter 4 focuses on the scope of work of our algorithm MySSM. We have discussed
proposed algorithms in chapter 5 which includes the steps of our Algorithm MySSM. We have
proposed seven algorithms named SYNTIM, MySSM, GCON, FS, GSGT, GAS, CMEM and
OUTR which all are discussed in this chapter.
Chapter 6 serves to experimentally validate the claims of efficiency in terms of Time and
Memory. In addition, we have empirically analyzed it for large database with
various parameters like various support values, no of items per transactions, no of transactions
per customers, no of customers per database.
Chapter 7 summarizes the thesis and focuses on future scope of the work. This chapter is
followed by references used in our thesis.
The fundamental aim of my thesis is to study and develop a new sequential sequence
mining technique that produces sequential sequences from the large database. It considers the
time gap between successive items to be purchased by the customers. It produces the sequential
sequences with reasonable amount of Time and Memory.