Data Mining Notes
Data Mining Notes
Mining
Data mining is the area in which large quantities of knowledge are obtained and analyzed to retrieve any
valuable information, i.e. structured information. As time goes, its desires increased. Everyone needs the
succinct and accurate knowledge that is possible through it since it is not an easy job, but through a set of
processes and technology, it becomes possible.
Major Sources of Abundant data
Business – Web, E-commerce, Transactions, Stocks
Science – Remote Sensing, Bioinformatics, Scientific Simulation
Society and Everyone – News, Digital Cameras, YouTube
In Industries – To know the ratings of individuals and people's likes
Data Mining Motivation
The Following areas in which data mining uses extensively are demonstrating data mining motivation:
1. Market Analysis
The best way to get a more holistic view of your clients is data mining and market analysis. We can learn
more about customer tastes with data take a look at purchase histories, collect demographics, gender, place,
other profile information, and much more. We can then have more customized customer experiences with
this mining research, update your marketing strategy, retain a rigorous analysis process, and pitch goods to
which customers are more likely to react well.
For example, email marketers, use data mining to provide users with more personalized content. They will
learn things like gender, place, weather conditions, and more with the aid of a CRM or another big data
collection tool. Then the information can be used by email marketers to classify lists to include more
specific content.
By gathering gender knowledge about clients, Adidas does. Then, to give their new men's apparel collection
to men and their new women's apparel collection to women, they segment their email lists and data sets.
2. Fraud Detection
"Usage of one's career for personal reasons enrichment by the malicious misuse or execution of the wealth
or properties of the recruiting company" in technological systems have dishonest processes, This has
happened in many aspects of everyday life, such as Network Telecommunications, Mobile Communications,
E-commerce and internet banking. Detection of fraud includes detecting fraud as rapidly as Once it is
perpetrated, as possible.
Methods for identifying theft are increasingly being built to protect offenders by responding to their tactics.
New strategies for detecting fraud are being developed. More complicated owing to the extreme constraint
of the exchange of views in the identification of fraud now, fraud a variety of approaches have been
introduced to detect data processing, statistics, and artificial intelligence, for instance. Fraud is uncovered
from data and trend irregularities
Type of Fraud - The types of frauds maybe credit card frauds, telecommunication frauds, and computer
intrusion.
3. Customer Retention
The retention of customers applies to a business or product's ability to maintain its customers for a given
period. High retention of customers means that buyers of the product or company prefer to return, continue
to shop or otherwise not defect to another product or company or not to use it altogether.
4. Production Control
Power over output is a rich source of possible applications for data mining. The collecting and cleaning of
data are reasonably simple. Organizations have their input records, but there are virtually no regulatory and
privacy challenges. Since companies have a long history of setting up operating procedures to maximize
production processes, cost justification and return on investment forecasts are simple to do.
5. Scientific Exploration
Data discovery is a method close to initial data analysis, whereby a data scientist uses visual exploration
rather than conventional data processing systems to explain what is in a dataset and the functionality of the
data.
Such features can include data size or quantity, data completeness, data consistency, potential interactions
between data elements or data files/tables. Usually, data exploration is done using a mixture of automatic
and manual operations.
To give the analyst an initial view of the data and an interpretation of main aspects, automated tasks may
include data profiling, data visualization or tabular reports.
o Compression Lossy
4. Reducing: It is necessary to store only the model parameter in this reduction
technique because the real data is replaced with mathematical models or a
smaller representation of the data instead of actual data. Or non-parametric
methods like clustering, histogram, screening, etc.
5.
a. Operation for Discretization & Definition Hierarchy: Data
discretization methods are used to separate the continuous nature's
attributes into interval data. We substitute several constant attribute
values for small interval marks. This suggests that mining effects are
demonstrated in a succinct and readily understood manner.
Discretization of the Top-down
Discretization from the Bottom-up
b. Hierarchies of Concept: By gathering and then replacing the low-
level concepts with high-level concepts, this decreases the data
scale (categorical variables such as middle age or Senior).
It is possible to adopt the following techniques for numeric data:
Binning - The method of changing numerical variables into categorical
equivalent is called binning; the number of categorical counterparts depends
on how many bins the user has defined.
Review of histograms - Like the binning process, the histogram is used to
divide the value into disjoint ranges called brackets for the X attribute.
Cluster Analysis - Cluster analysis is a common form of data discretization. A
clustering algorithm may be implemented by partitioning the values of A into
clusters or classes to isolate a computational feature of A.
It is possible to further decompose each initial cluster or partition into many
subcultures, creating a lower hierarchy level.
Now, if we want to view the sales data with a third dimension, For example, suppose the data
according to time and item, as well as the location is considered for the cities Chennai, Kolkata,
Mumbai, and Delhi. These 3D data are shown in the table. The 3D data of the table are represented as
a series of 2D tables.
Conceptually, it may also be represented by the same data in the form of a 3D data cube, as shown in
fig: