0% found this document useful (0 votes)
28 views17 pages

Data Mining

Data mining is the process of analyzing large data sets to uncover patterns and relationships that aid in business decision-making. It involves stages such as data gathering, preparation, mining, and analysis, utilizing various techniques like classification, clustering, and regression. The benefits of data mining include improved marketing, customer service, supply chain management, and risk management, ultimately leading to enhanced business performance.

Uploaded by

suranavaibhav23
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views17 pages

Data Mining

Data mining is the process of analyzing large data sets to uncover patterns and relationships that aid in business decision-making. It involves stages such as data gathering, preparation, mining, and analysis, utilizing various techniques like classification, clustering, and regression. The benefits of data mining include improved marketing, customer service, supply chain management, and risk management, ultimately leading to enhanced business performance.

Uploaded by

suranavaibhav23
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Data mining is the process of sorting through large data sets to identify patterns

and relationships that can help solve business problems through data analysis. Data
mining techniques and tools enable enterprises to predict future trends and make
more-informed business decisions.

Data mining is a key part of data analytics overall and one of the core disciplines
in data science, which uses advanced analytics techniques to find useful
information in data sets. At a more granular level, data mining is a step in the
knowledge discovery in databases (KDD) process, a data science methodology for
gathering, processing and analysing data. Data mining and KDD are sometimes
referred to interchangeably, but they're more commonly seen as distinct things.

Why is data mining important?


Data mining is a crucial component of successful analytics initiatives in
organizations. The information it generates can be used in business
intelligence (BI) and advanced analytics applications that involve analysis of
historical data, as well as real-time analytics applications that examine streaming
data as it's created or collected.

Effective data mining aids in various aspects of planning business strategies and
managing operations. That includes customer-facing functions such as marketing,
advertising, sales and customer support, plus manufacturing, supply chain
management, finance and HR. Data mining supports fraud detection, risk
management, cybersecurity planning and many other critical business use cases. It
also plays an important role in healthcare, government, scientific research,
mathematics, sports and more.

Data mining process: How does it work?


Data mining is typically done by data scientists and other skilled BI and analytics
professionals. But it can also be performed by data-savvy business analysts,
executives and workers who function as citizen data scientists in an organization.

Its core elements include machine learning and statistical analysis, along with data
management tasks done to prepare data for analysis. The use of machine learning
algorithms and artificial intelligence (AI) tools has automated more of the process
and made it easier to mine massive data sets, such as customer databases,
transaction records and log files from web servers, mobile apps and sensors.

The data mining process can be broken down into these four primary stages:

1. Data gathering. Relevant data for an analytics application is identified


and assembled. The data may be located in different source systems, a
data warehouse or a data lake, an increasingly common repository in big
data environments that contain a mix of structured and unstructured data.
External data sources may also be used. Wherever the data comes from,
a data scientist often moves it to a data lake for the remaining steps in the
process.
2. Data preparation. This stage includes a set of steps to get the data ready
to be mined. It starts with data exploration, profiling and pre-processing,
followed by data cleansing work to fix errors and other data
quality issues. Data transformation is also done to make data sets
consistent, unless a data scientist is looking to analyze unfiltered raw
data for a particular application.
3. Mining the data. Once the data is prepared, a data scientist chooses the
appropriate data mining technique and then implements one or more
algorithms to do the mining. In machine learning applications, the
algorithms typically must be trained on sample data sets to look for the
information being sought before they're run against the full set of data.
4. Data analysis and interpretation. The data mining results are used to
create analytical models that can help drive decision-making and other
business actions. The data scientist or another member of a data science
team also must communicate the findings to business executives and
users, often through data visualization and the use of data storytelling
techniques.

Types of data mining techniques


Various techniques can be used to mine data for different data science applications.
Pattern recognition is a common data mining use case that's enabled by multiple
techniques, as is anomaly detection, which aims to identify outlier values in data
sets. Popular data mining techniques include the following types:

● Association rule mining. In data mining, association rules are if-then


statements that identify relationships between data elements. Support and
confidence criteria are used to assess the relationships -- support
measures how frequently the related elements appear in a data set, while
confidence reflects the number of times an if-then statement is accurate.
● Classification. This approach assigns the elements in data sets to
different categories defined as part of the data mining process. Decision
trees, Naive Bayes classifiers, k-nearest neighbor and logistic
regression are some examples of classification methods.
● Clustering. In this case, data elements that share particular
characteristics are grouped together into clusters as part of data mining
applications. Examples include k-means clustering, hierarchical
clustering and Gaussian mixture models.
● Regression. This is another way to find relationships in data sets, by
calculating predicted data values based on a set of variables. Linear
regression and multivariate regression are examples. Decision trees and
some other classification methods can be used to do regressions, too.
● Sequence and path analysis. Data can also be mined to look for
patterns in which a particular set of events or values leads to later ones.
● Neural networks. A neural network is a set of algorithms that simulates
the activity of the human brain. Neural networks are particularly useful
in complex pattern recognition applications involving deep learning, a
more advanced offshoot of machine learning.
Data mining software and tools
Data mining tools are available from a large number of vendors, typically as part of
software platforms that also include other types of data science and advanced
analytics tools. Key features provided by data mining software include data
preparation capabilities, built-in algorithms, predictive modeling support, a
GUI-based development environment, and tools for deploying models and scoring
how they perform.

Vendors that offer tools for data mining include Alteryx, AWS, Databricks,
Dataiku, DataRobot, Google, H2O.ai, IBM, Knime, Microsoft, Oracle,
RapidMiner, SAP, SAS Institute and Tibco Software, among others.

A variety of free open source technologies can also be used to mine data, including
DataMelt, Elki, Orange, Rattle, scikit-learn and Weka. Some software vendors
provide open source options, too. For example, Knime combines an open source
analytics platform with commercial software for managing data science
applications, while companies such as Dataiku and H2O.ai offer free versions of
their tools.
Benefits of data mining
In general, the business benefits of data mining come from the increased ability to
uncover hidden patterns, trends, correlations and anomalies in data sets. That
information can be used to improve business decision-making and strategic
planning through a combination of conventional data analysis and predictive
analytics.

Specific data mining benefits include the following:

● More effective marketing and sales. Data mining helps marketers


better understand customer behavior and preferences, which enables
them to create targeted marketing and advertising campaigns. Similarly,
sales teams can use data mining results to improve lead conversion rates
and sell additional products and services to existing customers.
● Better customer service. Thanks to data mining, companies can identify
potential customer service issues more promptly and give contact center
agents up-to-date information to use in calls and online chats with
customers.
● Improved supply chain management. Organizations can spot market
trends and forecast product demand more accurately, enabling them to
better manage inventories of goods and supplies. Supply chain managers
can also use information from data mining to optimize warehousing,
distribution and other logistics operations.
● Increased production uptime. Mining operational data from sensors on
manufacturing machines and other industrial equipment supports
predictive maintenance applications to identify potential problems before
they occur, helping to avoid unscheduled downtime.
● Stronger risk management. Risk managers and business executives can
better assess financial, legal, cybersecurity and other risks to a company
and develop plans for managing them.
● Lower costs. Data mining helps drive cost savings through operational
efficiencies in business processes and reduced redundancy and waste in
corporate spending.

Ultimately, data mining initiatives can lead to higher revenue and profits, as well as
competitive advantages that set companies apart from their business rivals.

Industry examples of data mining


Here's how organizations in some industries use data mining as part of analytics
applications:
● Retail. Online retailers mine customer data and internet clickstream
records to help them target marketing campaigns, ads and promotional
offers to individual shoppers. Data mining and predictive modeling also
power the recommendation engines that suggest possible purchases to
website visitors, as well as inventory and supply chain management
activities.
● Financial services. Banks and credit card companies use data mining
tools to build financial risk models, detect fraudulent transactions and vet
loan and credit applications. Data mining also plays a key role in
marketing and in identifying potential upselling opportunities with
existing customers.
● Insurance. Insurers rely on data mining to aid in pricing insurance
policies and deciding whether to approve policy applications, including
risk modeling and management for prospective customers.
● Manufacturing. Data mining applications for manufacturers include
efforts to improve uptime and operational efficiency in production
plants, supply chain performance and product safety.
● Entertainment. Streaming services do data mining to analyze what
users are watching or listening to and to make personalized
recommendations based on people's viewing and listening habits.
● Healthcare. Data mining helps doctors diagnose medical conditions,
treat patients and analyze X-rays and other medical imaging results.
Medical research also depends heavily on data mining, machine learning
and other forms of analytics.

The Data Mining Process


To be most effective, data analysts generally follow a certain flow of tasks along
the data mining process. Without this structure, an analyst may encounter an issue
in the middle of their analysis that could have easily been prevented had they
prepared for it earlier. The data mining process is usually broken into the following
steps.

Step 1: Understand the Business


Before any data is touched, extracted, cleaned, or analyzed, it is important to
understand the underlying entity and the project at hand. What are the goals the
company is trying to achieve by mining data? What is their current business
situation? What are the findings of a SWOT analysis? Before looking at any data,
the mining process starts by understanding what will define success at the end of
the process.

Step 2: Understand the Data


Once the business problem has been clearly defined, it's time to start thinking
about data. This includes what sources are available, how they will be secured and
stored, how the information will be gathered, and what the final outcome or
analysis may look like. This step also includes determining the limits of the data,
storage, security, and collection and assesses how these constraints will affect the
data mining process.

Step 3: Prepare the Data


Data is gathered, uploaded, extracted, or calculated. It is then cleaned,
standardized, scrubbed for outliers, assessed for mistakes, and checked for
reasonableness. During this stage of data mining, the data may also be checked for
size as an oversized collection of information may unnecessarily slow
computations and analysis.

Step 4: Build the Model


With our clean data set in hand, it's time to crunch the numbers. Data scientists use
the types of data mining above to search for relationships, trends, associations, or
sequential patterns. The data may also be fed into predictive models to assess how
previous bits of information may translate into future outcomes.

Step 5: Evaluate the Results


The data-centered aspect of data mining concludes by assessing the findings of the
data model or models. The outcomes from the analysis may be aggregated,
interpreted, and presented to decision-makers that have largely been excluded from
the data mining process to this point. In this step, organizations can choose to make
decisions based on the findings.

Step 6: Implement Change and Monitor


The data mining process concludes with management taking steps in response to
the findings of the analysis. The company may decide the information was not
strong enough or the findings were not relevant, or the company may strategically
pivot based on findings. In either case, management reviews the ultimate impacts
of the business and recreates future data mining loops by identifying new business
problems or opportunities.

Types of data that can be mined


1. Data stored in the database
A database is also called a database management system or DBMS. Every DBMS
stores data that are related to each other in a way or the other. It also has a set of
software programs that are used to manage data and provide easy access to it.
These software programs serve a lot of purposes, including defining structure for
database, making sure that the stored information remains secured and consistent,
and managing different types of data access, such as shared, distributed, and
concurrent.

A relational database has tables that have different names, attributes, and can store
rows or records of large data sets. Every record stored in a table has a unique key.
Entity-relationship model is created to provide a representation of a relational
database that features entities and the relationships that exist between them.

2. Data warehouse
A data warehouse is a single data storage location that collects data from multiple
sources and then stores it in the form of a unified plan. When data is stored in a
data warehouse, it undergoes cleaning, integration, loading, and refreshing. Data
stored in a data warehouse is organized in several parts. If you want information on
data that was stored 6 or 12 months back, you will get it in the form of a summary.

3. Transactional data
Transactional database stores record that are captured as transactions. These
transactions include flight booking, customer purchase, click on a website, and
others. Every transaction record has a unique ID. It also lists all those items that
made it a transaction.

4. Other types of data


We have a lot of other types of data as well that are known for their structure,
semantic meanings, and versatility. They are used in a lot of applications. Here are
a few of those data types: data streams, engineering design data, sequence data,
graph data, spatial data, multimedia data, and more.

Popular tools

Teradata, Knime, Oracle data mining, Weka, Rattle, IBM SPSS modeler, and Kaggle.

Challenges
Data Quality
The quality of data used in data mining is one of the most significant challenges. The
accuracy, completeness, and consistency of the data affect the accuracy of the results
obtained. The data may contain errors, omissions, duplications, or inconsistencies,
which may lead to inaccurate results. Moreover, the data may be incomplete, meaning
that some attributes or values are missing, making it challenging to obtain a complete
understanding of the data.
Data quality issues can arise due to a variety of reasons, including data entry errors,
data storage issues, data integration problems, and data transmission errors. To address
these challenges, data mining practitioners must apply data cleaning and data
preprocessing techniques to improve the quality of the data. Data cleaning involves
detecting and correcting errors, while data preprocessing involves transforming the
data to make it suitable for data mining.
2]Data Complexity
Data complexity refers to the vast amounts of data generated by various sources, such
as sensors, social media, and the internet of things (IoT). The complexity of the data
may make it challenging to process, analyze, and understand. In addition, the data
may be in different formats, making it challenging to integrate into a single dataset.
To address this challenge, data mining practitioners use advanced techniques such as
clustering, classification, and association rule mining. These techniques help to
identify patterns and relationships in the data, which can then be used to gain insights
and make predictions.
3]Data Privacy and Security
Data privacy and security is another significant challenge in data mining. As more
data is collected, stored, and analyzed, the risk of data breaches and cyber-attacks
increases. The data may contain personal, sensitive, or confidential information that
must be protected. Moreover, data privacy regulations such as GDPR, CCPA, and
HIPAA impose strict rules on how data can be collected, used, and shared.
To address this challenge, data mining practitioners must apply data anonymization
and data encryption techniques to protect the privacy and security of the data. Data
anonymization involves removing personally identifiable information (PII) from the
data, while data encryption involves using algorithms to encode the data to make it
unreadable to unauthorized users.

4]Scalability
Data mining algorithms must be scalable to handle large datasets efficiently. As the
size of the dataset increases, the time and computational resources required to perform
data mining operations also increase. Moreover, the algorithms must be able to handle
streaming data, which is generated continuously and must be processed in real-time.
To address this challenge, data mining practitioners use distributed computing
frameworks such as Hadoop and Spark. These frameworks distribute the data and
processing across multiple nodes, making it possible to process large datasets quickly
and efficiently.
4]interpretability
Data mining algorithms can produce complex models that are difficult to interpret.
This is because the algorithms use a combination of statistical and mathematical
techniques to identify patterns and relationships in the data. Moreover, the models
may not be intuitive, making it challenging to understand how the model arrived at a
particular conclusion.
To address this challenge, data mining practitioners use visualization techniques to
represent the data and the models visually. Visualization makes it easier to understand
the patterns and relationships in the data and to identify the most important variables.
5]Ethics
Data mining raises ethical concerns related to the collection, use, and dissemination of
data. The data may be used to discriminate against certain groups, violate privacy
rights, or perpetuate existing biases. Moreover, data mining algorithms may not be
transparent, making it challenging to detect biases or discrimination.

Data Features

Data features, also known as variables or attributes, are individual characteristics or


properties of the data points or observations in a dataset. Each data feature represents a
specific aspect or measurement recorded for each item in the dataset. In the context of
machine learning and statistics, data features are used as inputs to build models, analyze
patterns, and make predictions.

For example, let's consider a dataset of houses for sale. Some common data features in this
dataset could include:

1. Price: The sale price of each house.


2. Area: The total area of the house in square feet.
3. Bedrooms: The number of bedrooms in the house.
4. Bathrooms: The number of bathrooms in the house.
5. Location: The geographical location of the house (e.g., city, neighborhood).
6. Year Built: The year the house was constructed.
7. Garage Size: The number of cars the garage can accommodate.
8. Lot Size: The size of the lot on which the house is built.
9. Heating Type: The type of heating system installed in the house.
10. Age: The age of the house (current year minus the year built).

In this example, each house is represented as a data point, and each of the features mentioned
above provides specific information about each house. In a dataset, the combination of all
these features makes up the entire set of observations that can be used for analysis and
modeling.

Data Features

When working with data, it's essential to understand the characteristics of each feature, such
as its data type (e.g., numerical, categorical), range, distribution, and potential relationships
with other features. Data preprocessing and feature engineering are crucial steps in preparing
data for machine learning models, as the quality and relevance of features can significantly
impact the model's performance and accuracy.
When we talk about data mining, we usually discuss knowledge discovery
from data. To get to know about the data it is necessary to discuss data
objects, data attributes, and types of data attributes. Mining data includes
knowing about data, finding relations between data. And for this, we need to
discuss data objects and attributes.
Data objects are the essential part of a database. A data object represents
the entity. Data Objects are like a group of attributes of an entity. For
example, a sales data object may represent customers, sales, or purchases.
When a data object is listed in a database they are called data tuples.

Attribute:
It can be seen as a data field that represents the characteristics or features of
a data object. For a customer, object attributes can be customer Id, address,
etc. We can say that a set of attributes used to describe a given object are
known as attribute vector or feature vector.

Type of attributes :
This is the First step of Data-preprocessing. We differentiate between
different types of attributes and then preprocess the data. So here is the
description of attribute types.
1. Qualitative (Nominal (N), Ordinal (O), Binary(B)).
2. Quantitative (Numeric, Discrete, Continuous)
Qualitative Attributes:

1. Nominal Attributes – related to names: The values of a Nominal attribute


are names of things, some kind of symbols. Values of Nominal attributes
represents some category or state and that’s why nominal attribute also
referred as categorical attributes and there is no order (rank, position)
among values of the nominal attribute.
Example :
2. Binary Attributes: Binary data has only 2 values/states. For Example yes
or no, affected or unaffected, true or false.
● Symmetric: Both values are equally important (Gender).
● Asymmetric: Both values are not equally important (Result).

3. Ordinal Attributes : The Ordinal Attributes contains values that have a


meaningful sequence or ranking(order) between them, but the
magnitude between values is not actually known, the order of values
that shows what is important but don’t indicate how important it is.

Quantitative Attributes:
1. Numeric: A numeric attribute is quantitative because, it is a measurable
quantity, represented in integer or real values. Numerical attributes are of 2
types, interval, and ratio.
● An interval-scaled attribute has values, whose differences are
interpretable, but the numerical attributes do not have the correct
reference point, or we can call zero points. Data can be added and
subtracted at an interval scale but can not be multiplied or divided.
Consider an example of temperature in degrees Centigrade. If a
day’s temperature of one day is twice of the other day we cannot
say that one day is twice as hot as another day.
● A ratio-scaled attribute is a numeric attribute with a fix zero-point.
If a measurement is ratio-scaled, we can say of a value as being a
multiple (or ratio) of another value. The values are ordered, and we
can also compute the difference between values, and the mean,
median, mode, Quantile-range, and Five number summary can be
given.

2. Discrete : Discrete data have finite values it can be numerical and can also
be in categorical form. These attributes has finite or countably infinite set of
values.

3. Continuous: Continuous data have an infinite no of states. Continuous data


is of float type. There can be many values between 2 and 3.
Example :
Basic statistical description refers to a set of summary measures and methods used to
describe and understand the main characteristics of a dataset. These measures provide
a clear and concise overview of the data, allowing researchers and analysts to gain
insights and make informed decisions. Some of the most common basic statistical
description methods include:

Mean: The arithmetic average of all values in the dataset. It is calculated by summing
all the values and dividing by the total number of observations.

Median: The middle value of the dataset when arranged in ascending or descending
order. If the number of observations is even, the median is the average of the two
middle values.

Mode: The value that appears most frequently in the dataset.

Range: The difference between the maximum and minimum values in the dataset. It
provides a measure of the spread or variability in the data.

Variance: A measure of how spread out the data points are from the mean. It
quantifies the dispersion of the data.

Standard Deviation: The square root of the variance. It provides a more interpretable
measure of the spread of data, as it is in the original units of the data.

Quartiles: Values that divide the data into four equal parts. The first quartile (Q1) is
the 25th percentile, the second quartile (Q2) is the median, and the third quartile (Q3)
is the 75th percentile.

Interquartile Range (IQR): The difference between the third quartile (Q3) and the first
quartile (Q1). It represents the range of the middle 50% of the data.

Skewness: A measure of the asymmetry of the distribution. Positive skewness


indicates a longer tail on the right, while negative skewness indicates a longer tail on
the left.
Kurtosis: A measure of the peakedness or flatness of the distribution. High kurtosis
indicates a sharp peak, while low kurtosis indicates a flatter distribution.

These basic statistical description methods help summarize data, identify patterns,
detect outliers, and provide a foundation for more advanced statistical analyses. They
are essential in data exploration and preliminary analysis to get a sense of the dataset
before diving into more complex modeling and inference.

https://towardsdatascience.com/17-types-of-similarity-and-dissimilarity-measures-use
d-in-data-science-3eb914d2681

https://www.educative.io/answers/what-are-similarity-and-dissimilarity-measures

Measuring data similarity and dissimilarity

It is a fundamental task in various fields, including data analysis, machine learning,


pattern recognition, and clustering. The choice of similarity or dissimilarity measure
depends on the type of data and the specific application. Here are some commonly
used methods for measuring data similarity and dissimilarity:

1. Euclidean Distance: This is one of the most straightforward measures for numerical
data. It calculates the straight-line distance between two points in an n-dimensional
space. For two points, A(x1, y1, ..., xn) and B(x2, y2, ..., xn), the Euclidean distance
is given by:

d(A, B) = √((x2 - x1)^2 + (y2 - y1)^2 + ... + (xn - x1)^2)

2. Manhattan Distance (City Block Distance): Similar to the Euclidean distance, but it
measures the distance along the axes (sum of absolute differences) instead of the
straight-line distance. It is commonly used for data with a grid-like structure.

3. Cosine Similarity: This measure calculates the cosine of the angle between two
vectors, representing data points. It is commonly used for text data or other
high-dimensional sparse data, where the magnitude of the data matters more than the
actual values.

cosine_similarity(A, B) = (A · B) / (||A|| * ||B||)

4. Jaccard Index (Jaccard Similarity): Used to measure the similarity between sets. It
calculates the size of the intersection divided by the size of the union of two sets.

J(A, B) = |A ∩ B| / |A ∪ B|
5. Pearson Correlation Coefficient: Measures the linear correlation between two
numerical variables. It ranges from -1 to 1, where -1 indicates a perfect negative
correlation, 1 indicates a perfect positive correlation, and 0 indicates no linear
correlation.

6. Hamming Distance: Used to compare binary data (strings of equal length). It


calculates the number of positions at which two strings differ.

7. Edit Distance (Levenshtein Distance): Measures the similarity between two strings
by calculating the minimum number of single-character edits (insertion, deletion, or
substitution) required to transform one string into the other.

8. Minkowski Distance: A generalization of both the Euclidean and Manhattan


distances, where the parameter "p" determines the type of distance (p = 2 for
Euclidean, p = 1 for Manhattan).

9. Mahalanobis Distance: Accounts for the correlations between variables in


multivariate data. It is a normalized distance measure that considers the covariance
structure of the data.

10. Earth Mover's Distance (EMD): Used to compare distributions or histograms. It


quantifies the minimum effort required to transform one distribution into another.

Choosing an appropriate similarity or dissimilarity measure depends on the nature of


the data and the specific task at hand. It's essential to understand the characteristics of
the data and the requirements of the analysis or model to select the most suitable
measure. Additionally, data preprocessing and normalization may be necessary to
ensure that the chosen measure is effective and meaningful for the data.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy