Data Mining
Data Mining
and relationships that can help solve business problems through data analysis. Data
mining techniques and tools enable enterprises to predict future trends and make
more-informed business decisions.
Data mining is a key part of data analytics overall and one of the core disciplines
in data science, which uses advanced analytics techniques to find useful
information in data sets. At a more granular level, data mining is a step in the
knowledge discovery in databases (KDD) process, a data science methodology for
gathering, processing and analysing data. Data mining and KDD are sometimes
referred to interchangeably, but they're more commonly seen as distinct things.
Effective data mining aids in various aspects of planning business strategies and
managing operations. That includes customer-facing functions such as marketing,
advertising, sales and customer support, plus manufacturing, supply chain
management, finance and HR. Data mining supports fraud detection, risk
management, cybersecurity planning and many other critical business use cases. It
also plays an important role in healthcare, government, scientific research,
mathematics, sports and more.
Its core elements include machine learning and statistical analysis, along with data
management tasks done to prepare data for analysis. The use of machine learning
algorithms and artificial intelligence (AI) tools has automated more of the process
and made it easier to mine massive data sets, such as customer databases,
transaction records and log files from web servers, mobile apps and sensors.
The data mining process can be broken down into these four primary stages:
Vendors that offer tools for data mining include Alteryx, AWS, Databricks,
Dataiku, DataRobot, Google, H2O.ai, IBM, Knime, Microsoft, Oracle,
RapidMiner, SAP, SAS Institute and Tibco Software, among others.
A variety of free open source technologies can also be used to mine data, including
DataMelt, Elki, Orange, Rattle, scikit-learn and Weka. Some software vendors
provide open source options, too. For example, Knime combines an open source
analytics platform with commercial software for managing data science
applications, while companies such as Dataiku and H2O.ai offer free versions of
their tools.
Benefits of data mining
In general, the business benefits of data mining come from the increased ability to
uncover hidden patterns, trends, correlations and anomalies in data sets. That
information can be used to improve business decision-making and strategic
planning through a combination of conventional data analysis and predictive
analytics.
Ultimately, data mining initiatives can lead to higher revenue and profits, as well as
competitive advantages that set companies apart from their business rivals.
A relational database has tables that have different names, attributes, and can store
rows or records of large data sets. Every record stored in a table has a unique key.
Entity-relationship model is created to provide a representation of a relational
database that features entities and the relationships that exist between them.
2. Data warehouse
A data warehouse is a single data storage location that collects data from multiple
sources and then stores it in the form of a unified plan. When data is stored in a
data warehouse, it undergoes cleaning, integration, loading, and refreshing. Data
stored in a data warehouse is organized in several parts. If you want information on
data that was stored 6 or 12 months back, you will get it in the form of a summary.
3. Transactional data
Transactional database stores record that are captured as transactions. These
transactions include flight booking, customer purchase, click on a website, and
others. Every transaction record has a unique ID. It also lists all those items that
made it a transaction.
Popular tools
Teradata, Knime, Oracle data mining, Weka, Rattle, IBM SPSS modeler, and Kaggle.
Challenges
Data Quality
The quality of data used in data mining is one of the most significant challenges. The
accuracy, completeness, and consistency of the data affect the accuracy of the results
obtained. The data may contain errors, omissions, duplications, or inconsistencies,
which may lead to inaccurate results. Moreover, the data may be incomplete, meaning
that some attributes or values are missing, making it challenging to obtain a complete
understanding of the data.
Data quality issues can arise due to a variety of reasons, including data entry errors,
data storage issues, data integration problems, and data transmission errors. To address
these challenges, data mining practitioners must apply data cleaning and data
preprocessing techniques to improve the quality of the data. Data cleaning involves
detecting and correcting errors, while data preprocessing involves transforming the
data to make it suitable for data mining.
2]Data Complexity
Data complexity refers to the vast amounts of data generated by various sources, such
as sensors, social media, and the internet of things (IoT). The complexity of the data
may make it challenging to process, analyze, and understand. In addition, the data
may be in different formats, making it challenging to integrate into a single dataset.
To address this challenge, data mining practitioners use advanced techniques such as
clustering, classification, and association rule mining. These techniques help to
identify patterns and relationships in the data, which can then be used to gain insights
and make predictions.
3]Data Privacy and Security
Data privacy and security is another significant challenge in data mining. As more
data is collected, stored, and analyzed, the risk of data breaches and cyber-attacks
increases. The data may contain personal, sensitive, or confidential information that
must be protected. Moreover, data privacy regulations such as GDPR, CCPA, and
HIPAA impose strict rules on how data can be collected, used, and shared.
To address this challenge, data mining practitioners must apply data anonymization
and data encryption techniques to protect the privacy and security of the data. Data
anonymization involves removing personally identifiable information (PII) from the
data, while data encryption involves using algorithms to encode the data to make it
unreadable to unauthorized users.
4]Scalability
Data mining algorithms must be scalable to handle large datasets efficiently. As the
size of the dataset increases, the time and computational resources required to perform
data mining operations also increase. Moreover, the algorithms must be able to handle
streaming data, which is generated continuously and must be processed in real-time.
To address this challenge, data mining practitioners use distributed computing
frameworks such as Hadoop and Spark. These frameworks distribute the data and
processing across multiple nodes, making it possible to process large datasets quickly
and efficiently.
4]interpretability
Data mining algorithms can produce complex models that are difficult to interpret.
This is because the algorithms use a combination of statistical and mathematical
techniques to identify patterns and relationships in the data. Moreover, the models
may not be intuitive, making it challenging to understand how the model arrived at a
particular conclusion.
To address this challenge, data mining practitioners use visualization techniques to
represent the data and the models visually. Visualization makes it easier to understand
the patterns and relationships in the data and to identify the most important variables.
5]Ethics
Data mining raises ethical concerns related to the collection, use, and dissemination of
data. The data may be used to discriminate against certain groups, violate privacy
rights, or perpetuate existing biases. Moreover, data mining algorithms may not be
transparent, making it challenging to detect biases or discrimination.
Data Features
For example, let's consider a dataset of houses for sale. Some common data features in this
dataset could include:
In this example, each house is represented as a data point, and each of the features mentioned
above provides specific information about each house. In a dataset, the combination of all
these features makes up the entire set of observations that can be used for analysis and
modeling.
Data Features
When working with data, it's essential to understand the characteristics of each feature, such
as its data type (e.g., numerical, categorical), range, distribution, and potential relationships
with other features. Data preprocessing and feature engineering are crucial steps in preparing
data for machine learning models, as the quality and relevance of features can significantly
impact the model's performance and accuracy.
When we talk about data mining, we usually discuss knowledge discovery
from data. To get to know about the data it is necessary to discuss data
objects, data attributes, and types of data attributes. Mining data includes
knowing about data, finding relations between data. And for this, we need to
discuss data objects and attributes.
Data objects are the essential part of a database. A data object represents
the entity. Data Objects are like a group of attributes of an entity. For
example, a sales data object may represent customers, sales, or purchases.
When a data object is listed in a database they are called data tuples.
Attribute:
It can be seen as a data field that represents the characteristics or features of
a data object. For a customer, object attributes can be customer Id, address,
etc. We can say that a set of attributes used to describe a given object are
known as attribute vector or feature vector.
Type of attributes :
This is the First step of Data-preprocessing. We differentiate between
different types of attributes and then preprocess the data. So here is the
description of attribute types.
1. Qualitative (Nominal (N), Ordinal (O), Binary(B)).
2. Quantitative (Numeric, Discrete, Continuous)
Qualitative Attributes:
Quantitative Attributes:
1. Numeric: A numeric attribute is quantitative because, it is a measurable
quantity, represented in integer or real values. Numerical attributes are of 2
types, interval, and ratio.
● An interval-scaled attribute has values, whose differences are
interpretable, but the numerical attributes do not have the correct
reference point, or we can call zero points. Data can be added and
subtracted at an interval scale but can not be multiplied or divided.
Consider an example of temperature in degrees Centigrade. If a
day’s temperature of one day is twice of the other day we cannot
say that one day is twice as hot as another day.
● A ratio-scaled attribute is a numeric attribute with a fix zero-point.
If a measurement is ratio-scaled, we can say of a value as being a
multiple (or ratio) of another value. The values are ordered, and we
can also compute the difference between values, and the mean,
median, mode, Quantile-range, and Five number summary can be
given.
2. Discrete : Discrete data have finite values it can be numerical and can also
be in categorical form. These attributes has finite or countably infinite set of
values.
Mean: The arithmetic average of all values in the dataset. It is calculated by summing
all the values and dividing by the total number of observations.
Median: The middle value of the dataset when arranged in ascending or descending
order. If the number of observations is even, the median is the average of the two
middle values.
Range: The difference between the maximum and minimum values in the dataset. It
provides a measure of the spread or variability in the data.
Variance: A measure of how spread out the data points are from the mean. It
quantifies the dispersion of the data.
Standard Deviation: The square root of the variance. It provides a more interpretable
measure of the spread of data, as it is in the original units of the data.
Quartiles: Values that divide the data into four equal parts. The first quartile (Q1) is
the 25th percentile, the second quartile (Q2) is the median, and the third quartile (Q3)
is the 75th percentile.
Interquartile Range (IQR): The difference between the third quartile (Q3) and the first
quartile (Q1). It represents the range of the middle 50% of the data.
These basic statistical description methods help summarize data, identify patterns,
detect outliers, and provide a foundation for more advanced statistical analyses. They
are essential in data exploration and preliminary analysis to get a sense of the dataset
before diving into more complex modeling and inference.
https://towardsdatascience.com/17-types-of-similarity-and-dissimilarity-measures-use
d-in-data-science-3eb914d2681
https://www.educative.io/answers/what-are-similarity-and-dissimilarity-measures
1. Euclidean Distance: This is one of the most straightforward measures for numerical
data. It calculates the straight-line distance between two points in an n-dimensional
space. For two points, A(x1, y1, ..., xn) and B(x2, y2, ..., xn), the Euclidean distance
is given by:
2. Manhattan Distance (City Block Distance): Similar to the Euclidean distance, but it
measures the distance along the axes (sum of absolute differences) instead of the
straight-line distance. It is commonly used for data with a grid-like structure.
3. Cosine Similarity: This measure calculates the cosine of the angle between two
vectors, representing data points. It is commonly used for text data or other
high-dimensional sparse data, where the magnitude of the data matters more than the
actual values.
4. Jaccard Index (Jaccard Similarity): Used to measure the similarity between sets. It
calculates the size of the intersection divided by the size of the union of two sets.
J(A, B) = |A ∩ B| / |A ∪ B|
5. Pearson Correlation Coefficient: Measures the linear correlation between two
numerical variables. It ranges from -1 to 1, where -1 indicates a perfect negative
correlation, 1 indicates a perfect positive correlation, and 0 indicates no linear
correlation.
7. Edit Distance (Levenshtein Distance): Measures the similarity between two strings
by calculating the minimum number of single-character edits (insertion, deletion, or
substitution) required to transform one string into the other.