Unit - 2
Unit - 2
Data Preprocessing
Data Preprocessing includes the steps we need to follow to transform or encode data so that it may be
The main agenda for a model to be accurate and precise in predictions is that the algorithm should be
The majority of the real-world datasets for machine learning are highly susceptible to be missing,
identify patterns effectively. Data Processing is, therefore, important to improve the overall data quality.
Duplicate or missing values may give an incorrect view of the overall statistics of data.
Outliers and inconsistent data points often tend to disturb the model’s overall learning, leading to
false predictions.
Quality decisions must be based on quality data. Data Preprocessing is important to get this quality data,
Now, let's discuss more in-depth four main stages of data preprocessing.
Data Cleaning
Data Cleaning is particularly done as part of data preprocessing to clean the data by filling missing
values, smoothing the noisy data, resolving the inconsistency, and removing outliers.
1. Missing values
Here are a few ways to solve this issue:
This method should be considered when the dataset is huge and numerous missing values are present
within a tuple.
There are many methods to achieve this, such as filling in the values manually, predicting the missing
values using regression method, or numerical methods like attribute mean.
2. Noisy Data
It involves removing a random error or variance in a measured variable. It can be done with the help of
the following techniques:
Binning
It is the technique that works on sorted data values to smoothen any noise present in it. The data is
divided into equal-sized bins, and each bin/bucket is dealt with independently. All data in a segment
can be replaced by its mean, median or boundary values.
Regression
This data mining technique is generally used for prediction. It helps to smoothen noise by fitting all the
data points in a regression function. The linear regression equation is used if there is only one
independent attribute; else Polynomial equations are used.
Clustering
Creation of groups/clusters from data having similar values. The values that don't lie in the cluster can
be treated as noisy data and can be removed.
3. Removing outliers
Clustering techniques group together similar data points. The tuples that lie outside the cluster are
outliers/inconsistent data.
Data Integration
Data Integration is one of the data preprocessing steps that are used to merge the data present in multiple
Data Integration is needed especially when we are aiming to solve a real-world scenario like detecting
the presence of nodules from CT Scan images. The only option is to integrate the images from multiple
We might run into some issues while adopting Data Integration as one of the Data Preprocessing steps:
Schema integration and object matching: The data can be present in different formats, and
attributes that might cause difficulty in data integration.
Data Transformation
Once data clearing has been done, we need to consolidate the quality data into alternate forms by
changing the value, structure, or format of data using the below-mentioned Data Transformation
strategies.
Generalization
The low-level or granular data that we have converted to high-level information by using concept
hierarchies. We can transform the primitive data in the address like the city to higher-level information
Normalization
It is the most important Data Transformation technique widely used. The numerical attributes are scaled
up or down to fit within a specified range. In this approach, we are constraining our data attribute to a
particular container to develop a correlation among different data points. Normalization can be done in
Min-max normalization
Z-Score normalization
Attribute Selection
New properties of data are created from existing attributes to help in the data mining process. For
example, date of birth, data attribute can be transformed to another property like is_senior_citizen for
each tuple, which will directly influence predicting diseases or chances of survival, etc.
Aggregation
It is a method of storing and presenting data in a summary format. For example sales, data can be
Data Reduction
The size of the dataset in a data warehouse can be too large to be handled by data analysis and data
mining algorithms.
One possible solution is to obtain a reduced representation of the dataset that is much smaller in volume
Dimensionality reduction
Dimensionality reduction techniques are used to perform feature extraction. The dimensionality of a
dataset refers to the attributes or individual features of the data. This technique aims to reduce the
number of redundant features we consider in machine learning algorithms. Dimensionality reduction can
Data compression
By using encoding technologies, the size of the data can significantly reduce. But compressing data can
be either lossy or non-lossy. If original data can be obtained after reconstruction from compressed data,
Discretization
Data discretization is used to divide the attributes of the continuous nature into data with intervals. This
is done because continuous features tend to have a smaller chance of correlation with the target variable.
Thus, it may be harder to interpret the results. After discretizing a variable, groups corresponding to the
target can be interpreted. For example, attribute age can be discretized into bins like below 18, 18-44,
Numerosity reduction
The data can be represented as a model or equation like a regression model. This would save the burden
dimensional data, which are difficult to train due to underfitting/overfitting problems. Only attributes
that add more value towards model training should be considered, and the rest all can be discarded.
Data Quality Assessment includes the statistical approaches one needs to follow to ensure that the data
has no issues. Data is to be used for operations, customer management, marketing analysis, and decision
1. Data profiling: It involves exploring the data to identify the data quality issues. Once the
analysis of the issues is done, the data needs to be summarized according to no duplicates, blank
values etc identified.
3. Data monitoring: It involves maintaining data in a clean state and having a continuous check
on business needs being satisfied by the data.
2. Clustering
Clustering in text extraction refers to the process of grouping similar textual documents
or pieces of text together based on their content. It is a technique used in natural language
processing and information retrieval to organize and categorize large amounts of text
data. The goal of text clustering is to discover patterns, relationships, and themes within
the text that might not be immediately apparent.
In the context of text extraction, clustering involves identifying and grouping text
documents or segments that share common characteristics or topics. This process helps in
organizing and summarizing large text datasets, making it easier to analyze and extract
valuable insights from the text. Text clustering is often used as a preliminary step before
more detailed analysis, such as sentiment analysis, topic modeling, or document
summarization.
Grouping for Exploration: Text clustering allows researchers or analysts to explore large
text collections by organizing them into meaningful groups. These groups might
represent topics, themes, or categories present in the text.
Algorithm Variety: Various clustering algorithms can be applied to text data, including k-
means, hierarchical clustering, DBSCAN, and more. Each algorithm has its own
strengths and weaknesses, making them suitable for different types of text analysis tasks.
3.Probabilistic Models
Probabilistic models refer to a class of statistical techniques and models used to analyze and
extract information from textual data while incorporating uncertainty and probability
distributions. These models are based on the principles of probability theory and are
particularly useful when dealing with uncertain or ambiguous text data.
In the context of text extraction, probabilistic models are employed to capture the likelihood
of certain events or patterns occurring in the text. These models help estimate the
probabilities of different outcomes, making them suitable for tasks such as information
retrieval, text classification, sentiment analysis, and more.
Uncertainty Handling: Text data often contains inherent uncertainty, ambiguity, and
variability. Probabilistic models provide a framework to handle and quantify this uncertainty
by assigning probabilities to different outcomes.
Bayesian Framework: Many probabilistic models in text extraction are based on the Bayesian
framework, which allows for the incorporation of prior knowledge and updating of
probabilities as new evidence (textual information) is observed.
Text Classification: Probabilistic models are commonly used for text classification tasks,
where the goal is to assign documents to predefined categories or classes. These models
estimate the probability of a document belonging to each class and make predictions based
on these probabilities.
Topic Modeling: Topic modeling techniques, such as Latent Dirichlet Allocation (LDA), are
probabilistic models that aim to discover the underlying topics in a collection of text
documents. These models estimate the probability distribution of words in each topic and the
probability distribution of topics in each document.
Information Retrieval: In information retrieval, probabilistic models are used to rank and
retrieve relevant documents based on their likelihood of containing the information a user is
seeking. Models like the Okapi BM25 and language models are probabilistic in nature.
Named Entity Recognition (NER): Probabilistic models can be used for named entity
recognition, where the goal is to identify and classify entities (such as names of people,
places, organizations) in text. These models estimate the probability of a sequence of words
being an entity.
Sentiment Analysis: For sentiment analysis, probabilistic models can estimate the likelihood
of a given text expressing positive, negative, or neutral sentiment.
Language Generation: In natural language generation tasks, probabilistic models are used to
generate coherent and contextually appropriate text. For instance, generative language
models like GPT-3 are based on probabilistic modeling principles.
Probabilistic models provide a robust framework for dealing with uncertainty in text
extraction tasks. They enable researchers and practitioners to make informed decisions and
predictions based on probabilities, making them valuable tools for various applications in the
field of natural language processing and text analysis.
Browsing and query refinement on the presentation layer refer to user-interface and interaction
techniques that enhance the way users explore and interact with textual data. These techniques
are designed to facilitate efficient information retrieval, exploration of relevant content, and
improvement of search results by allowing users to interact with the extracted information in a
more intuitive and user-friendly manner.
Key points about browsing and query refinement on the presentation layer in text
extraction:
Faceted Search: Faceted search is a browsing technique that allows users to filter and navigate
through search results using multiple attributes or facets. Each facet corresponds to a specific
characteristic or category associated with the extracted text, such as author, date, topic, or
keyword.
Query Expansion: Query refinement techniques enable users to expand or modify their queries
to improve search results. Suggestions for query expansion can be based on synonyms, related
terms, or previously viewed documents.
Feedback Loop: Browsing and query refinement techniques often incorporate user feedback to
improve search accuracy and relevance over time. The system adapts to user preferences and
behavior.
Overall, browsing and query refinement on the presentation layer in text extraction aim to
enhance the user experience, support efficient exploration of textual content, and help users find
the information they need more effectively. These techniques leverage user interactions to refine
and optimize search results, providing a seamless and interactive approach to accessing and
navigating through extracted text data.
In the context of text extraction, link analysis focuses on understanding how different pieces
of text are interconnected and how they contribute to the overall meaning or context. It is
often used to enhance information retrieval, knowledge discovery, and content
summarization from large collections of text data.
Semantic Relationships: Link analysis helps reveal semantic relationships between text
elements, such as synonyms, antonyms, hypernyms, hyponyms, and co-occurring terms.
Understanding these relationships contributes to a deeper understanding of the content.
Co-Occurrence Analysis: Link analysis can identify words, phrases, or entities that
frequently appear together in the same context. This can help uncover associations between
concepts and topics.
Topic Modeling: Link analysis can be used to construct topic models by analyzing the links
between documents or keywords. This assists in summarizing and organizing textual
information based on underlying themes.
Entity Resolution: Link analysis can aid in resolving references to the same entity across
different documents, helping to build a more comprehensive and accurate representation of
the information.
Hyperlinks and Citations: In the context of web-based text extraction, link analysis
involves analyzing hyperlinks and citations between web pages or documents. This can
provide insights into content relevance, authority, and influence.
Sentiment and Opinion Analysis: Link analysis can help identify connections between
sentiment-bearing terms, contributing to a more nuanced understanding of opinions
expressed in text.
Overall, link analysis in text extraction is a technique that aids in revealing the intricate web
of relationships between different textual elements. It provides a powerful tool for
understanding the structure, context, and meaning embedded within large volumes of text
data, enabling more effective information extraction and analysis.
Visualization approaches and their operations refer to techniques that transform textual
data into graphical representations to facilitate a better understanding of patterns,
relationships, and insights within the text. These visualizations enable users to explore
and interpret complex textual information more effectively.
Word Clouds: Word clouds visually represent the frequency of words in a text document
or collection. The size of each word corresponds to its frequency, allowing users to
quickly identify prominent terms.
Bar Charts and Histograms: Bar charts and histograms display the frequency
distribution of words, topics, or concepts. They provide insights into the relative
importance or occurrence of different elements.
Heatmaps: Heatmaps visualize the co-occurrence or similarity between words, topics, or
documents. Brighter colors indicate stronger relationships.
Topic Models: Visualizations of topic models (e.g., LDA) show the distribution of words
across identified topics. These visualizations help users understand the main themes
within the text.
Document Clusters: Visualizing document clusters helps users identify groups of related
documents. Clusters can be represented as groups of points on a 2D plane using
dimensionality reduction techniques.
Time-Series Plots: Time-series plots display how the frequency or sentiment of certain
terms changes over time. They can reveal trends and patterns in text data.
Tree Maps: Tree maps represent hierarchical relationships within the text, showing how
concepts or topics are nested and organized.