UNIT3
UNIT3
Feature Engineering
Outline
● Feature Generation and Feature Selection (Extracting Meaning from Data)-
Motivating application: user (customer) retention
● Feature Generation (brainstorming, role of domain expertise, and place for
imagination)
● Feature Selection algorithms.
Data
Data science is all about the data. If data quality is poor, even the most
sophisticated analysis would generate only lackluster results.
Data Preparation
● The tabular form is most commonly used to represent data for analysis
(see Table 1).
● Each row indicates a data point representing a single observation, and
each column shows a variable describing the data point.
● Variables are also known as attributes, features, or dimensions.
Variable Types
There are four main types of variables, and it is important to distinguish between them
to ensure that they are appropriate for our selected algorithms.
● Binary. This is the simplest type of variable, with only two possible options. In
Table 1, a binary variable is used to indicate if customers bought fish.
● Categorical. When there are more than two options, the information can be
represented via categorical variable. In Table 1, a categorical variable is used to
describe the customers’ species.
● Integer. These are used when the information can be represented as a whole
number. In Table 1, an integer variable is used to indicate the number of fruits
purchased by each customer.
● Continuous. This is the most detailed variable, representing numbers with
decimal places. In Table 1, a continuous variable is used to indicate the amount
spent by each customer.
Variable Selection
Qualitative Data or
Categorical Data Quantitative Data
● Nominal
● Ordinal
● Numeric
● Categorical
● Discrete or continuous
Nominal Data
● Nominal means “relating to names.”
● The values of a nominal attribute are symbols or names of things.
● Each value represents some kind of category, code, or state, and so nominal attributes
are also referred to as categorical.
● The values do not have any meaningful order.
● In computer science, the values are also known as enumerations.
Example Nominal attributes.
● Suppose that hair color and marital status are two attributes describing person
objects. In our application, possible values for hair color are black, brown, blond, red,
auburn, gray, and white. The attribute marital status can take on the values single,
married, divorced, and widowed.
● Another example of a nominal attribute is occupation, with the values teacher,
dentist, programmer, farmer, and so on
Ordinal Data
● An ordinal attribute is an attribute with possible values that have a meaningful order or ranking
among them, but the magnitude between successive values is not known.
● Ordinal attributes.
● Suppose that drink size corresponds to the size of drinks available at a fast-food restaurant. This
nominal attribute has three possible values: small, medium, and large. The values have a
meaningful sequence (which corresponds to increasing drink size); however, we cannot tell from the
values how much bigger, say, a medium is than a large.
● Other examples of ordinal attributes include grade (e.g., A+, A, A−, B+, and so on)
● Professional ranks can be enumerated in a sequential order: for example, assistant, associate,
and full for professors, and private, private first class, specialist, corporal, and sergeant for
army ranks.
● Ordinal attributes are useful for registering subjective assessments of qualities that cannot be
measured objectively; thus ordinal attributes are often used in surveys for ratings.
● In one survey, participants were asked to rate how satisfied they were as customers. Customer
satisfaction had the following ordinal categories: 0: very dissatisfied, 1: somewhat dissatisfied, 2:
neutral, 3: satisfied, and 4: very satisfied.
● Ordinal attributes may also be obtained from the discretization of numeric quantities by splitting
the value range into a finite number of ordered categories
Numeric Attributes
● Ratio-Scaled Attributes
○ A ratio-scaled attribute is a numeric attribute with an inherent zero-point.
○ That is, if a measurement is ratio-scaled, we can speak of a value as being a multiple (or
ratio) of another value.
○ In addition, the values are ordered, and we can also compute the difference between
values, as well as the mean, median, and mode.
Discrete vs Continuous Attributes
● Improve User Experience: The primary reason we engineer features is to enhance the user experience of a product or
service. By adding new features, we can make the product more intuitive, efficient, and user-friendly, which can increase
user satisfaction and engagement.
● Competitive Advantage: Another reason we engineer features is to gain a competitive advantage in the marketplace. By
offering unique and innovative features, we can differentiate our product from competitors and attract more customers.
● Meet Customer Needs: We engineer features to meet the evolving needs of customers. By analyzing user feedback,
market trends, and customer behavior, we can identify areas where new features could enhance the product’s value and
meet customer needs.
● Increase Revenue: Features can also be engineered to generate more revenue. For example, a new feature that
streamlines the checkout process can increase sales, or a feature that provides additional functionality could lead to more
upsells or cross-sells.
● Future-Proofing: Engineering features can also be done to future-proof a product or service. By anticipating future trends
and potential customer needs, we can develop features that ensure the product remains relevant and useful in the long
term.
Feature Processing Techniques
● Common feature engineering techniques:
○ Binarization
○ Quantization
○ Scaling (normalization)
○ log transforms
Feature Processing Techniques
● Binarization:
○ The data values can be binarize by cliping all counts greater than 1 to 0
○ Eg: Consider a music dataset, a user can put his favorite song in infinite loop.
○ So, the # of times song is listen can be binarized as: if the user listened to a song at least once, then
we count it as the user liking the song.
● Binarization:
○ Binarization
○ Quantization (Binning)
○ Scaling (normalization)
○ log transforms
Feature Processing Techniques
● Quantization or Binning
○ Types of Binning
■ Fixed-width binning
■ Quantile binning
Feature Processing Techniques
● Quantization or Binning
■ Fixed-width binning
● fixed-width binning, each bin contains a specific
numeric range.
● The ranges can be custom designed or
automatically segmented, and they can be
linearly scaled or exponentially scaled.
● The age ranges by decade: 0–9 years old in bin
1, 10 –19 years in bin 2, etc.
Feature Processing Techniques
● Quantization or Binning
■ Fixed-width binning
● Quantization or Binning
■ Fixed-width binning
● Adv.: Easy to compute
● Disadv. : if there are large gaps in the
counts, then there will be many empty bins
with no data
Feature Processing Techniques
● Quantization or Binning
■ Quantile binning
● Quantiles are values that divide the data into equal portions
● This is done by positioning the bins based on the distribution of the
data
● EG: the median divides the data in halves; half the data points are smaller
and half larger than the median.
● The quartiles divide the data into quarters, the deciles into tenths, etc.
Feature Processing Techniques
● Quantization or Binning
■ Quantile binning
Feature Processing Techniques
● Common numeric feature engineering techniques:
○ Binarization
○ Quantization
○ Scaling (normalization)
○ where the attribute data are scaled so as to fall within a small specified range, such as -1.0 to 1.0, or 0.0 to 1.0.
○ An attribute is normalized by scaling its values so that they fall within a small specified range
○ Normalization Techniques:
■ min-max normalization,
■ z-score normalization,
■ Min-Max normalization
● Normalization
■ z-score normalization
○ Binarization
○ Quantization
○ Scaling (normalization)
○ log transforms
Feature Processing Techniques
● Log transforms
○ The log function maps the small range of numbers between (0, 1) to the
entire range of negative numbers (–∞, 0).
○ The function log10 (aX) maps the range of [1, 10] to [0, 1], [10, 100] to [1, 2], and so on.
○ The log function compresses the range of large numbers and expands the range of small numbers.
Feature Selection
● Feature selection techniques prune away non-useful features in order to reduce the complexity of the resulting
model.
● The goal is to make a model that is quicker to compute, with little or no degradation in predictive accuracy.
● For making such a model, different feature selection techniques can be used:
○ Filtering
○ Wrapper methods
○ Embedded methods
Feature Selection Techniques
● Filtering
○ Filtering techniques preprocess features to remove the features that are unlikely to be useful for the model.
○ For example, The correlation or mutual information between each feature and the response variable
can be computed, and filter out the features that fall below a threshold.
○ They may not be able to select the right features for the model.
Feature Selection Techniques
● Wrapper methods
○ This method will not accidentally prune away features that are uninformative by themselves but useful when taken in
combination.
○ The wrapper method treats the model as a black box that provides a quality score of a proposed subset for features.
● Embedded Methods
○ These methods perform feature selection as part of the model training process.
○ For example, a decision tree inherently performs feature selection because it selects one feature
on which to split the tree at each training step.
○ Embedded methods incorporate feature selection as part of the model training process.
○ Compared to filtering, embedded methods select features that are specific to the model.
○ Embedded methods strike a balance between computational expense and quality of results
Example of a text data
● Emma knocked on the door. No answer. She knocked again and waited.
There was a large maple tree next to the house. Emma looked up the tree
and saw a giant raven perched at the tree top. Under the afternoon sun, the
raven gleamed magnificently. Its beak was hard and pointed, its claws sharp
and strong. It looked regal and imposing. It reigned the tree it stood on. The
raven was looking straight at Emma with its beady black eyes. Emma felt
slightly intimidated. She took a step back from the door and tentatively said,
“Hello?”
● The paragraph contains a lot of information.
● The vector contains an entry for every possible word in the vocabulary.
● EG:
○ “it”
○ “was”
○ “the”
○ “best”
○ “of”
○ “times”
○ “worst”
○ “age”
○ “wisdom”
○ “foolishness”
Bag-of-Words Model
● EG:
○ The objective is to turn each document of free text into a vector that we can use as input or output for a machine learning model.
■ “it” = 1
■ “was” = 1
■ “the” = 1
■ “best” = 1
■ “of” = 1
■ “times” = 1
■ “worst” = 0
■ “age” = 0
■ “wisdom” = 0
■ “foolishness” = 0
Bag-of-Words Model
● EG:
○ A 2-gram (bigram) is a two-word sequence of words like “please turn”, “turn your”, or “your homework”,
○ A 3-gram (trigram) is a three-word sequence of words like “please turn your”, or “turn your homework”
● After tokenization, the counting mechanism can collate individual tokens into word counts, or
count overlapping sequences as n-grams.
● For example, the sentence “Emma knocked on the door” generates the n-grams “Emma
knocked,” “knocked on,” “on the,” and “the door.”
Processing Text Features
● Text cleaning techniques :
○ Ignoring case
○ Ignoring punctuation
○ Ignoring frequent words that don’t contain much information, called stop words, like “a,” “of,” etc.
○ Classification and retrieval do not usually require an in-depth understanding of the Text.
○ In the sentence “Emma knocked on the door,” the words “on” and “the” don’t change the
fact that this sentence is about a person and a door
○ For coarse-grained tasks such as classification, the pronouns, articles, and prepositions
may not add much value.
Processing Text Features
● Frequency based Filtering
○ the occurrence of words in example documents needs to be scored after vocabulary has been
made
○ Frequencies: Calculate the frequency that each word appears in a document out of all the words in
the document.
Processing Text Features
● Frequency based Filtering
○ Rare words:
■ To a statistical model, a word that appears in only one or two documents is more like noise than useful
information.
■ Rare words can be easily identified and trimmed based on word count statistics.
■ Their counts can be aggregated into a special garbage bin, which can serve as an additional feature
Processing Text Features
○ Rare words:
○ Parsing is necessary when the string contains more than plain text
○ For eg, if the raw data is a web page, an email, or a log of some sort, then it contains
additional structure.
■ If the document is a web page, then the parser needs to handle URLs.
■ If it is an email, then fields like From, To, and Subject may require
special
handling—otherwise these headers will end up as normal words in the final
count, which may not be useful.
Processing Text Features
○ After parsing, the plain-text portion of the document can go through tokenization.
○ The tokenizer needs to know what characters indicate that one token has ended and another
is beginning.
○ If the text contains tweets, then hash marks (#) should not be used as separators (delimiters)
Processing Text Features
○ Complex text featurization methods like word2vec also work with sentences or paragraphs
○ Here, first document are parsed into sentences, then further tokenize each sentence into
words
Processing Text Features
○ The words together can mean more than their sum of parts (The Times of India, disk drive)
■ Frequency-based methods
■ Chunking and part-of-speech tagging
Categorical Variable
○ seasons in a year,
○ The values of a categorical variable cannot be ordered with respect to one another They are called nonordinal
● The vocabulary of a document corpus can be interpreted as a large categorical variable, with the
categories being unique words
Encoding Categorical Variables
○ One-Hot Encoding
○ Dummy Coding
○ Effect Coding
Encoding Categorical Variables
○ One-Hot Encoding
○ One-Hot Encoding
■ Mathematically, one can write this constraint as “the sum of all bits must be equal to 1”:
■ One-hot encoding is redundant, which allows for multiple valid models for the same
problem.
Encoding Categorical Variables
○ One-Hot Encoding
■ One-hot encoding is redundant, which allows for multiple valid models for the same
problem.
Encoding Categorical Variables
○ Dummy Coding
reference
category
■ It cannot easily handle missing data, since the all-zeros vector is already mapped to
the reference category.
■ It also encodes the effect of each category relative to the reference category
Encoding Categorical Variables
○ Effect Coding
Feature reduction is the process of reducing the dimension of the feature space.
Its goal is to streamline the number of features our model has to ingest without
losing important information.
Feature Selection
● Feature selection techniques prune away non-useful features in order to reduce the
complexity of the resulting model.
● The goal is to make a model that is quicker to compute, with little or no degradation in
predictive accuracy.
● For making such a model, different feature selection techniques can be used:
○ Filtering
○ Wrapper methods
○ Embedded methods
Feature Selection Techniques
● Filtering
○ Filtering techniques preprocess features to remove the features that are unlikely to be
useful for the model.
○ For example, The correlation or mutual information between each feature and the
response variable can be computed, and filter out the features that fall below a
threshold.
○ They may not be able to select the right features for the model.
Feature Selection Techniques
● Wrapper methods
○ This method will not accidentally prune away features that are uninformative by
themselves but useful when taken in combination.
○ The wrapper method treats the model as a black box that provides a quality score of
a proposed subset for features.
● Embedded Methods
○ These methods perform feature selection as part of the model training process.
○ For example, a decision tree inherently performs feature selection because it selects one feature
on which to split the tree at each training step.
○ Embedded methods incorporate feature selection as part of the model training process.
○ Compared to filtering, embedded methods select features that are specific to the model.
○ Embedded methods strike a balance between computational expense and quality of results
https://www.engati.com/glossary/feature-engineering