0% found this document useful (0 votes)
4 views98 pages

UNIT3

Chapter 3 discusses feature engineering, emphasizing the importance of feature generation and selection in data analysis, particularly for user retention. It outlines various types of variables, their significance in algorithms, and the need for effective feature engineering techniques such as binarization, quantization, and normalization. The chapter highlights how well-engineered features can enhance model performance and user experience while addressing the challenges of data quality and variable selection.

Uploaded by

janviiiiiii0046
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views98 pages

UNIT3

Chapter 3 discusses feature engineering, emphasizing the importance of feature generation and selection in data analysis, particularly for user retention. It outlines various types of variables, their significance in algorithms, and the need for effective feature engineering techniques such as binarization, quantization, and normalization. The chapter highlights how well-engineered features can enhance model performance and user experience while addressing the challenges of data quality and variable selection.

Uploaded by

janviiiiiii0046
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 98

Chapter -3

Feature Engineering
Outline
● Feature Generation and Feature Selection (Extracting Meaning from Data)-
Motivating application: user (customer) retention
● Feature Generation (brainstorming, role of domain expertise, and place for
imagination)
● Feature Selection algorithms.
Data

Fig. Different types of real time data


Data Preparation

Data science is all about the data. If data quality is poor, even the most
sophisticated analysis would generate only lackluster results.
Data Preparation

● The tabular form is most commonly used to represent data for analysis
(see Table 1).
● Each row indicates a data point representing a single observation, and
each column shows a variable describing the data point.
● Variables are also known as attributes, features, or dimensions.
Variable Types
There are four main types of variables, and it is important to distinguish between them
to ensure that they are appropriate for our selected algorithms.
● Binary. This is the simplest type of variable, with only two possible options. In
Table 1, a binary variable is used to indicate if customers bought fish.
● Categorical. When there are more than two options, the information can be
represented via categorical variable. In Table 1, a categorical variable is used to
describe the customers’ species.
● Integer. These are used when the information can be represented as a whole
number. In Table 1, an integer variable is used to indicate the number of fruits
purchased by each customer.
● Continuous. This is the most detailed variable, representing numbers with
decimal places. In Table 1, a continuous variable is used to indicate the amount
spent by each customer.
Variable Selection

● While we might be handed an original dataset that contains many


variables, throwing too many variables into an algorithm might lead to slow
computation, or wrong predictions due to excess noise.
● Hence, we need to shortlist the important variables. Selecting variables is
often a trial-and-error process, with variables swapped in or out based on
feedback from our results. As a start, we can use simple plots to examine
correlations between variables, with promising ones selected for further
analysis.
Feature Engineering
● Sometimes, however, the best variables need to be engineered.
● For example, if we wanted to predict which customers in Table 1
would avoid fish, we could look at the customer species variable to
determine that rabbits, horses and giraffes would not purchase fish.
● Nonetheless, if we had grouped customer species into broader categories
of herbivores, omnivores and carnivores, we could reach a more
generalized conclusion: herbivores do not buy fish.
● Besides recoding a single variable, we could also combine multiple
variables in a technique known as dimension reduction
● Dimension reduction can be used to extract the most useful
information and condense that into a new but smaller set of variables
for analysis.
Types of Data

Qualitative Data or
Categorical Data Quantitative Data

Nominal Ordinal Discrete


Eg:Hair color- red, Eg: Grades- Eg: No. of Students in class,
brown, black AA,AB,BB,BC…. no. of correctly answered Continuous
questions in a question
paper

Interval Scaled Ratio Scaled


Data

● Nominal
● Ordinal
● Numeric
● Categorical
● Discrete or continuous
Nominal Data
● Nominal means “relating to names.”
● The values of a nominal attribute are symbols or names of things.
● Each value represents some kind of category, code, or state, and so nominal attributes
are also referred to as categorical.
● The values do not have any meaningful order.
● In computer science, the values are also known as enumerations.
Example Nominal attributes.
● Suppose that hair color and marital status are two attributes describing person
objects. In our application, possible values for hair color are black, brown, blond, red,
auburn, gray, and white. The attribute marital status can take on the values single,
married, divorced, and widowed.
● Another example of a nominal attribute is occupation, with the values teacher,
dentist, programmer, farmer, and so on
Ordinal Data
● An ordinal attribute is an attribute with possible values that have a meaningful order or ranking
among them, but the magnitude between successive values is not known.
● Ordinal attributes.
● Suppose that drink size corresponds to the size of drinks available at a fast-food restaurant. This
nominal attribute has three possible values: small, medium, and large. The values have a
meaningful sequence (which corresponds to increasing drink size); however, we cannot tell from the
values how much bigger, say, a medium is than a large.
● Other examples of ordinal attributes include grade (e.g., A+, A, A−, B+, and so on)
● Professional ranks can be enumerated in a sequential order: for example, assistant, associate,
and full for professors, and private, private first class, specialist, corporal, and sergeant for
army ranks.
● Ordinal attributes are useful for registering subjective assessments of qualities that cannot be
measured objectively; thus ordinal attributes are often used in surveys for ratings.
● In one survey, participants were asked to rate how satisfied they were as customers. Customer
satisfaction had the following ordinal categories: 0: very dissatisfied, 1: somewhat dissatisfied, 2:
neutral, 3: satisfied, and 4: very satisfied.
● Ordinal attributes may also be obtained from the discretization of numeric quantities by splitting
the value range into a finite number of ordered categories
Numeric Attributes

● A numeric attribute is quantitative; that is, it is a measurable quantity,


represented in integer or real values.
● Numeric attributes can be interval-scaled or ratio-scaled.
● Interval-Scaled Attributes
○ Interval-scaled attributes are measured on a scale of equal-size units.
○ The values of interval-scaled attributes have order and can be positive, 0, or negative
○ Thus, in addition to providing a ranking of values, such attributes allow us to compare and
quantify the difference between values.
○ EG: Interval-scaled attributes. A temperature attribute is interval-scaled. Suppose that
we have the outdoor temperature value for a number of different days, where each day is
an object. By ordering the values, we obtain a ranking of the objects with respect to
temperature.
Numeric Attributes

● Ratio-Scaled Attributes
○ A ratio-scaled attribute is a numeric attribute with an inherent zero-point.
○ That is, if a measurement is ratio-scaled, we can speak of a value as being a multiple (or
ratio) of another value.
○ In addition, the values are ordered, and we can also compute the difference between
values, as well as the mean, median, and mode.
Discrete vs Continuous Attributes

● A discrete attribute has a finite or countably infinite set of values, which


may or may not be represented as integers.
● The attributes hair color, smoker, medical test, and drink size each have a
finite number of values, and so are discrete.
● If an attribute is not discrete, it is continuous.
● In practice, real values are represented using a finite number of digits.
Continuous attributes are typically represented as floating-point variables.
Discrete vs Continuous Attributes
Data

● Each piece of data provides a small window into a


limited aspect of reality
● The collection of all of these observations gives us a
picture of the whole.
● But the picture is messy because it is composed of a
thousand little pieces, and there’s always
measurement noise and missing pieces.
Feature
● Feature is numeric representation of data.
● Raw data can be changed to numeric measurements by many ways.
● Feature engineering is process of formulating the most appropriate
features given the data.
● If there are not enough informative features, then the model will be
unable to perform the ultimate task.
● If there are too many features, or if most of them are irrelevant, then
the model will be more expensive and tricky to train.
● Features and models sit between raw data and the desired insights.
Feature
● In machine learning and pattern recognition, a feature is an individual
measurable property or characteristic of a phenomenon.
● Consider our training data as a matrix where each row is a vector and
each column is a dimension.
● For example consider the matrix for the data x1=(1, 9, 8), x2=(2, 6, 0),
and x3=(1, 3, 1)
● We call each dimension a feature or a column in our matrix.
● Choosing informative, discriminating and independent features is a
crucial element of effective algorithms in pattern recognition, and
machine learning.
Feature
● We need Feature engineering as part of machine learning process
● It has three key components –feature construction, feature selection, and
feature transformation
● A feature is an attribute of a data set that is used in a machine learning process.
● There is a view amongst certain machine learning practitioners that only those
attributes which are meaningful to a machine learning problem are to be called as
features, but this view has to be taken with a pinch of salt.
● In fact, selection of the subset of features which are meaningful for machine
learning is a sub-area of feature engineering which draws a lot of research
interest.
● The features in a data set are also called its dimensions. So a data set having ‘n’
features is called an n-dimensional data set.
Feature

Fig. The place of feature engineering in the machine learning workflow


Feature
● In a machine learning workflow, both model and features are to be picked.
● It can be considered as a double-jointed lever, and the choice of one affects
the other.
● Good features make the subsequent modeling step easy and the resulting
model more capable of completing the desired task.
● Good features should not only represent salient aspects of the data, but also
conform to the assumptions of the model.
● Bad features may require a much more complicated model to achieve the
same level of performance.
● For selecting good features, feature engineering is required.
Need for Feature Engineering in Machine Learning?
We engineer features for various reasons, and some of the main reasons include:

● Improve User Experience: The primary reason we engineer features is to enhance the user experience of a product or
service. By adding new features, we can make the product more intuitive, efficient, and user-friendly, which can increase
user satisfaction and engagement.
● Competitive Advantage: Another reason we engineer features is to gain a competitive advantage in the marketplace. By
offering unique and innovative features, we can differentiate our product from competitors and attract more customers.
● Meet Customer Needs: We engineer features to meet the evolving needs of customers. By analyzing user feedback,
market trends, and customer behavior, we can identify areas where new features could enhance the product’s value and
meet customer needs.
● Increase Revenue: Features can also be engineered to generate more revenue. For example, a new feature that
streamlines the checkout process can increase sales, or a feature that provides additional functionality could lead to more
upsells or cross-sells.
● Future-Proofing: Engineering features can also be done to future-proof a product or service. By anticipating future trends
and potential customer needs, we can develop features that ensure the product remains relevant and useful in the long
term.
Feature Processing Techniques
● Common feature engineering techniques:

○ Binarization

○ Quantization

○ Scaling (normalization)

○ log transforms
Feature Processing Techniques
● Binarization:

○ The data values can be binarize by cliping all counts greater than 1 to 0

○ Eg: Consider a music dataset, a user can put his favorite song in infinite loop.

○ So, the # of times song is listen can be binarized as: if the user listened to a song at least once, then
we count it as the user liking the song.

○ i.e like or dislike (i.e. 0 or 1)


Feature Processing Techniques

● Binarization:

Fig. Binarization of an Image


Feature Processing Techniques
● Common feature engineering techniques:

○ Binarization

○ Quantization (Binning)

○ Scaling (normalization)

○ log transforms
Feature Processing Techniques
● Quantization or Binning

○ Group the counts into bins

○ Quantization maps a continuous number to a discrete one.

○ Types of Binning

■ Fixed-width binning
■ Quantile binning
Feature Processing Techniques

● Quantization or Binning
■ Fixed-width binning
● fixed-width binning, each bin contains a specific
numeric range.
● The ranges can be custom designed or
automatically segmented, and they can be
linearly scaled or exponentially scaled.
● The age ranges by decade: 0–9 years old in bin
1, 10 –19 years in bin 2, etc.
Feature Processing Techniques

● Quantization or Binning

■ Fixed-width binning

Fig. 6 Binning example


Feature Processing Techniques

● Quantization or Binning

■ Fixed-width binning
● Adv.: Easy to compute
● Disadv. : if there are large gaps in the
counts, then there will be many empty bins
with no data
Feature Processing Techniques
● Quantization or Binning

■ Quantile binning

● Quantiles are values that divide the data into equal portions
● This is done by positioning the bins based on the distribution of the
data
● EG: the median divides the data in halves; half the data points are smaller
and half larger than the median.
● The quartiles divide the data into quarters, the deciles into tenths, etc.
Feature Processing Techniques

● Quantization or Binning

■ Quantile binning
Feature Processing Techniques
● Common numeric feature engineering techniques:

○ Binarization

○ Quantization

○ Scaling (normalization)

○ log transforms (a type of power transform)


Feature Processing Techniques
● Normalization

○ where the attribute data are scaled so as to fall within a small specified range, such as -1.0 to 1.0, or 0.0 to 1.0.

○ An attribute is normalized by scaling its values so that they fall within a small specified range

○ Normalization Techniques:

■ min-max normalization,

■ z-score normalization,

■ normalization by decimal scaling.


Feature Processing Techniques
● Normalization

■ Min-Max normalization

● Min-max normalization performs a linear transformation on the


original data.
● Suppose that minA and maxA are the minimum and maximum
values of an attribute, A.
● Min-max normalization maps a value, v, of A to v0 in the range [new
minA ; new maxA] by computing
Feature Processing Techniques
● Normalization
■ Min-Max normalization
● Min-max normalization preserves the relationships
among the original data values.
● It will encounter an “out-of-bounds” error if a future
input case for normalization falls outside of the
original data range for A.
Feature Processing Techniques

● Normalization

■ z-score normalization

● Also called zero-mean normalization


● The values for an attribute, A, are normalized based on the mean
and standard deviation of A.
● A value, v, of A is normalized to v’ by computing

● This method of normalization is useful when the actual minimum


and maximum of attribute A are unknown, or when there are
outliers that dominate the min-max normalization.
Feature Processing Techniques
● Normalization
■ Normalization by decimal scaling
● normalizes by moving the decimal point of values of
attribute A.
● The number of decimal points moved depends on the
maximum absolute value of A.
● A value, v, of A is normalized to v’ by computing
Feature Processing Techniques
● Common feature engineering techniques:

○ Binarization

○ Quantization

○ Scaling (normalization)

○ log transforms
Feature Processing Techniques

● Log transforms

○ The log function is the inverse of the exponential function.

○ It is defined such that loga (aX) =x

○ where a is a positive constant, and x can be any positive number.

○ Since a = 1, we have loga (1)= 0.


0

○ The log function maps the small range of numbers between (0, 1) to the
entire range of negative numbers (–∞, 0).
○ The function log10 (aX) maps the range of [1, 10] to [0, 1], [10, 100] to [1, 2], and so on.

○ The log function compresses the range of large numbers and expands the range of small numbers.
Feature Selection
● Feature selection techniques prune away non-useful features in order to reduce the complexity of the resulting
model.

● The goal is to make a model that is quicker to compute, with little or no degradation in predictive accuracy.

● For making such a model, different feature selection techniques can be used:

○ Filtering

○ Wrapper methods

○ Embedded methods
Feature Selection Techniques
● Filtering

○ Filtering techniques preprocess features to remove the features that are unlikely to be useful for the model.

○ For example, The correlation or mutual information between each feature and the response variable
can be computed, and filter out the features that fall below a threshold.

○ Filtering techniques are much cheaper than the wrapper

○ They may not be able to select the right features for the model.
Feature Selection Techniques
● Wrapper methods

○ They allow to use subsets of features

○ This method will not accidentally prune away features that are uninformative by themselves but useful when taken in
combination.

○ The wrapper method treats the model as a black box that provides a quality score of a proposed subset for features.

○ There is a separate method that iteratively refines the subset.


Feature Selection Techniques

● Embedded Methods

○ These methods perform feature selection as part of the model training process.

○ For example, a decision tree inherently performs feature selection because it selects one feature
on which to split the tree at each training step.

○ Embedded methods incorporate feature selection as part of the model training process.

○ They are not as powerful as wrapper methods,

○ They are not expensive.

○ Compared to filtering, embedded methods select features that are specific to the model.

○ Embedded methods strike a balance between computational expense and quality of results
Example of a text data
● Emma knocked on the door. No answer. She knocked again and waited.
There was a large maple tree next to the house. Emma looked up the tree
and saw a giant raven perched at the tree top. Under the afternoon sun, the
raven gleamed magnificently. Its beak was hard and pointed, its claws sharp
and strong. It looked regal and imposing. It reigned the tree it stood on. The
raven was looking straight at Emma with its beady black eyes. Emma felt
slightly intimidated. She took a step back from the door and tentatively said,
“Hello?”
● The paragraph contains a lot of information.

● Which parts of this paragraph of information are salient (important)


features?

● So, we can apply feature processing on this piece of text


Bag-of-words

● A bag-of-words is a representation of text that describes the occurrence of


words within a document.

● It involves two things:

○ A vocabulary of known words.

○ A measure of the presence of known words.


Bag-of-words

● In bag-of-words (BoW) featurization, a text document is converted into a


vector of counts. (A vector is just a collection of n numbers.)

● The vector contains an entry for every possible word in the vocabulary.

● If the word—say, “is”—appears three times in the document, then the


feature vector has a count of 3 in the position corresponding to that word.

● If a word in the vocabulary doesn’t appear in the document, then it gets


a count of 0.
Bag-of-words

Fig. Turning raw text into a bag-of-words representation


Bag-of-Words Model

● EG:

● Data: “It was the best of times,


it was the worst of times,
it was the age of wisdom,
it was the age of foolishness”,

● Design the Vocabulary

● The unique words here (ignoring case and punctuation) are:

○ “it”

○ “was”

○ “the”

○ “best”

○ “of”

○ “times”

○ “worst”

○ “age”

○ “wisdom”

○ “foolishness”
Bag-of-Words Model

● EG:

● Create Document Vectors

○ The objective is to turn each document of free text into a vector that we can use as input or output for a machine learning model.

○ Vector for first line: “It was the best of times”

■ “it” = 1

■ “was” = 1

■ “the” = 1

■ “best” = 1

■ “of” = 1

■ “times” = 1

■ “worst” = 0

■ “age” = 0

■ “wisdom” = 0

■ “foolishness” = 0
Bag-of-Words Model

● EG:

○ Similarly for other lines:

○ "it was the worst of times" = [1, 1, 1, 0, 1, 1, 1, 0, 0, 0]

○ "it was the age of wisdom" = [1, 1, 1, 0, 1, 0, 0, 1, 1, 0]

○ "it was the age of foolishness" = [1, 1, 1, 0, 1, 0, 0, 1, 0, 1]


○ These
it
docs
was
can
the
also
best
beof represented
times worst
asagedocument-term
wisdom foolishness
Doc 1 matrix:
1 1 1 0 1 1 1 0 0 0
Doc 2 1 1 1 0 1 0 0 1 1 0
Doc 3 1 1 1 0 1 0 0 1 0 1

Table 1 : An example document-term matrix


Bag-of-words
● Bag-of-words converts a text document into a flat vector.
● It is “flat” because it doesn’t contain any of the original
textual structures.
● The original text is a sequence of words.
● But a bag-of-words has no sequence; it just remembers how
many times each word appears in the text.
Bag-of-words

Fig. Two equivalent BoW vectors


Bag-of-words
● In a bag-of-words vector, each word becomes a
dimension of the vector.
● If there are n words in the vocabulary, then a document
becomes a point in n-dimensional space.
• The fig depict data vectors in feature
space.
• The axes denote individual words,
which are features in the bag-of-words
representation, and the points in space
denote
Fig. shows what our example sentence looks likedata
in thepoints (text documents).
two-dimensional feature space
Bag-of-words
● Drawbacks:
○ Bag-of-words is not perfect.
○ Breaking down a sentence into single words can destroy the semantic meaning.
○ For instance, “not bad” semantically means “decent” or “good”
○ But “not” and “bad” constitute a negation plus a negative sentiment.
○ “toy dog” and “dog toy” could be very different things and the meaning is lost with the singleton words “toy” and
“dog.”
Bag-of-n-Grams

● Bag-of-n-Grams is a extension of bag-of-words.

● Each word or token is called a “gram”.

● An n-gram is a sequence of n tokens.

● A word is a 1-gram, also known as a unigram.

● An N-gram is an N-token sequence of words:

○ A 2-gram (bigram) is a two-word sequence of words like “please turn”, “turn your”, or “your homework”,

○ A 3-gram (trigram) is a three-word sequence of words like “please turn your”, or “turn your homework”

● After tokenization, the counting mechanism can collate individual tokens into word counts, or
count overlapping sequences as n-grams.

● For example, the sentence “Emma knocked on the door” generates the n-grams “Emma
knocked,” “knocked on,” “on the,” and “the door.”
Processing Text Features
● Text cleaning techniques :

○ Ignoring case

○ Ignoring punctuation

○ Ignoring frequent words that don’t contain much information, called stop words, like “a,” “of,” etc.

○ Reducing words to their stem (e.g. “play” from “playing”).


Processing Text Features
● Stopwords:

○ Classification and retrieval do not usually require an in-depth understanding of the Text.

○ In the sentence “Emma knocked on the door,” the words “on” and “the” don’t change the
fact that this sentence is about a person and a door

○ For coarse-grained tasks such as classification, the pronouns, articles, and prepositions
may not add much value.
Processing Text Features
● Frequency based Filtering

○ the occurrence of words in example documents needs to be scored after vocabulary has been
made

○ Counts: Count the number of times each word appears in a document.

○ Frequencies: Calculate the frequency that each word appears in a document out of all the words in
the document.
Processing Text Features
● Frequency based Filtering

○ Rare words:

■ need to filter out rare words

■ To a statistical model, a word that appears in only one or two documents is more like noise than useful
information.

■ Rare words can be easily identified and trimmed based on word count statistics.

■ Their counts can be aggregated into a special garbage bin, which can serve as an additional feature
Processing Text Features

● Frequency based Filtering

○ Rare words:

Fig. Bag-of-words feature vector with a garbage bin


Processing Text Features

● Parsing and Tokenization

○ how does a computer know what a word is?

○ Parsing is necessary when the string contains more than plain text

○ For eg, if the raw data is a web page, an email, or a log of some sort, then it contains
additional structure.

■ If the document is a web page, then the parser needs to handle URLs.
■ If it is an email, then fields like From, To, and Subject may require
special
handling—otherwise these headers will end up as normal words in the final
count, which may not be useful.
Processing Text Features

● Parsing and Tokenization

○ After parsing, the plain-text portion of the document can go through tokenization.

○ This turns the string—a sequence of characters—into a sequence of tokens.

○ Each token can then be counted as a word.

○ The tokenizer needs to know what characters indicate that one token has ended and another
is beginning.

○ Space characters are usually good separators, as are punctuation characters.

○ If the text contains tweets, then hash marks (#) should not be used as separators (delimiters)
Processing Text Features

● Parsing and Tokenization

○ Complex text featurization methods like word2vec also work with sentences or paragraphs

○ Here, first document are parsed into sentences, then further tokenize each sentence into
words
Processing Text Features

● Collocation Extraction for Phrase Detection

○ A collocation is an expression consisting of two or more words that correspond to some


conventional way of saying things.

○ The words together can mean more than their sum of parts (The Times of India, disk drive)

○ Techniques for finding the collocation:

■ Frequency-based methods
■ Chunking and part-of-speech tagging
Categorical Variable

● A categorical variable is used to represent categories or label categorical variable.

○ Eg: major cities in the world,

○ seasons in a year,

○ The industry (oil, travel, technology) of a company

○ The values of a categorical variable cannot be ordered with respect to one another They are called nonordinal

● In real world, one need to deal with Large categorical variables

○ Eg: Userid: many web services track users using an ID

● The vocabulary of a document corpus can be interpreted as a large categorical variable, with the
categories being unique words
Encoding Categorical Variables
○ One-Hot Encoding

○ Dummy Coding

○ Effect Coding
Encoding Categorical Variables
○ One-Hot Encoding

■ Group of Bits is used.


■ Each bit represents a possible category
■ If the variable cannot belong to multiple categories at once, then only one bit in the
group can be “on.”

Table 2 : One-hot encoding of a category of three cities


Encoding Categorical Variables

○ One-Hot Encoding

■ One-hot encoding is very simple to understand,


■ It uses one more bit than is strictly necessary.
■ If k–1 of the bits are 0, then the last bit must be 1 because the variable must take on one of
the k values.

■ Mathematically, one can write this constraint as “the sum of all bits must be equal to 1”:

■ One-hot encoding is redundant, which allows for multiple valid models for the same
problem.
Encoding Categorical Variables

○ One-Hot Encoding

■ One-hot encoding is very simple to understand,


■ It uses one more bit than is strictly necessary.

■ One-hot encoding is redundant, which allows for multiple valid models for the same
problem.
Encoding Categorical Variables
○ Dummy Coding

■ One feature represented by the vector of


all zeros.
■ This is known as the reference category.

reference
category

Table 3 : Dummy coding of a category of three cities


Encoding Categorical Variables
○ Dummy Coding

■ It cannot easily handle missing data, since the all-zeros vector is already mapped to
the reference category.

■ It also encodes the effect of each category relative to the reference category
Encoding Categorical Variables

○ Effect Coding

■ It is very similar to dummy coding,


■ Here reference category is now represented by the vector of all –1’s.
reference
category

Table 4 : Effect coding of a category of three cities


■ However, –1’s is a dense vector, which is expensive for both storage and computation.
■ All the above techniques discussed have some drawbacks, they can’t be efficiently
used for dealing with very large categorical variables.
Feature Engineering

● Feature engineering refers to the process of translating a data set into


features such that these features are able to represent the data set
more effectively and result in a better learning performance.
● feature engineering is an important pre-processing step for machine
learning. It has two major elements:
○ 1. feature transformation
○ 2. feature subset selection
● Feature engineering in Machine learning consists of mainly 5 processes:
Feature Creation, Feature Transformation, Feature Extraction, Feature
Selection, and Feature Scaling.
Feature Transformation

● Feature transformation transforms the data – structured or unstructured, into a


new set of features which can represent the underlying problem which machine
learning is trying to solve.
● It is a technique by which we can boost our model performance. Feature
transformation is a mathematical transformation in which we apply a
mathematical formula to a particular column(feature) and transform the values
which are useful for our further analysis.
Data

● Given the type of data,


○ Suppose for numeric data, check is to be made:
■ Whether the magnitude of data matters.?
■ The sign of the data is important?
■ The count of the data is important (eg: rating of movie, etc.)
● In machine learning, generally data should be in form of
numeric.
● Data can be represented in form of vectors
Data

● Scalars, Vectors, and Spaces


○ A single numeric feature is also known as a scalar.
○ An ordered list of scalars is known as a vector.
○ Vectors sit within a vector space.
● The input to a model is usually represented as a
numeric vector.
● Raw data can be converted to vector of numbers
using different methods.
Data

● A vector can be visualized as a point in space


● Let point be (1,-1), so its two-dimensional vector v = [1,
–1]
● a collection of data can be visualized in feature space as
a point cloud. 1
-1

Fig. vector representation of a point in 2-D space


Data

Fig. vector representation of categorical data in 2-D space


Feature Generation
● Feature generation is the process of constructing new features from existing
ones.
● The goal of feature generation is to derive new combinations and
representations of our data that might be useful to the machine learning
model.
● Feature generation is the process of adding transformations of terms into the
model. Feature generation enhances the power of models to fit more complex
relationships between target and predictors.
Examples of Feature Generation techniques

A transformation is a mapping that is used to transform a feature into a new


feature. The right transformation depends on the type and structure of the data, data
size and the goal. This can involve transforming single feature into a new feature using
standard operators like log, square, power, exponential, reciprocal, addition, division,
multiplication etc.

Often the relationship between dependent and independent variables are


assumed linear, but this is not always the case. There are feature combinations that
cannot be represented by a linear system. A new feature can be created based on a
polynomial combination of numeric features in a dataset. Moreover, new features can
be created using trigonometric combinations.
Feature Generation was an ad-hoc manual process that depended on domain
knowledge, intuition, data exploration and creativity. However, this process is
dataset-dependent, time-consuming, tedious, subjective, and it is not a scalable
solution. Automated Feature Generation automatically generates features using a
framework; these features can be filtered using Feature Selection to avoid feature
explosion.
Feature Reduction

Feature reduction is the process of reducing the dimension of the feature space.
Its goal is to streamline the number of features our model has to ingest without
losing important information.
Feature Selection
● Feature selection techniques prune away non-useful features in order to reduce the
complexity of the resulting model.

● The goal is to make a model that is quicker to compute, with little or no degradation in
predictive accuracy.

● For making such a model, different feature selection techniques can be used:

○ Filtering

○ Wrapper methods

○ Embedded methods
Feature Selection Techniques
● Filtering

○ Filtering techniques preprocess features to remove the features that are unlikely to be
useful for the model.

○ For example, The correlation or mutual information between each feature and the
response variable can be computed, and filter out the features that fall below a
threshold.

○ Filtering techniques are much cheaper than the wrapper

○ They may not be able to select the right features for the model.
Feature Selection Techniques
● Wrapper methods

○ They allow to use subsets of features

○ This method will not accidentally prune away features that are uninformative by
themselves but useful when taken in combination.

○ The wrapper method treats the model as a black box that provides a quality score of
a proposed subset for features.

○ There is a separate method that iteratively refines the subset.


Feature Selection Techniques

● Embedded Methods

○ These methods perform feature selection as part of the model training process.

○ For example, a decision tree inherently performs feature selection because it selects one feature
on which to split the tree at each training step.

○ Embedded methods incorporate feature selection as part of the model training process.

○ They are not as powerful as wrapper methods,

○ They are not expensive.

○ Compared to filtering, embedded methods select features that are specific to the model.

○ Embedded methods strike a balance between computational expense and quality of results
https://www.engati.com/glossary/feature-engineering

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy