0% found this document useful (0 votes)

16 views34 pages

Feature Eng

Uploaded by

Mashael D

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views34 pages

Feature Eng

Uploaded by

Mashael D

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 34

Machine Learning

Feature Creation and Selection

Jeff Howbert, introduction to machine learning, winter 2014

Feature creation

• Well-conceived new features can sometimes capture the important

information in a dataset much more effectively than the original
features.
• Three general methodologies:
• Feature extraction
• typically results in significant reduction in dimensionality
• domain-specific
• Map existing features to new space
• Feature construction
• combine existing features
Scale invariant feature transform (SIFT)

Image content is transformed into local feature

coordinates that are invariant to translation, rotation,
scale, and other imaging parameters.

SIFT features
Extraction of power bands from EEG
1. Select time window
2. Fourier transform on each channel EEG to give corresponding channel power
spectrum
3. Segment power spectrum into bands
4. Create channel-band feature by summing values in band

time window

Multi-channel EEG recording Multi-channel power spectrum

(time domain) (frequency domain)
Map existing features to new space

• Fourier transform
• Eliminates noise present in time domain

Two sine waves Two sine waves + noise Frequency

Attribute transformation
• Simple functions
• Examples of transform functions:
xk log( x ) ex |x|
• Often used to make the data more like some standard distribution, to
better satisfy assumptions of a particular algorithm.
• Example: discriminant analysis explicitly models each class distribution as a multivariate
Gaussian

log( x )
Modeling document similarity

• Vector space models

• Latent semantic indexing

Vector space models
• Vector of features for each document
• Word frequencies (usually weighted)
• Meta attributes, e.g. title, URL, PageRank

• Can use vector as document descriptor for classification tasks

• Can measure document relatedness by cosine similarity of vectors

• Useful for clustering, ranking tasks
Vector space models
• Term frequency-inverse document frequency (tf-idf)
• Very widely used model
• Each feature represents a single word (term)
• Feature value is product of:
• tf = proportion of counts of that term relative to all terms in
document
• idf = log( total number of documents /
number of documents that contain the term)
Vector space models
• Example of tf-idf vectors for a collection of documents
Vector space models
• Vector space models cannot detect:

• Synonymy – multiple words with same meaning

• Polysemy – single word with multiple meanings (e.g. play,

table)
Latent semantic indexing

• Aggregate document term vectors into matrix X

• Rows represent terms
• Columns represent documents
• Factorize X using singular value decomposition (SVD)
• X = T  S  DT
Latent semantic indexing
• Factorize X using singular value decomposition
(SVD)
• X = T  S  DT
• Columns of T are orthogonal and contain eigenvectors of XX T
• Columns of D are orthogonal and contain eigenvectors of X TX
• S is diagonal and contains singular values (analogous to eigenvalues)
• Rows of T represent terms in a new orthogonal space
• Rows of D represent documents in a new orthogonal space
• In the new orthogonal spaces of T and D, the correlations originally present in X are
captured in separate dimensions
• Better exposes relationships among data items
• Identifies dimensions of greatest variation within data
Latent semantic indexing
• Dimensional reduction with SVD
• Select k largest singular values and corresponding columns
from T and D
• Xk = Tk  Sk  DkT is a reduced rank reconstruction of full X
Latent semantic indexing
• Dimensional reduction with SVD
• Reduced rank representations of terms and documents
referred to as “latent concepts”
• Compare documents in lower dimensional space
• classification, clustering, matching, ranking
• Compare terms in lower dimensional space
• synonymy, polysemy, other cross-term semantic relationships
Latent semantic indexing

• Unsupervised

• May not learn a matching score that works well for a task of interest
Major tasks in NLP (Wikipedia)
• For most tasks on following slides, there are:
• Well-defined problem setting
• Large volume of research
• Standard metric for evaluating the task
• Standard corpora on which to evaluate
• Competitions devoted to the specific task

http://en.wikipedia.org/wiki/Natural_language_processing
Machine translation

• Automatically translate text from one human language to another.

• This is one of the most difficult problems, and is a
member of a class of problems colloquially termed "AI-
complete", i.e. requiring all of the different types of
knowledge that humans possess (grammar, semantics,
facts about the real world, etc.) in order to solve properly.
Parse tree (grammatical analysis)
• The grammar for natural languages is ambiguous and typical
sentences have multiple possible analyses. For a typical
sentence there may be thousands of potential parses (most of
which will seem
completely non-
sensical to a
human).
Part-of-speech tagging
• Given a sentence, determine the part of speech for each word. Many words,
especially common ones, can serve as multiple parts of speech. For example,
"book" can be a noun or verb, and "out" can be any of at least five different
parts of speech.
Speech recognition

• Given a sound clip of a person or people speaking, determine the

textual representation of the speech.
• Another extremely difficult problem, also regarded as "AI-complete".
• In natural speech there are hardly any pauses between successive
words, and thus speech segmentation (separation into words) is a
necessary subtask of speech recognition.
• In most spoken languages, the sounds representing successive letters
blend into each other in a process termed coarticulation, so the
conversion of the analog signal to discrete characters can be a very
difficult process.
Speech recognition

• Hidden Markov model (HMM) for phoneme extraction

https://www.assembla.com/code/sonido/subversion/node/blob/7/sphinx4/index.html
Sentiment analysis

• Extract subjective information, usually from a set of documents like

online reviews, to determine "polarity" about specific objects.
• Especially useful for identifying trends of public opinion in
the social media, for the purpose of marketing.
Information extraction (IE)

• Concerned with extraction of semantic information from text.

• Named entity recognition
• Coreference resolution
• Relationship extraction
• Word sense disambiguation
• Automatic summarization
• etc.
Named entity recognition
• Given a stream of text, determine which items in the text map to proper
names, such as people or places, and what the type of each such name is
(e.g. person, location, organization).
• In English, capitalization can aid in recognizing named entities, but cannot
aid in determining the type of named entity, and in any case is often
insufficient. For example, the first word of a sentence is also capitalized, and
named entities often span several words.
• German capitalizes all nouns.
• French and Spanish do not capitalize names that serve as adjectives.
• Many languages (e.g. Chinese or Arabic) do not have capitalization at all.
Coreference resolution

• Given a chunk of text, determine which words ("mentions") refer to

the same objects ("entities").
• Anaphora resolution is a specific example of this task,
concerned with matching up pronouns with the nouns or
names that they refer to.
• The more general task of coreference resolution also
includes identifying so-called "bridging relationships"
involving referring expressions.
• For example, in a sentence such as "He entered John's house
through the front door", "the front door" is a referring expression
and the bridging relationship to be identified is the fact that the
door being referred to is the front door of John's house (rather
than of some other structure that might also be referred to).
Relationship extraction

• Given a chunk of text, identify the relationships among named

entities (e.g. who is the wife of whom).
Word sense disambiguation

• Many words have more than one meaning; we have to select the
meaning which makes the most sense in context.
• For this problem, we are typically given a list of words and
associated word senses, e.g. from a dictionary or from an
online resource such as WordNet.
Automatic summarization

• Produce a readable summary of a chunk of text. Often used to

provide summaries of text of a known type, such as articles in the
financial section of a newspaper.
Natural language generation

• Convert information from computer databases into readable human

language.
Natural language understanding

• Convert chunks of text into more formal representations such as first-

order logic structures that are easier for computer programs to
manipulate.
• Natural language understanding involves the identification of the
intended semantic from the multiple possible semantics which can
be derived from a natural language expression which usually takes
the form of organized notations of natural languages concepts.
Introduction and creation of language metamodel and ontology are
efficient however empirical solutions. An explicit formalization of
natural languages semantics without confusions with implicit
assumptions such as closed world assumption (CWA) vs. open world
assumption, or subjective Yes/No vs. objective True/False is expected
for the construction of a basis of semantics formalization.
Optical character recognition (OCR)

• Given an image representing printed text, determine the

corresponding text.
Word segmentation

• Separate a chunk of continuous text into separate words.

• For English, this is fairly trivial, since words are usually
separated by spaces. However, some written languages
like Chinese, Japanese, and Thai do not mark word
boundaries in such a fashion, and in those languages text
segmentation is a significant task requiring knowledge of
the vocabulary and morphology of words in the language.
Speech processing

• Speech recognition
• Text-to-speech

1152cs191 Data Visualization Unit IV
No ratings yet
1152cs191 Data Visualization Unit IV
99 pages
Introduction To Text Visualization by Nan Cao, Weiwei Cui (Auth.)
No ratings yet
Introduction To Text Visualization by Nan Cao, Weiwei Cui (Auth.)
122 pages
Feature Eng2
No ratings yet
Feature Eng2
31 pages
Module5-Representing and Mining Text
No ratings yet
Module5-Representing and Mining Text
24 pages
Feature Engineering
100% (2)
Feature Engineering
44 pages
NLP Prez Word - Sentence Embedding - MAQUET - MARTIN - LEEFEBURE - MOGAVERO
No ratings yet
NLP Prez Word - Sentence Embedding - MAQUET - MARTIN - LEEFEBURE - MOGAVERO
18 pages
Unit 5 DL
No ratings yet
Unit 5 DL
11 pages
NLP Unit-5
No ratings yet
NLP Unit-5
83 pages
MPLS TP Overview
100% (1)
MPLS TP Overview
30 pages
Chapter #7 Applicatios of NLP (Reading Ass)
No ratings yet
Chapter #7 Applicatios of NLP (Reading Ass)
58 pages
Text Summarization Using NLP Final
No ratings yet
Text Summarization Using NLP Final
38 pages
NLTK 3
No ratings yet
NLTK 3
5 pages
Ling571 Class14 Distr Thes
No ratings yet
Ling571 Class14 Distr Thes
122 pages
Irs Unit5
No ratings yet
Irs Unit5
6 pages
Text
No ratings yet
Text
102 pages
NLP 9 Que
No ratings yet
NLP 9 Que
10 pages
Natural Language Processing Using Java: Sang Venkatraman April 21, 2015
No ratings yet
Natural Language Processing Using Java: Sang Venkatraman April 21, 2015
51 pages
06 Text and Document
No ratings yet
06 Text and Document
43 pages
Pile Type 1 - Screw Pile Load Test Outline (Terna)
No ratings yet
Pile Type 1 - Screw Pile Load Test Outline (Terna)
113 pages
CSE442 Text
No ratings yet
CSE442 Text
89 pages
Text Mining
No ratings yet
Text Mining
34 pages
Saic-Q-1035 Sub-Base & Base Course
No ratings yet
Saic-Q-1035 Sub-Base & Base Course
4 pages
Temporary Report
No ratings yet
Temporary Report
28 pages
DVT U4 My Notes
No ratings yet
DVT U4 My Notes
15 pages
Bachelor Thesis 2016
No ratings yet
Bachelor Thesis 2016
56 pages
DL Unit-IV
No ratings yet
DL Unit-IV
20 pages
Statistical Language Processing
No ratings yet
Statistical Language Processing
32 pages
1 Text Mining Review Slides
No ratings yet
1 Text Mining Review Slides
78 pages
Week 2 and 3
No ratings yet
Week 2 and 3
76 pages
NLP Basic - YL
No ratings yet
NLP Basic - YL
16 pages
(IJCST-V6I3P19) :vignesh Venkatesh
No ratings yet
(IJCST-V6I3P19) :vignesh Venkatesh
16 pages
NLP Text Preprocessing
No ratings yet
NLP Text Preprocessing
19 pages
Unit-4 NLP
No ratings yet
Unit-4 NLP
21 pages
Applications of NLP
No ratings yet
Applications of NLP
85 pages
DVT Unit 4
No ratings yet
DVT Unit 4
21 pages
Intro To TM
No ratings yet
Intro To TM
32 pages
Kinetic AppStudioExtensionsUserGuide
No ratings yet
Kinetic AppStudioExtensionsUserGuide
144 pages
Unit 3-1
No ratings yet
Unit 3-1
66 pages
DVT UNIT - 4 Notes 211124
No ratings yet
DVT UNIT - 4 Notes 211124
21 pages
Semantic Technology-Assisted Review STAR Document
No ratings yet
Semantic Technology-Assisted Review STAR Document
14 pages
WINSEM2023-24 BCSE306L TH VL2023240500598 2024-04-30 Reference-Material-I
No ratings yet
WINSEM2023-24 BCSE306L TH VL2023240500598 2024-04-30 Reference-Material-I
44 pages
Time Series Forecasting - SoftDrink - Business Report
75% (4)
Time Series Forecasting - SoftDrink - Business Report
37 pages
Doyle 2014 Art Talk
No ratings yet
Doyle 2014 Art Talk
29 pages
Questions With Solutions Mid-Sem Final
No ratings yet
Questions With Solutions Mid-Sem Final
7 pages
Previewpdf
100% (1)
Previewpdf
58 pages
Types of Modulator
No ratings yet
Types of Modulator
31 pages
Dealing With Textual Data
No ratings yet
Dealing With Textual Data
67 pages
Lect 5
No ratings yet
Lect 5
40 pages
Effect of Soundscape Dimensions On Acoustic Comfort in Urban Open Publicspaces
No ratings yet
Effect of Soundscape Dimensions On Acoustic Comfort in Urban Open Publicspaces
9 pages
Software User Manual PDF
No ratings yet
Software User Manual PDF
24 pages
Learn 4
No ratings yet
Learn 4
27 pages
Text Analytics Basics
No ratings yet
Text Analytics Basics
28 pages
2-3btc of Freebitco - in
100% (1)
2-3btc of Freebitco - in
2 pages
Unit 6 Endsem PYQs
No ratings yet
Unit 6 Endsem PYQs
15 pages
Chapter 4 - Processing Text
No ratings yet
Chapter 4 - Processing Text
7 pages
Data Mining:: Concepts and Techniques
No ratings yet
Data Mining:: Concepts and Techniques
37 pages
20200728204914D5872 - COMP6639 - Session 28 - Natural Language Processing
No ratings yet
20200728204914D5872 - COMP6639 - Session 28 - Natural Language Processing
29 pages
OpenStack Cookbook: Freedom in The Cloud...
100% (1)
OpenStack Cookbook: Freedom in The Cloud...
17 pages
Reference Material NLP - 2
No ratings yet
Reference Material NLP - 2
40 pages
NLP - Module 2
No ratings yet
NLP - Module 2
54 pages
Ai DP 2
No ratings yet
Ai DP 2
3 pages
Brochure Rilsan-PA11 2005
No ratings yet
Brochure Rilsan-PA11 2005
32 pages
Tasks in NLP
No ratings yet
Tasks in NLP
7 pages
CA LISA Virtualization - Presentation
No ratings yet
CA LISA Virtualization - Presentation
15 pages
Research Paper 2
No ratings yet
Research Paper 2
7 pages
Introduction To Semantic Processing
No ratings yet
Introduction To Semantic Processing
13 pages
AIML-HC Mod 04
No ratings yet
AIML-HC Mod 04
71 pages
UNIT 4 Information Retrieval Using NLP
No ratings yet
UNIT 4 Information Retrieval Using NLP
13 pages
Settingan Untuk Modem GSM
No ratings yet
Settingan Untuk Modem GSM
5 pages
NLP Unit 1
No ratings yet
NLP Unit 1
44 pages
Motivation Video: Mitsuku Vs Cleverbot - AI (Artificial Intelligence)
No ratings yet
Motivation Video: Mitsuku Vs Cleverbot - AI (Artificial Intelligence)
45 pages
DLT Unit-5
No ratings yet
DLT Unit-5
48 pages
All Test Cases PDF
0% (1)
All Test Cases PDF
7 pages
An Overview of Genetic Algorithms: Part 1, Fundamentals
No ratings yet
An Overview of Genetic Algorithms: Part 1, Fundamentals
16 pages
Comba Report
No ratings yet
Comba Report
51 pages
KODAG
No ratings yet
KODAG
24 pages
Q4 W1 Commissions
No ratings yet
Q4 W1 Commissions
23 pages
Assignment No. 1: Course: Hydraulic Engineering I&D-501 Due Date: 12 March 2021
No ratings yet
Assignment No. 1: Course: Hydraulic Engineering I&D-501 Due Date: 12 March 2021
2 pages
Periodical Exam Science 8
No ratings yet
Periodical Exam Science 8
3 pages
2 Operations On Polynomials
No ratings yet
2 Operations On Polynomials
5 pages
Notes On EV:CV
No ratings yet
Notes On EV:CV
13 pages
Numerical Simulation of Silicon Heterojunction Solar Cells Featuring Metal Oxides As Carrier-Selective Contacts
No ratings yet
Numerical Simulation of Silicon Heterojunction Solar Cells Featuring Metal Oxides As Carrier-Selective Contacts
9 pages
History of Computing
No ratings yet
History of Computing
3 pages
Sub-Surface Understanding of An Oil Field in Cambay Basin
No ratings yet
Sub-Surface Understanding of An Oil Field in Cambay Basin
9 pages
THS527 Datasheet
No ratings yet
THS527 Datasheet
5 pages
Assignment One
No ratings yet
Assignment One
4 pages
Break The Wall From Bottom: Automated Discovery of Protocol-Level Evasion Vulnerabilities in Web Application Firewalls
No ratings yet
Break The Wall From Bottom: Automated Discovery of Protocol-Level Evasion Vulnerabilities in Web Application Firewalls
50 pages
Josephine Taniha Joseph An Intelligent Image Based
No ratings yet
Josephine Taniha Joseph An Intelligent Image Based
6 pages
Chapter6 Data Preparation v2 20230112
No ratings yet
Chapter6 Data Preparation v2 20230112
82 pages
Comparative Analysis of Ensemble Learning Models On Intrusion Detection Systems in IoT Networks - Hiya Alghtani
No ratings yet
Comparative Analysis of Ensemble Learning Models On Intrusion Detection Systems in IoT Networks - Hiya Alghtani
87 pages
Reading Faces, Recommending Choices ASystematicReviewof Facial Emotion Recognition and RecommendationSystems
No ratings yet
Reading Faces, Recommending Choices ASystematicReviewof Facial Emotion Recognition and RecommendationSystems
12 pages
Bioengineering 09 00688
No ratings yet
Bioengineering 09 00688
25 pages
Analysis of Tension Members Part 2 of 2
No ratings yet
Analysis of Tension Members Part 2 of 2
13 pages
D Devi Human Activity Recognition Using Machine
No ratings yet
D Devi Human Activity Recognition Using Machine
5 pages
Week 6: Introduction To Natural Language Processing
No ratings yet
Week 6: Introduction To Natural Language Processing
18 pages
Bda5106 Feature Extraction Engineering
No ratings yet
Bda5106 Feature Extraction Engineering
4 pages
NLP Lab Manual-1
No ratings yet
NLP Lab Manual-1
18 pages
Basic Category Theory for Computer Scientists
From Everand
Basic Category Theory for Computer Scientists
Benjamin C. Pierce
3.5/5 (14)

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Feature Eng

Uploaded by

Feature Eng

Uploaded by

Machine Learning

Feature Creation and Selection

Jeff Howbert, introduction to machine learning, winter 2014

• Well-conceived new features can sometimes capture the important

Image content is transformed into local feature

Multi-channel EEG recording Multi-channel power spectrum

Two sine waves Two sine waves + noise Frequency

• Vector space models

• Latent semantic indexing

• Can use vector as document descriptor for classification tasks

• Can measure document relatedness by cosine similarity of vectors

• Synonymy – multiple words with same meaning

• Polysemy – single word with multiple meanings (e.g. play,

• Aggregate document term vectors into matrix X

• Automatically translate text from one human language to another.

• Given a sound clip of a person or people speaking, determine the

• Hidden Markov model (HMM) for phoneme extraction

• Extract subjective information, usually from a set of documents like

• Concerned with extraction of semantic information from text.

• Given a chunk of text, determine which words ("mentions") refer to

• Given a chunk of text, identify the relationships among named

• Produce a readable summary of a chunk of text. Often used to

• Convert information from computer databases into readable human

• Convert chunks of text into more formal representations such as first-

• Given an image representing printed text, determine the

• Separate a chunk of continuous text into separate words.

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.