0% found this document useful (0 votes)
147 views16 pages

12 Classification

The document discusses various classification methods that can be used to categorize samples based on developed models. It describes four levels of classification problems from simple to more complex categorization and prediction of properties. Several common classification methods are introduced, including linear learning machines, discriminant analysis, and classification trees. The document provides examples applying these methods to iris, wine, and coffee sample data sets to classify samples into predefined categories.

Uploaded by

déborah_rosales
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
147 views16 pages

12 Classification

The document discusses various classification methods that can be used to categorize samples based on developed models. It describes four levels of classification problems from simple to more complex categorization and prediction of properties. Several common classification methods are introduced, including linear learning machines, discriminant analysis, and classification trees. The document provides examples applying these methods to iris, wine, and coffee sample data sets to classify samples into predefined categories.

Uploaded by

déborah_rosales
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Classification Methods

Up to now we have been concerned with methods that: Display complex information. Detect patterns or trends. Now we will introduce methods that can be used to classify samples based on models that are developed.

Classification problems
Level I Simple classification into predefined categories. Level II Level I + detection of outliers, Level III Level II + prediction of an external property. Level IV Level II + prediction of more than one property.

Classification Methods
Many methods have been developed with new ones being published all of the time. Well look a some representative approaches. Linear Learning Machine
Supported by XLStat

Classification Methods All of these methods are considered supervised learning. Initial assumptions regarding membership or properties are made when developing a model. An initial evaluation of the data using exploratory data analysis is useful.

Discriminant Analysis Classification Trees K Nearest Neighbor SIMCA

The available methods and approaches may vary based on the package use.

Data sets
Needed develop and evaluate a classification model. Training set Representative samples used to build the model. The modeling software uses the class information. Evaluation set Samples of known class, used to test the model. The modeling software does not know the classes. Test set True unknowns.

Data pre-processing
With any of these methods, you may choose to do some sort of data preprocessing. Raw Is fastest. Scaled Gives equal weight to the variables. PCA Can be used to reduce noise, insignificant variables.

Data pre-processing
With some data sets, you may also want to some other types of pre-processing. Example. Spectral or chromatographic traces. Options may include: Smoothing, baseline correction, signal averaging, using the first or second derivative.

Creating an evaluation set The evaluation set is typically a sub-set of the training set that was omitted when building a model. Randomly pick a subset of the data. Random pick members from each class. Any approach that selectively removes a portion of the data could cause bias.

Leave-one-out validation A standardized approach for validation of a model where each sample serves as an evaluation set. 1. Omit a single sample from the set 2. Build the model 3. Test the omitted sample 4. Repeat the above steps until each sample has been omitted and tested once.

Your data While Leave-One-Out testing is the best approach, it can be slow for large sets. Alternate approaches are to leave two or more samples out with each pass. Samples should be randomly listed in the matrix. The same two (or more) sample should never be omitted together more than once.

Rule building methods


Methods where a set of rules are created to discriminate between classes. Linear learning machine One or more linear vectors are created to discriminate between classes. Discriminate analysis Linear or quadratic equations are used to separate classes. Classification trees Series of rules are used to sequentially classify.

Linear learning machine


The assumption is that one or more vector can be found that can be used to discriminate between our classes. This can make use of our raw data or work in PC space. PC space would be better as there would be noise reduction.

Linear learning machine


For simple classications, there can be many linear vectors that give complete class discrimination. You would select the one that gives the best partitioning. You are not limited to just 1 or 2-D vectors.

Linear learning machine


As the number of classes increases, the potential number of usable vectors will decrease. The problem can become complex very rapidly. You can reach a point where simple linear lines can no longer solve the problem.

Linear learning machine


In this example, a linear solution cant be found that discriminates between the classes. Clearly, there should be a way to discriminate - the classes appear to be well dened. A non-linear function may offer the best approach (discriminate analysis.

Discriminant Analysis (DA)


First descried by Fisher in 1936. Similar to LLM but can use both quantitative and qualitative variables. Approach uses linear models when sample classes have similar covariance matrices. Uses quadratic models when classes have dissimilar covariance matrices. Can have problems if you have variables with null variance or multicolinearity - must be eliminated.

Iris example
Well return to the Iris example dataset - using XLStats built in DA function. Were going to use autoscaled data.

DA with XLStat.

1 Sepal Width

0.75

0.5 Petal Sepal width Length 0.25

F2 (0.99 %)

Petal length 0

-0.25

-0.5

-0.75

-1 -1 -0.75 -0.5 -0.25 0 0.25 0.5 0.75 1

F1 (99.01 %)

Coffee example
3 33 3 33 3 3 1 3 3 3 2 2 2 2 22 2 2 2 2 1 2 0 2 2 1 5 1 3 33 3 3 2 3 3 3 3 3 3 3 3 3 33 2 33 3 3 3 -10 3 3 3 3 3 3 -5 33 3 3 2 3 1 1 1 1 11 1 1 1 1 1 11 11 11 1 1 11 1 1 1 1 1 1 11 11 11 1 1 1 1 1 1 1 1 1 1 11 1 1 1

3 3 3

33

22 2 2 22 22 222 2 2 2 2 3 22 2 2 2 2 2 2 2 2 2 2 222 2 2 2 2 -4

10

This consisted of 6 types of coffee - identified based on MS data. To avoid colinearity and null variable problems, PCA scores were used (first 5 components).

15 K KK K K

K K

10

F2 (22.79 %)

E E E E E

5 S S 0 S S S S S 15 UU U U U U 20

-15

-10

-5

10

R C C C C C C C

R R

-5

R R R -10

F1 (56.19 %)

Classification trees
Predicts class membership by sequential application of rules based on predictor variables. With DA and LLM, you create a set of math models that are all applied at once. With classification trees, the predictor variables are evaluated as ordinal rules, one at a time.

Classification trees
Solid - liquid

Density > 1

Red or green

Density > 1

Iris example (yet again!) XLStat supports the use of classification and regression trees. Classification if the Y variable (class) is qualitative, regression if the Y variable is quantitative. The iris example is a classification example.

Iris example

= If Petal width is between 1 and 8 the assign to Species 1


[1, 8[

50

Node: 1 Size: 150 2 %: 100 1 Purity(%): 33.3

50

50

Petal width [8, 25[

30

50

Node: 2 0 Size: 50 2 %: 33.3 1 Purity(%): 100

50

Node: 3 Size: 100 2 %: 66.7 10 Purity(%): 50

50

Petal width [8, 16.5[ [16.5, 25[

3
2

45

Node: 4 Size: 53 2 %: 35.3 10 Purity(%): 90.6

48

Node: 5 Size: 47 2 %: 31.3 10 Purity(%): 95.7

Petal length [30, 50.5[ [50.5, 58[ [45, 50.5[

Petal length [50.5, 69[

37

Node: 6 Size: 46 2 %: 30.7 10 Purity(%): 97.8

45

Node: 7 2 Size: 7 %: 4.7 10 Purity(%): 57.1

Node: 14 Size: 10 2 %: 6.7 10 Purity(%): 80

Node: 15 0 Size: 37 2 %: 24.7 10 Purity(%): 100

Petal length [30, 47.5[ [47.5, 50.5[ [60, 62.5[

Sepal Length [62.5, 72[

30
41

Node: 8 Size: 41 2 %: 27.3 10 Purity(%): 100

Node: 9 2 Size: 5 %: 3.3 10 Purity(%): 80

Node: 12 2 Size: 3 %: 2 10 Purity(%): 66.7

Node: 13 2 Size: 4 %: 2.7 10 Purity(%): 75

Sepal Width [22, 23.5[ [23.5, 31[

30
3

Node: 10 2 Size: 2 %: 1.3 10 Purity(%): 50

Node: 11 2 Size: 3 %: 2 10 Purity(%): 100

Node: 1 Size: 150 2 %: 100 1 Purity(%): 33.3

50

50

Petal width [8, 25[

50

Node: 3 Size: 100 2 %: 66.7 10 Purity(%): 50

50

Petal width [8, 16.5[ [16.5, 25[

3
2

45

Node: 4 Size: 53 2 %: 35.3 10 Purity(%): 90.6

48

Node: 5 Size: 47 2 %: 31.3 10 Purity(%): 95.7

Petal length [50.5, 58[ [45, 50.5[

Petal length [50.5, 69[

Using Classication Tree


37

ode: 6 ize: 46 2 %: 30.7 10 urity(%): 7.8

30, 0.5[

45

Node: 7 2 Size: 7 %: 4.7 10 Purity(%): 57.1

Node: 14 Size: 10 2 %: 6.7 10 Purity(%): 80

Node: 15 0 Size: 37 2 %: 24.7 10 Purity(%): 100

Petal length [47.5, 50.5[ [60, 62.5[

Sepal Length [62.5, 72[

Node: 9 2 Size: 5 %: 3.3 10 Purity(%): 80

Node: 12 2 Size: 3 %: 2 10 Purity(%): 66.7

Node: 13 2 Size: 4 %: 2.7 10 Purity(%): 75

Purity is just the percent of samples assigned to that node.

Using DA

Sepal Width [23.5, 31[

30
3

10 2 2 3 10 (%):

Node: 11 2 Size: 3 %: 2 10 Purity(%): 100

Wine example
Riesling vs. Chardonnay. Ohio vs. California. Assayed 5 organic and 4 trace metal components. Yes, youll do the same with your homework.

Node 1 2 3

Class CaC CaC CaR

Freq. 17 17 7

Purity 41.46% 58.62% 58.33%

Rules

If Ca in [17.5, 60.75[ then Class = CaC in 58.6% of cases If Ca in [60.75, 94.75[ then Class = CaR in 58.3% of cases If 2,3-butanediol in [0, 0.065[ and Ca in [17.5, 60.75[ then Class = CaR in 60% of cases If 2,3-butanediol in [0.065, 0.514[ and Ca in [17.5, 60.75[ then Class = CaC in 70.8% of cases

CaR

60.00%

CaC

17

70.83%

CaC

14

If Mn in [0.82, 1.625[ and 2,3-butanediol in [0.065, 100.00% 0.514[ and Ca in [17.5, 60.75[ then Class = CaC in 100% of cases If Mn in [1.625, 3.51[ and 2,3-butanediol in [0.065, 70.00% 0.514[ and Ca in [17.5, 60.75[ then Class = OhC in 70% of cases

OhC

OhR

6 8 10 17

CaC

If K in [735.5, 881.75[ and Mn in [1.625, 3.51[ and 60.00% 2,3-butanediol in [0.065, 0.514[ and Ca in [17.5, 60.75[ then Class = CaC in 60% of cases If K in [881.75, 1147.5[ and Mn in [1.625, 3.51[ and 100.00% 2,3-butanediol in [0.065, 0.514[ and Ca in [17.5, 60.75[ then Class = OhC in 100% of cases If 1-hexanol in [0.638, 0.723[ and K in [735.5, 881.75[ and Mn in [1.625, 3.51[ and 2,3-butanediol 100.00% in [0.065, 0.514[ and Ca in [17.5, 60.75[ then Class = OhC in 100% of cases If 1-hexanol in [0.723, 1.056[ and K in [735.5, 881.75[ and Mn in [1.625, 3.51[ and 2,3-butanediol 100.00% in [0.065, 0.514[ and Ca in [17.5, 60.75[ then Class = CaC in 100% of cases If 1-hexanol in [0.409, 0.673[ and Ca in [60.75, 83.33% 94.75[ then Class = OhR in 83.3% of cases If 1-hexanol in [0.673, 1.218[ and Ca in [60.75, 100.00% 94.75[ then Class = CaR in 100% of cases
[0.638, 0.723[
OhR
0 2 0 0

Node: 1 OhC Size: 41 CaR %: 100 CaC Purity(%):

Ca [17.5, 60.75[
OhR
1 8 3 17

OhC

Node: 2 OhC Size: 29 CaR %: 70.7 CaC Purity(%):

Node: 3 OhC Size: 12 CaR %: 29.3 CaC Purity(%):

[60.75, 94.75[

OhR

5 0 7 0

2,3-butanediol [0, 0.065[


OhR
1 1 3 0

1-hexanol
0 7 0 17

10

OhC

Node: 4 OhC CaR Size: 5 %: 12.2 CaC Purity(%):

Node: 5 OhC Size: 24 CaR %: 58.5 CaC Purity(%):

[0.065, 0.514[

OhR

Node: 12 OhC CaR Size: 6 %: 14.6 CaC Purity(%):

[0.409, 0.673[

OhR

5 0 1 0

Node: 13 OhC CaR Size: 6 %: 14.6 CaC Purity(%):

[0.673, 1.218[

OhR

0 0 6 0

Mn [0.82, 1.625[
OhR
0 0 0 14

11

CaC

Node: 6 OhC Size: 14 CaR %: 34.1 CaC Purity(%):

Node: 7 OhC Size: 10 CaR %: 24.4 CaC Purity(%):

[1.625, 3.51[

OhR

0 7 0 3

K [735.5, OhR 881.75[OhC


0 2 0 3

12 13

OhR CaR

5 6

Node: 8 CaR Size: 5 %: 12.2 CaC Purity(%):

Node: 9 OhC CaR Size: 5 %: 12.2 CaC Purity(%):

[881.75, OhR 1147.5[

0 5 0 0

1-hexanol [0.723, 1.056[


OhR
0 0 0 3

Node: 10 OhC CaR Size: 2 %: 4.9 CaC Purity(%):

Node: 11 OhC CaR Size: 3 %: 7.3 CaC Purity(%):

K nearest neighbor classification


A similarity-based classification method.
Confusion matrix for the estimation sample: from \ to CaC CaR OhC OhR Total CaC 17 0 0 0 17 CaR 0 9 1 1 11 OhC 0 0 7 0 7 OhR 0 1 0 5 6 Total 17 10 8 6 41 % correct 100.0% 90.0% 87.5% 83.3% 92.7%

It attempts to assign categories to unknown samples based on multivariate proximity to other samples. It works best with discrete classification types and is tolerant of poor data sets. K - ! The number of closest neighbors being compared. Consider this as the supervised version of HCA.

K nearest neighbor classification In its simplest form, KNN is conducted by: First, a training set is collected that contains examples of each class. Intersample distances are then calculated. 2 N "

KNN
The distance matrix is sorted and the distance of the unknown sample can be compared to: 1. The K nearest neighbors 2. The nearest class cluster. Option 2 requires that K = 1.

da " b=

!^a
j =1

- b bh

where N = # of variables or components used.

KNN
When using the distance to a class, you can use the same link options that were discussed earlier. The distance can be based on: Single link - closest member of class. Complete link - farthest member of class. Centroid - center of class cluster.

KNN - single link


Single link

In this example, the unknown is compared to the 3 closest known samples.

K=3

In this case, the three closest samples are all red.

KNN - centroid link


Centroid link

KNN
Ideally, if a test sample falls well within a known class, its closes neighbors should all be of one class.

With this approach, the distance to the center of a class cluster is determined and compared.

Here, all of the blue samples would be closer to the unknown than any of the green.

Mycobacteria - HCA
1000 900 800

Mycobacteria - k means

A quick review of ALL of the ways that this data set was difcult to get useful information from.

700

600

500

400

300

200

100

46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 46 46 46 44 44 44 44 44 47 44 44 44 44 44 44 44 44 44 44 43 44 43 43 43 43 43 43 43 43 43 43 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 43 43 43 43 43 43 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 43 43 43 43 43 43 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 43 43 45 47 47 47 47 47 47 47 47 47 47 47 47 47 47 47 47 47 47 47 47

Mycobacteria - PCA
4.000 3.000

Mycobacteria - DA

2.000

1.000

-6.000

-4.000

-2.000

0.000 0.000

2.000

4.000

6.000

8.000

10.000

42 43 44 45 46 47 49

-1.000

-2.000

-3.000

Mycobacteria - DA
42

Mycobacteria - DA
42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42

10

F2 (29.63 %)

-25

-20

-15

-10

-5

47 47 47 47 4744 47 47 47 47 47 47 47 44 44 44 47 47 44 44 44 47 44 43 43 44 44 44 47 43 43 44 44 44 47 44 47 43 43 4344 47 43 4343 5 47 43 43 43 43 43 43 43 4343 43 43 43 43 43 43 45 43 45 45 45 45 45 45 45 45 49 0 45 45 45 45 45 45 45 0 45 45 45 4549 45 45 49 49 5 49 45 49 45 45 45 49 49 45 45 45 4949 45 4949 49 49 49 49 49 49 49 -5 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 -10 46 46 46 46

10

46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46

-15

F1 (56.45 %)

10

Mycobacteria - DA
47 43 43 43 43 43 43 45 45 45 0 45 45 45 45 43 43 43 44 44 44 44 43 43 43 44 43 43 43

Getting out the vote


47 47 4747 47 47 47 47 47 47 47 44 44 47 44 47 47 44 44 44 47 47 44 44 44 44 44 44 47 4747 47

42 42 42 42 42 42 42 42 42 42 42 42 45 45 45

What if a samples distances is such that it could be in more than one class? When you have more than one possible class, we can take a vote. The class with the most votes wins.

F2 (29.63 %)

43 43 43 43 43 43 43 43 43 45 45

43

43 49 49 49 49 49 4949 49 49 49 5 49 49 49 49

45 45 45 45 45 45 45 45 45 45 45 45 45 49 45 45 45 45 45 45 45 45 45 49 49 49

49

-5 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46

F1 (56.45 %)

K=5

Getting out the vote


Example - K = 5 Sample Class 1 A 2 B 3 A 4 B 5 B Distance 0.134 0.145 0.158 0.234 0.502

Getting out the vote

Example - K = 3 Sample Class 1 A 2 B 3 A

Distance 0.134 0.145 0.158

Here you would end up with 3 votes for B and 2 for A. B would win.

Here you would end up with 2 votes for A and one for B - A would win and the distances would be smaller.

Getting out the vote


Example - K = 5 Sample Class 1 A 2 B 3 A 4 B 5 C

KNN validation
The optimum number for K can be found by trial an error but for a close match, it should make no difference. The classifying power of your data can be evaluated by leave one out validation of your training set. This should be done before any sort of real classification begins.

Distance 0.134 0.145 0.158 0.234 0.502

Here, A and B would tie. The tie-breaker would be that A averages a smaller distance so would be made the winner.

KNN validation
Validation You can sequentially leave out each of your samples and test it for votes at several K values. You end up with a vote matrix that will tell you the optimum K value for each class. You will also get a misclassification matrix "this tells you how often one of your knowns are incorrectly classified.

K nearest neighbor classification


So KNN will always assign a class. What if you have a material that is not a member of an existing class? One option is to set a maximum distance. Example Your intraclass distances run about 0.2 for all of your classes, you might want to omit votes with distances that exceed 0.2.

Iris (of course)


The Iris data set is included with a demo of the program Pirouette. Well be using the Pirouette demo to show how to conduct KNN and SIMCA classifications. You can download a copy of the demo from www.infometrix.com. The demo is fully functional but only with the data sets that are provided by Infometrix. The actual software is pretty easy to use but too expensive for our use in the course.

Iris example

Iris - scores by class

Iris - voting results

Iris - class partitions

Cola example
What? NOT the Iris data set? Headspace MS of 4 cola classes. Two cola brands. Diet and regular. m/e 44 - 149. May need to preprocess to eliminate any nonvariant data.
Class 1 2 3 4 Brand 1 Diet brand 1 Brand 2 Diet brand 2

PCA scores

PCA scores

PCA loadings

KNN classification
Not a bad job!

KNN classifications

SIMCA
Soft Independent Modeling of Class Analogy
A method of classification that provides: Detection of outliers. Estimates of confidence for a classification. Determination of potential membership in more than a single class.

SIMCA
Basic approach. For each class of samples, a PCA model is constructed. This model is based on the optimum number of components that best clusters an individual class. The optimum number of components can vary from class to class and can be determined by cross-validation

SIMCA models
Since the number of components used can vary, each class will be best described by its own hypervolume.

SIMCA models
Limitation of a class hypervolume. You can limit the size of a hypervolume by setting a standard deviation cutoff. This results in better defined classes.

SD = 3

SD = 2

SIMCA models
Once a model has been created for each class, you are ready to classify unknowns. For each model/sample combination: + The sample is transformed into PC space and compared to see if is a likely class member. + If it is within the hypervolume of a single class, you have a match.

SIMCA classification
The potential still exists for a sample to be classied as a member of more than one class.

It may also not be a member of any known class

SIMCA classification
SIMCA will give you an estimate as to the probability of class membership. Example - two possible classes. " " Probability " Class A" " 0.90 " Class B 0.45 Here, the sample is more likely to be a member of Class A.

SIMCA summary
Of the methods covered, SIMCA offers the most options for developing a classification model when the classes are well known. It also requires the most development time as you must determine the optimum model conditions for each class. If used, plan on spending quite a bit of time working with all of the available options.

SIMCA example - Iris.


Of course well look at the iris dataset again.

SIMCA example - Iris.

Note: We have a separate model for each class in the data set - in this case three.

SIMCA example - Iris.

SIMCA example - Iris.

Pirouette will provide an estimate as to the class hypervolumes based on the rst three PCs.

SIMCA example - Iris.


These plots show the relative positions of each sample when projected into any of the three class models - two classes at a time - with color coding based on known class.

It appears that petal length is the most useful for classifying.

Cola example
With the cola example (two brands, diet and regular), we have 4 classes. Here you can see that the classes are pretty well resolved.

Cola example

Mycobacteria again
This data set is included with the Pirouette demo. File = Mycosing.wks It is a subset of the version Ive been using (only 72 samples)

Mycobacteria SIMCA
Perfect classifications - a first for this dataset.

Mycobacteria SIMCA

Mycobacteria SIMCA
Example shows that a different number of components were used in developing the individual SIMCA hypervolumes.

Discriminating Power is a measure of which variables show the biggest class differences.

Mycobacteria SIMCA
Modeling power indicates the relative importance of each variable for classification.

Mycobacteria SIMCA
PC plots are pretty boring since you only have one class. However, it can be used to see if you have any sub-classes.

Loadings, as always show the relative significance of each variable in constructing each PC There are relatively unimportant.

Outliers are test for by plotting sample residuals (difference between sample and center of hypervolume) vs its Mahalanobis distance from the center of the cluster - similar to a Euclidian distance but takes into account correlations of the data and is scale invariant.

Mycobacteria SIMCA

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy