Book Matlab Document Stats
Book Matlab Document Stats
www.mathworks.com Web
comp.soft-sys.matlab Newsgroup
www.mathworks.com/contact_TS.html Technical Support
508-647-7000 (Phone)
508-647-7001 (Fax)
Trademarks
MATLAB and Simulink are registered trademarks of The MathWorks, Inc. See
www.mathworks.com/trademarks for a list of additional trademarks. Other product or brand
names may be trademarks or registered trademarks of their respective holders.
Patents
The MathWorks products are protected by one or more U.S. patents. Please see
www.mathworks.com/patents for more information.
Revision History
September 1993 First printing Version 1.0
March 1996 Second printing Version 2.0
January 1997 Third printing Version 2.11
November 2000 Fourth printing Revised for Version 3.0 (Release 12)
May 2001 Fifth printing Minor revisions
July 2002 Sixth printing Revised for Version 4.0 (Release 13)
February 2003 Online only Revised for Version 4.1 (Release 13.0.1)
June 2004 Seventh printing Revised for Version 5.0 (Release 14)
October 2004 Online only Revised for Version 5.0.1 (Release 14SP1)
March 2005 Online only Revised for Version 5.0.2 (Release 14SP2)
September 2005 Online only Revised for Version 5.1 (Release 14SP3)
March 2006 Online only Revised for Version 5.2 (Release 2006a)
September 2006 Online only Revised for Version 5.3 (Release 2006b)
March 2007 Eighth printing Revised for Version 6.0 (Release 2007a)
September 2007 Ninth printing Revised for Version 6.1 (Release 2007b)
March 2008 Online only Revised for Version 6.2 (Release 2008a)
October 2008 Online only Revised for Version 7.0 (Release 2008b)
March 2009 Online only Revised for Version 7.1 (Release 2009a)
September 2009 Online only Revised for Version 7.2 (Release 2009b)
March 2010 Online only Revised for Version 7.3 (Release 2010a)
Contents
Getting Started
1
Product Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-2
Organizing Data
2
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-2
Descriptive Statistics
3
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-2
v
Measures of Central Tendency . . . . . . . . . . . . . . . . . . . . . . 3-3
Statistical Visualization
4
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-2
Probability Distributions
5
Using Probability Distributions . . . . . . . . . . . . . . . . . . . . . 5-2
vi Contents
Supported Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-3
Parametric Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-4
Nonparametric Distributions . . . . . . . . . . . . . . . . . . . . . . . . 5-8
vii
Common Generation Methods . . . . . . . . . . . . . . . . . . . . . . 6-5
Direct Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-5
Inversion Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-7
Acceptance-Rejection Methods . . . . . . . . . . . . . . . . . . . . . . . 6-9
Hypothesis Tests
7
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-2
viii Contents
Available Hypothesis Tests . . . . . . . . . . . . . . . . . . . . . . . . . 7-13
Analysis of Variance
8
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-2
ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-3
One-Way ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-3
Two-Way ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-9
N-Way ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-12
Other ANOVA Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-26
Analysis of Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-27
Nonparametric Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-35
MANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-39
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-39
ANOVA with Multiple Responses . . . . . . . . . . . . . . . . . . . . 8-39
Regression Analysis
9
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-2
ix
Nonlinear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-58
Nonlinear Regression Models . . . . . . . . . . . . . . . . . . . . . . . . 9-58
Parametric Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-59
Mixed-Effects Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-64
Regression Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-94
Multivariate Methods
10
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-2
x Contents
Cluster Analysis
11
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-2
Classification
12
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-2
xi
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-9
Example: Classification Trees . . . . . . . . . . . . . . . . . . . . . . . 12-9
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-13
Markov Models
13
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-2
xii Contents
Design of Experiments
14
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-2
xiii
Function Reference
16
File I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-2
xiv Contents
Hypothesis Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-31
Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-42
Classification Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-42
Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-42
Classification Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-42
Naive Bayes Classification . . . . . . . . . . . . . . . . . . . . . . . . . 16-43
Ensemble Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-44
xv
Fractional Factorial Designs . . . . . . . . . . . . . . . . . . . . . . . . 16-48
Response Surface Designs . . . . . . . . . . . . . . . . . . . . . . . . . . 16-48
D-Optimal Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-48
Latin Hypercube Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-48
Quasi-Random Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-49
GUIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-52
Utilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-53
Class Reference
17
Data Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17-2
Categorical Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17-2
Dataset Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17-2
Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17-5
Classification Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17-5
Naive Bayes Classification . . . . . . . . . . . . . . . . . . . . . . . . . 17-5
xvi Contents
Ensemble Method Classes . . . . . . . . . . . . . . . . . . . . . . . . . . 17-5
Data Sets
A
Distribution Reference
B
Bernoulli Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-3
Definition of the Bernoulli Distribution . . . . . . . . . . . . . . . . B-3
See Also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-3
xvii
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-10
Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-11
See Also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-11
Copulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-14
F Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-25
Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-25
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-25
Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-26
See Also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-26
xviii Contents
Gaussian Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-30
xix
Johnson System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-48
xx Contents
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-70
Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-70
See Also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-71
xxi
Piecewise Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-88
xxii Contents
Uniform Distribution (Discrete) . . . . . . . . . . . . . . . . . . . . . B-101
Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-101
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-101
Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-101
See Also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-102
Bibliography
C
Index
xxiii
xxiv Contents
1
Getting Started
1 Getting Started
Product Overview
Statistics Toolbox™ software extends MATLAB® to support a wide range of
common statistical tasks. The toolbox contains two categories of tools:
Code for the building-block functions is open and extensible. Use the MATLAB
Editor to review, copy, and edit code for any function. Extend the toolbox by
copying code to new files or by writing files that call toolbox functions.
1-2
2
Organizing Data
Introduction
MATLAB data is placed into “data containers” in the form of workspace
variables. All workspace variables organize data into some form of array. For
statistical purposes, arrays are viewed as tables of values.
Data types determine the kind of data variables contain. (See “Classes (Data
Types)” in the MATLAB documentation.)
These variables are not specifically designed for statistical data, however.
Statistical data generally involves observations of multiple variables, with
measurements of heterogeneous type and size. Data may be numerical (of
type single or double), categorical, or in the form of descriptive metadata.
Fitting statistical data into basic MATLAB variables, and accessing it
efficiently, can be cumbersome.
2-2
Introduction
2-3
2 Organizing Data
MATLAB Arrays
In this section...
“Numerical Data” on page 2-4
“Heterogeneous Data” on page 2-7
“Statistical Functions” on page 2-9
Numerical Data
MATLAB two-dimensional numerical arrays (matrices) containing statistical
data use rows to represent observations and columns to represent measured
variables. For example,
loads the variables meas and species into the MATLAB workspace. The meas
variable is a 150-by-4 numerical matrix, representing 150 observations of 4
different measured variables (by column: sepal length, sepal width, petal
length, and petal width, respectively).
2-4
MATLAB® Arrays
setosa_indices = strcmp('setosa',species);
setosa = meas(setosa_indices,:);
To access and display the first five observations in the setosa data, use row,
column parenthesis indexing:
SetosaObs = setosa(1:5,:)
SetosaObs =
5.1000 3.5000 1.4000 0.2000
4.9000 3.0000 1.4000 0.2000
4.7000 3.2000 1.3000 0.2000
4.6000 3.1000 1.5000 0.2000
5.0000 3.6000 1.4000 0.2000
The data are organized into a table with implicit column headers “Sepal
Length,” “Sepal Width,” “Petal Length,” and “Petal Width.” Implicit row
headers are “Observation 1,” “Observation 2,” “Observation 3,” etc.
Similarly, 50 observations for iris versicolor and iris virginica can be extracted
from the meas container variable:
versicolor_indices = strcmp('versicolor',species);
versicolor = meas(versicolor_indices,:);
virginica_indices = strcmp('virginica',species);
virginica = meas(virginica_indices,:);
Because the data sets for the three species happen to be of the same size, they
can be reorganized into a single 50-by-4-by-3 multidimensional array:
iris = cat(3,setosa,versicolor,virginica);
The iris array is a three-layer table with the same implicit row and column
headers as the setosa, versicolor, and virginica arrays. The implicit layer
names, along the third dimension, are “Setosa,” “Versicolor,” and “Virginica.”
The utility of such a multidimensional organization depends on assigning
meaningful properties of the data to each dimension.
2-5
2 Organizing Data
SetosaSL = iris(1:5,1,1)
SetosaSL =
5.1000
4.9000
4.7000
4.6000
5.0000
2-6
MATLAB® Arrays
Heterogeneous Data
MATLAB data types include two container variables—cell arrays and
structure arrays—that allow you to combine metadata with variables of
different types and sizes.
obsnames = strcat({'Obs'},num2str((1:50)','%-d'));
iris1(2:end,1,:) = repmat(obsnames,[1 1 3]);
varnames = {'SepalLength','SepalWidth',...
'PetalLength','PetalWidth'};
iris1(1,2:end,:) = repmat(varnames,[1 1 3]);
iris1(2:end,2:end,1) = num2cell(setosa);
iris1(2:end,2:end,2) = num2cell(versicolor);
iris1(2:end,2:end,3) = num2cell(virginica);
iris1{1,1,1} = 'Setosa';
iris1{1,1,2} = 'Versicolor';
iris1{1,1,3} = 'Virginica';
To access and display the cells, use parenthesis indexing. The following
displays the first five observations in the setosa sepal data:
SetosaSLSW = iris1(1:6,1:3,1)
SetosaSLSW =
'Setosa' 'SepalLength' 'SepalWidth'
'Obs1' [ 5.1000] [ 3.5000]
'Obs2' [ 4.9000] [ 3]
'Obs3' [ 4.7000] [ 3.2000]
'Obs4' [ 4.6000] [ 3.1000]
'Obs5' [ 5] [ 3.6000]
Here, the row and column headers have been explicitly labeled with metadata.
To extract the data subset, use row, column curly brace indexing:
2-7
2 Organizing Data
subset = reshape([iris1{2:6,2:3,1}],5,2)
subset =
5.1000 3.5000
4.9000 3.0000
4.7000 3.2000
4.6000 3.1000
5.0000 3.6000
While cell arrays are useful for organizing heterogeneous data, they may
be cumbersome when it comes to manipulating and analyzing the data.
MATLAB and Statistics Toolbox statistical functions do not accept data in the
form of cell arrays. For processing, data must be extracted from the cell array
to a numerical container variable, as in the preceding example. The indexing
can become complicated for large, heterogeneous data sets. This limitation of
cell arrays is addressed by dataset arrays (see “Dataset Arrays” on page 2-23),
which are designed to store general statistical data and provide easy access.
The data in the preceding example can also be organized in a structure array,
as follows:
iris2.data = cat(3,setosa,versicolor,virginica);
iris2.varnames = {'SepalLength','SepalWidth',...
'PetalLength','PetalWidth'};
iris2.obsnames = strcat({'Obs'},num2str((1:50)','%-d'));
iris2.species = {'setosa','versicolor','virginica'};
The data subset is then returned using a combination of dot and parenthesis
indexing:
subset = iris2.data(1:5,1:2,1)
subset =
5.1000 3.5000
4.9000 3.0000
4.7000 3.2000
4.6000 3.1000
5.0000 3.6000
For statistical data, structure arrays have many of the same limitations as
cell arrays. Once again, dataset arrays (see “Dataset Arrays” on page 2-23),
designed specifically for general statistical data, address these limitations.
2-8
MATLAB® Arrays
Statistical Functions
One of the advantages of working in the MATLAB language is that functions
operate on entire arrays of data, not just on single scalar values. The
functions are said to be vectorized. Vectorization allows for both efficient
problem formulation, using array-based data, and efficient computation,
using vectorized statistical functions.
std(setosa)
ans =
0.3525 0.3791 0.1737 0.1054
The four standard deviations are for measurements of sepal length, sepal
width, petal length, and petal width, respectively.
Compare this to
std(setosa(:))
ans =
1.8483
2-9
2 Organizing Data
which gives the standard deviation across the entire array (all measurements).
sin(setosa)
This operation returns a 50-by-4 array the same size as setosa. The sin
function is vectorized in a different way than the std function, computing one
scalar value for each element in the array.
2-10
Statistical Arrays
Statistical Arrays
In this section...
“Introduction” on page 2-11
“Categorical Arrays” on page 2-13
“Dataset Arrays” on page 2-23
Introduction
As discussed in “MATLAB Arrays” on page 2-4, MATLAB data types include
arrays for numerical, logical, and character data, as well as cell and structure
arrays for heterogeneous collections of data.
Categorical arrays store data with values in a discrete set of levels. Each level
is meant to capture a single, defining characteristic of an observation. If no
ordering is encoded in the levels, the data and the array are nominal. If an
ordering is encoded, the data and the array are ordinal.
Categorical arrays also store labels for the levels. Nominal labels typically
suggest the type of an observation, while ordinal labels suggest the position
or rank.
2-11
2 Organizing Data
Both categorical and dataset arrays have associated methods for assembling,
accessing, manipulating, and processing the collected data. Basic array
operations parallel those for numerical, cell, and structure arrays.
2-12
Statistical Arrays
Categorical Arrays
• “Categorical Data” on page 2-13
• “Categorical Arrays” on page 2-14
• “Using Categorical Arrays” on page 2-16
Categorical Data
Categorical data take on values from only a finite, discrete set of categories
or levels. Levels may be determined before the data are collected, based on
the application, or they may be determined by the distinct values in the data
when converting them to categorical form. Predetermined levels, such as a
set of states or numerical intervals, are independent of the data they contain.
Any number of values in the data may attain a given level, or no data at all.
Categorical data show which measured values share common levels, and
which do not.
2-13
2 Organizing Data
Categorical Arrays
Categorical data can be represented using MATLAB integer arrays, but
this method has a number of drawbacks. First, it removes all of the useful
metadata that might be captured in labels for the levels. Labels must be
stored separately, in character arrays or cell arrays of strings. Secondly, this
method suggests that values stored in the integer array have their usual
numeric meaning, which, for categorical data, they may not. Finally, integer
types have a fixed set of levels (for example, -128:127 for all int8 arrays),
which cannot be changed.
load fisheriris
ndata = nominal(species,{'A','B','C'});
creates a nominal array with levels A, B, and C from the species data in
fisheriris.mat, while
odata = ordinal(ndata,{},{'C','A','B'});
encodes an ordering of the levels with C < A < B. See “Using Categorical
Arrays” on page 2-16, and the reference pages for nominal and ordinal, for
further examples.
2-14
Statistical Arrays
the nominal class and ordinal class. Use the corresponding constructors,
nominal or ordinal, to create categorical arrays. Methods of the classes are
used to display, summarize, convert, concatenate, and access the collected
data. Many of these methods are invoked using operations analogous to those
for numerical arrays, and do not need to be called directly (for example, []
invokes horzcat). Other methods, such as reorderlevels, must be called
directly.
2-15
2 Organizing Data
1 Load the 150-by-4 numerical array meas and the 150-by-1 cell array of
strings species:
The data are 150 observations of four measured variables (by column
number: sepal length, sepal width, petal length, and petal width,
respectively) over three species of iris (setosa, versicolor, and virginica).
n1 = nominal(species);
3 Open species and n1 side by side in the Variable Editor (see “Viewing and
Editing Workspace Variables with the Variable Editor” in the MATLAB
documentation). Note that the string information in species has been
converted to categorical form, leaving only information on which data share
the same values, indicated by the labels for the levels.
By default, levels are labeled with the distinct values in the data (in this
case, the strings in species). Give alternate labels with additional input
arguments to the nominal constructor:
n2 = nominal(species,{'species1','species2','species3'});
4 Open n2 in the Variable Editor, and compare it with species and n1. The
levels have been relabeled.
2-16
Statistical Arrays
o1 = ordinal(n2,{},{'species1','species3','species2'});
The second input argument to ordinal is the same as for nominal—a list
of labels for the levels in the data. If it is unspecified, as above, the labels
are inherited from the data, in this case n2. The third input argument of
ordinal indicates the ordering of the levels, in ascending order.
6 When displayed side by side in the Variable Editor, o1 does not appear any
different than n2. This is because the data in o1 have not been sorted. It
is important to recognize the difference between the ordering of the levels
in an ordinal array and sorting the actual data according to that ordering.
Use sort to sort ordinal data in ascending order:
o2 = sort(o1);
When displayed in the Variable Editor, o2 shows the data sorted by diploid
chromosome count.
7 To find which elements moved up in the sort, use the < operator for ordinal
arrays:
8 Use getlabels to display the labels for the levels in ascending order:
labels2 = getlabels(o2)
labels2 =
'species1' 'species3' 'species2'
9 The sort function reorders the display of the data, but not the order of the
levels. To reorder the levels, use reorderlevels:
2-17
2 Organizing Data
o3 = reorderlevels(o2,labels2([1 3 2]));
labels3 = getlabels(o3)
labels3 =
'species1' 'species2' 'species3'
o4 = sort(o3);
These operations return the levels in the data to their original ordering, by
species number, and then sort the data for display purposes.
low50 = o4(1:50);
Suppose you want to categorize the data in o4 with only two levels: low (the
data in low50) and high (the rest of the data). One way to do this is to use an
assignment with parenthesis indexing on the left-hand side:
o5 = o4; % Copy o4
o5(1:50) = 'low';
Warning: Categorical level 'low' being added.
o5(51:end) = 'high';
Warning: Categorical level 'high' being added.
Note the warnings: the assignments move data to new levels. The old levels,
though empty, remain:
getlabels(o5)
ans =
'species1' 'species2' 'species3' 'low' 'high'
o5 = droplevels(o5,{'species1','species2','species3'});
2-18
Statistical Arrays
o5 = mergelevels(o4,{'species1'},'low');
o5 = mergelevels(o5,{'species2','species3'},'high');
getlabels(o5)
ans =
'low' 'high'
The merged levels are removed and replaced with the new levels.
• Only categorical arrays of the same type can be combined. You cannot
concatenate a nominal array with an ordinal array.
• Only ordinal arrays with the same levels, in the same order, can be
combined.
• Nominal arrays with different levels can be combined to produce a nominal
array whose levels are the union of the levels in the component arrays.
First use ordinal to create ordinal arrays from the variables for sepal length
and sepal width in meas. Categorize the data as short or long depending on
whether they are below or above the median of the variable, respectively:
Because SL1 and SW1 are ordinal arrays with the same levels, in the same
order, they can be concatenated:
S1 = [SL1,SW1];
S1(1:10,:)
ans =
short long
short long
short long
2-19
2 Organizing Data
short long
short long
short long
short long
short long
short short
short long
If, on the other hand, the measurements are cast as nominal, different levels
can be used for the different variables, and the two nominal arrays can still
be combined:
SL2 = nominal(sl,{'short','long'},[],...
[min(sl),median(sl),max(sl)]);
SW2 = nominal(sw,{'skinny','wide'},[],...
[min(sw),median(sw),max(sw)]);
S2 = [SL2,SW2];
getlabels(S2)
ans =
'short' 'long' 'skinny' 'wide'
S2(1:10,:)
ans =
short wide
short wide
short wide
short wide
short wide
short wide
short wide
short wide
short skinny
short wide
2-20
Statistical Arrays
SetosaObs = ismember(n1,'setosa');
Since the code above compares elements of n1 to a single value, the same
operation is carried out by the equality operator:
The SetosaObs variable is used to index into meas to extract only the setosa
data:
SetosaData = meas(SetosaObs,:);
Categorical arrays are also used as grouping variables. The following plot
summarizes the sepal length data in meas by category:
boxplot(sl,n1)
2-21
2 Organizing Data
2-22
Statistical Arrays
Dataset Arrays
• “Statistical Data” on page 2-23
• “Dataset Arrays” on page 2-24
• “Using Dataset Arrays” on page 2-25
Statistical Data
MATLAB data containers (variables) are suitable for completely homogeneous
data (numeric, character, and logical arrays) and for completely heterogeneous
data (cell and structure arrays). Statistical data, however, are often a mixture
of homogeneous variables of heterogeneous types and sizes. Dataset arrays
are suitable containers for this kind of data.
2-23
2 Organizing Data
Dataset Arrays
Dataset arrays are variables created with dataset. For example, the
following creates a dataset array from observations that are a combination of
categorical and numerical measurements:
load fisheriris
NumObs = size(meas,1);
NameObs = strcat({'Obs'},num2str((1:NumObs)','%-d'));
iris = dataset({nominal(species),'species'},...
{meas,'SL','SW','PL','PW'},...
'ObsNames',NameObs);
iris(1:5,:)
ans =
species SL SW PL PW
Obs1 setosa 5.1 3.5 1.4 0.2
Obs2 setosa 4.9 3 1.4 0.2
Obs3 setosa 4.7 3.2 1.3 0.2
Obs4 setosa 4.6 3.1 1.5 0.2
Obs5 setosa 5 3.6 1.4 0.2
When creating a dataset array, variable names and observation names can be
assigned together with the data. Other metadata associated with the array
can be assigned with set and accessed with get:
2-24
Statistical Arrays
Constructing Dataset Arrays. Load the 150-by-4 numerical array meas and
the 150-by-1 cell array of strings species:
The data are 150 observations of four measured variables (by column number:
sepal length, sepal width, petal length, and petal width, respectively) over
three species of iris (setosa, versicolor, and virginica).
Use dataset to create a dataset array iris from the data, assigning variable
names species, SL, SW, PL, and PW and observation names Obs1, Obs2, Obs3,
etc.:
NumObs = size(meas,1);
NameObs = strcat({'Obs'},num2str((1:NumObs)','%-d'));
iris = dataset({nominal(species),'species'},...
{meas,'SL','SW','PL','PW'},...
'ObsNames',NameObs);
iris(1:5,:)
ans =
species SL SW PL PW
Obs1 setosa 5.1 3.5 1.4 0.2
Obs2 setosa 4.9 3 1.4 0.2
Obs3 setosa 4.7 3.2 1.3 0.2
Obs4 setosa 4.6 3.1 1.5 0.2
Obs5 setosa 5 3.6 1.4 0.2
2-25
2 Organizing Data
iris = set(iris,'Description',desc,...
'Units',units,...
'UserData',info);
get(iris)
Description: 'Fisher's iris data (1936)'
Units: {'' 'cm' 'cm' 'cm' 'cm'}
DimNames: {'Observations' 'Variables'}
UserData: 'http://en.wikipedia.org/wiki/R.A._Fisher'
ObsNames: {150x1 cell}
VarNames: {'species' 'SL' 'SW' 'PL' 'PW'}
get(iris(1:5,:),'ObsNames')
ans =
'Obs1'
'Obs2'
'Obs3'
'Obs4'
'Obs5'
2-26
Statistical Arrays
iris1 = iris(1:5,2:3)
iris1 =
SL SW
Obs1 5.1 3.5
Obs2 4.9 3
Obs3 4.7 3.2
Obs4 4.6 3.1
Obs5 5 3.6
Similarly, use parenthesis indexing to assign new data to the first variable
in iris1:
iris1(:,1) = dataset([5.2;4.9;4.6;4.6;5])
iris1 =
SL SW
Obs1 5.2 3.5
Obs2 4.9 3
Obs3 4.6 3.2
Obs4 4.6 3.1
Obs5 5 3.6
SepalObs = iris1({'Obs1','Obs3','Obs5'},'SL')
SepalObs =
SL
Obs1 5.2
Obs3 4.6
Obs5 5
2-27
2 Organizing Data
The following code extracts the sepal lengths in iris1 corresponding to sepal
widths greater than 3:
Dot indexing also allows entire variables to be deleted from a dataset array:
iris1.SL = []
iris1 =
SW
Obs 1 3.5
Obs 2 3
Obs 3 3.2
Obs 4 3.1
Obs 5 3.6
Dynamic variable naming works for dataset arrays just as it does for structure
arrays. For example, the units of the SW variable are changed in iris1 as
follows:
varname = 'SW';
iris1.(varname) = iris1.(varname)*10
iris1 =
SW
Obs1 35
Obs2 30
2-28
Statistical Arrays
Obs3 32
Obs4 31
Obs5 36
iris1 = set(iris1,'Units',{'mm'});
Curly brace indexing is used to access individual data elements. The following
are equivalent:
iris1{1,1}
ans =
35
iris1{'Obs1','SW'}
ans =
35
SepalData = iris(:,{'SL','SW'});
PetalData = iris(:,{'PL','PW'});
newiris = [SepalData,PetalData];
size(newiris)
ans =
150 4
The following concatenates variables within a dataset array and then deletes
the component variables:
newiris.SepalData = [newiris.SL,newiris.SW];
newiris.PetalData = [newiris.PL,newiris.PW];
newiris(:,{'SL','SW','PL','PW'}) = [];
size(newiris)
ans =
150 2
size(newiris.SepalData)
ans =
2-29
2 Organizing Data
150 2
snames = nominal({'setosa';'versicolor';'virginica'});
CC = dataset({snames,'species'},{[38;108;70],'cc'})
CC =
species cc
setosa 38
versicolor 108
virginica 70
iris2 = join(iris,CC);
2-30
Statistical Arrays
ds.var = [];
ds(:,j) = [];
ds = ds(:,[1:(j-1) (j+1):end]);
ds(i,:) = [];
ds = ds([1:(i-1) (i+1):end],:);
summary(newiris)
Fisher's iris data (1936)
SepalData: [153x2 double]
2-31
2 Organizing Data
min 4.3000 2
1st Q 5.1000 2.8000
median 5.8000 3
3rd Q 6.4000 3.3250
max 7.9000 4.4000
PetalData: [153x2 double]
min 1 0.1000
1st Q 1.6000 0.3000
median 4.4000 1.3000
3rd Q 5.1000 1.8000
max 6.9000 4.2000
SepalMeans = mean(newiris.SepalData)
SepalMeans =
5.8294 3.0503
means = datasetfun(@mean,newiris,'UniformOutput',false)
means =
[1x2 double] [1x2 double]
SepalMeans = means{1}
SepalMeans =
5.8294 3.0503
covs = datasetfun(@cov,newiris,'UniformOutput',false)
covs =
[2x2 double] [2x2 double]
SepalCovs = covs{1}
SepalCovs =
0.6835 -0.0373
-0.0373 0.2054
2-32
Statistical Arrays
SepalCovs = cov(double(newiris(:,1)))
SepalCovs =
0.6835 -0.0373
-0.0373 0.2054
2-33
2 Organizing Data
Grouped Data
In this section...
“Grouping Variables” on page 2-34
“Functions for Grouped Data” on page 2-35
“Using Grouping Variables” on page 2-36
Grouping Variables
Grouping variables are utility variables used to indicate which elements
in a data set are to be considered together when computing statistics and
creating visualizations. They may be numeric vectors, string arrays, cell
arrays of strings, or categorical arrays. Logical vectors can be used to indicate
membership (or not) in a single group.
Grouping variables have the same length as the variables (columns) in a data
set. Observations (rows) i and j are considered to be in the same group if the
values of the corresponding grouping variable are identical at those indices.
Grouping variables with multiple columns are used to specify different groups
within multiple variables.
For example, the following loads the 150-by-4 numerical array meas and the
150-by-1 cell array of strings species into the workspace:
The data are 150 observations of four measured variables (by column number:
sepal length, sepal width, petal length, and petal width, respectively)
over three species of iris (setosa, versicolor, and virginica). To group the
observations by species, the following are all acceptable (and equivalent)
grouping variables:
2-34
Grouped Data
For a full description of the syntax of any particular function, and examples
of its use, consult its reference page, linked from the table. “Using Grouping
Variables” on page 2-36 also includes examples.
2-35
2 Organizing Data
Load the 150-by-4 numerical array meas and the 150-by-1 cell array of strings
species:
The data are 150 observations of four measured variables (by column number:
sepal length, sepal width, petal length, and petal width, respectively) over
three species of iris (setosa, versicolor, and virginica).
group = nominal(species);
2-36
Grouped Data
Compute some basic statistics for the data (median and interquartile range),
by group, using the grpstats function:
[order,number,group_median,group_iqr] = ...
grpstats(meas,group,{'gname','numel',@median,@iqr})
order =
'setosa'
'versicolor'
'virginica'
number =
50 50 50 50
50 50 50 50
50 50 50 50
group_median =
5.0000 3.4000 1.5000 0.2000
5.9000 2.8000 4.3500 1.3000
6.5000 3.0000 5.5500 2.0000
group_iqr =
0.4000 0.5000 0.2000 0.1000
0.7000 0.5000 0.6000 0.3000
0.7000 0.4000 0.8000 0.5000
To improve the labeling of the data, create a dataset array (see “Dataset
Arrays” on page 2-23) from meas:
NumObs = size(meas,1);
NameObs = strcat({'Obs'},num2str((1:NumObs)','%-d'));
iris = dataset({group,'species'},...
{meas,'SL','SW','PL','PW'},...
'ObsNames',NameObs);
2-37
2 Organizing Data
When you call grpstats with a dataset array as an argument, you invoke the
grpstats method of the dataset class, rather than the grpstats function.
The method has a slightly different syntax than the function, but it returns
the same results, with better labeling:
stats = grpstats(iris,'species',{@median,@iqr})
stats =
species GroupCount
setosa setosa 50
versicolor versicolor 50
virginica virginica 50
median_SL iqr_SL
setosa 5 0.4
versicolor 5.9 0.7
virginica 6.5 0.7
median_SW iqr_SW
setosa 3.4 0.5
versicolor 2.8 0.5
virginica 3 0.4
median_PL iqr_PL
setosa 1.5 0.2
versicolor 4.35 0.6
virginica 5.55 0.8
median_PW iqr_PW
setosa 0.2 0.1
versicolor 1.3 0.3
virginica 2 0.5
subset = ismember(group,{'setosa','versicolor'});
scattergroup = group(subset);
gscatter(iris.SL(subset),...
iris.SW(subset),...
2-38
Grouped Data
scattergroup)
xlabel('Sepal Length')
ylabel('Sepal Width')
2-39
2 Organizing Data
2-40
3
Descriptive Statistics
Introduction
You may need to summarize large, complex data sets—both numerically
and visually—to convey their essence to the data analyst and to allow for
further processing. This chapter focuses on numerical summaries; Chapter 4,
“Statistical Visualization” focuses on visual summaries.
3-2
Measures of Central Tendency
The following table lists the functions that calculate the measures of central
tendency.
The average is a simple and popular estimate of location. If the data sample
comes from a normal distribution, then the sample mean is also optimal
(MVUE of µ).
The median and trimmed mean are two measures that are resistant (robust)
to outliers. The median is the 50th percentile of the sample, which will only
change slightly if you add a large perturbation to any value. The idea behind
the trimmed mean is to ignore a small percentage of the highest and lowest
values of a sample when determining the center of the sample.
The geometric mean and harmonic mean, like the average, are not robust
to outliers. They are useful when the sample is distributed lognormal or
heavily skewed.
3-3
3 Descriptive Statistics
The following example shows the behavior of the measures of location for a
sample with one outlier.
x = [ones(1,6) 100]
x =
1 1 1 1 1 1 100
locate =
1.9307 1.1647 15.1429 1.0000 1.0000
You can see that the mean is far from any data value because of the influence
of the outlier. The median and trimmed mean ignore the outlying value and
describe the location of the rest of the data values.
3-4
Measures of Dispersion
Measures of Dispersion
The purpose of measures of dispersion is to find out how spread out the data
values are on the number line. Another term for these statistics is measures
of spread.
Function
Name Description
iqr Interquartile range
mad Mean absolute deviation
moment Central moment of all orders
range Range
std Standard deviation
var Variance
The range (the difference between the maximum and minimum values) is the
simplest measure of spread. But if there is an outlier in the data, it will be the
minimum or maximum value. Thus, the range is not robust to outliers.
The standard deviation and the variance are popular measures of spread that
are optimal for normally distributed samples. The sample variance is the
MVUE of the normal parameter σ2. The standard deviation is the square root
of the variance and has the desirable property of being in the same units as
the data. That is, if the data is in meters, the standard deviation is in meters
as well. The variance is in meters2, which is more difficult to interpret.
Neither the standard deviation nor the variance is robust to outliers. A data
value that is separate from the body of the data can increase the value of the
statistics by an arbitrarily large amount.
The mean absolute deviation (MAD) is also sensitive to outliers. But the
MAD does not move quite as much as the standard deviation or variance in
response to bad data.
3-5
3 Descriptive Statistics
The interquartile range (IQR) is the difference between the 75th and 25th
percentile of the data. Since only the middle 50% of the data affects this
measure, it is robust to outliers.
The following example shows the behavior of the measures of dispersion for a
sample with one outlier.
x = [ones(1,6) 100]
x =
1 1 1 1 1 1 100
stats =
0 24.2449 99.0000 37.4185
3-6
Measures of Shape
Measures of Shape
Quantiles and percentiles provide information about the shape of data as
well as its location and spread.
1 n sorted data points are the 0.5/n, 1.5/n, ..., (n–0.5)/n quantiles.
3 The data min or max are assigned to quantiles outside the range.
4 Missing values are treated as NaN, and removed from the data.
The following example shows the result of looking at every quartile (quantiles
with orders that are multiples of 0.25) of a sample containing a mixture of
two distributions.
x = [normrnd(4,1,1,100) normrnd(6,0.5,1,200)];
p = 100*(0:0.25:1);
y = prctile(x,p);
z = [p;y]
z =
0 25.0000 50.0000 75.0000 100.0000
1.8293 4.6728 5.6459 6.0766 7.1546
boxplot(x)
3-7
3 Descriptive Statistics
The long lower tail and plus signs show the lack of symmetry in the sample
values. For more information on box plots, see “Box Plots” on page 4-6.
3-8
Resampling Statistics
Resampling Statistics
In this section...
“The Bootstrap” on page 3-9
“The Jackknife” on page 3-12
“Parallel Computing Support for Resampling Methods” on page 3-13
The Bootstrap
The bootstrap procedure involves choosing random samples with replacement
from a data set and analyzing each sample the same way. Sampling with
replacement means that each observation is selected separately at random
from the original dataset. So a particular data point from the original data
set could appear multiple times in a given bootstrap sample. The number of
elements in each bootstrap sample equals the number of elements in the
original data set. The range of sample estimates you obtain enables you to
establish the uncertainty of the quantity you are estimating.
This example from Efron and Tibshirani [33] compares Law School Admission
Test (LSAT) scores and subsequent law school grade point average (GPA) for
a sample of 15 law schools.
load lawdata
plot(lsat,gpa,'+')
lsline
3-9
3 Descriptive Statistics
The least-squares fit line indicates that higher LSAT scores go with higher
law school GPAs. But how certain is this conclusion? The plot provides some
intuition, but nothing quantitative.
You can calculate the correlation coefficient of the variables using the corr
function.
rhohat = corr(lsat,gpa)
rhohat =
0.7764
Now you have a number describing the positive connection between LSAT
and GPA; though it may seem large, you still do not know if it is statistically
significant.
3-10
Resampling Statistics
Using the bootstrp function you can resample the lsat and gpa vectors as
many times as you like and consider the variation in the resulting correlation
coefficients.
Here is an example.
rhos1000 = bootstrp(1000,'corr',lsat,gpa);
This command resamples the lsat and gpa vectors 1000 times and computes
the corr function on each sample. Here is a histogram of the result.
hist(rhos1000,30)
set(get(gca,'Children'),'FaceColor',[.8 .8 1])
3-11
3 Descriptive Statistics
ci = bootci(5000,@corr,lsat,gpa)
ci =
0.3313
0.9427
Although the bootci function computes the Bias Corrected and accelerated
(BCa) interval as the default type, it is also able to compute various other
types of bootstrap confidence intervals, such as the studentized bootstrap
confidence interval.
The Jackknife
Similar to the bootstrap is the jackknife, which uses resampling to estimate
the bias of a sample statistic. Sometimes it is also used to estimate standard
error of the sample statistic. The jackknife is implemented by the Statistics
Toolbox function jackknife.
3-12
Resampling Statistics
load lawdata
rhohat = corr(lsat,gpa)
rhohat =
0.7764
Next compute the correlations for jackknife samples, and compute their mean:
jackrho = jackknife(@corr,lsat,gpa);
meanrho = mean(jackrho)
meanrho =
0.7759
n = length(lsat);
biasrho = (n-1) * (meanrho-rhohat)
biasrho =
-0.0065
3-13
3 Descriptive Statistics
• bootci
• bootstrp
• crossval
• jackknife
• TreeBagger
• TreeBagger.growTrees
• You have a license for Parallel Computing Toolbox™ software and the
software is installed.
• A group of processors has been prepared for parallel computation using the
matlabpool command of the Parallel Computing Toolbox.
• The option UseParallel is set to 'always'. The default value of this option
is 'never'. You specify this option using the 'Options' argument that all
of these resampling functions accept.
When these conditions hold, the functions resample in parallel. For more
information on the Parallel Computing Toolbox, see its User Guide..
Suppose, for example, you want to apply the jackknife to your function
userfcn, which calls parfor, and you wish to call jackknife in a loop.
Suppose also that the conditions for parallel resampling of bootstrp, as given
in the section above, are satisfied. The following figure shows three cases:
3-14
Resampling Statistics
3-15
3 Descriptive Statistics
For example:
X = magic(3);
X([1 5]) = [NaN NaN]
X =
NaN 1 6
3 NaN 7
4 9 2
s1 = sum(X)
s1 =
NaN NaN 15
Removing the NaN values would destroy the matrix structure. Removing
the rows containing the NaN values would discard data. Statistics Toolbox
functions in the following table remove NaN values only for the purposes of
computation.
Function Description
nancov Covariance matrix, ignoring NaN values
nanmax Maximum, ignoring NaN values
nanmean Mean, ignoring NaN values
nanmedian Median, ignoring NaN values
nanmin Minimum, ignoring NaN values
nanstd Standard deviation, ignoring NaN values
nansum Sum, ignoring NaN values
nanvar Variance, ignoring NaN values
3-16
Data with Missing Values
For example:
s2 = nansum(X)
s2 =
7 10 15
Other Statistics Toolbox functions also ignore NaN values. These include iqr,
kurtosis, mad, prctile, range, skewness, and trimmean.
3-17
3 Descriptive Statistics
3-18
4
Statistical Visualization
Introduction
形象化 大量的
Statistics Toolbox data visualization functions add to the extensive graphics
能力
capabilities already in MATLAB.
分散 多變量
形象化
• Scatter plots are a basic visualization tool for multivariate data. They
辨識 relationships among variables.
are used to identify 變因 版本 of
Grouped versions
指出 group membership.
these plots use different plotting symbols to indicate
The gname function is used to label points on these plots with a text label
or an observation number.
• Box plots display a five-number summary of a set of data: the median,
the two ends of the interquartile range (the box), and two extreme values
(the whiskers) above and below the box. Because they show less detail
than histograms, box plots are most useful for side-by-side comparisons
of two distributions.
• Distribution plots help you identify an appropriate distribution family
for your data. They include normal and Weibull probability plots,
quantile-quantile plots, and empirical cumulative distribution plots.
4-2
Scatter Plots
A scatter plot is a simple plot of one variable against another. The MATLAB
functions plot and scatter produce scatter plots. The MATLAB function
plotmatrix can produce a matrix of such plots showing the relationship
between several pairs of variables.
Suppose you want to examine the weight and mileage of cars from three
different model years.
load carsmall
gscatter(Weight,MPG,Model_Year,'','xos')
4-3
4 Statistical Visualization
This shows that not only is there a strong relationship between the weight of
a car and its mileage, but also that newer cars tend to be lighter and have
better gas mileage than older cars.
參數
The default arguments for gscatter produce a scatter plot with the different
groups shown with the same symbol but different colors. The last two
arguments above request that all groups be shown in default colors and with
different symbols.
The carsmall data set contains other variables that describe different aspects
of cars. You can examine several of them in a single display by creating a
grouped plot matrix.
4-4
Scatter Plots
gplotmatrix(xvars,yvars,Model_Year,'','xos')
The upper right subplot displays MPG against Horsepower, and shows that
over the years the horsepower of the cars has decreased but the gas mileage
has improved.
The gplotmatrix function can also graph all pairs from a single list of
variables, along with histograms for each variable. See “MANOVA” on page
8-39.
4-5
4 Statistical Visualization
Box Plots
The graph below, created with the boxplot command, compares petal lengths
in samples from two species of iris.
load fisheriris
s1 = meas(51:100,3);
s2 = meas(101:150,3);
boxplot([s1 s2],'notch','on',...
'labels',{'versicolor','virginica'})
• The tops and bottoms of each “box” are the 25th and 75th percentiles of the
samples, respectively. The distances between the tops and bottoms are the
interquartile ranges.
4-6
Box Plots
• The line in the middle of each box is the sample median. If the median is
not centered in the box, it shows sample skewness.
• The whiskers are lines extending above and below each box. Whiskers are
drawn from the ends of the interquartile ranges to the furthest observations
within the whisker length (the adjacent values).
• Observations beyond the whisker length are marked as outliers. By
default, an outlier is a value that is more than 1.5 times the interquartile
range away from the top or bottom of the box, but this value can be adjusted
with additional input arguments. Outliers are displayed with a red + sign.
• Notches display the variability of the median between samples. The width
of a notch is computed so that box plots whose notches do not overlap (as
above) have different medians at the 5% significance level. The significance
level is based on a normal distribution assumption, but comparisons of
medians are reasonably robust for other distributions. Comparing box-plot
medians is like a visual hypothesis test, analogous to the t test used for
means.
4-7
4 Statistical Visualization
Distribution Plots
In this section...
“Normal Probability Plots” on page 4-8
“Quantile-Quantile Plots” on page 4-10
“Cumulative Distribution Plots” on page 4-12
“Other Probability Plots” on page 4-14
The following example shows a normal probability plot created with the
normplot function.
x = normrnd(10,1,25,1);
normplot(x)
4-8
Distribution Plots
The plus signs plot the empirical probability versus the data value for each
point in the data. A solid line connects the 25th and 75th percentiles in the
data, and a dashed line extends it to the ends of the data. The y-axis values
are probabilities from zero to one, but the scale is not linear. The distance
between tick marks on the y-axis matches the distance between the quantiles
of a normal distribution. The quantiles are close together near the median
(probability = 0.5) and stretch out symmetrically as you move away from
the median.
In a normal probability plot, if all the data points fall near the line, an
assumption of normality is reasonable. Otherwise, the points will curve away
from the line, and an assumption of normality is not justified.
For example:
x = exprnd(10,100,1);
4-9
4 Statistical Visualization
normplot(x)
The plot is strong evidence that the underlying distribution is not normal.
Quantile-Quantile Plots
Quantile-quantile plots are used to determine whether two samples come from
the same distribution family. They are scatter plots of quantiles computed
from each sample, with a line drawn between the first and third quartiles. If
the data falls near the line, it is reasonable to assume that the two samples
come from the same distribution. The method is robust with respect to
changes in the location and scale of either distribution.
4-10
Distribution Plots
x = poissrnd(10,50,1);
y = poissrnd(5,100,1);
qqplot(x,y);
Even though the parameters and sample sizes are different, the approximate
linear relationship suggests that the two samples may come from the same
distribution family. As with normal probability plots, hypothesis tests,
as described in Chapter 7, “Hypothesis Tests”, can provide additional
justification for such an assumption. For statistical procedures that depend
on the two samples coming from the same distribution, however, a linear
quantile-quantile plot is often sufficient.
4-11
4 Statistical Visualization
The following example shows what happens when the underlying distributions
are not the same.
x = normrnd(5,1,100,1);
y = wblrnd(2,0.5,100,1);
qqplot(x,y);
These samples clearly are not from the same distribution family.
4-12
Distribution Plots
To create an empirical cdf plot, use the cdfplot function (or ecdf and stairs).
The following example compares the empirical cdf for a sample from an
extreme value distribution with a plot of the cdf for the sampling distribution.
In practice, the sampling distribution would be unknown, and would be
chosen to match the empirical cdf.
y = evrnd(0,3,100,1);
cdfplot(y)
hold on
x = -20:0.1:10;
f = evcdf(x,0,3);
plot(x,f,'m')
legend('Empirical','Theoretical','Location','NW')
4-13
4 Statistical Visualization
For example, the following plot assesses two samples, one from a Weibull
distribution and one from a Rayleigh distribution, to see if they may have
come from a Weibull population.
x1 = wblrnd(3,3,100,1);
x2 = raylrnd(3,100,1);
probplot('weibull',[x1 x2])
legend('Weibull Sample','Rayleigh Sample','Location','NW')
4-14
Distribution Plots
The plot gives justification for modeling the first sample with a Weibull
distribution; much less so for the second sample.
4-15
4 Statistical Visualization
4-16
5
Probability Distributions
The Statistics Toolbox provides several ways of working with both parametric
and nonparametric probability distributions:
5-2
Supported Distributions
Supported Distributions
In this section...
“Parametric Distributions” on page 5-4
“Nonparametric Distributions” on page 5-8
5-3
5 Probability Distributions
Parametric Distributions
5-4
Supported Distributions
5-5
5 Probability Distributions
5-6
Supported Distributions
Discrete Distributions
5-7
5 Probability Distributions
Multivariate Distributions
Nonparametric Distributions
Name pdf cdf inv stat fit like rnd
ksdensity,
Nonparametric ksdensity ksdensity ksdensity
dfittool
5-8
Working with Distributions Through GUIs
Exploring Distributions
To interactively see the influence of parameter changes on the shapes of the
pdfs and cdfs of supported Statistics Toolbox distributions, use the Probability
Distribution Function Tool.
5-9
5 Probability Distributions
Function
plot
Function
value Draggable
reference
lines
5-10
Working with Distributions Through GUIs
5-11
5 Probability Distributions
dfittool
The following figure shows the main window of the Distribution Fitting Tool.
Task buttons
Import data
from workspace
Adjusting the Plot. Buttons at the top of the tool allow you to adjust the
plot displayed in the main window:
5-12
Working with Distributions Through GUIs
Displaying the Data. The Display Type field specifies the type of plot
displayed in the main window. Each type corresponds to a probability
function, for example, a probability density function. The following display
types are available:
Inputting and Fitting Data. The task buttons enable you to perform the
tasks necessary to fit distributions to data. Each button opens a new window
in which you perform the task. The buttons include
• Data — Import and manage data sets. See “Creating and Managing Data
Sets” on page 5-14.
• New Fit — Create new fits. See “Creating a New Fit” on page 5-19.
• Manage Fits — Manage existing fits. See “Managing Fits” on page 5-26.
• Evaluate — Evaluate fits at any points you choose. See “Evaluating Fits”
on page 5-28.
• Exclude — Create rules specifying which values to exclude when fitting a
distribution. See “Excluding Data” on page 5-32.
The display pane displays plots of the data sets and fits you create. Whenever
you make changes in one of the task windows, the results in the display pane
update.
5-13
5 Probability Distributions
• Save and load sessions. See “Saving and Loading Sessions” on page 5-38.
• Generate a file with which you can fit distributions to data and plot the
results independently of the Distribution Fitting Tool. See “Generating a
File to Fit and Plot Distributions” on page 5-46.
• Define and import custom distributions. See “Using Custom Distributions”
on page 5-48.
To begin, click the Data button in the main window of the Distribution Fitting
Tool to open the Data window shown in the following figure.
5-14
Working with Distributions Through GUIs
• Data—The drop-down list in the Data field contains the names of all
matrices and vectors, other than 1-by-1 matrices (scalars) in the MATLAB
workspace. Select the array containing the data you want to fit. The actual
data you import must be a vector. If you select a matrix in the Data field,
the first column of the matrix is imported by default. To select a different
column or row of the matrix, click Select Column or Row. This displays
5-15
5 Probability Distributions
the matrix in the Variable Editor, where you can select a row or column
by highlighting it with the mouse.
Alternatively, you can enter any valid MATLAB expression in the Data
field.
When you select a vector in the Data field, a histogram of the data is
displayed in the Data preview pane.
• Censoring—If some of the points in the data set are censored, enter
a Boolean vector, of the same size as the data vector, specifying the
censored entries of the data. A 1 in the censoring vector specifies that the
corresponding entry of the data vector is censored, while a 0 specifies that
the entry is not censored. If you enter a matrix, you can select a column or
row by clicking Select Column or Row. If you do not want to censor any
data, leave the Censoring field blank.
• Frequency—Enter a vector of positive integers of the same size as the
data vector to specify the frequency of the corresponding entries of the data
vector. For example, a value of 7 in the 15th entry of frequency vector
specifies that there are 7 data points corresponding to the value in the 15th
entry of the data vector. If all entries of the data vector have frequency 1,
leave the Frequency field blank.
• Data set name—Enter a name for the data set you import from the
workspace, such as My data.
After you have entered the information in the preceding fields, click Create
Data Set to create the data set My data.
Managing Data Sets. The Manage data sets pane enables you to view
and manage the data sets you create. When you create a data set, its name
appears in the Data sets list. The following figure shows the Manage data
sets pane after creating the data set My data.
5-16
Working with Distributions Through GUIs
For each data set in the Data sets list, you can
• Select the Plot check box to display a plot of the data in the main
Distribution Fitting Tool window. When you create a new data set, Plot is
selected by default. Clearing the Plot check box removes the data from the
plot in the main window. You can specify the type of plot displayed in the
Display Type field in the main window.
• If Plot is selected, you can also select Bounds to display confidence
interval bounds for the plot in the main window. These bounds are
pointwise confidence bounds around the empirical estimates of these
functions. The bounds are only displayed when you set Display Type in
the main window to one of the following:
- Cumulative probability (CDF)
- Survivor function
- Cumulative hazard
When you select a data set from the list, the following buttons are enabled:
5-17
5 Probability Distributions
Setting Bin Rules. To set bin rules for the histogram of a data set, click Set
Bin Rules. This opens the dialog box shown in the following figure.
5-18
Working with Distributions Through GUIs
• Bin width — Enter the width of each bin. If you select this option, you can
make the following choices:
- Automatic bin placement — Places the edges of the bins at integer
multiples of the Bin width.
- Bin boundary at — Enter a scalar to specify the boundaries of the
bins. The boundary of each bin is equal to this scalar plus an integer
multiple of the Bin width.
The Set Bin Width Rules dialog box also provides the following options:
• Apply to all existing data sets — When selected, the rule is applied to
all data sets. Otherwise, the rule is only applied to the data set currently
selected in the Data window.
• Save as default — When selected, the current rule is applied to any
new data sets that you create. You can also set default bin width rules
by selecting Set Default Bin Rules from the Tools menu in the main
window.
5-19
5 Probability Distributions
5-20
Working with Distributions Through GUIs
Field Description
Name
Fit Name Enter a name for the fit in the Fit Name field.
Data The Data field contains a drop-down list of the data sets you
have created. Select the data set to which you want to fit a
distribution.
Distribution Select the type of distribution you want to fit from the
Distribution drop-down list. See “Available Distributions”
on page 5-22 for a list of distributions supported by the
Distribution Fitting Tool.
Only the distributions that apply to the values of the selected
data set are displayed in the Distribution field. For example,
positive distributions are not displayed when the data include
values that are zero or negative.
You can specify either a parametric or a nonparametric
distribution. When you select a parametric distribution from
the drop-down list, a description of its parameters is displayed
in the pane below the Exclusion rule field. The Distribution
Fitting Tool estimates these parameters to fit the distribution
to the data set. When you select Nonparametric fit, options
for the fit appear in the pane, as described in “Further Options
for Nonparametric Fits” on page 5-23.
Exclusion You can specify a rule to exclude some the data in the
Rule Exclusion rule field. You can create an exclusion rule by
clicking Exclude in the main window of the Distribution
Fitting Tool. For more information, see “Excluding Data” on
page 5-32.
Apply the New Fit. Click Apply to fit the distribution. For a parametric
fit, the Results pane displays the values of the estimated parameters. For a
nonparametric fit, the Results pane displays information about the fit.
When you click Apply, the main window of Distribution Fitting Tool displays
a plot of the distribution, along with the corresponding data.
5-21
5 Probability Distributions
Note When you click Apply, the title of the window changes to Edit Fit. You
can now make changes to the fit you just created and click Apply again to
save them. After closing the Edit Fit window, you can reopen it from the Fit
Manager window at any time to edit the fit.
Most, but not all, of the distributions available in the Distribution Fitting
Tool are supported elsewhere in Statistics Toolbox software (see “Supported
Distributions” on page 5-3), and have dedicated distribution fitting functions.
These functions are used to compute the majority of the fits in the Distribution
Fitting Tool, and are referenced in the list below.
Other fits are computed using functions internal to the Distribution Fitting
Tool. Distributions that do not have corresponding Statistics Toolbox
fitting functions are described in “Additional Distributions Available in the
Distribution Fitting Tool” on page 5-49.
Not all of the distributions listed below are available for all data sets. The
Distribution Fitting Tool determines the extent of the data (nonnegative, unit
interval, etc.) and displays appropriate distributions in the Distribution
drop-down list. Distribution data ranges are given parenthetically in the
list below.
• Beta (unit interval values) distribution, fit using the function betafit.
• Binomial (nonnegative values) distribution, fit using the function binopdf.
• Birnbaum-Saunders (positive values) distribution.
• Exponential (nonnegative values) distribution, fit using the function
expfit.
• Extreme value (all values) distribution, fit using the function evfit.
• Gamma (positive values) distribution, fit using the function gamfit.
• Generalized extreme value (all values) distribution, fit using the function
gevfit.
5-22
Working with Distributions Through GUIs
• Generalized Pareto (all values) distribution, fit using the function gpfit.
• Inverse Gaussian (positive values) distribution.
• Logistic (all values) distribution.
• Loglogistic (positive values) distribution.
• Lognormal (positive values) distribution, fit using the function lognfit.
• Nakagami (positive values) distribution.
• Negative binomial (nonnegative values) distribution, fit using the function
nbinpdf.
• Nonparametric (all values) distribution, fit using the function ksdensity.
See “Further Options for Nonparametric Fits” on page 5-23 for a description
of available options.
• Normal (all values) distribution, fit using the function normfit.
• Poisson (nonnegative integer values) distribution, fit using the function
poisspdf.
• Rayleigh (positive values) distribution using the function raylfit.
• Rician (positive values) distribution.
• t location-scale (all values) distribution.
• Weibull (positive values) distribution using the function wblfit.
5-23
5 Probability Distributions
5-24
Working with Distributions Through GUIs
Displaying Results
This section explains the different ways to display results in the main window
of the Distribution Fitting Tool. The main window displays plots of
• The data sets for which you select Plot in the Data window.
• The fits for which you select Plot in the Fit Manager window.
• Confidence bounds for
- Data sets for which you select Bounds in the Data window.
- Fits for which you select Bounds in the Fit Manager.
Display Type. The Display Type field in the main window specifies the type
of plot displayed. Each type corresponds to a probability function, for example,
a probability density function. The following display types are available:
5-25
5 Probability Distributions
- Lognormal
- Normal
- Rayleigh
- Weibull
In addition to these choices, you can create a probability plot against a
parametric fit that you create in the New Fit panel. These fits are added
at the bottom of the Distribution drop-down list when you create them.
• Survivor function — Displays a survivor function plot of the data.
• Cumulative hazard — Displays a cumulative hazard plot of the data.
Note Some of these distributions are not available if the plotted data
includes 0 or negative values.
Confidence Bounds. You can display confidence bounds for data sets and
fits, provided that you set Display Type to Cumulative probability (CDF),
Survivor function, Cumulative hazard, or Quantile for fits only.
• To display bounds for a data set, select Bounds next to the data set in the
Data sets pane of the Data window.
• To display bounds for a fit, select Bounds next to the fit in the Fit
Manager window. Confidence bounds are not available for all fit types.
To set the confidence level for the bounds, select Confidence Level from the
View menu in the main window and choose from the options.
Managing Fits
This section describes how to manage fits that you have created. To begin,
click the Manage Fits button in the main window of the Distribution Fitting
Tool. This opens the Fit Manager window as shown in the following figure.
5-26
Working with Distributions Through GUIs
The Table of fits displays a list of the fits you create, with the following
options:
• Plot—Select Plot to display a plot of the fit in the main window of the
Distribution Fitting Tool. When you create a new fit, Plot is selected by
default. Clearing the Plot check box removes the fit from the plot in the
main window.
• Bounds—If Plot is selected, you can also select Bounds to display
confidence bounds in the plot. The bounds are displayed when you set
Display Type in the main window to one of the following:
- Cumulative probability (CDF)
- Quantile (inverse CDF)
- Survivor function
- Cumulative hazard
5-27
5 Probability Distributions
Note You can only edit the currently selected fit in the Edit Fit window.
To edit a different fit, select it in the Table of fits and click Edit to open
another Edit Fit window.
Evaluating Fits
The Evaluate window enables you to evaluate any fit at whatever points you
choose. To open the window, click the Evaluate button in the main window
of the Distribution Fitting Tool. The following figure shows the Evaluate
window.
5-28
Working with Distributions Through GUIs
• Fit pane — Displays the names of existing fits. Select one or more fits
that you want to evaluate. Using your platform specific functionality, you
can select multiple fits.
• Function — Select the type of probability function you want to evaluate
for the fit. The available functions are
- Density (PDF) — Computes a probability density function.
5-29
5 Probability Distributions
Note The settings for Compute confidence bounds, Level, and Plot
function do not affect the plots that are displayed in the main window of
the Distribution Fitting Tool. The settings only apply to plots you create by
clicking Plot function in the Evaluate window.
5-30
Working with Distributions Through GUIs
Click Apply to apply these settings to the selected fit. The following figure
shows the results of evaluating the cumulative density function for the fit My
fit, created in “Example: Fitting a Distribution” on page 5-39, at the points
in the vector -3:0.5:3.
The window displays the following values in the columns of the table to the
right of the Fit pane:
5-31
5 Probability Distributions
• LB — The lower bounds for the confidence interval, if you select Compute
confidence bounds
• UB — The upper bounds for the confidence interval, if you select Compute
confidence bounds
Excluding Data
To exclude values from fit, click the Exclude button in the main window of
the Distribution Fitting Tool. This opens the Exclude window, in which you
can create rules for excluding specified values. You can use these rules to
exclude data when you create a new fit in the New Fit window. The following
figure shows the Exclude window.
5-32
Working with Distributions Through GUIs
OR
5-33
5 Probability Distributions
To set a lower limit for the boundary of the excluded region, click Add
Lower Limit. This displays a vertical line on the left side of the plot
window. Move the line with the mouse to the point you where you want
the lower limit, as shown in the following figure.
5-34
Working with Distributions Through GUIs
Moving the vertical line changes the value displayed in the Lower limit:
exclude data field in the Exclude window, as shown in the following figure.
5-35
5 Probability Distributions
Similarly, you can set the upper limit for the boundary of the excluded
region by clicking Add Upper Limit and moving the vertical line that
appears at the right side of the plot window. After setting the lower and
upper limits, click Close and return to the Exclude window.
3 Create Exclusion Rule—Once you have set the lower and upper limits
for the boundary of the excluded data, click Create Exclusion Rule
to create the new rule. The name of the new rule now appears in the
Existing exclusion rules pane.
When you select an exclusion rule in the Existing exclusion rules pane,
the following buttons are enabled:
• Copy — Creates a copy of the rule, which you can then modify. To save
the modified rule under a different name, click Create Exclusion Rule.
• View — Opens a new window in which you can see which data points
are excluded by the rule. The following figure shows a typical example.
5-36
Working with Distributions Through GUIs
The shaded areas in the plot graphically display which data points are
excluded. The table to the right lists all data points. The shaded rows
indicate excluded points:
• Rename — Renames the rule
• Delete — Deletes the rule
Once you define an exclusion rule, you can use it when you fit a distribution
to your data. The rule does not exclude points from the display of the data
set.
5-37
5 Probability Distributions
Saving a Session. To save the current session, select Save Session from
the File menu in the main window. This opens a dialog box that prompts you
to enter a filename, such as my_session.dfit, for the session. Clicking Save
saves the following items created in the current session:
• Data sets
• Fits
• Exclusion rules
• Plot settings
• Bin width rules
5-38
Working with Distributions Through GUIs
Step 1: Generate Random Data. To try the example, first generate some
random data to which you will fit a distribution. The following command
generates a vector data, of length 100, whose entries are random numbers
from a normal distribution with mean.36 and standard deviation 1.4.
Step 2: Import Data. To import the vector data into the Distribution
Fitting Tool, click the Data button in main window. This opens the window
shown in the following figure.
5-39
5 Probability Distributions
The Data field displays all numeric arrays in the MATLAB workspace. Select
data from the drop-down list, as shown in the following figure.
5-40
Working with Distributions Through GUIs
In the Data set name field, type a name for the data set, such as My data,
and click Create Data Set to create the data set. The main window of the
Distribution Fitting Tool now displays a larger version of the histogram in the
Data preview pane, as shown in the following figure.
5-41
5 Probability Distributions
Note Because the example uses random data, you might see a slightly
different histogram if you try this example for yourself.
Step 3: Create a New Fit. To fit a distribution to the data, click New Fit
in the main window of the Distribution Fitting Tool. This opens the window
shown in the following figure.
5-42
Working with Distributions Through GUIs
1 Enter a name for the fit, such as My fit, in the Fit name field.
5-43
5 Probability Distributions
3 Click Apply.
The Results pane displays the mean and standard deviation of the normal
distribution that best fits My data, as shown in the following figure.
The main window of the Distribution Fitting Tool displays a plot of the
normal distribution with this mean and standard deviation, as shown in the
following figure.
5-44
Working with Distributions Through GUIs
5-45
5 Probability Distributions
• Fits the distributions used in the current session to any data vector in the
MATLAB workspace.
• Plots the data and the fits.
After you end the current session, you can use the file to create plots in a
standard MATLAB figure window, without having to reopen the Distribution
Fitting Tool.
You can then apply the function normal_fit to any vector of data in the
MATLAB workspace. For example, the following commands
fit a normal distribution to a data set and generate a plot of the data and
the fit.
5-46
Working with Distributions Through GUIs
Note By default, the file labels the data in the legend using the same name as
the data set in the Distribution Fitting Tool. You can change the label using
the legend command, as illustrated by the preceding example.
5-47
5 Probability Distributions
The template includes example code that computes the Laplace distribution,
beginning at the lines
% -
% Remove the following return statement to define the
% Laplace distributon
% -
return
To use this example, simply delete the command return and save the
file. If you save the template in a folder on the MATLAB path, under its
default name dfittooldists.m, the Distribution Fitting Tool reads it in
automatically when you start the tool. You can also save the template under a
different name, such as laplace.m, and then import the custom distribution
as described in the following section.
5-48
Working with Distributions Through GUIs
For a complete list of the distributions available for use with the Distribution
Fitting Tool, see “Supported Distributions” on page 5-3. Distributions listing
dfittool in the fit column of the tables in that section can be used with
the Distribution Fitting Tool.
5-49
5 Probability Distributions
Histogram
Parameter
bounds
Parameter
value
Parameter
control Additional Sample again Export to
parameters from the same workspace
distribution
5-50
Working with Distributions Through GUIs
• Use the controls at the bottom of the window to set parameter values for
the distribution and to change their upper and lower bounds.
• Draw another sample from the same distribution, with the same size and
parameters.
• Export the current sample to your workspace. A dialog box enables you
to provide a name for the sample.
5-51
5 Probability Distributions
5-52
Statistics Toolbox™ Distribution Functions
⎛ n⎞
f (k) = ⎜ ⎟ pk (1 − p)n− k
⎝ k⎠
5-53
5 Probability Distributions
f (t) = λe−λt
is used to model the probability that a process with constant failure rate λ will
have a failure within time t . Each time t > 0 is assigned a positive probability
density. Densities are computed with the exppdf function:
5-54
Statistics Toolbox™ Distribution Functions
Probabilities for continuous pdfs can be computed with the quad function.
In the example above, the probability of failure in the time interval [0,1] is
computed as follows:
cars = load('carsmall','MPG','Origin');
MPG = cars.MPG;
hist(MPG)
set(get(gca,'Children'),'FaceColor',[.8 .8 1])
5-55
5 Probability Distributions
[f,x] = ksdensity(MPG);
plot(x,f);
title('Density estimate for MPG')
5-56
Statistics Toolbox™ Distribution Functions
The first call to ksdensity returns the default bandwidth, u, of the kernel
smoothing function. Subsequent calls modify this bandwidth.
[f,x,u] = ksdensity(MPG);
plot(x,f)
title('Density estimate for MPG')
hold on
5-57
5 Probability Distributions
[f,x] = ksdensity(MPG,'width',u/3);
plot(x,f,'r');
[f,x] = ksdensity(MPG,'width',u*3);
plot(x,f,'g');
5-58
Statistics Toolbox™ Distribution Functions
The green curve shows a density with the kernel bandwidth set too high.
This curve smooths out the data so much that the end result looks just like
the kernel function. The red curve has a smaller bandwidth and is rougher
looking than the blue curve. It may be too rough, but it does provide an
indication that there might be two major peaks rather than the single peak
of the blue curve. A reasonable choice of width might lead to a curve that is
intermediate between the red and blue curves.
Using default bandwidths, you can now plot the same mileage data, using
each of the available kernel functions.
5-59
5 Probability Distributions
The density estimates are roughly comparable, but the box kernel produces a
density that is rougher than the others.
Origin = cellstr(cars.Origin);
I = strcmp('USA',Origin);
J = strcmp('Japan',Origin);
K = ~(I|J);
MPG_USA = MPG(I);
MPG_Japan = MPG(J);
MPG_Europe = MPG(K);
5-60
Statistics Toolbox™ Distribution Functions
[fI,xI] = ksdensity(MPG_USA);
plot(xI,fI,'b')
hold on
[fJ,xJ] = ksdensity(MPG_Japan);
plot(xJ,fJ,'r')
[fK,xK] = ksdensity(MPG_Europe);
plot(xK,fK,'g')
legend('USA','Japan','Europe')
hold off
5-61
5 Probability Distributions
F ( x) = ∑ f ( y)
y≤ x
x
F ( x) = ∫ f ( y) dy
−∞
• P(y ≤ x) = F(x)
• P(y > x) = 1 – F(x)
• P(x1 < y ≤ x2) = F(x2) – F(x1)
5-62
Statistics Toolbox™ Distribution Functions
x−
t=
s/ n
mu = 1; % Population mean
sigma = 2; % Population standard deviation
n = 100; % Sample size
x = normrnd(mu,sigma,n,1); % Random sample from population
xbar = mean(x); % Sample mean
s = std(x); % Sample standard deviation
t = (xbar-mu)/(s/sqrt(n)) % t-statistic
t =
0.2489
p = 1-tcdf(t,n-1) % Probability of larger t-statistic
p =
0.4020
This probability is the same as the p value returned by a t-test of the null
hypothesis that the sample comes from a normal population with mean μ:
[h,ptest] = ttest(x,mu,0.05,'right')
h =
0
ptest =
0.4020
5-63
5 Probability Distributions
returns the values of a function F such that F(x) represents the proportion of
observations in a sample less than or equal to x.
The idea behind the empirical cdf is simple. It is a function that assigns
probability 1/n to each of n observations in a sample. Its graph has a
stair-step appearance. If a sample comes from a distribution in a parametric
family (such as a normal distribution), its empirical cdf is likely to resemble
the parametric distribution. If not, its empirical distribution still gives an
estimate of the cdf for the distribution that generated the data.
x = normrnd(10,2,20,1);
[f,xf] = ecdf(x);
stairs(xf,f)
hold on
xx=linspace(5,15,100);
yy = normcdf(xx,10,2);
plot(xx,yy,'r:')
hold off
legend('Empirical cdf','Normal cdf',2)
5-64
Statistics Toolbox™ Distribution Functions
For piecewise probability density estimation, using the empirical cdf in the
center of the distribution and Pareto distributions in the tails, see “Fitting
Piecewise Distributions” on page 5-72.
5-65
5 Probability Distributions
For continuous distributions, the inverse cdf returns the unique outcome
whose cdf value is the input cumulative probability.
x = 0.5:0.2:1.5 % Outcomes
x =
0.5000 0.7000 0.9000 1.1000 1.3000 1.5000
p = expcdf(x,1) % Cumulative probabilities
p =
0.3935 0.5034 0.5934 0.6671 0.7275 0.7769
expinv(p,1) % Return original outcomes
ans =
0.5000 0.7000 0.9000 1.1000 1.3000 1.5000
For discrete distributions, there may be no outcome whose cdf value is the
input cumulative probability. In these cases, the inverse cdf returns the first
outcome whose cdf value equals or exceeds the input cumulative probability.
p =
5-66
Statistics Toolbox™ Distribution Functions
q =
0 1 1 2
5-67
5 Probability Distributions
For example, the wblstat function can be used to visualize the mean of the
Weibull distribution as a function of its two distribution parameters:
a = 0.5:0.1:3;
b = 0.5:0.1:3;
[A,B] = meshgrid(a,b);
M = wblstat(A,B);
surfc(A,B,M)
5-68
Statistics Toolbox™ Distribution Functions
5-69
5 Probability Distributions
The Statistics Toolbox function mle is a convenient front end to the individual
distribution fitting functions, and more. The function computes MLEs for
distributions beyond those for which Statistics Toolbox software provides
specific pdf functions.
For some pdfs, MLEs can be given in closed form and computed directly.
For other pdfs, a search for the maximum likelihood must be employed. The
search can be controlled with an options input argument, created using
the statset function. For efficient searches, it is important to choose a
reasonable distribution model and set appropriate convergence tolerances.
MLEs can be heavily biased, especially for small samples. As sample size
increases, however, MLEs become unbiased minimum variance estimators
with approximate normal distributions. This is used to compute confidence
bounds for the estimates.
mu = 1; % Population parameter
n = 1e3; % Sample size
ns = 1e4; % Number of samples
5-70
Statistics Toolbox™ Distribution Functions
The Central Limit Theorem says that the means will be approximately
normally distributed, regardless of the distribution of the data in the samples.
The normfit function can be used to find the normal distribution that best
fits the means:
[muhat,sigmahat,muci,sigmaci] = normfit(means)
muhat =
1.0003
sigmahat =
0.0319
muci =
0.9997
1.0010
sigmaci =
0.0314
0.0323
The function returns MLEs for the mean and standard deviation and their
95% confidence intervals.
To visualize the distribution of sample means together with the fitted normal
distribution, you must scale the fitted pdf, with area = 1, to the area of the
histogram being used to display the means:
numbins = 50;
hist(means,numbins)
set(get(gca,'Children'),'FaceColor',[.8 .8 1])
hold on
[bincounts,binpositions] = hist(means,numbins);
binwidth = binpositions(2) - binpositions(1);
histarea = binwidth*sum(bincounts);
x = binpositions(1):0.001:binpositions(end);
y = normpdf(x,muhat,sigmahat);
plot(x,histarea*y,'r','LineWidth',2)
5-71
5 Probability Distributions
5-72
Statistics Toolbox™ Distribution Functions
left_tail = -exprnd(1,10,1);
right_tail = exprnd(5,10,1);
center = randn(80,1);
data = [left_tail;center;right_tail];
Neither a normal distribution nor a t distribution fits the tails very well:
probplot(data);
p = fitdist(data,'tlocationscale');
h = probplot(gca,p);
set(h,'color','r','linestyle','-')
title('{\bf Probability Plot}')
legend('Normal','Data','t','Location','NW')
5-73
5 Probability Distributions
On the other hand, the empirical distribution provides a perfect fit, but the
outliers make the tails very discrete:
ecdf(data)
5-74
Statistics Toolbox™ Distribution Functions
The paretotails function provides a single, well-fit model for the entire
sample. The following uses generalized Pareto distributions (GPDs) for the
lower and upper 10% of the data:
pfit = paretotails(data,0.1,0.9)
pfit =
Piecewise distribution with 3 segments
-Inf < x < -1.30726 (0 < p < 0.1)
lower tail, GPD(-1.10167,1.12395)
5-75
5 Probability Distributions
x = -4:0.01:10;
plot(x,cdf(pfit,x))
Access information about the fit using the methods of the paretotails class.
Options allow for nonparametric estimation of the center of the cdf.
5-76
Statistics Toolbox™ Distribution Functions
L(a) = ∏ f (a| x)
x∈ X
5-77
5 Probability Distributions
For example, use gamrnd to generate a random sample from a specific gamma
distribution:
a = [1,2];
X = gamrnd(a(1),a(2),1e3,1);
Given X, the gamlike function can be used to visualize the likelihood surface
in the neighborhood of a:
mesh = 50;
delta = 0.5;
a1 = linspace(a(1)-delta,a(1)+delta,mesh);
a2 = linspace(a(2)-delta,a(2)+delta,mesh);
logL = zeros(mesh); % Preallocate memory
for i = 1:mesh
for j = 1:mesh
logL(i,j) = gamlike([a1(i),a2(j)],X);
end
end
[A1,A2] = meshgrid(a1,a2);
surfc(A1,A2,logL)
5-78
Statistics Toolbox™ Distribution Functions
These can be compared to the MLEs returned by the gamfit function, which
uses a combination search and solve algorithm:
ahat = gamfit(X)
ahat =
1.0231 1.9728
The MLEs can be added to the surface plot (rotated to show the minimum):
hold on
plot3(MLES(1),MLES(2),LL(MLES),...
'ro','MarkerSize',5,...
'MarkerFaceColor','r')
5-79
5 Probability Distributions
• cvpartition
• hmmgenerate
• lhsdesign
• lhsnorm
• mhsample
• random
• randsample
• slicesample
By controlling the default random number stream and its state, you can
control how the RNGs in Statistics Toolbox software generate random values.
For example, to reproduce the same sequence of values from an RNG, you
can save and restore the default stream’s state, or reset the default stream.
For details on managing the default random number stream, see “Managing
the Default Stream”.
MATLAB initializes the default random number stream to the same state
each time it starts up. Thus, RNGs in Statistics Toolbox software will
generate the same sequence of values for each MATLAB session unless you
modify that state at startup. One simple way to do that is to add commands
to startup.m such as
5-80
Statistics Toolbox™ Distribution Functions
stream = RandStream('mt19937ar','seed',sum(100*clock));
RandStream.setDefaultStream(stream);
5-81
5 Probability Distributions
5-82
Statistics Toolbox™ Distribution Functions
5-83
5 Probability Distributions
Probability distribution objects allow you to easily fit, access, and store
distribution information for a given data set. The following operations are
easier to perform using distribution objects:
5-84
Using Probability Distribution Objects
If you are a novice statistician who would like to explore how various
distributions look without having to manipulate data, see “Working with
Distributions Through GUIs” on page 5-9.
If you have no data to fit, but want to calculate a pdf, cdf, etc for various
parameters, see “Statistics Toolbox Distribution Functions” on page 5-52.
5-85
5 Probability Distributions
The left side of this diagram shows the inheritance line from all probability
distributions down to univariate parametric probability distributions. The
right side shows the lineage down to univariate kernel distributions. Here is
how to interpret univariate parametric distribution lineage:
5-86
Using Probability Distribution Objects
5-87
5 Probability Distributions
pd = ProbDistUnivParam('normal',[100 10])
5-88
Using Probability Distribution Objects
load carsmall
pd = ProbDistUnivKernel(MPG)
Object-Supported Distributions
Object-oriented programming in the Statistics Toolbox supports the following
distributions.
Parametric Distributions
Use the following distribution to create ProbDistUnivParam objects using
fitdist. For more information on the cumulative distribution function (cdf)
and probability density function (pdf) methods, as well as other available
methods, see the ProbDistUnivParam class reference page.
5-89
5 Probability Distributions
Nonparametric Distributions
Use the following distributions to create ProbDistUnivKernel objects.
For more information on the cumulative distribution function (cdf) and
probability density function (pdf) methods, as well as other available
methods, see the ProbDistUnivKernel class reference page.
5-90
Using Probability Distribution Objects
load carsmall
NormDist = fitdist(MPG,'normal')
NormDist =
normal distribution
mu = 23.7181
sigma = 8.03573
load carsmall
[WeiByOrig, Country] = fitdist(MPG,'weibull','by',Origin)
Warning: Error while fitting group 'Italy':
Not enough data in X to fit this distribution.
> In fitdist at 171
WeiByOrig =
5-91
5 Probability Distributions
Columns 1 through 4
Columns 5 through 6
[1x1 ProbDistUnivParam] []
Country =
'USA'
'France'
'Japan'
'Germany'
'Sweden'
'Italy'
A warning appears informing you that, since the data only represents one
Italian car, fitdist cannot fit a Weibull distribution to that group. Each
one of the five other groups now has a distribution object associated with it,
represented in the cell array wd. Each object contains properties that hold
information about the data, the distribution, and the parameters. For more
information on what properties exist and what information they contain, see
ProbDistUnivParam or ProbDistUnivKernel.
5-92
Using Probability Distribution Objects
Now you can easily compare PDFs using the pdf method of the
ProbDistUnivParam class:
time = linspace(0,45);
pdfjapan = pdf(distjapan,time);
pdfusa = pdf(distusa,time);
hold on
plot(time,[pdfjapan;pdfusa])
l = legend('Japan','USA')
set(l,'Location','Best')
xlabel('MPG')
ylabel('Probability Density')
5-93
5 Probability Distributions
You could then further group the data and compare, for example, MPG by
year for American cars:
load carsmall
[WeiByYearOrig, Names] = fitdist(MPG,'weibull','by',...
{Origin Model_Year});
USA70 = WeiByYearOrig{1};
USA76 = WeiByYearOrig{2};
USA82 = WeiByYearOrig{3};
time = linspace(0,45);
pdf70 = pdf(USA70,time);
pdf76 = pdf(USA76,time);
pdf82 = pdf(USA82,time);
line(t,[pdf70;pdf76;pdf82])
l = legend('1970','1976','1982')
set(l,'Location','Best')
title('USA Car MPG by Year')
xlabel('MPG')
ylabel('Probability Density')
5-94
Using Probability Distribution Objects
load carsmall;
[WeiByOrig, Country] = fitdist(MPG,'weibull','by',Origin);
[NormByOrig, Country] = fitdist(MPG,'normal','by',Origin);
[LogByOrig, Country] = fitdist(MPG,'logistic','by',Origin);
[KerByOrig, Country] = fitdist(MPG,'kernel','by',Origin);
5-95
5 Probability Distributions
Extract the fits for American cars and compare the fits visually against a
histogram of the original data:
WeiUSA = WeiByOrig{1};
NormUSA = NormByOrig{1};
LogUSA = LogByOrig{1};
KerUSA = KerByOrig{1};
5-96
Using Probability Distribution Objects
You can see that only the nonparametric kernel distribution, KerUSA, comes
close to revealing the two modes in the data.
5-97
5 Probability Distributions
load carsmall;
[WeiByOrig, Country] = fitdist(MPG,'weibull','by',Origin);
[NormByOrig, Country] = fitdist(MPG,'normal','by',Origin);
[LogByOrig, Country] = fitdist(MPG,'logistic','by',Origin);
[KerByOrig, Country] = fitdist(MPG,'kernel','by',Origin);
Combine all four fits and the country labels into a single cell array, including
“headers” to indicate which distributions correspond to which objects. Then,
save the array to a .mat file:
To show that the data is both safely saved and easily restored, clear your
workspace of relevant variables. This command clears only those variables
associated with this example:
clear('Weight','Acceleration','AllFits','Country',...
'Cylinders','Displacement','Horsepower','KerByOrig',...
'LogByOrig','MPG','Model','Model_Year','NormByOrig',...
'Origin','WeiByOrig')
load CarSmallFits
AllFits
You can now access the distributions objects as in the previous examples.
5-98
Probability Distributions Used for Multivariate Modeling
5-99
5 Probability Distributions
obj = gmdistribution(MU,SIGMA,p);
properties = fieldnames(obj)
properties =
'NDimensions'
'DistName'
'NComponents'
'PComponents'
'mu'
'Sigma'
'NlogL'
'AIC'
'BIC'
'Converged'
'Iters'
'SharedCov'
'CovType'
'RegV'
dimension = obj.NDimensions
dimension =
2
name = obj.DistName
name =
gaussian mixture distribution
5-100
Probability Distributions Used for Multivariate Modeling
Use the methods pdf and cdf to compute values and visualize the object:
5-101
5 Probability Distributions
Fitting a Model to Data. You can also create Gaussian mixture models
by fitting a parametric model with a specified number of components to
data. The fit method of the gmdistribution class uses the syntax obj =
gmdistribution.fit(X,k), where X is a data matrix and k is the specified
number of components. Choosing a suitable number of components k is
essential for creating a useful model of the data—too few components fails to
model the data accurately; too many components leads to an over-fit model
with singular covariance matrices.
First, create some data from a mixture of two bivariate Gaussian distributions
using the mvnrnd function:
MU1 = [1 2];
SIGMA1 = [2 0; 0 .5];
5-102
Probability Distributions Used for Multivariate Modeling
options = statset('Display','final');
obj = gmdistribution.fit(X,2,'Options',options);
hold on
h = ezcontour(@(x,y)pdf(obj,[x y]),[-8 6],[-8 6]);
hold off
5-103
5 Probability Distributions
ComponentMeans = obj.mu
ComponentMeans =
0.9391 2.0322
-2.9823 -4.9737
ComponentCovariances = obj.Sigma
ComponentCovariances(:,:,1) =
1.7786 -0.0528
-0.0528 0.5312
ComponentCovariances(:,:,2) =
1.0491 -0.0150
5-104
Probability Distributions Used for Multivariate Modeling
-0.0150 0.9816
MixtureProportions = obj.PComponents
MixtureProportions =
0.5000 0.5000
AIC = zeros(1,4);
obj = cell(1,4);
for k = 1:4
obj{k} = gmdistribution.fit(X,k);
AIC(k)= obj{k}.AIC;
end
[minAIC,numComponents] = min(AIC);
numComponents
numComponents =
2
model = obj{2}
model =
Gaussian mixture distribution
with 2 components in 2 dimensions
Component 1:
Mixing proportion: 0.500000
Mean: 0.9391 2.0322
Component 2:
Mixing proportion: 0.500000
Mean: -2.9823 -4.9737
Both the Akaike and Bayes information are negative log-likelihoods for the
data with penalty terms for the number of estimated parameters. You can use
them to determine an appropriate number of components for a model when
the number of components is unspecified.
5-105
5 Probability Distributions
MU = [1 2;-3 -5];
SIGMA = cat(3,[2 0;0 .5],[1 0;0 1]);
p = ones(1,2)/2;
obj = gmdistribution(MU,SIGMA,p);
Y = random(obj,1000);
scatter(Y(:,1),Y(:,2),10,'.')
5-106
Probability Distributions Used for Multivariate Modeling
Copulas
• “Determining Dependence Between Simulation Inputs” on page 5-108
• “Constructing Dependent Bivariate Distributions” on page 5-112
• “Using Rank Correlation Coefficients” on page 5-116
• “Using Bivariate Copulas” on page 5-119
• “Higher Dimension Copulas” on page 5-126
• “Archimedean Copulas” on page 5-128
• “Simulating Dependent Multivariate Data Using Copulas” on page 5-130
• “Example: Fitting Copulas to Data” on page 5-135
5-107
5 Probability Distributions
It can be difficult to generate random inputs with dependence when they have
distributions that are not from a standard multivariate distribution. Further,
some of the standard multivariate distributions can model only limited types
of dependence. It is always possible to make the inputs independent, and
while that is a simple choice, it is not always sensible and can lead to the
wrong conclusions.
5-108
Probability Distributions Used for Multivariate Modeling
n = 1000;
sigma = .5;
SigmaInd = sigma.^2 .* [1 0; 0 1]
SigmaInd =
0.25 0
0 0.25
ZInd = mvnrnd([0 0],SigmaInd,n);
XInd = exp(ZInd);
plot(XInd(:,1),XInd(:,2),'.')
axis([0 5 0 5])
axis equal
xlabel('X1')
ylabel('X2')
5-109
5 Probability Distributions
rho = .7;
A second scatter plot demonstrates the difference between these two bivariate
distributions:
plot(XDep(:,1),XDep(:,2),'.')
axis([0 5 0 5])
axis equal
xlabel('X1')
ylabel('X2')
5-110
Probability Distributions Used for Multivariate Modeling
It is clear that there is a tendency in the second data set for large values of
X1 to be associated with large values of X2, and similarly for small values.
The correlation parameter, ρ, of the underlying bivariate normal determines
this dependence. The conclusions drawn from the simulation could well
depend on whether you generate X1 and X2 with dependence. The bivariate
lognormal distribution is a simple solution in this case; it easily generalizes
to higher dimensions in cases where the marginal distributions are different
lognormals.
5-111
5 Probability Distributions
n = 1000;
z = normrnd(0,1,n,1);
hist(z,-3.75:.5:3.75)
xlim([-4 4])
title('1000 Simulated N(0,1) Random Values')
xlabel('Z')
ylabel('Frequency')
set(get(gca,'Children'),'FaceColor',[.8 .8 1])
5-112
Probability Distributions Used for Multivariate Modeling
u = normcdf(z);
hist(u,.05:.1:.95)
title('1000 Simulated N(0,1) Values Transformed to Unif(0,1)')
xlabel('U')
ylabel('Frequency')
set(get(gca,'Children'),'FaceColor',[.8 .8 1])
5-113
5 Probability Distributions
x = gaminv(u,2,1);
hist(x,.25:.5:9.75)
title('1000 Simulated N(0,1) Values Transformed to Gamma(2,1)')
xlabel('X')
ylabel('Frequency')
set(get(gca,'Children'),'FaceColor',[.8 .8 1])
5-114
Probability Distributions Used for Multivariate Modeling
⎛ ⎡1 ⎤⎞
Z = [ Z1 , Z2 ] ∼ N ⎜ [0, 0] , ⎢ ⎟
⎝ ⎣ 1 ⎥⎦ ⎠
U = ⎡⎣Φ ( Z1 ) , Φ ( Z2 ) ⎤⎦
X = ⎡⎣G1 (U1 ) , G2 (U2 ) ⎤⎦
5-115
5 Probability Distributions
where G1 and G2 are inverse cdfs of two possibly different distributions. For
example, the following generates random vectors from a bivariate distribution
with t5 and Gamma(2,1) marginals:
scatterhist(X(:,1),X(:,2))
This plot has histograms alongside a scatter plot to show both the marginal
distributions, and the dependence.
5-116
Probability Distributions Used for Multivariate Modeling
e − 1
2
cor ( X 1, X 2) =
e − 1
2
which is strictly less than ρ, unless ρ is exactly 1. In more general cases such
as the Gamma/t construction, the linear correlation between X1 and X2 is
difficult or impossible to express in terms of ρ, but simulations show that the
same effect happens.
2 ⎛ ⎞
= arcsin ( )
or = sin ⎜ ⎟
⎝ 2⎠
6 ⎛⎞ ⎛ ⎞
s = arcsin ⎜ ⎟ or = 2sin ⎜ s ⎟
⎝2⎠ ⎝ 6⎠
rho = -1:.01:1;
5-117
5 Probability Distributions
tau = 2.*asin(rho)./pi;
rho_s = 6.*asin(rho./2)./pi;
plot(rho,tau,'b-','LineWidth',2)
hold on
plot(rho,rho_s,'g-','LineWidth',2)
plot([-1 1],[-1 1],'k:','LineWidth',2)
axis([-1 1 -1 1])
xlabel('rho')
ylabel('Rank correlation coefficient')
legend('Kendall''s {\it\tau}', ...
'Spearman''s {\it\rho_s}', ...
'location','NW')
5-118
Probability Distributions Used for Multivariate Modeling
Thus, it is easy to create the desired rank correlation between X1 and X2,
regardless of their marginal distributions, by choosing the correct ρ parameter
value for the linear correlation between Z1 and Z2.
For example, use the copularnd function to create scatter plots of random
values from a bivariate Gaussian copula for various levels of ρ, to illustrate the
range of different dependence structures. The family of bivariate Gaussian
copulas is parameterized by the linear correlation matrix:
5-119
5 Probability Distributions
⎛1 ⎞
Ρ=⎜
⎝ 1 ⎟⎠
n = 500;
5-120
Probability Distributions Used for Multivariate Modeling
u1 = linspace(1e-3,1-1e-3,50);
u2 = linspace(1e-3,1-1e-3,50);
[U1,U2] = meshgrid(u1,u2);
Rho = [1 .8; .8 1];
f = copulapdf('t',[U1(:) U2(:)],Rho,5);
f = reshape(f,size(U1));
surf(u1,u2,log(f),'FaceColor','interp','EdgeColor','none')
5-121
5 Probability Distributions
view([-15,20])
xlabel('U1')
ylabel('U2')
zlabel('Probability Density')
u1 = linspace(1e-3,1-1e-3,50);
u2 = linspace(1e-3,1-1e-3,50);
[U1,U2] = meshgrid(u1,u2);
F = copulacdf('t',[U1(:) U2(:)],Rho,5);
F = reshape(F,size(U1));
surf(u1,u2,F,'FaceColor','interp','EdgeColor','none')
view([-15,20])
xlabel('U1')
ylabel('U2')
zlabel('Cumulative Probability')
5-122
Probability Distributions Used for Multivariate Modeling
5-123
5 Probability Distributions
For example, use the copularnd function to create scatter plots of random
values from a bivariate t1 copula for various levels of ρ, to illustrate the range
of different dependence structures:
n = 500;
nu = 1;
5-124
Probability Distributions Used for Multivariate Modeling
n = 1000;
rho = .7;
nu = 1;
5-125
5 Probability Distributions
scatterhist(X(:,1),X(:,2))
5-126
Probability Distributions Used for Multivariate Modeling
n = 1000;
Rho = [1 .4 .2; .4 1 -.8; .2 -.8 1];
U = copularnd('Gaussian',Rho,n);
X = [gaminv(U(:,1),2,1) betainv(U(:,2),2,2) tinv(U(:,3),5)];
subplot(1,1,1)
plot3(X(:,1),X(:,2),X(:,3),'.')
grid on
view([-55, 15])
xlabel('X1')
ylabel('X2')
zlabel('X3')
Notice that the relationship between the linear correlation parameter ρ and,
for example, Kendall’s τ, holds for each entry in the correlation matrix P
used here. You can verify that the sample rank correlations of the data are
approximately equal to the theoretical values:
tauTheoretical = 2.*asin(Rho)./pi
tauTheoretical =
5-127
5 Probability Distributions
1 0.26198 0.12819
0.26198 1 -0.59033
0.12819 -0.59033 1
tauSample = corr(X,'type','Kendall')
tauSample =
1 0.27254 0.12701
0.27254 1 -0.58182
0.12701 -0.58182 1
Archimedean Copulas
Statistics Toolbox functions are available for three bivariate Archimedean
copula families:
• Clayton copulas
• Frank copulas
• Gumbel copulas
These are one-parameter families that are defined directly in terms of their
cdfs, rather than being defined constructively using a standard multivariate
distribution.
alpha = copulaparam('Clayton',tau,'type','kendall')
alpha =
2.882
Finally, plot a random sample from the Clayton copula with copularnd.
Repeat the same procedure for the Frank and Gumbel copulas:
5-128
Probability Distributions Used for Multivariate Modeling
n = 500;
U = copularnd('Clayton',alpha,n);
subplot(3,1,1)
plot(U(:,1),U(:,2),'.');
title(['Clayton Copula, {\it\alpha} = ',sprintf('%0.2f',alpha)])
xlabel('U1')
ylabel('U2')
alpha = copulaparam('Frank',tau,'type','kendall');
U = copularnd('Frank',alpha,n);
subplot(3,1,2)
plot(U(:,1),U(:,2),'.')
title(['Frank Copula, {\it\alpha} = ',sprintf('%0.2f',alpha)])
xlabel('U1')
ylabel('U2')
alpha = copulaparam('Gumbel',tau,'type','kendall');
U = copularnd('Gumbel',alpha,n);
subplot(3,1,3)
plot(U(:,1),U(:,2),'.')
title(['Gumbel Copula, {\it\alpha} = ',sprintf('%0.2f',alpha)])
xlabel('U1')
ylabel('U2')
5-129
5 Probability Distributions
5-130
Probability Distributions Used for Multivariate Modeling
Suppose you have return data for two stocks and want to run a Monte Carlo
simulation with inputs that follow the same distributions as the data:
load stockreturns
nobs = size(stocks,1);
subplot(2,1,1)
hist(stocks(:,1),10)
xlim([-3.5 3.5])
xlabel('X1')
ylabel('Frequency')
set(get(gca,'Children'),'FaceColor',[.8 .8 1])
subplot(2,1,2)
hist(stocks(:,2),10)
xlim([-3.5 3.5])
xlabel('X2')
ylabel('Frequency')
set(get(gca,'Children'),'FaceColor',[.8 .8 1])
5-131
5 Probability Distributions
You could fit a parametric model separately to each dataset, and use those
estimates as the marginal distributions. However, a parametric model may
not be sufficiently flexible. Instead, you can use a nonparametric model
to transform to the marginal distributions. All that is needed is a way to
compute the inverse cdf for the nonparametric model.
5-132
Probability Distributions Used for Multivariate Modeling
[Fi,xi] = ecdf(stocks(:,1));
stairs(xi,Fi,'b','LineWidth',2)
hold on
Fi_sm = ksdensity(stocks(:,1),xi,'function','cdf','width',.15);
plot(xi,Fi_sm,'r-','LineWidth',1.5)
xlabel('X1')
ylabel('Cumulative Probability')
legend('Empirical','Smoothed','Location','NW')
grid on
5-133
5 Probability Distributions
parameter. For the correlation parameter, you can compute the rank
correlation of the data, and then find the corresponding linear correlation
parameter for the t copula using copulaparam:
nu = 5;
tau = corr(stocks(:,1),stocks(:,2),'type','kendall')
tau =
0.51798
Next, use copularnd to generate random values from the t copula and
transform using the nonparametric inverse cdfs. The ksdensity function
allows you to make a kernel estimate of distribution and evaluate the inverse
cdf at the copula points all in one step:
n = 1000;
Alternatively, when you have a large amount of data or need to simulate more
than one set of values, it may be more efficient to compute the inverse cdf
over a grid of values in the interval (0,1) and use interpolation to evaluate it
at the copula points:
p = linspace(0.00001,0.99999,1000);
G1 = ksdensity(stocks(:,1),p,'function','icdf','width',0.15);
X1 = interp1(p,G1,U(:,1),'spline');
G2 = ksdensity(stocks(:,2),p,'function','icdf','width',0.15);
X2 = interp1(p,G2,U(:,2),'spline');
scatterhist(X1,X2)
5-134
Probability Distributions Used for Multivariate Modeling
The marginal histograms of the simulated data are a smoothed version of the
histograms for the original data. The amount of smoothing is controlled by
the bandwidth input to ksdensity.
5-135
5 Probability Distributions
load stockreturns
x = stocks(:,1);
y = stocks(:,2);
scatterhist(x,y)
Transform the data to the copula scale (unit square) using a kernel estimator
of the cumulative distribution function:
5-136
Probability Distributions Used for Multivariate Modeling
u = ksdensity(x,x,'function','cdf');
v = ksdensity(y,y,'function','cdf');
scatterhist(u,v)
xlabel('u')
ylabel('v')
Fit a t copula:
5-137
5 Probability Distributions
r = copularnd('t',Rho,nu,1000);
u1 = r(:,1);
v1 = r(:,2);
scatterhist(u1,v1)
xlabel('u')
ylabel('v')
set(get(gca,'children'),'marker','.')
Transform the random sample back to the original scale of the data:
x1 = ksdensity(u,u1,'function','icdf');
5-138
Probability Distributions Used for Multivariate Modeling
y1 = ksdensity(v,v1,'function','icdf');
scatterhist(x1,y1)
set(get(gca,'children'),'marker','.')
5-139
5 Probability Distributions
5-140
6
Random Number
Generation
Random number generators (RNGs) like those in MATLAB are algorithms for
generating pseudorandom numbers with a specified distribution.
For more information on the GUI for generating random numbers from
supported distributions, see “Visually Exploring Random Number Generation”
on page 5-49.
6-2
Random Number Generation Functions
6-3
6 Random Number Generation
6-4
Common Generation Methods
Direct Methods
Direct methods directly use the definition of the distribution.
function X = directbinornd(N,p,m,n)
For example:
X = directbinornd(100,0.3,1e4,1);
hist(X,101)
set(get(gca,'Children'),'FaceColor',[.8 .8 1])
6-5
6 Random Number Generation
The Statistics Toolbox function binornd uses a modified direct method, based
on the definition of a binomial random variable as the sum of Bernoulli
random variables.
You can easily convert the previous method to a random number generator
for the Poisson distribution with parameter λ. The Poisson distribution is
the limiting case of the binomial distribution as N approaches infinity, p
approaches zero, and Np is held fixed at λ. To generate Poisson random
numbers, create a version of the previous generator that inputs λ rather than
N and p, and internally sets N to some large number and p to λ/N.
The Statistics Toolbox function poissrnd actually uses two direct methods:
6-6
Common Generation Methods
Inversion Methods
Inversion methods are based on the observation that continuous cumulative
distribution functions (cdfs) range uniformly over the interval (0,1). If u is a
uniform random number on (0,1), then using X = F -1(U) generates a random
number X from a continuous distribution with specified cdf F.
For example, the following code generates random numbers from a specific
exponential distribution using the inverse cdf and the MATLAB uniform
random number generator rand:
mu = 1;
X = expinv(rand(1e4,1),mu);
Compare the distribution of the generated random numbers to the pdf of the
specified exponential by scaling the pdf to the area of the histogram used
to display the distribution:
numbins = 50;
hist(X,numbins)
set(get(gca,'Children'),'FaceColor',[.8 .8 1])
hold on
[bincounts,binpositions] = hist(X,numbins);
binwidth = binpositions(2) - binpositions(1);
histarea = binwidth*sum(bincounts);
x = binpositions(1):0.001:binpositions(end);
y = exppdf(x,mu);
plot(x,histarea*y,'r','LineWidth',2)
6-7
6 Random Number Generation
function X = discreteinvrnd(p,m,n)
Use the function to generate random numbers from any discrete distribution:
6-8
Common Generation Methods
Acceptance-Rejection Methods
The functional form of some distributions makes it difficult or time-consuming
to generate random numbers using direct or inversion methods.
Acceptance-rejection methods provide an alternative in these cases.
6-9
6 Random Number Generation
1 Chooses a density g.
6-10
Common Generation Methods
function X = accrejrnd(f,g,grnd,c,m,n)
For example, the function f(x) = xe–x2/2 satisfies the conditions for a pdf on [0,∞)
(nonnegative and integrates to 1). The exponential pdf with mean 1, f(x) = e–x,
dominates g for c greater than about 2.2. Thus, you can use rand and exprnd
to generate random numbers from f:
f = @(x)x.*exp(-(x.^2)/2);
g = @(x)exp(-x);
grnd = @()exprnd(1);
X = accrejrnd(f,g,grnd,2.2,1e4,1);
Y = raylrnd(1,1e4,1);
hist([X Y])
h = get(gca,'Children');
set(h(1),'FaceColor',[.8 .8 1])
legend('A-R RNG','Rayleigh RNG')
6-11
6 Random Number Generation
6-12
Parallel Computing Support for Random Number Generation
The following functions use random number generators and support both
parallel and serial computation. They supply two options to control random
number generation, whether in serial and parallel mode.
• bootci
• bootstrp
• TreeBagger
• TreeBagger.growTrees
Reproducing Computations
The previous functions include the 'UseSubstreams' option. This option
provides a quick and easy way to reproduce computations performed using
random number generators. Use this option to rerun a command with
reproducible results, whether using serial or parallel computation. This
option is available only with RandStream types that support substreams. The
default is not to use substreams, since reproducing random number streams
is not commonly desired.
6-13
6 Random Number Generation
For more information on each of these options, see the function reference
pages.
6-14
Representing Sampling Distributions Using Markov Chain Samplers
3 Accept y(t) as the next sample x(t + 1) with probability r(x(t),y(t)), and keep
x(t) as the next sample x(t + 1) with probability 1 – r(x(t),y(t)), where:
⎧ f ( y) q( x | y) ⎫
r( x, y) = min ⎨ , 1⎬
⎩ f ( x) q( y| x) ⎭
4 Increment t → t+1, and repeat steps 2 and 3 until you get the desired
number of samples.
6-15
6 Random Number Generation
2 Draw a real value y uniformly from (0, f(x(t))), thereby defining a horizontal
“slice” as S = {x: y < f(x)}.
3 Find an interval I = (L, R) around x(t) that contains all, or much of the
“slice” S.
5 Increment t → t+1 and repeat steps 2 through 4 until you get the desired
number of samples.
Generate random numbers using the slice sampling method with the
slicesample function.
6-16
Generating Quasi-Random Numbers
Quasi-Random Sequences
Quasi-random number generators (QRNGs) produce highly uniform samples
of the unit hypercube. QRNGs minimize the discrepancy between the
distribution of generated points and a distribution with equal proportions of
points in each sub-cube of a uniform partition of the hypercube. As a result,
QRNGs systematically fill the “holes” in any initial segment of the generated
quasi-random sequence.
6-17
6 Random Number Generation
• Skip — A Skip value specifies the number of initial points to ignore. In this
example, set the Skip value to 2. The sequence is now 5,7,9,2,4,6,8,10
and the first three points are [5,7,9]:
• Leap — A Leap value specifies the number of points to ignore for each one
you take. Continuing the example with the Skip set to 2, if you set the Leap
to 1, the sequence uses every other point. In this example, the sequence is
now 5,9,4,8 and the first three points are [5,9,4]:
6-18
Generating Quasi-Random Numbers
Quasi-random sequences are functions from the positive integers to the unit
hypercube. To be useful in application, an initial point set of a sequence must
be generated. Point sets are matrices of size n-by-d, where n is the number of
points and d is the dimension of the hypercube being sampled. The functions
haltonset and sobolset construct point sets with properties of a specified
quasi-random sequence. Initial segments of the point sets are generated by
the net method of the qrandset class (parent class of the haltonset class
and sobolset class), but points can be generated and accessed more generally
using parenthesis indexing.
p = haltonset(2,'Skip',1e3,'Leap',1e2)
p =
Halton point set in 2 dimensions (8.918019e+013 points)
Properties:
Skip : 1000
Leap : 100
ScrambleMethod : none
6-19
6 Random Number Generation
p = scramble(p,'RR2')
p =
Halton point set in 2 dimensions (8.918019e+013 points)
Properties:
Skip : 1000
Leap : 100
ScrambleMethod : RR2
X0 = net(p,500);
X0 = p(1:500,:);
Values of the point set X0 are not generated and stored in memory until you
access p using net or parenthesis indexing.
scatter(X0(:,1),X0(:,2),5,'r')
axis square
title('{\bf Quasi-Random Scatter}')
6-20
Generating Quasi-Random Numbers
X = rand(500,2);
scatter(X(:,1),X(:,2),5,'b')
axis square
title('{\bf Uniform Random Scatter}')
6-21
6 Random Number Generation
nTests = 1e5;
sampSize = 50;
PVALS = zeros(nTests,1);
for test = 1:nTests
x = rand(sampSize,1);
[h,pval] = kstest(x,[x,x]);
6-22
Generating Quasi-Random Numbers
PVALS(test) = pval;
end
hist(PVALS,100)
set(get(gca,'Children'),'FaceColor',[.8 .8 1])
xlabel('{\it p}-values')
ylabel('Number of Tests')
The results are quite different when the test is performed repeatedly on
uniform quasi-random samples:
p = haltonset(1,'Skip',1e3,'Leap',1e2);
p = scramble(p,'RR2');
nTests = 1e5;
sampSize = 50;
PVALS = zeros(nTests,1);
for test = 1:nTests
6-23
6 Random Number Generation
x = p(test:test+(sampSize-1),:);
[h,pval] = kstest(x,[x,x]);
PVALS(test) = pval;
end
hist(PVALS,100)
set(get(gca,'Children'),'FaceColor',[.8 .8 1])
xlabel('{\it p}-values')
ylabel('Number of Tests')
Small p-values call into question the null hypothesis that the data are
uniformly distributed. If the hypothesis is true, about 5% of the p-values are
expected to fall below 0.05. The results are remarkably consistent in their
failure to challenge the hypothesis.
6-24
Generating Quasi-Random Numbers
Quasi-Random Streams
Quasi-random streams, produced by the qrandstream function, are used
to generate sequential quasi-random outputs, rather than point sets of a
specific size. Streams are used like pseudoRNGS, such as rand, when client
applications require a source of quasi-random numbers of indefinite size that
can be accessed intermittently. Properties of a quasi-random stream, such
as its type (Halton or Sobol), dimension, skip, leap, and scramble, are set
when the stream is constructed.
p = haltonset(1,'Skip',1e3,'Leap',1e2);
p = scramble(p,'RR2');
nTests = 1e5;
sampSize = 50;
PVALS = zeros(nTests,1);
for test = 1:nTests
x = p(test:test+(sampSize-1),:);
[h,pval] = kstest(x,[x,x]);
PVALS(test) = pval;
end
6-25
6 Random Number Generation
p = haltonset(1,'Skip',1e3,'Leap',1e2);
p = scramble(p,'RR2');
q = qrandstream(p)
nTests = 1e5;
sampSize = 50;
PVALS = zeros(nTests,1);
for test = 1:nTests
X = qrand(q,sampSize);
[h,pval] = kstest(X,[X,X]);
PVALS(test) = pval;
end
6-26
Generating Data Using Flexible Families of Distributions
Data Input
The following parameters define each member of the Pearson and Johnson
systems
These statistics can also be computed with the moment function. The Johnson
system, while based on these four parameters, is more naturally described
using quantiles, estimated by the quantile function.
6-27
6 Random Number Generation
load carbig
MPG = MPG(~isnan(MPG));
[n,x] = hist(MPG,15);
bar(x,n)
set(get(gca,'Children'),'FaceColor',[.8 .8 1])
The following two sections model the distribution with members of the
Pearson and Johnson systems, respectively.
6-28
Generating Data Using Flexible Families of Distributions
values for each of these moments from data, it is easy to find the distribution
in the Pearson system that matches these four moments and to generate a
random sample.
For a given set of moments, there are distributions that are not in the system
that also have those same first four moments, and the distribution in the
Pearson system may not be a good match to your data, particularly if the
data are multimodal. But the system does cover a wide range of distribution
shapes, including both symmetric and skewed distributions.
moments = {mean(MPG),std(MPG),skewness(MPG),kurtosis(MPG)};
[r,type] = pearsrnd(moments{:},10000,1);
The optional second output from pearsrnd indicates which type of distribution
within the Pearson system matches the combination of moments.
type
type =
1
In this case, pearsrnd has determined that the data are best described with a
Type I Pearson distribution, which is a shifted, scaled beta distribution.
Verify that the sample resembles the original data by overlaying the empirical
cumulative distribution functions.
ecdf(MPG);
[Fi,xi] = ecdf(r);
hold on, stairs(xi,Fi,'r'); hold off
6-29
6 Random Number Generation
⎛ ( Ζ-ξ ) ⎞
Χ = γ + δ ⋅ Γ ⎜⎜ ⎟⎟
⎝ λ ⎠
6-30
Generating Data Using Flexible Families of Distributions
To generate a sample from the Johnson distribution that matches the MPG
data, first define the four quantiles to which the four evenly spaced standard
normal quantiles of -1.5, -0.5, 0.5, and 1.5 should be transformed. That is, you
compute the sample quantiles of the data for the cumulative probabilities of
0.067, 0.309, 0.691, and 0.933.
quantiles = quantile(MPG,probs)
quantiles =
13.0000 18.0000 27.2000 36.0000
[r1,type] = johnsrnd(quantiles,10000,1);
The optional second output from johnsrnd indicates which type of distribution
within the Johnson system matches the quantiles.
type
type =
SB
You can verify that the sample resembles the original data by overlaying the
empirical cumulative distribution functions.
ecdf(MPG);
[Fi,xi] = ecdf(r1);
hold on, stairs(xi,Fi,'r'); hold off
6-31
6 Random Number Generation
However, while the new sample matches the original data better in the right
tail, it matches much worse in the left tail.
[Fj,xj] = ecdf(r2);
hold on, stairs(xj,Fj,'g'); hold off
6-32
Generating Data Using Flexible Families of Distributions
6-33
6 Random Number Generation
6-34
7
Hypothesis Tests
Introduction
Hypothesis testing is a common method of drawing inferences about a
population based on statistical evidence from a sample.
Sample averages differ from one another due to chance variability in the
selection process. Suppose your sample average comes out to be $1.18. Is the
$0.03 difference an artifact of random sampling or significant evidence that
the average price of a gallon of gas was in fact greater than $1.15? Hypothesis
testing is a statistical method for making such decisions.
7-2
Hypothesis Test Terminology
7-3
7 Hypothesis Tests
7-4
Hypothesis Test Assumptions
For example, the z-test (ztest) and the t-test (ttest) both assume that
the data are independently sampled from a normal distribution. Statistics
Toolbox functions are available for testing this assumption, such as chi2gof,
jbtest, lillietest, and normplot.
Both the z-test and the t-test are relatively robust with respect to departures
from this assumption, so long as the sample size n is large enough. Both
tests compute a sample mean x , which, by the Central Limit Theorem,
has an approximately normal sampling distribution with mean equal to the
population mean μ, regardless of the population distribution being sampled.
The difference between the z-test and the t-test is in the assumption of the
standard deviation σ of the underlying normal distribution. A z-test assumes
that σ is known; a t-test does not. As a result, a t-test must compute an
estimate s of the standard deviation from the sample.
Test statistics for the z-test and the t-test are, respectively,
x−
z=
/ n
x−
t=
s/ n
Under the null hypothesis that the population is distributed with mean μ, the
z-statistic has a standard normal distribution, N(0,1). Under the same null
hypothesis, the t-statistic has Student’s t distribution with n – 1 degrees of
freedom. For small sample sizes, Student’s t distribution is flatter and wider
than N(0,1), compensating for the decreased confidence in the estimate s.
As sample size increases, however, Student’s t distribution approaches the
standard normal distribution, and the two tests become essentially equivalent.
7-5
7 Hypothesis Tests
Knowing the distribution of the test statistic under the null hypothesis allows
for accurate calculation of p-values. Interpreting p-values in the context of
the test assumptions allows for critical analysis of test results.
7-6
Example: Hypothesis Testing
load gas
prices = [price1 price2];
As a first step, you might want to test the assumption that the samples come
from normal distributions.
7-7
7 Hypothesis Tests
normplot(prices)
Both scatters approximately follow straight lines through the first and third
quartiles of the samples, indicating approximate normal distributions.
The February sample (the right-hand line) shows a slight departure from
normality in the lower tail. A shift in the mean from January to February is
evident.
A hypothesis test is used to quantify the test of normality. Since each sample
is relatively small, a Lilliefors test is recommended.
7-8
Example: Hypothesis Testing
lillietest(price1)
ans =
0
lillietest(price2)
ans =
0
sample_means = mean(prices)
sample_means =
115.1500 118.5000
You might want to test the null hypothesis that the mean price across the
state on the day of the January sample was $1.15. If you know that the
standard deviation in prices across the state has historically, and consistently,
been $0.04, then a z-test is appropriate.
[h,pvalue,ci] = ztest(price1/100,1.15,0.04)
h =
0
pvalue =
0.8668
ci =
1.1340
1.1690
7-9
7 Hypothesis Tests
Does the later sample offer stronger evidence for rejecting a null hypothesis
of a state-wide average price of $1.15 in February? The shift shown in the
probability plot and the difference in the computed sample means suggest
this. The shift might indicate a significant fluctuation in the market, raising
questions about the validity of using the historical standard deviation. If a
known standard deviation cannot be assumed, a t-test is more appropriate.
[h,pvalue,ci] = ttest(price2/100,1.15)
h =
1
pvalue =
4.9517e-004
ci =
1.1675
1.2025
You might want to investigate the shift in prices a little more closely.
The function ttest2 tests if two independent samples come from normal
distributions with equal but unknown standard deviations and the same
mean, against the alternative that the means are unequal.
[h,sig,ci] = ttest2(price1,price2)
h =
1
sig =
0.0083
ci =
-5.7845
-0.9155
7-10
Example: Hypothesis Testing
boxplot(prices,1)
set(gca,'XTick',[1 2])
set(gca,'XtickLabel',{'January','February'})
xlabel('Month')
ylabel('Prices ($0.01)')
The plot displays the distribution of the samples around their medians. The
heights of the notches in each box are computed so that the side-by-side
boxes have nonoverlapping notches when their medians are different at a
default 5% significance level. The computation is based on an assumption
of normality in the data, but the comparison is reasonably robust for other
distributions. The side-by-side plots provide a kind of visual hypothesis test,
comparing medians rather than means. The plot above appears to barely
reject the null hypothesis of equal medians.
7-11
7 Hypothesis Tests
[p,h] = ranksum(price1,price2)
p =
0.0095
h =
1
The test rejects the null hypothesis of equal medians at the default 5%
significance level.
7-12
Available Hypothesis Tests
7-13
7 Hypothesis Tests
Function Description
ranksum Wilcoxon rank sum test. Tests if two independent
samples come from identical continuous distributions
with equal medians, against the alternative that they
do not have equal medians.
runstest Runs test. Tests if a sequence of values comes in
random order, against the alternative that the ordering
is not random.
signrank One-sample or paired-sample Wilcoxon signed rank test.
Tests if a sample comes from a continuous distribution
symmetric about a specified median, against the
alternative that it does not have that median.
signtest One-sample or paired-sample sign test. Tests if a
sample comes from an arbitrary continuous distribution
with a specified median, against the alternative that it
does not have that median.
ttest One-sample or paired-sample t-test. Tests if a sample
comes from a normal distribution with unknown
variance and a specified mean, against the alternative
that it does not have that mean.
ttest2 Two-sample t-test. Tests if two independent samples
come from normal distributions with unknown but
equal (or, optionally, unequal) variances and the same
mean, against the alternative that the means are
unequal.
vartest One-sample chi-square variance test. Tests if a sample
comes from a normal distribution with specified
variance, against the alternative that it comes from a
normal distribution with a different variance.
vartest2 Two-sample F-test for equal variances. Tests if two
independent samples come from normal distributions
with the same variance, against the alternative that
they come from normal distributions with different
variances.
7-14
Available Hypothesis Tests
Function Description
vartestn Bartlett multiple-sample test for equal variances. Tests
if multiple samples come from normal distributions
with the same variance, against the alternative that
they come from normal distributions with different
variances.
ztest One-sample z-test. Tests if a sample comes from a
normal distribution with known variance and specified
mean, against the alternative that it does not have that
mean.
7-15
7 Hypothesis Tests
7-16
8
Analysis of Variance
Introduction
Analysis of variance (ANOVA) is a procedure for assigning sample variance to
different sources and deciding whether the variation arises within or among
different population groups. Samples are described in terms of variation
around group means and variation of group means around an overall mean. If
variations within groups are small relative to variations between groups, a
difference in group means may be inferred. Chapter 7, “Hypothesis Tests” are
used to quantify decisions.
This chapter treats ANOVA among groups, that is, among categorical
predictors. ANOVA for regression, with continuous predictors, is discussed in
“Tabulating Diagnostic Statistics” on page 9-13.
8-2
ANOVA
ANOVA
In this section...
“One-Way ANOVA” on page 8-3
“Two-Way ANOVA” on page 8-9
“N-Way ANOVA” on page 8-12
“Other ANOVA Models” on page 8-26
“Analysis of Covariance” on page 8-27
“Nonparametric Methods” on page 8-35
One-Way ANOVA
• “Introduction” on page 8-3
• “Example: One-Way ANOVA” on page 8-4
• “Multiple Comparisons” on page 8-6
• “Example: Multiple Comparisons” on page 8-7
Introduction
The purpose of one-way ANOVA is to find out whether data from several
groups have a common mean. That is, to determine whether the groups are
actually different in the measured characteristic.
One-way ANOVA is a simple special case of the linear model. The one-way
ANOVA form of the model is
yij = . j + ij
where:
8-3
8 Analysis of Variance
• α.j is a matrix whose columns are the group means. (The “dot j” notation
means that α applies to all rows of column j. That is, the value αij is the
same for all i.)
• εij is a matrix of random disturbances.
The model assumes that the columns of y are a constant plus a random
disturbance. You want to know if the constants are all the same.
load hogg
hogg
hogg =
24 14 11 7 19
15 7 9 7 24
21 12 7 4 19
27 17 13 7 15
33 14 12 12 10
23 16 18 18 20
[p,tbl,stats] = anova1(hogg);
p
p =
1.1971e-04
The standard ANOVA table has columns for the sums of squares, degrees of
freedom, mean squares (SS/df), F statistic, and p value.
8-4
ANOVA
You can use the F statistic to do a hypothesis test to find out if the bacteria
counts are the same. anova1 returns the p value from this hypothesis test.
In this case the p value is about 0.0001, a very small value. This is a strong
indication that the bacteria counts from the different shipments are not the
same. An F statistic as extreme as the observed F would occur by chance only
once in 10,000 times if the counts were truly equal.
You can get some graphical assurance that the means are different by
looking at the box plots in the second figure window displayed by anova1.
Note, however, that the notches are used for a comparison of medians, not a
comparison of means. For more information on this display, see “Box Plots”
on page 4-6.
8-5
8 Analysis of Variance
Multiple Comparisons
Sometimes you need to determine not just whether there are any differences
among the means, but specifically which pairs of means are significantly
different. It is tempting to perform a series of t tests, one for each pair of
means, but this procedure has a pitfall.
In this example there are five means, so there are 10 pairs of means to
compare. It stands to reason that if all the means are the same, and if there is
a 5% chance of incorrectly concluding that there is a difference in one pair,
8-6
ANOVA
then the probability of making at least one incorrect conclusion among all 10
pairs is much larger than 5%.
load hogg
[p,tbl,stats] = anova1(hogg);
[c,m] = multcompare(stats)
c =
1.0000 2.0000 2.4953 10.5000 18.5047
1.0000 3.0000 4.1619 12.1667 20.1714
1.0000 4.0000 6.6619 14.6667 22.6714
1.0000 5.0000 -2.0047 6.0000 14.0047
2.0000 3.0000 -6.3381 1.6667 9.6714
2.0000 4.0000 -3.8381 4.1667 12.1714
2.0000 5.0000 -12.5047 -4.5000 3.5047
3.0000 4.0000 -5.5047 2.5000 10.5047
3.0000 5.0000 -14.1714 -6.1667 1.8381
4.0000 5.0000 -16.6714 -8.6667 -0.6619
m =
23.8333 1.9273
13.3333 1.9273
11.6667 1.9273
9.1667 1.9273
17.8333 1.9273
The first output from multcompare has one row for each pair of groups, with
an estimate of the difference in group means and a confidence interval for that
group. For example, the second row has the values
8-7
8 Analysis of Variance
[4.1619, 20.1714]. This interval does not contain 0, so you can conclude that
the means of groups 1 and 3 are different.
The second output contains the mean and its standard error for each group.
There are five groups. The graph instructs you to Click on the group you
want to test. Three groups have slopes significantly different from group one.
The graph shows that group 1 is significantly different from groups 2, 3, and
4. By using the mouse to select group 4, you can determine that it is also
significantly different from group 5. Other pairs are not significantly different.
8-8
ANOVA
Two-Way ANOVA
• “Introduction” on page 8-9
• “Example: Two-Way ANOVA” on page 8-10
Introduction
The purpose of two-way ANOVA is to find out whether data from several
groups have a common mean. One-way ANOVA and two-way ANOVA differ
in that the groups in two-way ANOVA have two categories of defining
characteristics instead of one.
Suppose an automobile company has two factories, and each factory makes
the same three models of car. It is reasonable to ask if the gas mileage in the
cars varies from factory to factory as well as from model to model. There are
two predictors, factory and model, to explain differences in mileage.
Finally, a factory might make high mileage cars in one model (perhaps
because of a superior production line), but not be different from the other
factory for other models. This effect is called an interaction. It is impossible
to detect an interaction unless there are duplicate observations for some
combination of factory and car model.
Two-way ANOVA is a special case of the linear model. The two-way ANOVA
form of the model is
yijk = + . j + i. + ij + ijk
• yijk is a matrix of gas mileage observations (with row index i, column index
j, and repetition index k).
• μ is a constant matrix of the overall mean gas mileage.
8-9
8 Analysis of Variance
• α.j is a matrix whose columns are the deviations of each car’s gas mileage
(from the mean gas mileage μ) that are attributable to the car’s model. All
values in a given column of α.j are identical, and the values in each row of
α.j sum to 0.
• βi. is a matrix whose rows are the deviations of each car’s gas mileage
(from the mean gas mileage μ) that are attributable to the car’s factory. All
values in a given row of βi. are identical, and the values in each column
of βi. sum to 0.
• γij is a matrix of interactions. The values in each row of γij sum to 0, and the
values in each column of γij sum to 0.
• εijk is a matrix of random disturbances.
load mileage
mileage
mileage =
cars = 3;
[p,tbl,stats] = anova2(mileage,cars);
p
p =
0.0000 0.0039 0.8411
There are three models of cars (columns) and two factories (rows). The reason
there are six rows in mileage instead of two is that each factory provides
8-10
ANOVA
three cars of each model for the study. The data from the first factory is in the
first three rows, and the data from the second factory is in the last three rows.
The standard ANOVA table has columns for the sums of squares,
degrees-of-freedom, mean squares (SS/df), F statistics, and p-values.
You can use the F statistics to do hypotheses tests to find out if the mileage is
the same across models, factories, and model-factory pairs (after adjusting for
the additive effects). anova2 returns the p value from these tests.
The p value for the model effect is zero to four decimal places. This is a strong
indication that the mileage varies from one model to another. An F statistic
as extreme as the observed F would occur by chance less than once in 10,000
times if the gas mileage were truly equal from model to model. If you used the
multcompare function to perform a multiple comparison test, you would find
that each pair of the three models is significantly different.
The p value for the factory effect is 0.0039, which is also highly significant.
This indicates that one factory is out-performing the other in the gas mileage
of the cars it produces. The observed p value indicates that an F statistic as
extreme as the observed F would occur by chance about four out of 1000 times
if the gas mileage were truly equal from factory to factory.
There does not appear to be any interaction between factories and models.
The p value, 0.8411, means that the observed result is quite likely (84 out 100
times) given that there is no interaction.
8-11
8 Analysis of Variance
In addition, anova2 requires that data be balanced, which in this case means
there must be the same number of cars for each combination of model and
factory. The next section discusses a function that supports unbalanced data
with any number of predictors.
N-Way ANOVA
• “Introduction” on page 8-12
• “N-Way ANOVA with a Small Data Set” on page 8-13
• “N-Way ANOVA with a Large Data Set” on page 8-15
• “ANOVA with Random Effects” on page 8-19
Introduction
You can use N-way ANOVA to determine if the means in a set of data differ
when grouped by multiple factors. If they do differ, you can determine which
factors or combinations of factors are associated with the difference.
yijkl = + . j. + i.. + ..k + ( )ij. + ( )i.k + ( ). jk + ( )ijk + ijkl
8-12
ANOVA
The anovan function performs N-way ANOVA. Unlike the anova1 and anova2
functions, anovan does not expect data in a tabular form. Instead, it expects
a vector of response measurements and a separate vector (or text array)
containing the values corresponding to each factor. This input data format is
more convenient than matrices when there are more than two factors or when
the number of measurements per factor combination is not constant.
anova2(m,2)
ans =
0.0197 0.2234 0.2663
The factor information is implied by the shape of the matrix m and the number
of measurements at each factor combination (2). Although anova2 does not
actually require arrays of factor values, for illustrative purposes you could
create them as follows.
cfactor = repmat(1:3,4,1)
cfactor =
1 2 3
1 2 3
1 2 3
1 2 3
rfactor =
1 1 1
8-13
8 Analysis of Variance
1 1 1
2 2 2
2 2 2
The cfactor matrix shows that each column of m represents a different level
of the column factor. The rfactor matrix shows that the top two rows of m
represent one level of the row factor, and bottom two rows of m represent a
second level of the row factor. In other words, each value m(i,j) represents
an observation at column factor level cfactor(i,j) and row factor level
rfactor(i,j).
To solve the above problem with anovan, you need to reshape the matrices m,
cfactor, and rfactor to be vectors.
m = m(:);
cfactor = cfactor(:);
rfactor = rfactor(:);
[m cfactor rfactor]
ans =
23 1 1
27 1 1
43 1 2
41 1 2
15 2 1
17 2 1
3 2 2
9 2 2
20 3 1
63 3 1
55 3 2
90 3 2
anovan(m,{cfactor rfactor},2)
ans =
0.0197
8-14
ANOVA
0.2234
0.2663
load carbig
whos
The example focusses on four variables. MPG is the number of miles per gallon
for each of 406 cars (though some have missing values coded as NaN). The
other three variables are factors: cyl4 (four-cylinder car or not), org (car
originated in Europe, Japan, or the USA), and when (car was built early in the
period, in the middle of the period, or late in the period).
First, fit the full model, requesting up to three-way interactions and Type 3
sums-of-squares.
varnames = {'Origin';'4Cyl';'MfgDate'};
anovan(MPG,{org cyl4 when},3,3,varnames)
ans =
0.0000
8-15
8 Analysis of Variance
NaN
0
0.7032
0.0001
0.2072
0.6990
Note that many terms are marked by a # symbol as not having full rank,
and one of them has zero degrees of freedom and is missing a p value. This
can happen when there are missing factor combinations and the model has
higher-order terms. In this case, the cross-tabulation below shows that there
are no cars made in Europe during the early part of the period with other than
four cylinders, as indicated by the 0 in table(2,1,1).
table(:,:,1) =
82 75 25
0 4 3
3 3 4
8-16
ANOVA
table(:,:,2) =
12 22 38
23 26 17
12 25 32
chi2 =
207.7689
p =
factorvals =
Using even the limited information available in the ANOVA table, you can see
that the three-way interaction has a p value of 0.699, so it is not significant.
So this time you examine only two-way interactions.
terms =
1 0 0
0 1 0
0 0 1
1 1 0
1 0 1
0 1 1
8-17
8 Analysis of Variance
Now all terms are estimable. The p-values for interaction term 4
(Origin*4Cyl) and interaction term 6 (4Cyl*MfgDate) are much larger than
a typical cutoff value of 0.05, indicating these terms are not significant. You
could choose to omit these terms and pool their effects into the error term.
The output terms variable returns a matrix of codes, each of which is a bit
pattern representing a term. You can omit terms from the model by deleting
their entries from terms and running anovan again, this time supplying the
resulting vector as the model argument.
terms([4 6],:) = []
terms =
1 0 0
0 1 0
0 0 1
1 0 1
ans =
1.0e-003 *
8-18
ANOVA
0.0000
0
0
0.1140
Now you have a more parsimonious model indicating that the mileage of
these cars seems to be related to all three factors, and that the effect of the
manufacturing date depends on where the car was made.
8-19
8 Analysis of Variance
Setting Up the Model. To set up the example, first load the data, which is
stored in a 6-by-3 matrix, mileage.
load mileage
The anova2 function works only with balanced data, and it infers the values
of the grouping variables from the row and column numbers of the input
matrix. The anovan function, on the other hand, requires you to explicitly
create vectors of grouping variable values. To create these vectors, do the
following steps:
1 Create an array indicating the factory for each value in mileage. This
array is 1 for the first column, 2 for the second, and 3 for the third.
factory = repmat(1:3,6,1);
2 Create an array indicating the car model for each mileage value. This array
is 1 for the first three rows of mileage, and 2 for the remaining three rows.
mileage = mileage(:);
factory = factory(:);
carmod = carmod(:);
[mileage factory carmod]
ans =
8-20
ANOVA
In the fixed effects version of this fit, which you get by omitting the inputs
'random',1 in the preceding code, the effect of car model is significant, with a
p value of 0.0039. But in this example, which takes into account the random
variation of the effect of the variable 'Car Model' from one factory to another,
the effect is still significant, but with a higher p value of 0.0136.
8-21
8 Analysis of Variance
In the example described in “Setting Up the Model” on page 8-20, the effect
of the variable 'Factory' could vary across car models. In this case, the
interaction mean square takes the place of the error mean square in the F
statistic. The F statistic for factory is:
F = 1.445 / 0.02
F =
72.2500
The degrees of freedom for the statistic are the degrees of freedom for the
numerator (1) and denominator (2) mean squares. Therefore the p value
for the statistic is:
pval = 1 - fcdf(F,1,2)
pval =
0.0136
With random effects, the expected value of each mean square depends not only
on the variance of the error term, but also on the variances contributed by
the random effects. You can see these dependencies by writing the expected
values as linear combinations of contributions from the various model terms.
To find the coefficients of these linear combinations, enter stats.ems, which
returns the ems field of the stats structure:
stats.ems
ans =
8-22
ANOVA
stats.txtems
ans =
'6*V(Factory)+3*V(Factory*Car Model)+V(Error)'
'9*Q(Car Model)+3*V(Factory*Car Model)+V(Error)'
'3*V(Factory*Car Model)+V(Error)'
'V(Error)'
The expected value for the mean square due to car model (second term)
includes contributions from a quadratic function of the car model effects, plus
three times the variance of the interaction term’s effect, plus the variance
of the error term. Notice that if the car model effects were all zero, the
expression would reduce to the expected mean square for the third term (the
interaction term). That is why the F statistic for the car model effect uses the
interaction mean square in the denominator.
In some cases there is no single term whose expected value matches the one
required for the denominator of theFstatistic. In that case, the denominator is
a linear combination of mean squares. The stats structure contains fields
giving the definitions of the denominators for each F statistic. The txtdenom
field, stats.txtdenom, gives a text representation, and the denom field gives
a matrix that defines a linear combination of the variances of terms in the
model. For balanced models like this one, the denom matrix, stats.denom,
contains zeros and ones, because the denominator is just a single term’s mean
square:
stats.txtdenom
ans =
'MS(Factory*Car Model)'
'MS(Factory*Car Model)'
'MS(Error)'
stats.denom
8-23
8 Analysis of Variance
ans =
stats.rtnames
ans =
'Factory'
'Factory*Car Model'
'Error'
You do not know those variances, but you can estimate them from the data.
Recall that the ems field of the stats structure expresses the expected value
of each term’s mean square as a linear combination of unknown variances for
random terms, and unknown quadratic forms for fixed terms. If you take
the expected mean square expressions for the random terms, and equate
those expected values to the computed mean squares, you get a system of
equations that you can solve for the unknown variances. These solutions
are the variance component estimates. The varest field contains a variance
component estimate for each term. The rtnames field contains the names
of the random terms.
stats.varest
ans =
4.4426
-0.0313
0.1139
8-24
ANOVA
is common to set the estimate to zero, which you might do, for example, to
create a bar graph of the components.
bar(max(0,stats.varest))
set(gca,'xtick',1:3,'xticklabel',stats.rtnames)
You can also compute confidence bounds for the variance estimate. The
anovan function does this by computing confidence bounds for the variance
expected mean squares, and finding lower and upper limits on each variance
component containing all of these bounds. This procedure leads to a set
of bounds that is conservative for balanced data. (That is, 95% confidence
bounds will have a probability of at least 95% of containing the true variances
if the number of observations for each combination of grouping variables
is the same.) For unbalanced data, these are approximations that are not
guaranteed to be conservative.
ans =
8-25
8 Analysis of Variance
For example, the mileage data from the previous section assumed that the
two car models produced in each factory were the same. Suppose instead,
each factory produced two distinct car models for a total of six car models, and
we numbered them 1 and 2 for each factory for convenience. Then, the car
model is nested in factory. A more accurate and less ambiguous numbering of
car model would be as follows:
8-26
ANOVA
Analysis of Covariance
• “Introduction” on page 8-27
• “Analysis of Covariance Tool” on page 8-27
• “Confidence Bounds” on page 8-32
• “Multiple Comparisons” on page 8-34
Introduction
Analysis of covariance is a technique for analyzing grouped data having a
response (y, the variable to be predicted) and a predictor (x, the variable
used to do the prediction). Using analysis of covariance, you can model y as
a linear function of x, with the coefficients of the line possibly varying from
group to group.
Same line y = α + βx + ε
Parallel lines y = (α + αi) + βx + ε
Separate lines y = (α + αi) + (β + βi)x + ε
For example, in the parallel lines model the intercept varies from one group
to the next, but the slope is the same for each group. In the same mean
model, there is a common intercept and no slope. In order to make the group
coefficients well determined, the tool imposes the constraints
∑ j = ∑ j = 0
The following steps describe the use of aoctool.
8-27
8 Analysis of Variance
1 Load the data. The Statistics Toolbox data set carsmall.mat contains
information on cars from the years 1970, 1976, and 1982. This example
studies the relationship between the weight of a car and its mileage,
and whether this relationship has changed over the years. To start the
demonstration, load the data set.
load carsmall
2 Start the tool. The following command calls aoctool to fit a separate line
to the column vectors Weight and MPG for each of the three model group
defined in Model_Year. The initial fit models the y variable, MPG, as a linear
function of the x variable, Weight.
[h,atab,ctab,stats] = aoctool(Weight,MPG,Model_Year);
8-28
ANOVA
See the aoctool function reference page for detailed information about
calling aoctool.
8-29
8 Analysis of Variance
The coefficients of the three lines appear in the figure titled ANOCOVA
Coefficients. You can see that the slopes are roughly –0.0078, with a small
deviation for each group:
• Model year 1970: y = (45.9798 – 8.5805) + (–0.0078 + 0.002)x + ε
• Model year 1976: y = (45.9798 – 3.8902) + (–0.0078 + 0.0011)x + ε
• Model year 1982: y = (45.9798 + 12.4707) + (–0.0078 – 0.0031)x + ε
Because the three fitted lines have slopes that are roughly similar, you may
wonder if they really are the same. The Model_Year*Weight interaction
expresses the difference in slopes, and the ANOVA table shows a test for
the significance of this term. With an F statistic of 5.23 and a p value of
0.0072, the slopes are significantly different.
8-30
ANOVA
4 Constrain the slopes to be the same. To examine the fits when the
slopes are constrained to be the same, return to the ANOCOVA Prediction
Plot window and use the Model pop-up menu to select a Parallel Lines
model. The window updates to show the following graph.
Though this fit looks reasonable, it is significantly worse than the Separate
Lines model. Use the Model pop-up menu again to return to the original
model.
8-31
8 Analysis of Variance
Confidence Bounds
The example in “Analysis of Covariance Tool” on page 8-27 provides estimates
of the relationship between MPG and Weight for each Model_Year, but how
accurate are these estimates? To find out, you can superimpose confidence
bounds on the fits by examining them one group at a time.
1 In the Model_Year menu at the lower right of the figure, change the
setting from All Groups to 82. The data and fits for the other groups are
dimmed, and confidence bounds appear around the 82 fit.
8-32
ANOVA
The dashed lines form an envelope around the fitted line for model year 82.
Under the assumption that the true relationship is linear, these bounds
provide a 95% confidence region for the true line. Note that the fits for the
other model years are well outside these confidence bounds for Weight
values between 2000 and 3000.
8-33
8 Analysis of Variance
Like the polytool function, the aoctool function has cross hairs that you
can use to manipulate the Weight and watch the estimate and confidence
bounds along the y-axis update. These values appear only when a single
group is selected, not when All Groups is selected.
Multiple Comparisons
You can perform a multiple comparison test by using the stats output
structure from aoctool as input to the multcompare function. The
multcompare function can test either slopes, intercepts, or population
marginal means (the predicted MPG of the mean weight for each group). The
example in “Analysis of Covariance Tool” on page 8-27 shows that the slopes
are not all the same, but could it be that two are the same and only the other
one is different? You can test that hypothesis.
multcompare(stats,0.05,'on','','s')
ans =
1.0000 2.0000 -0.0012 0.0008 0.0029
1.0000 3.0000 0.0013 0.0051 0.0088
2.0000 3.0000 0.0005 0.0042 0.0079
This matrix shows that the estimated difference between the intercepts of
groups 1 and 2 (1970 and 1976) is 0.0008, and a confidence interval for the
difference is [–0.0012, 0.0029]. There is no significant difference between the
two. There are significant differences, however, between the intercept for
1982 and each of the other two. The graph shows the same information.
8-34
ANOVA
Note that the stats structure was created in the initial call to the aoctool
function, so it is based on the initial model fit (typically a separate-lines
model). If you change the model interactively and want to base your multiple
comparisons on the new model, you need to run aoctool again to get another
stats structure, this time specifying your new model as the initial model.
Nonparametric Methods
• “Introduction” on page 8-36
8-35
8 Analysis of Variance
Introduction
Statistics Toolbox functions include nonparametric versions of one-way and
two-way analysis of variance. Unlike classical tests, nonparametric tests
make only mild assumptions about the data, and are appropriate when the
distribution of the data is non-normal. On the other hand, they are less
powerful than classical methods for normally distributed data.
Kruskal-Wallis Test
The example “Example: One-Way ANOVA” on page 8-4 uses one-way
analysis of variance to determine if the bacteria counts of milk varied from
shipment to shipment. The one-way analysis rests on the assumption that
the measurements are independent, and that each has a normal distribution
with a common variance and with a mean that was constant in each column.
You can conclude that the column means were not all the same. The following
example repeats that analysis using a nonparametric procedure.
load hogg
p = kruskalwallis(hogg)
p =
0.0020
8-36
ANOVA
The low p value means the Kruskal-Wallis test results agree with the one-way
analysis of variance results.
Friedman’s Test
“Example: Two-Way ANOVA” on page 8-10 uses two-way analysis of variance
to study the effect of car model and factory on car mileage. The example
tests whether either of these factors has a significant effect on mileage, and
whether there is an interaction between these factors. The conclusion of
the example is there is no interaction, but that each individual factor has
a significant effect. The next example examines whether a nonparametric
analysis leads to the same conclusion.
Friedman’s test is a nonparametric test for data having a two-way layout (data
grouped by two categorical factors). Unlike two-way analysis of variance,
Friedman’s test does not treat the two factors symmetrically and it does not
test for an interaction between them. Instead, it is a test for whether the
columns are different after adjusting for possible row differences. The test is
based on an analysis of variance using the ranks of the data across categories
of the row factor. Output includes a table similar to an ANOVA table.
load mileage
p = friedman(mileage,3)
p =
7.4659e-004
Recall the classical analysis of variance gave a p value to test column effects,
row effects, and interaction effects. This p value is for column effects. Using
either this p value or the p value from ANOVA (p < 0.0001), you conclude that
there are significant column effects.
In order to test for row effects, you need to rearrange the data to swap the
roles of the rows in columns. For a data matrix x with no replications, you
could simply transpose the data and type
p = friedman(x')
8-37
8 Analysis of Variance
representing the replicates, swapping the other two dimensions, and restoring
the two-dimensional shape.
x = reshape(mileage, [3 2 3]);
x = permute(x,[1 3 2]);
x = reshape(x,[9 2])
x =
33.3000 32.6000
33.4000 32.5000
32.9000 33.0000
34.5000 33.4000
34.8000 33.7000
33.8000 33.9000
37.4000 36.6000
36.8000 37.0000
37.6000 36.7000
friedman(x,3)
ans =
0.0082
You cannot use Friedman’s test to test for interactions between the row and
column factors.
8-38
MANOVA
MANOVA
In this section...
“Introduction” on page 8-39
“ANOVA with Multiple Responses” on page 8-39
Introduction
The analysis of variance technique in “Example: One-Way ANOVA” on
page 8-4 takes a set of grouped data and determine whether the mean of a
variable differs significantly among groups. Often there are multiple response
variables, and you are interested in determining whether the entire set of
means is different from one group to the next. There is a multivariate version
of analysis of variance that can address the problem.
load carsmall
whos
Name Size Bytes Class
Acceleration 100x1 800 double array
Cylinders 100x1 800 double array
Displacement 100x1 800 double array
Horsepower 100x1 800 double array
MPG 100x1 800 double array
Model 100x36 7200 char array
Model_Year 100x1 800 double array
Origin 100x7 1400 char array
Weight 100x1 800 double array
8-39
8 Analysis of Variance
Model_Year indicates the year in which the car was made. You can create a
grouped plot matrix of these variables using the gplotmatrix function.
It appears the cars do differ from year to year. The upper right plot, for
example, is a graph of MPG versus Weight. The 1982 cars appear to have
higher mileage than the older cars, and they appear to weigh less on average.
But as a group, are the three years significantly different from one another?
The manova1 function can answer that question.
[d,p,stats] = manova1(x,Model_Year)
8-40
MANOVA
d =
2
p =
1.0e-006 *
0
0.1141
stats =
W: [4x4 double]
B: [4x4 double]
T: [4x4 double]
dfW: 90
dfB: 2
dfT: 92
lambda: [2x1 double]
chisq: [2x1 double]
chisqdf: [2x1 double]
eigenval: [4x1 double]
eigenvec: [4x4 double]
canon: [100x4 double]
mdist: [100x1 double]
gmdist: [3x3 double]
8-41
8 Analysis of Variance
The next three fields are used to do a canonical analysis. Recall that in
principal components analysis (“Principal Component Analysis” on page
10-31) you look for the combination of the original variables that has the
largest possible variation. In multivariate analysis of variance, you instead
look for the linear combination of the original variables that has the largest
separation between groups. It is the single variable that would give the most
significant result in a univariate one-way analysis of variance. Having found
that combination, you next look for the combination with the second highest
separation, and so on.
The eigenvec field is a matrix that defines the coefficients of the linear
combinations of the original variables. The eigenval field is a vector
measuring the ratio of the between-group variance to the within-group
variance for the corresponding linear combination. The canon field is a matrix
of the canonical variable values. Each column is a linear combination of the
mean-centered original variables, using coefficients from the eigenvec matrix.
A grouped scatter plot of the first two canonical variables shows more
separation between groups then a grouped scatter plot of any pair of original
variables. In this example it shows three clouds of points, overlapping but
with distinct centers. One point in the bottom right sits apart from the others.
By using the gname function, you can see that this is the 20th point.
c1 = stats.canon(:,1);
c2 = stats.canon(:,2);
gscatter(c2,c1,Model_Year,[],'oxs')
gname
8-42
MANOVA
Roughly speaking, the first canonical variable, c1, separates the 1982 cars
(which have high values of c1) from the older cars. The second canonical
variable, c2, reveals some separation between the 1970 and 1976 cars.
The final two fields of the stats structure are Mahalanobis distances. The
mdist field measures the distance from each point to its group mean. Points
with large values may be outliers. In this data set, the largest outlier is the
one in the scatter plot, the Buick Estate station wagon. (Note that you could
have supplied the model name to the gname function above if you wanted to
label the point with its model name rather than its row number.)
max(stats.mdist)
ans =
31.5273
find(stats.mdist == ans)
ans =
8-43
8 Analysis of Variance
20
Model(20,:)
ans =
buick_estate_wagon_(sw)
The gmdist field measures the distances between each pair of group means.
The following commands examine the group means and their distances:
grpstats(x, Model_Year)
ans =
1.0e+003 *
0.0177 0.1489 0.2869 3.4413
0.0216 0.1011 0.1978 3.0787
0.0317 0.0815 0.1289 2.4535
stats.gmdist
ans =
0 3.8277 11.1106
3.8277 0 6.1374
11.1106 6.1374 0
8-44
9
Regression Analysis
Introduction
Regression is the process of fitting models to data. The process depends on the
model. If a model is parametric, regression estimates the parameters from the
data. If a model is linear in the parameters, estimation is based on methods
from linear algebra that minimize the norm of a residual vector. If a model
is nonlinear in the parameters, estimation is based on search methods from
optimization that minimize the norm of a residual vector. Nonparametric
models, like “Regression Trees” on page 9-94, use methods all their own.
This chapter considers data and models with continuous predictors and
responses. Categorical predictors are the subject of Chapter 8, “Analysis
of Variance”. Categorical responses are the subject of Chapter 12,
“Classification”.
9-2
Linear Regression
Linear Regression
In this section...
“Linear Regression Models” on page 9-3
“Multiple Linear Regression” on page 9-8
“Robust Regression” on page 9-14
“Stepwise Regression” on page 9-19
“Ridge Regression” on page 9-29
“Partial Least Squares” on page 9-32
“Polynomial Models” on page 9-37
“Response Surface Models” on page 9-45
“Generalized Linear Models” on page 9-52
“Multivariate Regression” on page 9-57
y = 0 + 1 x1 + 2 x2 + 3 x1 x2 + 4 x12 + 5 x22 +
y = 1 f1 ( x) + ... + p f p ( x) +
9-3
9 Regression Analysis
Given n independent observations (x1, y1), …, (xn, yn) of the predictor x and the
response y, the linear regression model becomes an n-by-p system of equations:
9-4
Linear Regression
⎛ y1 ⎞ ⎛ f1 ( x1 ) f p ( x1 ) ⎞ ⎛ 1 ⎞ ⎛ 1 ⎞
⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟
⎜ ⎟=⎜ ⎟⎜ ⎟ + ⎜ ⎟
⎜ y ⎟ ⎜ f (x ) … f p ( xn ) ⎟⎠ ⎜⎝ p ⎟⎠ ⎜⎝ n ⎟⎠
⎝ n⎠ ⎝ 1 n
y X
X is the design matrix of the system. The columns of X are the terms of the
model evaluated at the predictors. To fit the model to the data, the system
must be solved for the p coefficient values in β = (β1, …, βp)T.
betahat = X\y
9-5
9 Regression Analysis
( )
X T y − X ˆ = 0
or
X T X ˆ = X T y
If X is n-by-p, the normal equations are a p-by-p square system with solution
betahat = inv(X'*X)*X'*y, where inv is the MATLAB inverse operator.
The matrix inv(X'*X)*X' is the pseudoinverse of X, computed by the
MATLAB function pinv.
The normal equations are often badly conditioned relative to the original
system y = Xβ (the coefficient estimates are much more sensitive to the model
error ε), so the MATLAB backslash operator avoids solving them directly.
9-6
Linear Regression
X T X ˆ = XT y
(QR)T (QR) ˆ = (QR)T y
RT QT QRˆ = RT QT y
RT Rˆ = RT QT y
Rˆ = QT y
Statistics Toolbox functions like regress and regstats call the MATLAB
backslash operator to perform linear regression. The QR decomposition is also
used for efficient computation of confidence intervals.
Once betahat is computed, the model can be evaluated at the predictor data:
yhat = X*betahat
or
yhat = X*inv(X'*X)*X'*y
9-7
9 Regression Analysis
Introduction
The system of linear equations
⎛ y1 ⎞ ⎛ f1 ( x1 ) f p ( x1 ) ⎞ ⎛ 1 ⎞ ⎛ 1 ⎞
⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟
⎜ ⎟=⎜ ⎟⎜ ⎟ + ⎜ ⎟
⎜ y ⎟ ⎜ f (x ) … f p ( xn ) ⎟⎠ ⎜⎝ p ⎟⎠ ⎜⎝ n ⎟⎠
⎝ n⎠ ⎝ 1 n
y X
The Statistics Toolbox functions regress and regstats are used for multiple
linear regression analysis.
9-8
Linear Regression
load moore
X1 = [ones(size(moore,1),1) moore(:,1:5)];
y = moore(:,6);
betahat = regress(y,X1)
betahat =
-2.1561
-0.0000
0.0013
0.0001
0.0079
0.0001
betahat = X1\y
betahat =
-2.1561
-0.0000
0.0013
0.0001
0.0079
0.0001
The advantage of working with regress is that it allows for additional inputs
and outputs relevant to statistical analysis of the regression. For example:
alpha = 0.05;
[betahat,Ibeta,res,Ires,stats] = regress(y,X1,alpha);
9-9
9 Regression Analysis
Visualize the residuals, in case (row number) order, with the rcoplot
function:
rcoplot(res,Ires)
9-10
Linear Regression
The interval around the first residual, shown in red when plotted, does not
contain zero. This indicates that the residual is larger than expected in 95%
of new observations, and suggests the data point is an outlier.
2 If there is a systematic error in the model (that is, if the model is not
appropriate for generating the data under model assumptions), the mean
of the residuals is not zero.
3 If the errors in the model are not normally distributed, the distributions
of the residuals may be skewed or leptokurtic (with heavy tails and more
outliers).
X2 = moore(:,1:5);
stats = regstats(y,X2);
regstats(y,X2)
9-11
9 Regression Analysis
Select the check boxes corresponding to the statistics you want to compute and
click OK. Selected statistics are returned to the MATLAB workspace. Names
9-12
Linear Regression
of container variables for the statistics appear on the right-hand side of the
interface, where they can be changed to any valid MATLAB variable name.
t = stats.tstat;
CoeffTable = dataset({t.beta,'Coef'},{t.se,'StdErr'}, ...
{t.t,'tStat'},{t.pval,'pVal'})
CoeffTable =
Coef StdErr tStat pVal
-2.1561 0.91349 -2.3603 0.0333
-9.0116e-006 0.00051835 -0.017385 0.98637
0.0013159 0.0012635 1.0415 0.31531
0.0001278 7.6902e-005 1.6618 0.11876
0.0078989 0.014 0.56421 0.58154
0.00014165 7.3749e-005 1.9208 0.075365
The MATLAB function fprintf gives you control over tabular formatting.
For example, the fstat field of the stats output structure of regstats is a
structure with statistics related to the analysis of variance (ANOVA) of the
regression. The following commands produce a standard regression ANOVA
table:
f = stats.fstat;
fprintf('\n')
fprintf('Regression ANOVA');
fprintf('\n\n')
fprintf('%6s','Source');
fprintf('%10s','df','SS','MS','F','P');
fprintf('\n')
fprintf('%6s','Regr');
9-13
9 Regression Analysis
fprintf('%10.4f',f.dfr,f.ssr,f.ssr/f.dfr,f.f,f.pval);
fprintf('\n')
fprintf('%6s','Resid');
fprintf('%10.4f',f.dfe,f.sse,f.sse/f.dfe);
fprintf('\n')
fprintf('%6s','Total');
fprintf('%10.4f',f.dfe+f.dfr,f.sse+f.ssr);
fprintf('\n')
Regression ANOVA
Source df SS MS F P
Regr 5.0000 4.1084 0.8217 11.9886 0.0001
Resid 14.0000 0.9595 0.0685
Total 19.0000 5.0679
Robust Regression
• “Introduction” on page 9-14
• “Programmatic Robust Regression” on page 9-15
• “Interactive Robust Regression” on page 9-16
Introduction
The models described in “Linear Regression Models” on page 9-3 are based on
certain assumptions, such as a normal distribution of errors in the observed
responses. If the distribution of errors is asymmetric or prone to outliers,
model assumptions are invalidated, and parameter estimates, confidence
intervals, and other computed statistics become unreliable. The Statistics
Toolbox function robustfit is useful in these cases. The function implements
a robust fitting method that is less sensitive than ordinary least squares to
large changes in small parts of the data.
9-14
Linear Regression
reweighted least squares. In the first iteration, each point is assigned equal
weight and model coefficients are estimated using ordinary least squares. At
subsequent iterations, weights are recomputed so that points farther from
model predictions in the previous iteration are given lower weight. Model
coefficients are then recomputed using weighted least squares. The process
continues until the values of the coefficient estimates converge within a
specified tolerance.
load moore
X1 = [ones(size(moore,1),1) moore(:,1:5)];
y = moore(:,6);
betahat = regress(y,X1)
betahat =
-2.1561
-0.0000
0.0013
0.0001
0.0079
0.0001
X2 = moore(:,1:5);
robustbeta = robustfit(X2,y)
robustbeta =
-1.7516
0.0000
0.0009
0.0002
0.0060
0.0001
9-15
9 Regression Analysis
[robustbeta,stats] = robustfit(X2,y);
stats.w'
ans =
Columns 1 through 5
0.0246 0.9986 0.9763 0.9323 0.9704
Columns 6 through 10
0.8597 0.9180 0.9992 0.9590 0.9649
Columns 11 through 15
0.9769 0.9868 0.9999 0.9976 0.8122
Columns 16 through 20
0.9733 0.9892 0.9988 0.8974 0.6774
The first data point has a very low weight compared to the other data points,
and so is effectively ignored in the robust regression.
1 Start the demo. To begin using robustdemo with the built-in data, simply
type the function name:
robustdemo
9-16
Linear Regression
The resulting figure shows a scatter plot with two fitted lines. The red line
is the fit using ordinary least-squares regression. The green line is the
fit using robust regression. At the bottom of the figure are the equations
for the fitted lines, together with the estimated root mean squared errors
for each fit.
9-17
9 Regression Analysis
In the built-in data, the right-most point has a relatively high leverage of
0.35. The point exerts a large influence on the least-squares fit, but its
small robust weight shows that it is effectively excluded from the robust fit.
3 See how changes in the data affect the fits. With the left mouse
button, click and hold on any data point and drag it to a new location.
When you release the mouse button, the displays update.
9-18
Linear Regression
Bringing the right-most data point closer to the least-squares line makes
the two fitted lines nearly identical. The adjusted right-most data point
has significant weight in the robust fit.
Stepwise Regression
• “Introduction” on page 9-19
• “Programmatic Stepwise Regression” on page 9-21
• “Interactive Stepwise Regression” on page 9-27
Introduction
Multiple linear regression models, as described in “Multiple Linear
Regression” on page 9-8, are built from a potentially large number of
predictive terms. The number of interaction terms, for example, increases
exponentially with the number of predictor variables. If there is no theoretical
9-19
9 Regression Analysis
2 If any terms not in the model have p-values less than an entrance tolerance
(that is, if it is unlikely that they would have zero coefficient if added to
the model), add the one with the smallest p value and repeat this step;
otherwise, go to step 3.
3 If any terms in the model have p-values greater than an exit tolerance (that
is, if it is unlikely that the hypothesis of a zero coefficient can be rejected),
remove the one with the largest p value and go to step 2; otherwise, end.
Depending on the terms included in the initial model and the order in which
terms are moved in and out, the method may build different models from the
same set of potential terms. The method terminates when no single step
improves the model. There is no guarantee, however, that a different initial
model or a different sequence of steps will not lead to a better fit. In this
sense, stepwise models are locally optimal, but may not be globally optimal.
9-20
Linear Regression
load hald
whos
Name Size Bytes Class Attributes
The response (heat) depends on the quantities of the four predictors (the
columns of ingredients).
stepwisefit(ingredients,heat,...
'penter',0.05,'premove',0.10);
Initial columns included: none
Step 1, added column 4, p=0.000576232
Step 2, added column 1, p=1.10528e-006
Final columns included: 1 4
'Coeff' 'Std.Err.' 'Status' 'P'
[ 1.4400] [ 0.1384] 'In' [1.1053e-006]
[ 0.4161] [ 0.1856] 'Out' [ 0.0517]
[-0.4100] [ 0.1992] 'Out' [ 0.0697]
[-0.6140] [ 0.0486] 'In' [1.8149e-007]
initialModel = ...
[false true false false]; % Force in 2nd term
9-21
9 Regression Analysis
stepwisefit(ingredients,heat,...
'inmodel',initialModel,...
'penter',.05,'premove',0.10);
Initial columns included: 2
Step 1, added column 1, p=2.69221e-007
Final columns included: 1 2
'Coeff' 'Std.Err.' 'Status' 'P'
[ 1.4683] [ 0.1213] 'In' [2.6922e-007]
[ 0.6623] [ 0.0459] 'In' [5.0290e-008]
[ 0.2500] [ 0.1847] 'Out' [ 0.2089]
[-0.2365] [ 0.1733] 'Out' [ 0.2054]
The preceding two models, built from different initial models, use different
subsets of the predictive terms. Terms 2 and 4, swapped in the two models,
are highly correlated:
term2 = ingredients(:,2);
term4 = ingredients(:,4);
R = corrcoef(term2,term4)
R =
1.0000 -0.9730
-0.9730 1.0000
[betahat1,se1,pval1,inmodel1,stats1] = ...
stepwisefit(ingredients,heat,...
'penter',.05,'premove',0.10,...
'display','off');
[betahat2,se2,pval2,inmodel2,stats2] = ...
stepwisefit(ingredients,heat,...
'inmodel',initialModel,...
'penter',.05,'premove',0.10,...
'display','off');
RMSE1 = stats1.rmse
RMSE1 =
2.7343
RMSE2 = stats2.rmse
RMSE2 =
2.4063
9-22
Linear Regression
The second model has a lower Root Mean Square Error (RMSE).
An added variable plot is used to determine the unique effect of adding a new
term to a model. The plot shows the relationship between the part of the
response unexplained by terms already in the model and the part of the new
term unexplained by terms already in the model. The “unexplained” parts
are measured by the residuals of the respective regressions. A scatter of the
residuals from the two regressions forms the added variable plot.
For example, suppose you want to add term2 to a model that already contains
the single term term1. First, consider the ability of term2 alone to explain
the response:
load hald
term2 = ingredients(:,2);
scatter(term2,heat)
xlabel('Term 2')
ylabel('Heat')
hold on
x2 = 20:80;
y2 = b2(1) + b2(2)*x2;
plot(x2,y2,'r')
title('{\bf Response Explained by Term 2: Ignoring Term 1}')
9-23
9 Regression Analysis
Next, consider the following regressions involving the model term term1:
term1 = ingredients(:,1);
[b1,Ib1,res1] = regress(heat,[ones(size(term1)) term1]);
[b21,Ib21,res21] = regress(term2,[ones(size(term1)) term1]);
bres = regress(res1,[ones(size(res21)) res21]);
A scatter of the residuals res1 vs. the residuals res12 forms the added
variable plot:
figure
scatter(res21,res1)
xlabel('Residuals: Term 2 on Term 1')
ylabel('Residuals: Heat on Term 1')
hold on
9-24
Linear Regression
xres = -30:30;
yres = bres(1) + bres(2)*xres;
plot(xres,yres,'r')
title('{\bf Response Explained by Term 2: Adjusted for Term 1}')
Since the plot adjusted for term1 shows a stronger relationship (less variation
along the fitted line) than the plot ignoring term1, the two terms act jointly to
explain extra variation. In this case, adding term2 to a model consisting of
term1 would reduce the RMSE.
figure
9-25
9 Regression Analysis
In addition to the scatter of residuals, the plot shows 95% confidence intervals
on predictions from the fitted line. The fitted line has intercept zero because,
under the assumptions outlined in “Linear Regression Models” on page 9-3,
both of the plotted variables have mean zero. The slope of the fitted line is the
coefficient that term2 would have if it was added to the model with term1.
The addevarplot function is useful for considering the unique effect of adding
a new term to an existing model with any number of terms.
9-26
Linear Regression
load hald
stepwise(ingredients,heat)
9-27
9 Regression Analysis
The upper left of the interface displays estimates of the coefficients for all
potential terms, with horizontal bars indicating 90% (colored) and 95% (grey)
confidence intervals. The red color indicates that, initially, the terms are not
in the model. Values displayed in the table are those that would result if
the terms were added to the model.
The middle portion of the interface displays summary statistics for the entire
model. These statistics are updated with each step.
The lower portion of the interface, Model History, displays the RMSE for
the model. The plot tracks the RMSE from step to step, so you can compare
the optimality of different models. Hover over the blue dots in the history to
see which terms were in the model at a particular step. Click on a blue dot
in the history to open a copy of the interface initialized with the terms in
the model at that step.
To center and scale the input data (compute z-scores) to improve conditioning
of the underlying least-squares problem, select Scale Inputs from the
Stepwise menu.
1 Click Next Step to select the recommended next step. The recommended
next step either adds the most significant term or removes the least
significant term. When the regression reaches a local minimum of RMSE,
the recommended next step is “Move no terms.” You can perform all of the
recommended steps at once by clicking All Steps.
2 Click a line in the plot or in the table to toggle the state of the corresponding
term. Clicking a red line, corresponding to a term not currently in the
model, adds the term to the model and changes the line to blue. Clicking
a blue line, corresponding to a term currently in the model, removes the
term from the model and changes the line to red.
9-28
Linear Regression
To call addedvarplot and produce an added variable plot from the stepwise
interface, select Added Variable Plot from the Stepwise menu. A list of
terms is displayed. Select the term you want to add, and then click OK.
Click Export to display a dialog box that allows you to select information
from the interface to save to the MATLAB workspace. Check the information
you want to export and, optionally, change the names of the workspace
variables to be created. Click OK to export the information.
Ridge Regression
• “Introduction” on page 9-29
• “Example: Ridge Regression” on page 9-30
Introduction
Coefficient estimates for the models described in “Multiple Linear Regression”
on page 9-8 rely on the independence of the model terms. When terms are
correlated and the columns of the design matrix X have an approximate
linear dependence, the matrix (XTX)–1 becomes close to singular. As a result,
the least-squares estimate
ˆ = ( X T X )−1 X T y
ˆ = ( X T X + kI )−1 X T y
where k is the ridge parameter and I is the identity matrix. Small positive
values of k improve the conditioning of the problem and reduce the variance
of the estimates. While biased, the reduced variance of ridge estimates
often result in a smaller mean square error when compared to least-squares
estimates.
9-29
9 Regression Analysis
load acetylene
subplot(1,3,1)
plot(x1,x2,'.')
xlabel('x1'); ylabel('x2'); grid on; axis square
subplot(1,3,2)
plot(x1,x3,'.')
xlabel('x1'); ylabel('x3'); grid on; axis square
subplot(1,3,3)
plot(x2,x3,'.')
xlabel('x2'); ylabel('x3'); grid on; axis square
Note the correlation between x1 and the other two predictor variables.
Use ridge and x2fx to compute coefficient estimates for a multilinear model
with interaction terms, for a range of ridge parameters:
X = [x1 x2 x3];
D = x2fx(X,'interaction');
9-30
Linear Regression
figure
plot(k,betahat,'LineWidth',2)
ylim([-100 100])
grid on
xlabel('Ridge Parameter')
ylabel('Standardized Coefficient')
title('{\bf Ridge Trace}')
legend('x1','x2','x3','x1x2','x1x3','x2x3')
9-31
9 Regression Analysis
The estimates stabilize to the right of the plot. Note that the coefficient of
the x2x3 interaction term changes sign at a value of the ridge parameter
≈ 5 × 10–4.
9-32
Linear Regression
Introduction
Partial least-squares (PLS) regression is a technique used with data that
contain correlated predictor variables. This technique constructs new
predictor variables, known as components, as linear combinations of the
original predictor variables. PLS constructs these components while
considering the observed response values, leading to a parsimonious model
with reliable predictive power.
• Multiple linear regression finds a combination of the predictors that best fit
a response.
• Principal component analysis finds combinations of the predictors with
large variance, reducing correlations. The technique makes no use of
response values.
• PLS finds combinations of the predictors that have a large covariance with
the response values.
PLS therefore combines information about the variances of both the predictors
and the responses, while also considering the correlations among them.
load moore
9-33
9 Regression Analysis
y = moore(:,6); % Response
X0 = moore(:,1:5); % Original predictors
X1 = X0+10*randn(size(X0)); % Correlated predictors
X = [X0,X1];
[XL,yl,XS,YS,beta,PCTVAR] = plsregress(X,y,10);
plot(1:10,cumsum(100*PCTVAR(2,:)),'-bo');
xlabel('Number of PLS components');
ylabel('Percent Variance Explained in y');
Choosing the number of components in a PLS model is a critical step. The plot
gives a rough indication, showing nearly 80% of the variance in y explained
9-34
Linear Regression
[XL,yl,XS,YS,beta,PCTVAR,MSE,stats] = plsregress(X,y,6);
yfit = [ones(size(X,1),1) X]*beta;
plot(y,yfit,'o')
TSS = sum((y-mean(y)).^2);
RSS = sum((y-yfit).^2);
Rsquared = 1 - RSS/TSS
Rsquared =
0.8421
9-35
9 Regression Analysis
A plot of the weights of the ten predictors in each of the six components shows
that two of the components (the last two computed) explain the majority of
the variance in X:
plot(1:10,stats.W,'o-');
legend({'c1','c2','c3','c4','c5','c6'},'Location','NW')
xlabel('Predictor');
ylabel('Weight');
[axes,h1,h2] = plotyy(0:6,MSE(1,:),0:6,MSE(2,:));
set(h1,'Marker','o')
set(h2,'Marker','o')
legend('MSE Predictors','MSE Response')
xlabel('Number of Components')
9-36
Linear Regression
Polynomial Models
• “Introduction” on page 9-37
• “Programmatic Polynomial Regression” on page 9-38
• “Interactive Polynomial Regression” on page 9-43
Introduction
Polynomial models are a special case of the linear models discussed in “Linear
Regression Models” on page 9-3. Polynomial models have the advantages of
being simple, familiar in their properties, and reasonably flexible for following
9-37
9 Regression Analysis
data trends. They are also robust with respect to changes in the location and
scale of the data (see “Conditioning Polynomial Fits” on page 9-41). However,
polynomial models may be poor predictors of new values. They oscillate
between data points, especially as the degree is increased to improve the fit.
Asymptotically, they follow power functions, leading to inaccuracies when
extrapolating other long-term trends. Choosing a polynomial model is often a
trade-off between a simple description of overall data trends and the accuracy
of predictions made from the model.
x = 0:5; % x data
y = [2 1 4 4 3 2]; % y data
p = polyfit(x,y,3) % Degree 3 fit
p =
-0.1296 0.6865 -0.1759 1.6746
9-38
Linear Regression
r = roots(p)
r =
5.4786
-0.0913 + 1.5328i
-0.0913 - 1.5328i
The MATLAB function poly solves the inverse problem, finding a polynomial
with specified roots. poly is the inverse of roots up to ordering, scaling, and
round-off error.
9-39
9 Regression Analysis
[p,S] = polyfit(x,y,3);
[yhat,delta] = polyconf(p,x,S);
PI = [yhat-delta;yhat+delta]'
PI =
-5.3022 8.6514
-4.2068 8.3179
-2.9899 9.0534
-2.1963 9.8471
-2.6036 9.9211
-5.2229 8.7308
x = -5:5;
y = x.^2 - 5*x - 3 + 5*randn(size(x));
p = polydemo(x,y,2,0.05)
9-40
Linear Regression
p =
0.8107 -4.5054 -1.1862
load census
x = cdate;
9-41
9 Regression Analysis
y = pop;
p = polyfit(x,y,3);
Warning: Polynomial is badly conditioned.
Add points with distinct X values,
reduce the degree of the polynomial,
or try centering and scaling as
described in HELP POLYFIT.
xfit = linspace(x(1),x(end),100);
plot(xfit,yfit,'b-') % Plot conditioned fit vs. x data
grid on
9-42
Linear Regression
The Basic Fitting Tool. The Basic Fitting Tool is a MATLAB interface,
discussed in “Interactive Fitting” in the MATLAB documentation. The tool
allows you to:
9-43
9 Regression Analysis
x = -5:5;
y = x.^2 - 5*x - 3 + 5*randn(size(x));
polytool(x,y,2,0.05)
9-44
Linear Regression
• Interactively change the degree of the fit. Change the value in the Degree
text box at the top of the figure.
• Evaluate the fit and the bounds using a movable crosshair. Click, hold, and
drag the crosshair to change its position.
• Export estimated coefficients, predicted values, prediction intervals, and
residuals to the MATLAB workspace. Click Export to a open a dialog box
with choices for exporting the data.
Options for the displayed bounds and the fitting method are available through
menu options at the top of the figure:
• The Bounds menu lets you choose between bounds on new observations
(the default) and bounds on estimated values. It also lets you choose
between nonsimultaneous (the default) and simultaneous bounds. See
polyconf for a description of these options.
• The Method menu lets you choose between ordinary least-squares
regression and robust regression, as described in “Robust Regression” on
page 9-14.
Introduction
Polynomial models are generalized to any number of predictor variables xi (i
= 1, ..., N) as follows:
N N N
y( x) = a0 + ∑ ai xi + ∑ aij xi x j + ∑ aii xi2 + …
i=0 i< j i=0
The model includes, from left to right, an intercept, linear terms, quadratic
interaction terms, and squared terms. Higher order terms would follow, as
necessary.
9-45
9 Regression Analysis
load reaction
9-46
Linear Regression
x2fx function converts predictor data to design matrices for quadratic models.
The regstats function calls x2fx when instructed to do so.
For example, the following fits a quadratic response surface model to the
data in reaction.mat:
stats = regstats(rate,reactants,'quadratic','beta');
b = stats.beta; % Model coefficients
The 10-by-1 vector b contains, in order, a constant term and then the
coefficients for the model terms x1, x2, x3, x1x2, x1x3, x2x3, x12, x22, and x32, where
x1, x2, and x3 are the three columns of reactants. The order of coefficients for
quadratic models is described in the reference page for x2fx.
Since the model involves only three predictors, it is possible to visualize the
entire response surface using a color dimension for the reaction rate:
x1 = reactants(:,1);
x2 = reactants(:,2);
x3 = reactants(:,3);
xx1 = linspace(min(x1),max(x1),25);
xx2 = linspace(min(x2),max(x2),25);
xx3 = linspace(min(x3),max(x3),25);
[X1,X2,X3] = meshgrid(xx1,xx2,xx3);
hmodel = scatter3(X1(:),X2(:),X3(:),5,RATE(:),'filled');
hold on
hdata = scatter3(x1,x2,x3,'ko','filled');
axis tight
xlabel(xn(1,:))
ylabel(xn(2,:))
zlabel(xn(3,:))
hbar = colorbar;
ylabel(hbar,yn);
title('{\bf Quadratic Response Surface Model}')
9-47
9 Regression Analysis
legend(hdata,'Data','Location','NE')
The plot show a general increase in model response, within the space of
the observed data, as the concentration of n-pentane increases and the
concentrations of hydrogen and isopentane decrease.
9-48
Linear Regression
H = [b(8),b(5)/2,b(6)/2; ...
b(5)/2,b(9),b(7)/2; ...
b(6)/2,b(7)/2,b(10)];
lambda = eig(H)
lambda =
1.0e-003 *
-0.1303
0.0412
0.4292
9-49
9 Regression Analysis
delete(hmodel)
X2slice = 200; % Fix n-Pentane concentration
slice(X1,X2,X3,RATE,[],X2slice,[])
9-50
Linear Regression
load reaction
alpha = 0.01; % Significance level
rstool(reactants,rate,'quadratic',alpha,xn,yn)
9-51
9 Regression Analysis
in the plots. Predictor values are changed by editing the text boxes or by
dragging the dashed blue lines. When you change the value of a predictor, all
plots update to show the new point in predictor space.
Introduction
Linear regression models describe a linear relationship between a response
and one or more predictive terms. Many times, however, a nonlinear
relationship exists. “Nonlinear Regression” on page 9-58 describes general
nonlinear models. A special class of nonlinear models, known as generalized
linear models, makes use of linear methods.
• At each set of values for the predictors, the response has a normal
distribution with mean μ.
• A coefficient vector b defines a linear combination Xb of the predictors X.
• The model is μ = Xb.
• At each set of values for the predictors, the response has a distribution
that may be normal, binomial, Poisson, gamma, or inverse Gaussian, with
parameters including a mean μ.
• A coefficient vector b defines a linear combination Xb of the predictors X.
9-52
Linear Regression
plot(w,poor./total,'x','LineWidth',2)
grid on
xlabel('Weight')
ylabel('Proportion of Poor-Mileage Cars')
9-53
9 Regression Analysis
The logistic model is useful for proportion data. It defines the relationship
between the proportion p and the weight w by:
Some of the proportions in the data are 0 and 1, making the left-hand side of
this equation undefined. To keep the proportions within range, add relatively
small perturbations to the poor and total values. A semi-log plot then shows
a nearly linear relationship, as predicted by the model:
p_adjusted = (poor+.5)./(total+1);
semilogy(w,p_adjusted./(1-p_adjusted),'x','LineWidth',2)
grid on
xlabel('Weight')
ylabel('Adjusted p / (1 - p)')
9-54
Linear Regression
b = glmfit(w,[poor total],'binomial','link','logit')
b =
-13.3801
0.0042
9-55
9 Regression Analysis
x = 2100:100:4500;
y = glmval(b,x,'logit');
plot(w,poor./total,'x','LineWidth',2)
hold on
plot(x,y,'r-','LineWidth',2)
grid on
xlabel('Weight')
ylabel('Proportion of Poor-Mileage Cars')
9-56
Linear Regression
Multivariate Regression
Whether or not the predictor x is a vector of predictor variables, multivariate
regression refers to the case where the response y = (y1, ..., yM) is a vector of
M response variables.
9-57
9 Regression Analysis
Nonlinear Regression
In this section...
“Nonlinear Regression Models” on page 9-58
“Parametric Models” on page 9-59
“Mixed-Effects Models” on page 9-64
“Regression Trees” on page 9-94
9-58
Nonlinear Regression
Parametric Models
• “A Parametric Nonlinear Model” on page 9-59
• “Confidence Intervals for Parameter Estimates” on page 9-61
• “Confidence Intervals for Predicted Responses” on page 9-61
• “Interactive Nonlinear Parametric Regression” on page 9-62
1 x2 − x3 / 5
rate =
1 + 2 x1 + 3 x2 + 4 x3
where rate is the reaction rate, x1, x2, and x3 are concentrations of hydrogen,
n-pentane, and isopentane, respectively, and β1, β2, ... , β5 are the unknown
parameters.
load reaction
The function for the model is hougen, which looks like this:
type hougen
9-59
9 Regression Analysis
b1 = beta(1);
b2 = beta(2);
b3 = beta(3);
b4 = beta(4);
b5 = beta(5);
x1 = x(:,1);
x2 = x(:,2);
x3 = x(:,3);
nlinfit requires the predictor data, the responses, and an initial guess of the
unknown parameters. It also requires a function handle to a function that
takes the predictor data and parameter estimates and returns the responses
predicted by the model.
To fit the reaction data, call nlinfit using the following syntax:
load reaction
betahat = nlinfit(reactants,rate,@hougen,beta)
betahat =
1.2526
0.0628
9-60
Nonlinear Regression
0.0400
0.1124
1.1914
The function nlinfit has robust options, similar to those for robustfit, for
fitting nonlinear models to data with outliers.
[betahat,resid,J] = nlinfit(reactants,rate,@hougen,beta);
betaci = nlparci(betahat,resid,J)
betaci =
-0.7467 3.2519
-0.0377 0.1632
-0.0312 0.1113
-0.0609 0.2857
-0.7381 3.1208
The columns of the output betaci contain the lower and upper bounds,
respectively, of the (default) 95% confidence intervals for each parameter.
[yhat,delta] = nlpredci(@hougen,reactants,betahat,resid,J);
opd = [rate yhat delta]
opd =
8.5500 8.4179 0.2805
3.7900 3.9542 0.2474
4.8200 4.9109 0.1766
0.0200 -0.0110 0.1875
2.7500 2.6358 0.1578
14.3900 14.3402 0.4236
2.5400 2.5662 0.2425
9-61
9 Regression Analysis
The output opd contains the observed rates in the first column and the
predicted rates in the second column. The (default) 95% simultaneous
confidence intervals on the predictions are the values in the second column ±
the values in the third column. These are not intervals for new observations
at the predictors, even though most of the confidence intervals do contain the
original observations.
Open nlintool with the reaction data and the hougen model by typing
load reaction
nlintool(reactants,rate,@hougen,beta,0.01,xn,yn)
9-62
Nonlinear Regression
You see three plots. The response variable for all plots is the reaction rate,
plotted in green. The red lines show confidence intervals on predicted
responses. The first plot shows hydrogen as the predictor, the second shows
n-pentane, and the third shows isopentane.
Each plot displays the fitted relationship of the reaction rate to one predictor
at a fixed value of the other two predictors. The fixed values are in the text
boxes below each predictor axis. Change the fixed values by typing in a new
value or by dragging the vertical lines in the plots to new positions. When
you change the value of a predictor, all plots update to display the model
at the new point in predictor space.
9-63
9 Regression Analysis
While this example uses only three predictors, nlintool can accommodate
any number of predictors.
Mixed-Effects Models
• “Introduction” on page 9-64
• “Mixed-Effects Model Hierarchy” on page 9-65
• “Specifying Mixed-Effects Models” on page 9-67
• “Specifying Covariate Models” on page 9-70
• “Choosing nlmefit or nlmefitsa” on page 9-71
• “Using Output Functions with Mixed-Effects Models” on page 9-74
• “Example: Mixed-Effects Models Using nlmefit and nlmefitsa” on page 9-80
Introduction
In statistics, an effect is anything that influences the value of a response
variable at a particular setting of the predictor variables. Effects are
translated into model parameters. In linear models, effects become
coefficients, representing the proportional contributions of model terms. In
nonlinear models, effects often have specific physical interpretations, and
appear in more general nonlinear combinations.
9-64
Nonlinear Regression
C0 e−[ r +( ri − r )]t = C0 e− ( + bi )t ,
Random effects are useful when data falls into natural groups. In the drug
elimination model, the groups are simply the individuals under study. More
sophisticated models might group data by an individual’s age, weight, diet,
etc. Although the groups are not the focus of the study, adding random effects
to a model extends the reliability of inferences beyond the specific sample of
individuals.
Mixed-effects models account for both fixed and random effects. As with
all regression models, their purpose is to describe a response variable as a
function of the predictor variables. Mixed-effects models, however, recognize
correlations within sample subgroups. In this way, they provide a compromise
between ignoring data groups entirely and fitting each group with a separate
model.
yij = f ( , xij ) + ij
9-65
9 Regression Analysis
yij = f ( i , xij ) + ij
i = + bi
i = A + Bbi
i = Ai + Bibi
If the design matrices also differ among observations, the model becomes
ij = Aij + Bij bi
yij = f ( ij , xij ) + ij
9-66
Nonlinear Regression
Some of the group-specific predictors in xij may not change with observation j.
Calling those vi, the model becomes
yij = f ( ij , xij , vi ) + ij
i = Ai + Bi bi
yi = f (i , X i ) + i
bi ∼ N (0, Ψ)
i ∼ N (0, 2 )
9-67
9 Regression Analysis
− rpitij − rqitij
yij = C pi e + Cqi e + ij ,
where yij is the observed concentration in individual i at time tij. The model
allows for different sampling times and different numbers of observations for
different individuals.
The elimination rates rpi and rqi must be positive to be physically meaningful.
Enforce this by introducing the log rates Rpi = log(rpi) and Rqi = log(rqi) and
reparametrizing the model:
9-68
Nonlinear Regression
To introduce fixed effects β and random effects bi for all model parameters,
reexpress the model as follows:
Fitting the model and estimating the covariance matrix Ψ often leads to
further refinements. A relatively small estimate for the variance of a random
effect suggests that it can be removed from the model. Likewise, relatively
small estimates for covariances among certain random effects suggests that a
full covariance matrix is unnecessary. Since random effects are unobserved,
Ψ must be estimated indirectly. Specifying a diagonal or block-diagonal
covariance pattern for Ψ can improve convergence and efficiency of the fitting
algorithm.
Statistics Toolbox functions nlmefit and nlmefitsa fit the general nonlinear
mixed-effects model to data, estimating the fixed and random effects. The
functions also estimate the covariance matrix Ψ for the random effects.
Additional diagnostic outputs allow you to assess tradeoffs between the
number of model parameters and the goodness of fit.
9-69
9 Regression Analysis
⎛ 1 ⎞
⎛ 1 ⎞ ⎛ 1 0 0 wi ⎞ ⎜ ⎟ ⎛ 1 0 0 ⎞ ⎛ b1 ⎞
⎜ ⎟ ⎜ ⎟ ⎜ 2 ⎟ ⎜ ⎟⎜ ⎟
⎜ 2 ⎟ = ⎜ 0 1 0 0 ⎟ ⎜ ⎟ + ⎜ 0 1 0 ⎟ ⎜ b2 ⎟
⎜ ⎟ ⎜ 0 0 1 0 ⎟ ⎜ 3 ⎟ ⎜ 0 0 1 ⎟ ⎜ b ⎟
⎝ 3⎠ ⎝ ⎠⎜ ⎟ ⎝ ⎠⎝ 3 ⎠
⎝ 4⎠
Thus, the parameter φi for any individual in the ith group is:
⎛ 1 ⎞ ⎛ + * w ⎞ ⎛ b1 ⎞
⎜ i ⎟ ⎜ 1 4 i
⎟ ⎜⎜
i
⎟
⎜ 2 ⎟=⎜ 2 ⎟ + ⎜ b2i ⎟
⎜ i ⎟ ⎜ ⎟ ⎜ ⎟
⎜ 3 ⎟ ⎝ 3 ⎠ ⎝ b3i ⎟
⎝ i ⎠ ⎠
To specify a covariate model, use the 'FEGroupDesign' option.
9-70
Nonlinear Regression
% Number of covariates
num_cov = 1;
% Assuming number of groups in the data set is 7
num_groups = 7;
% Array of covariate values
covariates = [75; 52; 66; 55; 70; 58; 62 ];
A = repmat(eye(num_params, num_params+num_cov),...
[1,1,num_groups]);
A(1,num_params+1,1:num_groups) = covariates(:,1)
options.FEGroupDesign = A;
• 'LME' — Use the likelihood for the linear mixed-effects model at the
current conditional estimates of beta and B. This is the default.
• 'RELME' — Use the restricted likelihood for the linear mixed-effects model
at the current conditional estimates of beta and B.
• 'FO' — First-order Laplacian approximation without random effects.
• 'FOCE' — First-order Laplacian approximation at the conditional estimates
of B.
9-71
9 Regression Analysis
9-72
Nonlinear Regression
9-73
9 Regression Analysis
- The fixed effect design must be constant in every group (for every
individual), so an observation-dependent design is not supported.
- The random effect design must be constant for the entire data set, so
neither an observation-dependent design nor a group-dependent design
is supported.
- As mentioned under Random Effects, the random effect design must
not specify random effects for slope coefficients. This implies that the
design must consist of zeros and ones.
- The random effect design must not use the same random effect for
multiple coefficients, and cannot use more than one random effect for
any single coefficient.
- The fixed effect design must not use the same coefficient for multiple
parameters. This implies that it can have at most one non-zero value
in each column.
If you want to use nlmefitsa for data in which the covariate effects are
random, include the covariates directly in the nonlinear model expression.
Don’t include the covariates in the fixed or random effect design matrices.
• Convergence — As described in the Model form, nlmefit and nlmefitsa
have different approaches to measuring convergence. nlmefit uses
traditional optimization measures, and nlmefitsa provides diagnostics to
help you judge the convergence of a random simulation.
9-74
Nonlinear Regression
2 Use statset to set the value of Outputfcn to be a function handle, that is,
the name of the function preceded by the @ sign. For example, if the output
function is outfun.m, the command
stop = outfun(beta,status,state)
where
The solver passes the values of the input arguments to outfun at each
iteration.
Fields in status. The following table lists the fields of the status structure:
9-75
9 Regression Analysis
Field Description
procedure • 'ALT' — alternating algorithm for the optimization of
the linear mixed effects or restricted linear mixed effects
approximations
• 'LAP' — optimization of the Laplacian approximation for
first order or first order conditional estimation
iteration An integer starting from 0.
inner A structure describing the status of the inner iterations
within the ALT and LAP procedures, with the fields:
9-76
Nonlinear Regression
Field Description
theta The current parameterization of Psi
mse The current error variance
9-77
9 Regression Analysis
States of the Algorithm. The following table lists the possible values for
state:
state Description
'init' The algorithm is in the initial state before the first
iteration.
'iter' The algorithm is at the end of an iteration.
'done' The algorithm is in the final state after the last iteration.
The following code illustrates how the output function might use the value of
state to decide which tasks to perform at the current iteration:
switch state
case 'iter'
% Make updates to plot or guis as needed
case 'init'
% Setup for plots or guis
case 'done'
% Cleanup of plots, guis, or final plot
otherwise
end
Stop Flag. The output argument stop is a flag that is true or false.
The flag tells the solver whether it should quit or continue. The following
examples show typical ways to use the stop flag.
The output function can stop the estimation at any iteration based on the
values of arguments passed into it. For example, the following code sets stop
to true based on the value of the log likelihood stored in the 'fval'field of
the status structure:
stop = outfun(beta,status,state)
stop = false;
% Check if loglikelihood is more than 132.
if status.fval > -132
stop = true;
9-78
Nonlinear Regression
end
If you design a GUI to perform nlmefit iterations, you can make the output
function stop when a user clicks a Stop button on the GUI. For example, the
following code implements a dialog to cancel calculations:
function stopper(varargin)
9-79
9 Regression Analysis
To prevent nlmefitsa from using of this function, specify an empty value for
the output function:
load indomethacin
gscatter(time,concentration,subject)
xlabel('Time (hours)')
ylabel('Concentration (mcg/ml)')
title('{\bf Indomethacin Elimination}')
hold on
9-80
Nonlinear Regression
Use the nlinfit function to fit the model to all of the data, ignoring
subject-specific effects:
phi0 = [1 1 1 1];
[phi,res] = nlinfit(time,concentration,model,phi0);
numObs = length(time);
numParams = 4;
df = numObs-numParams;
mse = (res'*res)/df
9-81
9 Regression Analysis
mse =
0.0304
tplot = 0:0.01:8;
plot(tplot,model(phi,tplot),'k','LineWidth',2)
hold off
A boxplot of residuals by subject shows that the boxes are mostly above or
below zero, indicating that the model has failed to account for subject-specific
effects:
colors = 'rygcbm';
h = boxplot(res,subject,'colors',colors,'symbol','o');
set(h(~isnan(h)),'LineWidth',2)
hold on
boxplot(res,subject,'colors','k','symbol','ko')
9-82
Nonlinear Regression
grid on
xlabel('Subject')
ylabel('Residual')
hold off
To account for subject-specific effects, fit the model separately to the data
for each subject:
phi0 = [1 1 1 1];
PHI = zeros(4,6);
RES = zeros(11,6);
for I = 1:6
tI = time(subject == I);
cI = concentration(subject == I);
[PHI(:,I),RES(:,I)] = nlinfit(tI,cI,model,phi0);
end
9-83
9 Regression Analysis
PHI
PHI =
0.1915 0.4989 1.6757 0.2545 3.5661 0.9685
-1.7878 -1.6354 -0.4122 -1.6026 1.0408 -0.8731
2.0293 2.8277 5.4683 2.1981 0.2915 3.0023
0.5794 0.8013 1.7498 0.2423 -1.5068 1.0882
numParams = 24;
df = numObs-numParams;
mse = (RES(:)'*RES(:))/df
mse =
0.0057
gscatter(time,concentration,subject)
xlabel('Time (hours)')
ylabel('Concentration (mcg/ml)')
title('{\bf Indomethacin Elimination}')
hold on
for I = 1:6
plot(tplot,model(PHI(:,I),tplot),'Color',colors(I))
end
axis([0 8 0 3.5])
hold off
9-84
Nonlinear Regression
PHI gives estimates of the four model parameters for each of the six subjects.
The estimates vary considerably, but taken as a 24-parameter model of the
data, the mean-squared error of 0.0057 is a significant reduction from 0.0304
in the original four-parameter model.
A boxplot of residuals by subject shows that the larger model accounts for
most of the subject-specific effects:
h = boxplot(RES,'colors',colors,'symbol','o');
set(h(~isnan(h)),'LineWidth',2)
hold on
boxplot(RES,'colors','k','symbol','ko')
grid on
xlabel('Subject')
ylabel('Residual')
9-85
9 Regression Analysis
hold off
The spread of the residuals (the vertical scale of the boxplot) is much smaller
than in the previous boxplot, and the boxes are now mostly centered on zero.
9-86
Nonlinear Regression
phi0 = [1 1 1 1];
[phi,PSI,stats] = nlmefit(time,concentration,subject, ...
[],nlme_model,phi0)
phi =
0.4606
-1.3459
2.8277
0.7729
PSI =
0.0124 0 0 0
0 0.0000 0 0
0 0 0.3264 0
0 0 0 0.0250
stats =
logl: 54.5884
mse: 0.0066
aic: -91.1767
bic: -71.4698
sebeta: NaN
dfe: 57
The estimated covariance matrix PSI shows that the variance of the second
random effect is essentially zero, suggesting that you can remove it to simplify
the model. To do this, use the REParamsSelect parameter to specify the
indices of the parameters to be modeled with random effects in nlmefit:
9-87
9 Regression Analysis
phi =
0.4606
-1.3460
2.8277
0.7729
PSI =
0.0124 0 0
0 0.3270 0
0 0 0.0250
stats =
logl: 54.5876
mse: 0.0066
aic: -93.1752
bic: -75.6580
sebeta: NaN
dfe: 58
The log-likelihood logl is almost identical to what it was with random effects
for all of the parameters, the Akaike information criterion aic is reduced
from -91.1767 to -93.1752, and the Bayesian information criterion bic is
reduced from -71.4698 to -75.6580. These measures support the decision to
drop the second random effect.
Refitting the simplified model with a full covariance matrix allows for
identification of correlations among the random effects. To do this, use the
CovPattern parameter to specify the pattern of nonzero elements in the
covariance matrix:
9-88
Nonlinear Regression
The estimated covariance matrix PSI shows that the random effects on the
last two parameters have a relatively strong correlation, and both have a
relatively weak correlation with the first random effect. This structure in
the covariance matrix is more apparent if you convert PSI to a correlation
matrix using corrcov:
RHO = corrcov(PSI)
RHO =
1.0000 0.4707 0.1179
0.4707 1.0000 0.9316
0.1179 0.9316 1.0000
clf; imagesc(RHO)
set(gca,'XTick',[1 2 3],'YTick',[1 2 3])
title('{\bf Random Effect Correlation}')
h = colorbar;
set(get(h,'YLabel'),'String','Correlation');
9-89
9 Regression Analysis
Incorporate this structure into the model by changing the specification of the
covariance pattern to block-diagonal:
9-90
Nonlinear Regression
-1.1087
2.8056
0.8476
PSI =
0.0331 0 0
0 0.4793 0.1069
0 0.1069 0.0294
stats =
logl: 57.4996
mse: 0.0061
aic: -96.9992
bic: -77.2923
sebeta: NaN
dfe: 57
b =
-0.2438 0.0723 0.2014 0.0592 -0.2181 0.1289
-0.8500 -0.1237 0.9538 -0.7267 0.5895 0.1571
-0.1591 0.0033 0.1568 -0.2144 0.1834 0.0300
The output b gives predictions of the three random effects for each of the six
subjects. These are combined with the estimates of the fixed effects in phi
to produce the mixed-effects model.
Use the following commands to plot the mixed-effects model for each of the six
subjects. For comparison, the model without random effects is also shown.
9-91
9 Regression Analysis
cI = concentration(subject == I);
RES(:,I) = cI - fitted_model(tI);
subplot(2,3,I)
scatter(tI,cI,20,colors(I),'filled')
hold on
plot(tplot,fitted_model(tplot),'Color',colors(I))
plot(tplot,model(phi,tplot),'k')
axis([0 8 0 3.5])
xlabel('Time (hours)')
ylabel('Concentration (mcg/ml)')
legend(num2str(I),'Subject','Fixed')
end
9-92
Nonlinear Regression
If obvious outliers in the data (visible in previous box plots) are ignored, a
normal probability plot of the residuals shows reasonable agreement with
model assumptions on the errors:
clf; normplot(RES(:))
9-93
9 Regression Analysis
Regression Trees
Introduction
Parametric models specify the form of the relationship between predictors and
a response, as in the Hougen-Watson model described in “Parametric Models”
on page 9-59. In many cases, the form of the relationship is unknown, and
a parametric model requires assumptions and simplifications. Regression
trees offer a nonparametric alternative. When response data are categorical,
classification trees are a natural modification.
9-94
Nonlinear Regression
Load the data and use the classregtree constructor of the classregtree
class to create the regression tree:
load carsmall
t = classregtree([Weight, Cylinders],MPG,...
'cat',2,'splitmin',20,...
'names',{'W','C'})
t =
9-95
9 Regression Analysis
9 fit = 29.6111
10 fit = 23.25
11 if W<2827.5 then node 14 elseif W>=2827.5 then node 15 else 27.2143
12 if W<3533.5 then node 16 elseif W>=3533.5 then node 17 else 14.8696
13 fit = 11
14 fit = 27.6389
15 fit = 24.6667
16 fit = 16.6
17 fit = 14.3889
Use the type method of the classregtree class to show the type of the tree:
treetype = type(t)
treetype =
regression
To view the tree, use the view method of the classregtree class:
view(t)
9-96
Nonlinear Regression
The tree predicts the response values at the circular leaf nodes based on a
series of questions about the car at the triangular branching nodes. A true
answer to any question follows the branch to the left; a false follows the
branch to the right.
Use the tree to predict the mileage for a 2000-pound car with either 4, 6, or
8 cylinders:
Note that the object allows for functional evaluation, of the form t(X). This is
a shorthand way of calling the eval method of the classregtree class.
The predicted responses computed above are all the same. This is because they
follow a series of splits in the tree that depend only on weight, terminating
at the left-most leaf node in the view above. A 4000-pound car, following the
right branch from the top of the tree, leads to different predicted responses:
You can use a variety of other methods of the classregtree class, such as
cutvar, cuttype, and cutcategories, to get more information about the split
at node 3 that distinguishes the 8-cylinder car:
9-97
9 Regression Analysis
Regression trees fit the original (training) data well, but may do a poor job of
predicting new values. Lower branches, especially, may be strongly affected
by outliers. A simpler tree often avoids over-fitting. To find the best regression
tree, employing the techniques of resubstitution and cross-validation, use the
test method of the classregtree class.
9-98
10
Multivariate Methods
Introduction
Large, high-dimensional data sets are common in the modern era
of computer-based instrumentation and electronic data storage.
High-dimensional data present many challenges for statistical visualization,
analysis, and modeling.
10-2
Multidimensional Scaling
Multidimensional Scaling
In this section...
“Introduction” on page 10-3
“Classical Multidimensional Scaling” on page 10-3
“Nonclassical Multidimensional Scaling” on page 10-8
“Nonmetric Multidimensional Scaling” on page 10-10
Introduction
One of the most important goals in visualizing data is to get a sense of how
near or far points are from each other. Often, you can do this with a scatter
plot. However, for some analyses, the data that you have might not be in
the form of points at all, but rather in the form of pairwise similarities or
dissimilarities between cases, observations, or subjects. There are no points
to plot.
Even if your data are in the form of points rather than pairwise distances,
a scatter plot of those data might not be useful. For some kinds of data,
the relevant way to measure how near two points are might not be their
Euclidean distance. While scatter plots of the raw data make it easy to
compare Euclidean distances, they are not always useful when comparing
other kinds of inter-point distances, city block distance for example, or even
more general dissimilarities. Also, with a large number of variables, it is very
difficult to visualize distances unless the data can be represented in a small
number of dimensions. Some sort of dimension reduction is usually necessary.
10-3
10 Multivariate Methods
Introduction
The function cmdscale performs classical (metric) multidimensional scaling,
also known as principal coordinates analysis. cmdscale takes as an input a
matrix of inter-point distances and creates a configuration of points. Ideally,
those points are in two or three dimensions, and the Euclidean distances
between them reproduce the original distance matrix. Thus, a scatter plot
of the points created by cmdscale provides a visual representation of the
original distances.
As a very simple example, you can reconstruct a set of points from only their
inter-point distances. First, create some four dimensional points with a small
component in their fourth coordinate, and reduce them to distances.
X = [ normrnd(0,1,10,3), normrnd(0,.1,10,1) ];
D = pdist(X,'euclidean');
[Y,eigvals] = cmdscale(D);
cmdscale produces two outputs. The first output, Y, is a matrix containing the
reconstructed points. The second output, eigvals, is a vector containing the
sorted eigenvalues of what is often referred to as the “scalar product matrix,”
which, in the simplest case, is equal to Y*Y'. The relative magnitudes of those
eigenvalues indicate the relative contribution of the corresponding columns of
Y in reproducing the original distance matrix D with the reconstructed points.
format short g
[eigvals eigvals/max(abs(eigvals))]
ans =
12.623 1
4.3699 0.34618
1.9307 0.15295
0.025884 0.0020505
1.7192e-015 1.3619e-016
6.8727e-016 5.4445e-017
10-4
Multidimensional Scaling
4.4367e-017 3.5147e-018
-9.2731e-016 -7.3461e-017
-1.327e-015 -1.0513e-016
-1.9232e-015 -1.5236e-016
If eigvals contains only positive and zero (within round-off error) eigenvalues,
the columns of Y corresponding to the positive eigenvalues provide an exact
reconstruction of D, in the sense that their inter-point Euclidean distances,
computed using pdist, for example, are identical (within round-off) to the
values in D.
If two or three of the eigenvalues in eigvals are much larger than the rest,
then the distance matrix based on the corresponding columns of Y nearly
reproduces the original distance matrix D. In this sense, those columns
form a lower-dimensional representation that adequately describes the
data. However it is not always possible to find a good low-dimensional
reconstruction.
% good reconstruction in 3D
maxerr3 = max(abs(D - pdist(Y(:,1:3))))
maxerr3 =
0.029728
% poor reconstruction in 2D
maxerr2 = max(abs(D - pdist(Y(:,1:2))))
maxerr2 =
0.91641
max(max(D))
ans =
3.4686
10-5
10 Multivariate Methods
cities = ...
{'Atl','Chi','Den','Hou','LA','Mia','NYC','SF','Sea','WDC'};
D = [ 0 587 1212 701 1936 604 748 2139 2182 543;
587 0 920 940 1745 1188 713 1858 1737 597;
1212 920 0 879 831 1726 1631 949 1021 1494;
701 940 879 0 1374 968 1420 1645 1891 1220;
1936 1745 831 1374 0 2339 2451 347 959 2300;
604 1188 1726 968 2339 0 1092 2594 2734 923;
748 713 1631 1420 2451 1092 0 2571 2408 205;
2139 1858 949 1645 347 2594 2571 0 678 2442;
2182 1737 1021 1891 959 2734 2408 678 0 2329;
543 597 1494 1220 2300 923 205 2442 2329 0];
[Y,eigvals] = cmdscale(D);
format short g
[eigvals eigvals/max(abs(eigvals))]
ans =
9.5821e+006 1
1.6868e+006 0.17604
8157.3 0.0008513
1432.9 0.00014954
508.67 5.3085e-005
25.143 2.624e-006
10-6
Multidimensional Scaling
5.3394e-010 5.5722e-017
-897.7 -9.3685e-005
-5467.6 -0.0005706
-35479 -0.0037026
However, in this case, the two largest positive eigenvalues are much larger
in magnitude than the remaining eigenvalues. So, despite the negative
eigenvalues, the first two coordinates of Y are sufficient for a reasonable
reproduction of D.
Dtriu = D(find(tril(ones(10),-1)))';
maxrelerr = max(abs(Dtriu-pdist(Y(:,1:2))))./max(Dtriu)
maxrelerr =
0.0075371
plot(Y(:,1),Y(:,2),'.')
text(Y(:,1)+25,Y(:,2),cities)
xlabel('Miles')
ylabel('Miles')
10-7
10 Multivariate Methods
load cereal.mat
X = [Calories Protein Fat Sodium Fiber ...
Carbo Sugars Shelf Potass Vitamins];
10-8
Multidimensional Scaling
dissimilarities = pdist(zscore(X),'cityblock');
size(dissimilarities)
ans =
1 231
This example code first standardizes the cereal data, and then uses city block
distance as a dissimilarity. The choice of transformation to dissimilarities is
application-dependent, and the choice here is only for simplicity. In some
applications, the original data are already in the form of dissimilarities.
Next, use mdscale to perform metric MDS. Unlike cmdscale, you must
specify the desired number of dimensions, and the method to use to construct
the output configuration. For this example, use two dimensions. The metric
STRESS criterion is a common method for computing the output; for other
choices, see the mdscale reference page in the online documentation. The
second output from mdscale is the value of that criterion evaluated for the
output configuration. It measures the how well the inter-point distances of
the output configuration approximate the original input dissimilarities:
[Y,stress] =...
mdscale(dissimilarities,2,'criterion','metricstress');
stress
stress =
0.1856
plot(Y(:,1),Y(:,2),'o','LineWidth',2);
gname(Name(strmatch('G',Mfg)))
10-9
10 Multivariate Methods
You use mdscale to perform nonmetric MDS in much the same way as for
metric scaling. The nonmetric STRESS criterion is a common method for
computing the output; for more choices, see the mdscale reference page in
the online documentation. As with metric scaling, the second output from
10-10
Multidimensional Scaling
mdscale is the value of that criterion evaluated for the output configuration.
For nonmetric scaling, however, it measures the how well the inter-point
distances of the output configuration approximate the disparities. The
disparities are returned in the third output. They are the transformed values
of the original dissimilarities:
[Y,stress,disparities] = ...
mdscale(dissimilarities,2,'criterion','stress');
stress
stress =
0.1562
distances = pdist(Y);
[dum,ord] = sortrows([disparities(:) dissimilarities(:)]);
plot(dissimilarities,distances,'bo', ...
dissimilarities(ord),disparities(ord),'r.-', ...
[0 25],[0 25],'k-')
xlabel('Dissimilarities')
ylabel('Distances/Disparities')
legend({'Distances' 'Disparities' '1:1 Line'},...
'Location','NorthWest');
10-11
10 Multivariate Methods
This plot shows that mdscale has found a configuration of points in two
dimensions whose inter-point distances approximates the disparities, which
in turn are a nonlinear transformation of the original dissimilarities. The
concave shape of the disparities as a function of the dissimilarities indicates
that fit tends to contract small distances relative to the corresponding
dissimilarities. This may be perfectly acceptable in practice.
10-12
Multidimensional Scaling
opts = statset('Display','final');
[Y,stress] =...
mdscale(dissimilarities,2,'criterion','stress',...
'start','random','replicates',5,'Options',opts);
90 iterations, Final stress criterion = 0.156209
100 iterations, Final stress criterion = 0.195546
116 iterations, Final stress criterion = 0.156209
85 iterations, Final stress criterion = 0.156209
106 iterations, Final stress criterion = 0.17121
Notice that mdscale finds several different local solutions, some of which
do not have as low a stress value as the solution found with the cmdscale
starting point.
10-13
10 Multivariate Methods
Procrustes Analysis
In this section...
“Comparing Landmark Data” on page 10-14
“Data Input” on page 10-14
“Preprocessing Data for Accurate Results” on page 10-15
“Example: Comparing Handwritten Shapes” on page 10-16
Data Input
The procrustes function takes two matrices as input:
10-14
Procrustes Analysis
Z = bYT + c (10-1)
where:
n p
∑∑(X
i =1 j =1
ij − Z ij ) 2
10-15
10 Multivariate Methods
Create X and Y from A and B, moving B to the side to make each shape more
visible:
X = A;
Y = B + repmat([25 0], 10,1);
Plot the shapes, using letters to designate the landmark points. Lines in the
figure join the points to indicate the drawing path of each shape.
10-16
Procrustes Analysis
10-17
10 Multivariate Methods
10-18
Procrustes Analysis
d =
0.1502
The small value of d in this case shows that the two shapes are similar.
numerator = sum(sum((X-Z).^2))
numerator =
166.5321
denominator = sum(sum(bsxfun(@minus,X,mean(X)).^2))
denominator =
1.1085e+003
ratio = numerator/denominator
ratio =
0.1502
10-19
10 Multivariate Methods
tr.b
ans =
0.9291
The sizes of the target and comparison shapes appear similar. This visual
impression is reinforced by the value of b = 0.93, which implies that the best
transformation results in shrinking the comparison shape by a factor .93
(only 7%).
ds = procrustes(X,Y,'Scaling',false)
ds =
0.1552
det(tr.T)
ans =
1.0000
[dr,Zr,trr] = procrustes(X,Y,'Reflection',true);
dr
dr =
10-20
Procrustes Analysis
0.8130
• The landmark data points are now further away from their target
counterparts.
• The transformed three is now an undesirable mirror image of the target
three.
10-21
10 Multivariate Methods
It appears that the shapes might be better matched if you flipped the
transformed shape upside down. Flipping the shapes would make the
transformation even worse, however, because the landmark data points
would be further away from their target counterparts. From this example,
it is clear that manually adjusting the scaling and reflection parameters is
generally not optimal.
10-22
Feature Selection
Feature Selection
In this section...
“Introduction” on page 10-23
“Sequential Feature Selection” on page 10-23
Introduction
Feature selection reduces the dimensionality of data by selecting only a subset
of measured features (predictor variables) to create a model. Selection criteria
usually involve the minimization of a specific measure of predictive error for
models fit to different subsets. Algorithms search for a subset of predictors
that optimally model measured responses, subject to constraints such as
required or excluded features and the size of the subset.
Introduction
A common method of feature selection is sequential feature selection. This
method has two components:
10-23
10 Multivariate Methods
n = 100;
m = 10;
X = rand(n,m);
b = [1 0 0 2 .5 0 0 0.1 0 1];
Xb = X*b';
10-24
Feature Selection
p = 1./(1+exp(-Xb));
N = 50;
y = binornd(N,p);
Y = [y N*ones(size(y))];
[b0,dev0,stats0] = glmfit(X,Y,'binomial');
This is the full model, using all of the features (and an initial constant term).
Sequential feature selection searches for a subset of the features in the full
model with comparative predictive power.
First, you must specify a criterion for selecting the features. The following
function, which calls glmfit and returns the deviance of the fit (a
generalization of the residual sum of squares) is a useful criterion in this case:
[b,dev] = glmfit(X,Y,'binomial');
10-25
10 Multivariate Methods
maxdev = chi2inv(.95,1);
opt = statset('display','iter',...
'TolFun',maxdev,...
'TolTypeFun','abs');
inmodel = sequentialfs(@critfun,X,Y,...
'cv','none',...
'nullmodel',true,...
'options',opt,...
'direction','forward');
The iterative display shows a decrease in the criterion value as each new
feature is added to the model. The final result is a reduced model with only
four of the original ten features: columns 1, 4, 5, and 10 of X. These features
are indicated in the logical vector inmodel returned by sequentialfs.
The deviance of the reduced model is higher than for the full model, but
the addition of any other single feature would not decrease the criterion
by more than the absolute tolerance, maxdev, set in the options structure.
Adding a feature with no effect reduces the deviance by an amount that has
a chi-square distribution with one degree of freedom. Adding a significant
feature results in a larger change. By setting maxdev to chi2inv(.95,1), you
instruct sequentialfs to continue adding features so long as the change in
deviance is more than would be expected by random chance.
10-26
Feature Selection
[b,dev,stats] = glmfit(X(:,in),Y,'binomial');
10-27
10 Multivariate Methods
Feature Transformation
In this section...
“Introduction” on page 10-28
“Nonnegative Matrix Factorization” on page 10-28
“Principal Component Analysis” on page 10-31
“Factor Analysis” on page 10-45
Introduction
Feature transformation is a group of methods that create new features
(predictor variables). The methods are useful for dimension reduction when
the transformed features have a descriptive power that is more easily ordered
than the original features. In this case, less descriptive features can be
dropped from consideration when building models.
10-28
Feature Transformation
Introduction
Nonnegative matrix factorization (NMF) is a dimension-reduction technique
based on a low-rank approximation of the feature space. Besides providing
a reduction in the number of features, NMF guarantees that the features
are nonnegative, producing additive models that respect, for example, the
nonnegativity of physical quantities.
load moore
X = moore(:,1:5);
opt = statset('MaxIter',10,'Display','final');
[W0,H0] = nnmf(X,2,'replicates',5,...
'options',opt,...
10-29
10 Multivariate Methods
'algorithm','mult');
rep iteration rms resid |delta x|
1 10 358.296 0.00190554
2 10 78.3556 0.000351747
3 10 230.962 0.0172839
4 10 326.347 0.00739552
5 10 361.547 0.00705539
Final root mean square residual = 78.3556
opt = statset('Maxiter',1000,'Display','final');
[W,H] = nnmf(X,2,'w0',W0,'h0',H0,...
'options',opt,...
'algorithm','als');
rep iteration rms resid |delta x|
1 3 77.5315 3.52673e-005
Final root mean square residual = 77.5315
The two columns of W are the transformed predictors. The two rows of H give
the relative contributions of each of the five predictors in X to the predictors
in W:
H
H =
0.0835 0.0190 0.1782 0.0072 0.9802
0.0558 0.0250 0.9969 0.0085 0.0497
The fifth predictor in X (weight 0.9802) strongly influences the first predictor
in W. The third predictor in X (weight 0.9969) strongly influences the second
predictor in W.
10-30
Feature Transformation
biplot(H','scores',W,'varlabels',{'','','v3','','v5'});
axis([0 1.1 0 1.1])
xlabel('Column 1')
ylabel('Column 2')
Introduction 內在的 多重
One of the difficulties inherent in multivariate statistics is the problem of
可見的 visualizing data that has many variables. The MATLAB function plot
displays a graph of the relationship between two variables. The plot3
and surf commands display different three-dimensional views. But when
10-31
10 Multivariate Methods
超過三個變數就很難從肉眼看出他們的關係
there are more than three variables, it is more difficult to visualize their
relationships.
The first principal component is a single axis in space. When you project
投影到此單軸會產生新的變數
each observation on that axis, the resulting values form a new variable. And
這些可變變數是第一主成分 the variance of this variable is the maximum among all possible choices of
所有可能的選擇的最大值 the first axis.
The full set of principal components is as large as the original set of variables.
But it is commonplace for the sum of the variances of the first few principal
超過
components to exceed 80% of the total variance of the original data. By
examining plots of these few new variables, researchers often develop a
deeper understanding of the driving forces that generated the original data.
10-32
Feature Transformation
You can use the function princomp to find the principal components. To use
princomp, you need to have the actual measured data you want to analyze.
However, if you lack the actual data, but have the sample covariance or
correlation matrix for the data, you can still use the function pcacov to
perform a principal components analysis. See the reference page for pcacov
for a description of its inputs and outputs.
load cities
whos
Name Size Bytes Class
categories 9x14 252 char array
names 329x43 28294 char array
ratings 329x9 23688 double array
The whos command generates a table of information about all the variables
in the workspace.
10-33
10 Multivariate Methods
categories
categories =
climate
housing
health
crime
transportation
education
arts
recreation
economics
first5 = names(1:5,:)
first5 =
Abilene, TX
Akron, OH
Albany, GA
Albany-Troy, NY
Albuquerque, NM
boxplot(ratings,'orientation','horizontal','labels',categories)
This command generates the plot below. Note that there is substantially
more variability in the ratings of the arts and housing than in the ratings
of crime and climate.
10-34
Feature Transformation
Ordinarily you might also graph pairs of the original variables, but there are
36 two-variable plots. Perhaps principal components analysis can reduce the
number of variables you need to consider.
Sometimes it makes sense to compute principal components for raw data. This
is appropriate when all the variables are in the same units. Standardizing the
data is often preferable when the variables are in different units or when the
牢固的
variance of the different columns is substantial (as in this case).
You can standardize the data by dividing each column by its standard
deviation.
stdr = std(ratings);
sr = ratings./repmat(stdr,329,1);
10-35
10 Multivariate Methods
[coefs,scores,variances,t2] = princomp(sr);
c3 = coefs(:,1:3)
c3 =
0.2064 0.2178 -0.6900
0.3565 0.2506 -0.2082
0.4602 -0.2995 -0.0073
0.2813 0.3553 0.1851
0.3512 -0.1796 0.1464
0.2753 -0.4834 0.2297
0.4631 -0.1948 -0.0265
0.3279 0.3845 -0.0509
0.1354 0.4713 0.6073
The largest coefficients in the first column (first principal component) are
the third and seventh elements, corresponding to the variables health and
arts. All the coefficients of the first principal component have the same sign,
making it a weighted average of all the original variables.
I = c3'*c3
I =
1.0000 -0.0000 -0.0000
-0.0000 1.0000 -0.0000
-0.0000 -0.0000 1.0000
10-36
Feature Transformation
A plot of the first two columns of scores shows the ratings data projected
onto the first two principal components. princomp computes the scores to
have mean zero.
plot(scores(:,1),scores(:,2),'+')
xlabel('1st Principal Component')
ylabel('2nd Principal Component')
The function gname is useful for graphically identifying a few points in a plot
like this. You can call gname with a string matrix containing as many case
10-37
10 Multivariate Methods
labels as points in the plot. The string matrix names works for labeling points
with the city names.
gname(names)
Move your cursor over the plot and click once near each point in the right
half. As you click each point, it is labeled with the proper row from the names
string matrix. Here is the plot after a few clicks:
When you are finished labeling points, press the Return key.
The labeled cities are some of the biggest population centers in the United
States. They are definitely different from the remainder of the data, so
perhaps they should be considered separately. To remove the labeled cities
from the data, first identify their corresponding row numbers as follows:
10-38
Feature Transformation
plot(scores(:,1),scores(:,2),'+')
xlabel('1st Principal Component');
ylabel('2nd Principal Component');
4 Click near the points you labeled in the preceding figure. This labels the
points by their row numbers, as shown in the following figure.
Then you can create an index variable containing the row numbers of all
the metropolitan areas you choose.
10-39
10 Multivariate Methods
To remove these rows from the ratings matrix, enter the following.
rsubset = ratings;
nsubset = names;
nsubset(metro,:) = [];
rsubset(metro,:) = [];
size(rsubset)
ans =
322 9
variances
variances =
3.4083
1.2140
1.1415
0.9209
0.7533
0.6306
0.4930
0.3180
0.1204
You can easily calculate the percent of the total variability explained by each
principal component.
10-40
Feature Transformation
percent_explained = 100*variances/sum(variances)
percent_explained =
37.8699
13.4886
12.6831
10.2324
8.3698
7.0062
5.4783
3.5338
1.3378
Use the pareto function to make a scree plot of the percent variability
explained by each principal component.
pareto(percent_explained)
xlabel('Principal Component')
ylabel('Variance Explained (%)')
10-41
10 Multivariate Methods
The preceding figure shows that the only clear break in the amount of
variance accounted for by each component is between the first and second
components. However, that component by itself explains less than 40% of the
variance, so more components are probably needed. You can see that the first
three principal components explain roughly two-thirds of the total variability
in the standardized ratings, so that might be a reasonable way to reduce the
dimensions in order to visualize the data.
Hotelling’s T2. The last output of the princomp function, t2, is Hotelling’s T2,
a statistical measure of the multivariate distance of each observation from
the center of the data set. This is an analytical way to find the most extreme
points in the data.
It is not surprising that the ratings for New York are the furthest from the
average U.S. town.
Visualizing the Results. Use the biplot function to help visualize both
the principal component coefficients for each variable and the principal
component scores for each observation in a single plot. For example, the
following command plots the results from the principal components analysis
on the cities and labels each of the variables.
biplot(coefs(:,1:2), 'scores',scores(:,1:2),...
'varlabels',categories);
axis([-.26 1 -.51 .51]);
10-42
Feature Transformation
對component 1來說,
利用feature select 選擇arts & health最能代表coponent 1
Each of the nine variables is represented in this plot by a vector, and the
direction and length of the vector indicates how each variable contributes to
the two principal components in the plot. For example, you have seen that the
first principal component, represented in this biplot by the horizontal axis,
has positive coefficients for all nine variables. That corresponds to the nine
vectors directed into the right half of the plot. You have also seen that the
second principal component, represented by the vertical axis, has positive
coefficients for the variables education, health, arts, and transportation, and
negative coefficients for the remaining five variables. That corresponds to
vectors directed into the top and bottom halves of the plot, respectively. This
indicates that this component distinguishes between cities that have high
values for the first set of variables and low for the second, and cities that
have the opposite.
10-43
10 Multivariate Methods
The variable labels in this figure are somewhat crowded. You could either
leave out the VarLabels parameter when making the plot, or simply select
and drag some of the labels to better positions using the Edit Plot tool from
the figure window toolbar.
You can use the Data Cursor, in the Tools menu in the figure window, to
identify the items in this plot. By clicking on a variable (vector), you can read
off that variable’s coefficients for each principal component. By clicking on
an observation (point), you can read off that observation’s scores for each
principal component.
You can also make a biplot in three dimensions. This can be useful if the first
two principal coordinates do not explain enough of the variance in your data.
Selecting Rotate 3D in the Tools menu enables you to rotate the figure to
see it from different angles.
biplot(coefs(:,1:3), 'scores',scores(:,1:3),...
'obslabels',names);
axis([-.26 1 -.51 .51 -.61 .81]);
view([30 40]);
10-44
Feature Transformation
Factor Analysis
• “Introduction” on page 10-45
• “Example: Factor Analysis” on page 10-46
Introduction
Multivariate data often includes a large number of measured variables, and
sometimes those variables overlap, in the sense that groups of them might be
dependent. For example, in a decathlon, each athlete competes in 10 events,
but several of them can be thought of as speed events, while others can be
thought of as strength events, etc. Thus, you can think of a competitor’s 10
event scores as largely dependent on a smaller set of three or four types of
athletic ability.
10-45
10 Multivariate Methods
Factor analysis is a way to fit a model to multivariate data to estimate just this
sort of interdependence. In a factor analysis model, the measured variables
depend on a smaller number of unobserved (latent) factors. Because each
factor might affect several variables in common, they are known as common
factors. Each variable is assumed to be dependent on a linear combination
of the common factors, and the coefficients are known as loadings. Each
measured variable also includes a component due to independent random
variability, known as specific variance because it is specific to one variable.
Specifically, factor analysis assumes that the covariance matrix of your data
is of the form
∑ x = ΛΛΤ + Ψ
where Λ is the matrix of loadings, and the elements of the diagonal matrix
Ψ are the specific variances. The function factoran fits the Factor Analysis
model using maximum likelihood.
Factor Loadings. Over the course of 100 weeks, the percent change in stock
prices for ten companies has been recorded. Of the ten companies, the first
four can be classified as primarily technology, the next three as financial, and
the last three as retail. It seems reasonable that the stock prices for companies
that are in the same sector might vary together as economic conditions
change. Factor Analysis can provide quantitative evidence that companies
within each sector do experience similar week-to-week changes in stock price.
In this example, you first load the data, and then call factoran, specifying a
model fit with three common factors. By default, factoran computes rotated
estimates of the loadings to try and make their interpretation simpler. But in
this example, you specify an unrotated solution.
10-46
Feature Transformation
load stockreturns
[Loadings,specificVar,T,stats] = ...
factoran(stocks,3,'rotate','none');
The first two factoran return arguments are the estimated loadings and the
estimated specific variances. Each row of the loadings matrix represents one
of the ten stocks, and each column corresponds to a common factor. With
unrotated estimates, interpretation of the factors in this fit is difficult because
most of the stocks contain fairly large coefficients for two or more factors.
Loadings
Loadings =
0.8885 0.2367 -0.2354
0.7126 0.3862 0.0034
0.3351 0.2784 -0.0211
0.3088 0.1113 -0.1905
0.6277 -0.6643 0.1478
0.4726 -0.6383 0.0133
0.1133 -0.5416 0.0322
0.6403 0.1669 0.4960
0.2363 0.5293 0.5770
0.1105 0.1680 0.5524
Note “Factor Rotation” on page 10-48 helps to simplify the structure in the
Loadings matrix, to make it easier to assign meaningful interpretations to
the factors.
From the estimated specific variances, you can see that the model indicates
that a particular stock price varies quite a lot beyond the variation due to
the common factors.
specificVar
specificVar =
0.0991
0.3431
0.8097
0.8559
0.1429
10-47
10 Multivariate Methods
0.3691
0.6928
0.3162
0.3311
0.6544
The p value returned in the stats structure fails to reject the null hypothesis
of three common factors, suggesting that this model provides a satisfactory
explanation of the covariation in these data.
stats.p
ans =
0.8144
To determine whether fewer than three factors can provide an acceptable fit,
you can try a model with two common factors. The p value for this second fit
is highly significant, and rejects the hypothesis of two factors, indicating that
the simpler model is not sufficient to explain the pattern in these data.
[Loadings2,specificVar2,T2,stats2] = ...
factoran(stocks, 2,'rotate','none');
stats2.p
ans =
3.5610e-006
10-48
Feature Transformation
loadings in the rotated coordinate system. There are various ways to do this.
Some methods leave the axes orthogonal, while others are oblique methods
that change the angles between them. For this example, you can rotate the
estimated loadings by using the promax criterion, a common oblique method.
[LoadingsPM,specVarPM] = factoran(stocks,3,'rotate','promax');
LoadingsPM
LoadingsPM =
0.9452 0.1214 -0.0617
0.7064 -0.0178 0.2058
0.3885 -0.0994 0.0975
0.4162 -0.0148 -0.1298
0.1021 0.9019 0.0768
0.0873 0.7709 -0.0821
-0.1616 0.5320 -0.0888
0.2169 0.2844 0.6635
0.0016 -0.1881 0.7849
-0.2289 0.0636 0.6475
biplot(LoadingsPM,'varlabels',num2str((1:10)'));
axis square
view(155,27);
10-49
10 Multivariate Methods
This plot shows that promax has rotated the factor loadings to a simpler
structure. Each stock depends primarily on only one factor, and it is possible
to describe each factor in terms of the stocks that it affects. Based on which
companies are near which axes, you could reasonably conclude that the first
factor axis represents the financial sector, the second retail, and the third
technology. The original conjecture, that stocks vary primarily within sector,
is apparently supported by the data.
Because the data in this example are the raw stock price changes, and not
just their correlation matrix, you can have factoran return estimates of the
10-50
Feature Transformation
value of each of the three rotated common factors for each week. You can
then plot the estimated scores to see how the different stock sectors were
affected during each week.
[LoadingsPM,specVarPM,TPM,stats,F] = ...
factoran(stocks, 3,'rotate','promax');
plot3(F(:,1),F(:,2),F(:,3),'b.')
line([-4 4 NaN 0 0 NaN 0 0], [0 0 NaN -4 4 NaN 0 0],...
[0 0 NaN 0 0 NaN -4 4], 'Color','black')
xlabel('Financial Sector')
ylabel('Retail Sector')
zlabel('Technology Sector')
grid on
axis square
view(-22.5, 8)
10-51
10 Multivariate Methods
Oblique rotation often creates factors that are correlated. This plot shows
some evidence of correlation between the first and third factors, and you can
investigate further by computing the estimated factor correlation matrix.
inv(TPM'*TPM)
ans =
1.0000 0.1559 0.4082
0.1559 1.0000 -0.0559
0.4082 -0.0559 1.0000
Visualizing the Results. You can use the biplot function to help visualize
both the factor loadings for each variable and the factor scores for each
observation in a single plot. For example, the following command plots the
results from the factor analysis on the stock data and labels each of the 10
stocks.
biplot(LoadingsPM,'scores',F,'varlabels',num2str((1:10)'))
xlabel('Financial Sector')
ylabel('Retail Sector')
zlabel('Technology Sector')
axis square
view(155,27)
10-52
Feature Transformation
In this case, the factor analysis includes three factors, and so the biplot is
three-dimensional. Each of the 10 stocks is represented in this plot by a vector,
and the direction and length of the vector indicates how each stock depends
on the underlying factors. For example, you have seen that after promax
rotation, the first four stocks have positive loadings on the first factor, and
unimportant loadings on the other two factors. That first factor, interpreted
as a financial sector effect, is represented in this biplot as one of the horizontal
axes. The dependence of those four stocks on that factor corresponds to the
four vectors directed approximately along that axis. Similarly, the dependence
of stocks 5, 6, and 7 primarily on the second factor, interpreted as a retail
sector effect, is represented by vectors directed approximately along that axis.
Each of the 100 observations is represented in this plot by a point, and their
locations indicate the score of each observation for the three factors. For
example, points near the top of this plot have the highest scores for the
10-53
10 Multivariate Methods
technology sector factor. The points are scaled to fit within the unit square, so
only their relative locations can be determined from the plot.
You can use the Data Cursor tool from the Tools menu in the figure window
to identify the items in this plot. By clicking a stock (vector), you can read off
that stock’s loadings for each factor. By clicking an observation (point), you
can read off that observation’s scores for each factor.
10-54
11
Cluster Analysis
Introduction
Cluster analysis, also called segmentation analysis or taxonomy analysis,
creates groups, or clusters, of data. Clusters are formed in such a way that
objects in the same cluster are very similar and objects in different clusters
are very distinct. Measures of similarity depend on the application.
11-2
Hierarchical Clustering
Hierarchical Clustering
In this section...
“Introduction” on page 11-3
“Algorithm Description” on page 11-3
“Similarity Measures” on page 11-4
“Linkages” on page 11-6
“Dendrograms” on page 11-8
“Verifying the Cluster Tree” on page 11-10
“Creating Clusters” on page 11-16
Introduction
Hierarchical clustering groups data over a variety of scales by creating a
cluster tree or dendrogram. The tree is not a single set of clusters, but rather
a multilevel hierarchy, where clusters at one level are joined as clusters at
the next level. This allows you to decide the level or scale of clustering that
is most appropriate for your application. The Statistics Toolbox function
clusterdata supports agglomerative clustering and performs all of the
necessary steps for you. It incorporates the pdist, linkage, and cluster
functions, which you can use separately for more detailed analysis. The
dendrogram function plots the cluster tree.
Algorithm Description
To perform agglomerative hierarchical cluster analysis on a data set using
Statistics Toolbox functions, follow this procedure:
11-3
11 Cluster Analysis
The following sections provide more information about each of these steps.
Similarity Measures
You use the pdist function to calculate the distance between every pair of
objects in a data set. For a data set made up of m objects, there are m*(m –
1)/2 pairs in the data set. The result of this computation is commonly known
as a distance or dissimilarity matrix.
There are many ways to calculate this distance information. By default, the
pdist function calculates the Euclidean distance between objects; however,
you can specify one of several other options. See pdist for more information.
Note You can optionally normalize the values in the data set before
calculating the distance information. In a real world data set, variables can
be measured against different scales. For example, one variable can measure
Intelligence Quotient (IQ) test scores and another variable can measure head
circumference. These discrepancies can distort the proximity calculations.
Using the zscore function, you can convert all the values in the data set to
use the same proportional scale. See zscore for more information.
11-4
Hierarchical Clustering
For example, consider a data set, X, made up of five objects where each object
is a set of x,y coordinates.
• Object 1: 1, 2
• Object 2: 2.5, 4.5
• Object 3: 2, 2
• Object 4: 4, 1.5
• Object 5: 4, 2.5
and pass it to pdist. The pdist function calculates the distance between
object 1 and object 2, object 1 and object 3, and so on until the distances
between all the pairs have been calculated. The following figure plots these
objects in a graph. The Euclidean distance between object 2 and object 3 is
shown to illustrate one interpretation of distance.
Distance Information
The pdist function returns this distance information in a vector, Y, where
each element contains the distance between a pair of objects.
11-5
11 Cluster Analysis
Y = pdist(X)
Y =
Columns 1 through 5
2.9155 1.0000 3.0414 3.0414 2.5495
Columns 6 through 10
3.3541 2.5000 2.0616 2.0616 1.0000
squareform(Y)
ans =
0 2.9155 1.0000 3.0414 3.0414
2.9155 0 2.5495 3.3541 2.5000
1.0000 2.5495 0 2.0616 2.0616
3.0414 3.3541 2.0616 0 1.0000
3.0414 2.5000 2.0616 1.0000 0
Linkages
Once the proximity between objects in the data set has been computed, you
can determine how objects in the data set should be grouped into clusters,
using the linkage function. The linkage function takes the distance
information generated by pdist and links pairs of objects that are close
together into binary clusters (clusters made up of two objects). The linkage
function then links these newly formed clusters to each other and to other
objects to create bigger clusters until all the objects in the original data set
are linked together in a hierarchical tree.
For example, given the distance vector Y generated by pdist from the sample
data set of x- and y-coordinates, the linkage function generates a hierarchical
cluster tree, returning the linkage information in a matrix, Z.
Z = linkage(Y)
Z =
4.0000 5.0000 1.0000
11-6
Hierarchical Clustering
In this output, each row identifies a link between objects or clusters. The first
two columns identify the objects that have been linked. The third column
contains the distance between these objects. For the sample data set of x-
and y-coordinates, the linkage function begins by grouping objects 4 and 5,
which have the closest proximity (distance value = 1.0000). The linkage
function continues by grouping objects 1 and 3, which also have a distance
value of 1.0000.
The third row indicates that the linkage function grouped objects 6 and 7. If
the original sample data set contained only five objects, what are objects 6
and 7? Object 6 is the newly formed binary cluster created by the grouping
of objects 4 and 5. When the linkage function groups two objects into a
new cluster, it must assign the cluster a unique index value, starting with
the value m+1, where m is the number of objects in the original data set.
(Values 1 through m are already used by the original data set.) Similarly,
object 7 is the cluster formed by grouping objects 1 and 3.
As the final cluster, the linkage function grouped object 8, the newly formed
cluster made up of objects 6 and 7, with object 2 from the original data set.
The following figure graphically illustrates the way linkage groups the
objects into a hierarchy of clusters.
11-7
11 Cluster Analysis
Dendrograms
The hierarchical, binary cluster tree created by the linkage function is most
easily understood when viewed graphically. The Statistics Toolbox function
dendrogram plots the tree, as follows:
dendrogram(Z)
11-8
Hierarchical Clustering
2.5
1.5
4 5 1 3 2
In the figure, the numbers along the horizontal axis represent the indices of
the objects in the original data set. The links between objects are represented
as upside-down U-shaped lines. The height of the U indicates the distance
between the objects. For example, the link representing the cluster containing
objects 1 and 3 has a height of 1. The link representing the cluster that groups
object 2 together with objects 1, 3, 4, and 5, (which are already clustered as
object 8) has a height of 2.5. The height represents the distance linkage
computes between objects 2 and 8. For more information about creating a
dendrogram diagram, see the dendrogram reference page.
11-9
11 Cluster Analysis
Verifying Dissimilarity
In a hierarchical cluster tree, any two objects in the original data set are
eventually linked together at some level. The height of the link represents
the distance between the two clusters that contain those two objects. This
height is known as the cophenetic distance between the two objects. One
way to measure how well the cluster tree generated by the linkage function
reflects your data is to compare the cophenetic distances with the original
distance data generated by the pdist function. If the clustering is valid, the
linking of objects in the cluster tree should have a strong correlation with
the distances between objects in the distance vector. The cophenet function
compares these two sets of values and computes their correlation, returning a
value called the cophenetic correlation coefficient. The closer the value of the
cophenetic correlation coefficient is to 1, the more accurately the clustering
solution reflects your data.
You can use the cophenetic correlation coefficient to compare the results of
clustering the same data set using different distance calculation methods or
clustering algorithms. For example, you can use the cophenet function to
evaluate the clusters created for the sample data set
c = cophenet(Z,Y)
c =
0.8615
where Z is the matrix output by the linkage function and Y is the distance
vector output by the pdist function.
11-10
Hierarchical Clustering
Execute pdist again on the same data set, this time specifying the city block
metric. After running the linkage function on this new pdist output using
the average linkage method, call cophenet to evaluate the clustering solution.
Y = pdist(X,'cityblock');
Z = linkage(Y,'average');
c = cophenet(Z,Y)
c =
0.9047
Verifying Consistency
One way to determine the natural cluster divisions in a data set is to compare
the height of each link in a cluster tree with the heights of neighboring links
below it in the tree.
A link that is approximately the same height as the links below it indicates
that there are no distinct divisions between the objects joined at this level of
the hierarchy. These links are said to exhibit a high level of consistency,
because the distance between the objects being joined is approximately the
same as the distances between the objects they contain.
On the other hand, a link whose height differs noticeably from the height of
the links below it indicates that the objects joined at this level in the cluster
tree are much farther apart from each other than their components were when
they were joined. This link is said to be inconsistent with the links below it.
The following dendrogram illustrates inconsistent links. Note how the objects
in the dendrogram fall into two groups that are connected by links at a much
higher level in the tree. These links are inconsistent when compared with the
links below them in the hierarchy.
11-11
11 Cluster Analysis
11-12
Hierarchical Clustering
function compares each link in the cluster hierarchy with adjacent links that
are less than two levels below it in the cluster hierarchy. This is called the
depth of the comparison. You can also specify other depths. The objects at
the bottom of the cluster tree, called leaf nodes, that have no further objects
below them, have an inconsistency coefficient of zero. Clusters that join two
leaves also have a zero inconsistency coefficient.
For example, you can use the inconsistent function to calculate the
inconsistency values for the links created by the linkage function in
“Linkages” on page 11-6.
I = inconsistent(Z)
I =
1.0000 0 1.0000 0
1.0000 0 1.0000 0
1.3539 0.6129 3.0000 1.1547
2.2808 0.3100 2.0000 0.7071
Column Description
1 Mean of the heights of all the links included in the calculation
2 Standard deviation of all the links included in the calculation
3 Number of links included in the calculation
4 Inconsistency coefficient
In the sample output, the first row represents the link between objects 4
and 5. This cluster is assigned the index 6 by the linkage function. Because
both 4 and 5 are leaf nodes, the inconsistency coefficient for the cluster is zero.
The second row represents the link between objects 1 and 3, both of which are
also leaf nodes. This cluster is assigned the index 7 by the linkage function.
The third row evaluates the link that connects these two clusters, objects 6
and 7. (This new cluster is assigned index 8 in the linkage output). Column 3
indicates that three links are considered in the calculation: the link itself and
the two links directly below it in the hierarchy. Column 1 represents the mean
of the heights of these links. The inconsistent function uses the height
11-13
11 Cluster Analysis
The following figure illustrates the links and heights included in this
calculation.
Links
Heights
11-14
Hierarchical Clustering
Note In the preceding figure, the lower limit on the y-axis is set to 0 to show
the heights of the links. To set the lower limit to 0, select Axes Properties
from the Edit menu, click the Y Axis tab, and enter 0 in the field immediately
to the right of Y Limits.
Row 4 in the output matrix describes the link between object 8 and object 2.
Column 3 indicates that two links are included in this calculation: the link
itself and the link directly below it in the hierarchy. The inconsistency
coefficient for this link is 0.7071.
The following figure illustrates the links and heights included in this
calculation.
11-15
11 Cluster Analysis
Links
Heights
Creating Clusters
After you create the hierarchical tree of binary clusters, you can prune the
tree to partition your data into clusters using the cluster function. The
cluster function lets you create clusters in two ways, as discussed in the
following sections:
11-16
Hierarchical Clustering
For example, if you use the cluster function to group the sample data set
into clusters, specifying an inconsistency coefficient threshold of 1.2 as the
value of the cutoff argument, the cluster function groups all the objects
in the sample data set into one cluster. In this case, none of the links in the
cluster hierarchy had an inconsistency coefficient greater than 1.2.
T = cluster(Z,'cutoff',1.2)
T =
1
1
1
1
1
The cluster function outputs a vector, T, that is the same size as the original
data set. Each element in this vector contains the number of the cluster into
which the corresponding object from the original data set was placed.
T = cluster(Z,'cutoff',0.8)
T =
3
2
3
1
1
11-17
11 Cluster Analysis
This output indicates that objects 1 and 3 were placed in cluster 1, objects 4
and 5 were placed in cluster 2, and object 2 was placed in cluster 3.
When clusters are formed in this way, the cutoff value is applied to the
inconsistency coefficient. These clusters may, but do not necessarily,
correspond to a horizontal slice across the dendrogram at a certain height.
If you want clusters corresponding to a horizontal slice of the dendrogram,
you can either use the criterion option to specify that the cutoff should be
based on distance rather than inconsistency, or you can specify the number of
clusters directly as described in the following section.
For example, you can specify that you want the cluster function to partition
the sample data set into two clusters. In this case, the cluster function
creates one cluster containing objects 1, 3, 4, and 5 and another cluster
containing object 2.
T = cluster(Z,'maxclust',2)
T =
2
1
2
2
2
To help you visualize how the cluster function determines these clusters, the
following figure shows the dendrogram of the hierarchical cluster tree. The
horizontal dashed line intersects two lines of the dendrogram, corresponding
to setting 'maxclust' to 2. These two lines partition the objects into two
clusters: the objects below the left-hand line, namely 1, 3, 4, and 5, belong to
one cluster, while the object below the right-hand line, namely 2, belongs to
the other cluster.
11-18
Hierarchical Clustering
maxclust= 2
On the other hand, if you set 'maxclust' to 3, the cluster function groups
objects 4 and 5 in one cluster, objects 1 and 3 in a second cluster, and object 2
in a third cluster. The following command illustrates this.
T = cluster(Z,'maxclust',3)
T =
1
3
1
2
2
11-19
11 Cluster Analysis
This time, the cluster function cuts off the hierarchy at a lower point,
corresponding to the horizontal line that intersects three lines of the
dendrogram in the following figure.
maxclust= 3
11-20
K-Means Clustering
K-Means Clustering
In this section...
“Introduction” on page 11-21
“Creating Clusters and Determining Separation” on page 11-22
“Determining the Correct Number of Clusters” on page 11-23
“Avoiding Local Minima” on page 11-26
Introduction
K-means clustering is a partitioning method. The function kmeans partitions
data into k mutually exclusive clusters, and returns the index of the cluster
to which it has assigned each observation. Unlike hierarchical clustering,
k-means clustering operates on actual observations (rather than the larger
set of dissimilarity measures), and creates a single level of clusters. The
distinctions mean that k-means clustering is often more suitable than
hierarchical clustering for large amounts of data.
Each cluster in the partition is defined by its member objects and by its
centroid, or center. The centroid for each cluster is the point to which the sum
of distances from all objects in that cluster is minimized. kmeans computes
cluster centroids differently for each distance measure, to minimize the sum
with respect to the measure that you specify.
kmeans uses an iterative algorithm that minimizes the sum of distances from
each object to its cluster centroid, over all clusters. This algorithm moves
objects between clusters until the sum cannot be decreased further. The
result is a set of clusters that are as compact and well-separated as possible.
You can control the details of the minimization using several optional input
parameters to kmeans, including ones for the initial values of the cluster
centroids, and for the maximum number of iterations.
11-21
11 Cluster Analysis
load kmeansdata;
size(X)
ans =
560 4
Even though these data are four-dimensional, and cannot be easily visualized,
kmeans enables you to investigate whether a group structure exists in them.
Call kmeans with k, the desired number of clusters, equal to 3. For this
example, specify the city block distance measure, and use the default starting
method of initializing centroids from randomly selected data points:
idx3 = kmeans(X,3,'distance','city');
To get an idea of how well-separated the resulting clusters are, you can make
a silhouette plot using the cluster indices output from kmeans. The silhouette
plot displays a measure of how close each point in one cluster is to points in
the neighboring clusters. This measure ranges from +1, indicating points that
are very distant from neighboring clusters, through 0, indicating points that
are not distinctly in one cluster or another, to -1, indicating points that are
probably assigned to the wrong cluster. silhouette returns these values in
its first output:
[silh3,h] = silhouette(X,idx3,'city');
set(get(gca,'Children'),'FaceColor',[.8 .8 1])
xlabel('Silhouette Value')
ylabel('Cluster')
11-22
K-Means Clustering
From the silhouette plot, you can see that most points in the third cluster
have a large silhouette value, greater than 0.6, indicating that the cluster is
somewhat separated from neighboring clusters. However, the first cluster
contains many points with low silhouette values, and the second contains a
few points with negative values, indicating that those two clusters are not
well separated.
11-23
11 Cluster Analysis
2 1 53 2736.67
3 1 50 2476.78
4 1 102 1779.68
5 1 5 1771.1
6 2 0 1771.1
6 iterations, total sum of distances = 1771.1
Notice that the total sum of distances decreases at each iteration as kmeans
reassigns points between clusters and recomputes cluster centroids. In this
case, the second phase of the algorithm did not make any reassignments,
indicating that the first phase reached a minimum after five iterations. In
some problems, the first phase might not reach a minimum, but the second
phase always will.
A silhouette plot for this solution indicates that these four clusters are better
separated than the three in the previous solution:
[silh4,h] = silhouette(X,idx4,'city');
set(get(gca,'Children'),'FaceColor',[.8 .8 1])
xlabel('Silhouette Value')
ylabel('Cluster')
11-24
K-Means Clustering
A more quantitative way to compare the two solutions is to look at the average
silhouette values for the two cases:
mean(silh3)
ans =
0.52594
mean(silh4)
ans =
0.63997
idx5 = kmeans(X,5,'dist','city','replicates',5);
[silh5,h] = silhouette(X,idx5,'city');
set(get(gca,'Children'),'FaceColor',[.8 .8 1])
xlabel('Silhouette Value')
11-25
11 Cluster Analysis
ylabel('Cluster')
mean(silh5)
ans =
0.52657
This silhouette plot indicates that this is probably not the right number of
clusters, since two of the clusters contain points with mostly low silhouette
values. Without some knowledge of how many clusters are really in the data,
it is a good idea to experiment with a range of values for k.
11-26
K-Means Clustering
better solution does exist. However, you can use the optional 'replicates'
parameter to overcome that problem.
For four clusters, specify five replicates, and use the 'display' parameter to
print out the final sum of distances for each of the solutions.
[idx4,cent4,sumdist] = kmeans(X,4,'dist','city',...
'display','final','replicates',5);
17 iterations, total sum of distances = 2303.36
5 iterations, total sum of distances = 1771.1
6 iterations, total sum of distances = 1771.1
5 iterations, total sum of distances = 1771.1
8 iterations, total sum of distances = 2303.36
The output shows that, even for this relatively simple problem, non-global
minima do exist. Each of these five replicates began from a different randomly
selected set of initial centroids, and kmeans found two different local minima.
However, the final solution that kmeans returns is the one with the lowest
total sum of distances, over all replicates.
sum(sumdist)
ans =
1771.1
11-27
11 Cluster Analysis
Introduction
Gaussian mixture models are formed by combining multivariate normal
density components. For information on individual multivariate normal
densities, see “Multivariate Normal Distribution” on page B-58 and related
distribution functions listed under “Multivariate Distributions” on page 5-8.
Gaussian mixture models are often used for data clustering. Clusters are
assigned by selecting the component that maximizes the posterior probability.
Like k-means clustering, Gaussian mixture modeling uses an iterative
algorithm that converges to a local optimum. Gaussian mixture modeling may
be more appropriate than k-means clustering when clusters have different
sizes and correlation within them. Clustering using Gaussian mixture models
is sometimes considered a soft clustering method. The posterior probabilities
for each point indicate that each data point has some probability of belonging
to each cluster.
11-28
Gaussian Mixture Models
mu1 = [1 2];
sigma1 = [3 .2; .2 2];
mu2 = [-1 -2];
sigma2 = [2 0; 0 1];
X = [mvnrnd(mu1,sigma1,200);mvnrnd(mu2,sigma2,100)];
scatter(X(:,1),X(:,2),10,'ko')
11-29
11 Cluster Analysis
options = statset('Display','final');
gm = gmdistribution.fit(X,2,'Options',options);
This displays
hold on
ezcontour(@(x,y)pdf(gm,[x y]),[-8 6],[-8 6]);
hold off
11-30
Gaussian Mixture Models
4 Partition the data into clusters using the cluster method for the fitted
mixture distribution. The cluster method assigns each point to one of the
two components in the mixture distribution.
idx = cluster(gm,X);
cluster1 = (idx == 1);
cluster2 = (idx == 2);
scatter(X(cluster1,1),X(cluster1,2),10,'r+');
hold on
scatter(X(cluster2,1),X(cluster2,2),10,'bo');
hold off
legend('Cluster 1','Cluster 2','Location','NW')
11-31
11 Cluster Analysis
For example, plot the posterior probability of the first component for each
point:
P = posterior(gm,X);
scatter(X(cluster1,1),X(cluster1,2),10,P(cluster1,1),'+')
hold on
scatter(X(cluster2,1),X(cluster2,2),10,P(cluster2,1),'o')
hold off
legend('Cluster 1','Cluster 2','Location','NW')
clrmap = jet(80); colormap(clrmap(9:72,:))
11-32
Gaussian Mixture Models
[~,order] = sort(P(:,1));
plot(1:size(X,1),P(order,1),'r-',1:size(X,1),P(order,2),'b-');
legend({'Cluster 1 Score' 'Cluster 2 Score'},'location','NW');
ylabel('Cluster Membership Score');
xlabel('Point Ranking');
11-33
11 Cluster Analysis
Although a clear separation of the data is hard to see in a scatter plot of the
data, plotting the membership scores indicates that the fitted distribution
does a good job of separating the data into groups. Very few points have
scores close to 0.5.
11-34
Gaussian Mixture Models
gm2 = gmdistribution.fit(X,2,'CovType','Diagonal',...
'SharedCov',true);
You can compute the soft cluster membership scores without computing hard
cluster assignments, using posterior, or as part of hard clustering, as the
second output from cluster:
11-35
11 Cluster Analysis
1 Given a data set X, first fit a Gaussian mixture distribution. The previous
code has already done that.
gm
gm =
Gaussian mixture distribution with 2 components in 2 dimensions
11-36
Gaussian Mixture Models
Component 1:
Mixing proportion: 0.312592
Mean: -0.9082 -2.1109
Component 2:
Mixing proportion: 0.687408
Mean: 0.9532 1.8940
2 You can then use cluster to assign each point in a new data set, Y, to one
of the clusters defined for the original data:
Y = [mvnrnd(mu1,sigma1,50);mvnrnd(mu2,sigma2,25)];
idx = cluster(gm,Y);
cluster1 = (idx == 1);
cluster2 = (idx == 2);
scatter(Y(cluster1,1),Y(cluster1,2),10,'r+');
hold on
scatter(Y(cluster2,1),Y(cluster2,2),10,'bo');
hold off
legend('Class 1','Class 2','Location','NW')
11-37
11 Cluster Analysis
As with the previous example, the posterior probabilities for each point can
be treated as membership scores rather than determining "hard" cluster
assignments.
For cluster to provide meaningful results with new data, Y should come
from the same population as X, the original data used to create the mixture
distribution. In particular, the estimated mixing probabilities for the
Gaussian mixture distribution fitted to X are used when computing the
posterior probabilities for Y.
11-38
12
LDA不同類別的差異最大化
Classification
Introduction
Models of data with a categorical response are called classifiers. A classifier is
built from training data, for which classifications are known. The classifier
assigns new test data to one of the categorical levels of the response.
12-2
Discriminant Analysis
Discriminant Analysis
In this section...
“Introduction” on page 12-3
“Example: Discriminant Analysis” on page 12-3
Introduction
Discriminant analysis uses training data to estimate the parameters of
discriminant functions of the predictor variables. Discriminant functions
determine boundaries in predictor space between various classes. The
resulting classifier discriminates among the classes (the categorical levels of
the response) based on the predictor data.
load fisheriris
SL = meas(51:end,1);
SW = meas(51:end,2);
group = species(51:end);
h1 = gscatter(SL,SW,group,'rb','v^',[],'off');
set(h1,'LineWidth',2)
legend('Fisher versicolor','Fisher virginica',...
'Location','NW')
12-3
12 Classification
[X,Y] = meshgrid(linspace(4.5,8),linspace(2,4));
X = X(:); Y = Y(:);
[C,err,P,logp,coeff] = classify([X Y],[SL SW],...
group,'quadratic');
hold on;
gscatter(X,Y,C,'rb','.',1,'off');
K = coeff(1,2).const;
L = coeff(1,2).linear;
Q = coeff(1,2).quadratic;
% Plot the curve K + [x,y]*L + [x,y]*Q*[x,y]' = 0:
f = @(x,y) K + L(1)*x + L(2)*y + Q(1,1)*x.^2 + ...
12-4
Discriminant Analysis
(Q(1,2)+Q(2,1))*x.*y + Q(2,2)*y.^2
h2 = ezplot(f,[4.5 8 2 4]);
set(h2,'Color','m','LineWidth',2)
axis([4.5 8 2 4])
xlabel('Sepal Length')
ylabel('Sepal Width')
title('{\bf Classification with Fisher Training Data}')
12-5
12 Classification
2 Prediction step: For any unseen test sample, the method computes the
posterior probability of that sample belonging to each class. The method
then classifies the test sample according the largest posterior probability.
Supported Distributions
Naive Bayes classification is based on estimating P(X|Y), the probability or
probability density of features X given class Y. The Naive Bayes classification
object NaiveBayes provides support for normal (Gaussian), kernel,
multinomial, and multivariate multinomial distributions. It is possible to use
different distributions for different features.
12-6
Naive Bayes Classification
distribution for each class by computing the mean and standard deviation of
the training data in that class. For more information on normal distributions,
see “Normal Distribution” on page B-83.
Kernel Distribution
The 'kernel' distribution is appropriate for features that have a continuous
distribution. It does not require a strong assumption such as a normal
distribution and you can use it in cases where the distribution of a feature may
be skewed or have multiple peaks or modes. It requires more computing time
and more memory than the normal distribution. For each feature you model
with a kernel distribution, the Naive Bayes classifier computes a separate
kernel density estimate for each class based on the training data for that class.
By default the kernel is the normal kernel, and the classifier selects a width
automatically for each class and feature. It is possible to specify different
kernels for each feature, and different widths for each feature or class.
Multinomial Distribution
The multinomial distribution (specify with the 'mn' keyword) is appropriate
when all features represent counts of a set of words or tokens. This is
sometimes called the "bag of words" model. For example, an e-mail spam
classifier might be based on features that count the number of occurrences
of various tokens in an e-mail. One feature might count the number of
exclamation points, another might count the number of times the word
"money" appears, and another might count the number of times the recipient’s
name appears. This is a Naive Bayes model under the further assumption
that the total number of tokens (or the total document length) is independent
of response class.
For the multinomial option, each feature represents the count of one token.
The classifier counts the set of relative token probabilities separately for
each class. The classifier defines the multinomial distribution for each row
by the vector of probabilities for the corresponding class, and by N, the total
token count for that row.
12-7
12 Classification
For each feature you model with a multivariate multinomial distribution, the
Naive Bayes classifier computes a separate set of probabilities for the set of
feature levels for each class.
12-8
Classification Trees
Classification Trees
In this section...
“Introduction” on page 12-9
“Example: Classification Trees” on page 12-9
“References” on page 12-13
Introduction
Parametric models specify the form of the relationship between predictors
and a response, as in the Hougen-Watson model described in “Parametric
Models” on page 9-59. In many cases, however, the form of the relationship is
unknown, and a parametric model requires assumptions and simplifications.
Regression Trees offer a nonparametric alternative. When response data are
categorical, classification trees are a natural modification.
1 Load the data and use the classregtree constructor of the classregtree
class to create the classification tree:
load fisheriris
t = classregtree(meas,species,...
'names',{'SL' 'SW' 'PL' 'PW'})
t =
Decision tree for classification
12-9
12 Classification
2 Use the type method of the classregtree class to show the type of the tree:
treetype = type(t)
treetype =
classification
3 To view the tree, use the view method of the classregtree class:
view(t)
12-10
Classification Trees
The tree predicts the response values at the circular leaf nodes based on a
series of questions about the iris at the triangular branching nodes. A true
answer to any question follows the branch to the left; a false follows the
branch to the right.
4 The tree does not use sepal measurements for predicting species. These
can go unmeasured in new data, and you can enter them as NaN values for
predictions. For example, to use the tree to predict the species of an iris
with petal length 4.8 and petal width 1.6, type:
12-11
12 Classification
Note that the object allows for functional evaluation, of the form t(X).
This is a shorthand way of calling the eval method of the classregtree
class. The predicted species is the left leaf node at the bottom of the tree
in the previous view.
5 You can use a variety of other methods of the classregtree class, such as
cutvar and cuttype to get more information about the split at node 6 that
makes the final distinction between versicolor and virginica:
6 Classification trees fit the original (training) data well, but may do a poor
job of classifying new values. Lower branches, especially, may be strongly
affected by outliers. A simpler tree often avoids overfitting. You can use
the prune method of the classregtree class to find the next largest tree
from an optimal pruning sequence:
pruned = prune(t,'level',1)
pruned =
Decision tree for classification
1 if PL<2.45 then node 2 elseif PL>=2.45 then node 3 else setosa
2 class = setosa
3 if PW<1.75 then node 4 elseif PW>=1.75 then node 5 else versicolor
4 if PL<4.95 then node 6 elseif PL>=4.95 then node 7 else versicolor
5 class = virginica
6 class = versicolor
7 class = virginica
view(pruned)
12-12
Classification Trees
References
[1] Breiman, L., et al., Classification and Regression Trees, Chapman & Hall,
Boca Raton, 1993.
12-13
12 Classification
Pairwise Distance
Categorizing query points based on their distance to points in a training data
set can be a simple yet effective way of classifying new points. You can use a
variety of metrics to determine the distance, described in the following section.
Use pdist2 to find the distance between a sets of data and query points.
Distance Metrics
Given an mx-by-n data matrix X, which is treated as mx (1-by-n) row vectors
x1, x2, ..., xmx, and my-by-n data matrix Y, which is treated as my (1-by-n)
row vectors y1, y2, ...,ymy, the various distances between the vector xs and yt
are defined as follows:
• Euclidean distance
2
dst = ( xs − yt )( xs − yt )′
2
dst = ( xs − yt )V −1 ( xs − yt )′
where V is the n-by-n diagonal matrix whose jth diagonal element is S(j)2,
where S is the vector containing the inverse weights.
• Mahalanobis distance
2
dst = ( xs − yt )C −1 ( xs − yt )′
12-14
Classification Using Nearest Neighbors
n
dst = ∑ xsj − ytj
j =1
Notice that the city block distance is a special case of the Minkowski
metric, where p = 1.
• Minkowski metric
n p
dst = p ∑ xsj − ytj
j =1
Notice that for the special case of p = 1, the Minkowski metric gives the
city block metric, for the special case of p = 2, the Minkowski metric gives
the Euclidean distance, and for the special case of p = ∞, the Minkowski
metric gives the Chebychev distance.
• Chebychev distance
{
dst = max j xsj − ytj }
Notice that the Chebychev distance is a special case of the Minkowski
metric, where p = ∞.
• Cosine distance
⎛ xs yt′ ⎞
dst = ⎜ 1 − ⎟
⎜
⎝ ( xs xs′ ) ( yt yt′ ) ⎟⎠
• Correlation distance
( xs − xs ) ( yt − yt )′
dst = 1 −
( xs − xs ) ( xs − xs )′ ( yt − yt ) ( yt − yt )′
where
12-15
12 Classification
1
xs = ∑ xsj
n j
and
1
yt = ∑ ytj
n j
• Hamming distance
• Jaccard distance
( ) ((
# ⎡ xsj ≠ ytj ∩ xsj ≠ 0 ∪ ytj ≠ 0 ⎤
⎣ ⎦) ( ))
dst =
⎡ ( ) (
# xsj ≠ 0 ∪ ytj ≠ 0
⎣
⎤
⎦ )
• Spearman distance
( rs − rs ) ( rt − rt )′
dst = 1 −
( rs − rs ) ( rs − rs )′ ( rt − rt ) ( rt − rt )′
where
- rsj is the rank of xsj taken over x1j, x2j, ...xmx,j, as computed by tiedrank
- rtj is the rank of ytj taken over y1j, y2j, ...ymy,j, as computed by tiedrank
- rs and rt are the coordinate-wise rank vectors of xs and yt, i.e.,
rs = (rs1, rs2, ... rsn) and rt = (rt1, rt2, ... rtn)
1 ( n + 1)
- rs = ∑
n j
rsj =
2
1 ( n + 1)
- rt = ∑
n j
rtj =
2
12-16
Classification Using Nearest Neighbors
knnsearch will also use the exhaustive search method if your search object
is an ExhaustiveSearcher object. The exhaustive search method finds the
distance from each query point to every point in X, ranks them in ascending
12-17
12 Classification
order, and returns the k points with the smallest distances. For example, this
diagram shows the k=3 nearest neighbors.
12-18
Classification Using Nearest Neighbors
- 'minkowski'
- 'chebychev'
kd-trees divide your data into nodes with at most BucketSize (default is
50) points per node, based on coordinates (as opposed to categories). The
following diagrams illustrate this concept using patch objects to color code
the different “buckets.”
When you want to find the k-nearest neighbors to a given query point,
knnsearch performs the following steps:
1 Determine the node to which the query point belongs. In the following
example, the query point (73,21.5) belongs to Node 12.
2 Find the closest k points within that node and its distance to the query
point. In the following example, the points in red squares are equidistant
12-19
12 Classification
from the query point, and are the closest points to the query point within
Node 12.
3 Choose all other nodes having any area that is within the same distance,
in any direction, from the query point to the kth closest point. In this
example, only Node 13 overlaps the solid black circle centered at the query
point with radius equal to the distance to the closest points within Node 12.
4 Search nodes within that range for any points closer to the query point. In
the following example, the point circled in red is clearly closer to the query
point than those within Node 12.
Using a kd-tree for large datasets with fewer than 10 dimensions (columns)
can be much more efficient than using the exhaustive search method, as
knnsearch needs to calculate only a subset of the distances. To maximize the
efficiency of kd-trees, use a KDTreeSearcher object.
12-20
Classification Using Nearest Neighbors
All search objects have a knnsearch method specific to that class. This allows
you to perform a k-nearest neighbors search on your object in the most efficient
way for that specific object type. In addition, there is a generic knnsearch
function which performs the search without creating or using an object.
To determine which type of object and search method is best for your data,
consider the following:
• Does your data have many columns, say more than 10? The
ExhaustiveSearcher object may perform better.
• Is your data sparse? Use the ExhaustiveSearcher object.
• Do you want to use one of the following distance measures to find the
nearest neighbors? Use the ExhaustiveSearcher object.
- 'seuclidean'
- 'mahalanobis'
- 'cosine'
- 'correlation'
- 'spearman'
- 'hamming'
- 'jaccard'
- A custom distance function
• Is your dataset very large (but with fewer than 10 columns)? Use the
KDTreeSearcher object.
• Are you searching for the nearest neighbors for a large number of query
points? Use the KDTreeSearcher object.
12-21
12 Classification
load fisheriris
x = meas(:,3:4);
gscatter(x(:,1),x(:,2),species)
set(legend,'location','best')
newpoint = [5 1.45];
line(newpoint(1),newpoint(2),'marker','x','color','k',...
'markersize',10,'linewidth',2)
12-22
Classification Using Nearest Neighbors
[n,d] = knnsearch(x,newpoint,'k',10)
line(x(n,1),x(n,2),'color',[.5 .5 .5],'marker','o',...
'linestyle','none','markersize',10)
12-23
12 Classification
It appears that knnsearch has found only the nearest eight neighbors. In fact,
this particular dataset contains duplicate values:
x(n,:)
ans =
5.0000 1.5000
4.9000 1.5000
4.9000 1.5000
5.1000 1.5000
5.1000 1.6000
4.8000 1.4000
5.0000 1.7000
4.7000 1.4000
4.7000 1.4000
4.7000 1.5000
12-24
Classification Using Nearest Neighbors
To make duplicate values visible on the plot, use the following code:
The jittered points do not affect any analysis of the data, only the
visualization. This example does not jitter the points.
Make the axes equal so the calculated distances correspond to the apparent
distances on the plot axis equal and zoom in to see the neighbors better:
tabulate(species(n))
12-25
12 Classification
Using a rule based on the majority vote of the 10 nearest neighbors, you can
classify this new point as a versicolor.
You can also visually identify the neighbors by drawing a circle around the
group of them:
12-26
Classification Using Nearest Neighbors
Using the same dataset, find the 10 nearest neighbors to three new points:
figure
newpoint2 = [5 1.45;6 2;2.75 .75];
gscatter(x(:,1),x(:,2),species)
legend('location','best')
[n2,d2] = knnsearch(x,newpoint2,'k',10);
line(x(n2,1),x(n2,2),'color',[.5 .5 .5],'marker','o',...
'linestyle','none','markersize',10)
line(newpoint2(:,1),newpoint2(:,2),'marker','x','color','k',...
'markersize',10,'linewidth',2,'linestyle','none')
12-27
12 Classification
Find the species of the 10 nearest neighbors for each new point:
tabulate(species(n(1,:)))
Value Count Percent
virginica 2 20.00%
versicolor 8 80.00%
tabulate(species(n(2,:)))
Value Count Percent
virginica 10 100.00%
tabulate(species(n(3,:)))
Value Count Percent
versicolor 7 70.00%
setosa 3 30.00%
12-28
Classification Using Nearest Neighbors
For further examples using the knnsearch methods and function, see the
individual reference pages.
12-29
12 Classification
Introduction
Bagging, which stands for “bootstrap aggregation”, is a type of ensemble
learning. To bag a weak learner such as a decision tree on a dataset, generate
many bootstrap replicas of this dataset and grow decision trees on these
replicas. Obtain each bootstrap replica by randomly selecting N observations
out of N with replacement, where N is the dataset size. To find the predicted
response of a trained ensemble, take an average over predictions from
individual trees.
12-30
Regression and Classification by Bagging Decision Trees
observations for each decision tree. These are "out-of-bag" observations. You
can use them to estimate the predictive power and feature importance. For
each observation, you can estimate the out-of-bag prediction by averaging over
predictions from all trees in the ensemble for which this observation is out of
bag. You can then compare the computed prediction against the true response
for this observation. By comparing the out-of-bag predicted responses against
the true responses for all observations used for training, you can estimate the
average out-of-bag error. This out-of-bag average is an unbiased estimator of
the true ensemble error. You can also obtain out-of-bag estimates of feature
importance by randomly permuting out-of-bag data across one variable or
column at a time and estimating the increase in the out-of-bag error due to
this permutation. The larger the increase, the more important the feature.
Thus, you do not need to supply test data for bagged ensembles because you
obtain reliable estimates of the predictive power and feature importance in
the process of training, which is an attractive feature of bagging.
Examples
The following examples show how to use ensembles of decision trees for
regression and classification.
First, load the dataset and split it into predictor and response arrays:
12-31
12 Classification
load imports-85;
Y = X(:,1);
X = X(:,2:end);
Because bagging uses randomized data drawings, its exact outcome depends
on the initial random seed. To reproduce the exact results in this example,
use the random stream settings
s = RandStream('mt19937ar','seed',1945);
RandStream.setDefaultStream(s);
leaf = [1 5 10 20 50 100];
col = 'rgbcmy';
figure(1);
for i=1:length(leaf)
b = TreeBagger(50,X,Y,'method','r','oobpred','on',...
'cat',16:25,'minleaf',leaf(i));
plot(oobError(b),col(i));
hold on;
end
xlabel('Number of Grown Trees');
ylabel('Mean Squared Error');
legend({'1' '5' '10' '20' '50' '100'},'Location','NorthEast');
hold off;
12-32
Regression and Classification by Bagging Decision Trees
The red (leaf size 1) curve gives the lowest MSE values.
b = TreeBagger(100,X,Y,'method','r','oobvarimp','on',...
'cat',16:25,'minleaf',1);
Inspect the error curve again to make sure nothing went wrong during
training:
figure(2);
plot(oobError(b));
xlabel('Number of Grown Trees');
ylabel('Out-of-Bag Mean Squared Error');
12-33
12 Classification
For each feature, you can permute the values of this feature across all of
the observations in the data set and measure how much worse the mean
squared error (MSE) becomes after the permutation. You can repeat this
for each feature.
Using the following code, plot the increase in MSE due to permuting out-of-bag
observations across each input variable. The OOBPermutedVarDeltaError
array stores the increase in MSE averaged over all trees in the ensemble and
divided by the standard deviation taken over the trees, for each variable. The
larger this value, the more important the variable. Imposing an arbitrary
cutoff at 0.65, you can select the five most important features.
figure(3);
bar(b.OOBPermutedVarDeltaError);
xlabel('Feature Number');
ylabel('Out-Of-Bag Feature Importance');
idxvar = find(b.OOBPermutedVarDeltaError>0.65)
idxvar =
1 2 4 16 19
12-34
Regression and Classification by Bagging Decision Trees
finbag = zeros(1,b.NTrees);
for t=1:b.NTrees
finbag(t) = sum(all(~b.OOBIndices(:,1:t),2));
end
finbag = finbag / size(X,1);
figure(4);
plot(finbag);
xlabel('Number of Grown Trees');
12-35
12 Classification
b5v = TreeBagger(100,X(:,idxvar),Y,'method','r',...
'oobvarimp','on','cat',4:5,'minleaf',1);
figure(5);
plot(oobError(b5v));
12-36
Regression and Classification by Bagging Decision Trees
12-37
12 Classification
These five most powerful features give the same MSE as the full set, and
the ensemble trained on the reduced set ranks these features similarly to
each other. Features 1 and 2 from the reduced set perhaps could be removed
without a significant loss in the predictive power.
Finding Outliers
To find outliers in the training data, compute the proximity matrix using
fillProximities:
b5v = fillProximities(b5v);
The method normalizes this measure by subtracting the mean outlier measure
for the entire sample, taking the magnitude of this difference and dividing the
result by the median absolute deviation for the entire sample.
figure(7);
hist(b5v.OutlierMeasure);
xlabel('Outlier Measure');
ylabel('Number of Observations');
12-38
Regression and Classification by Bagging Decision Trees
figure(8);
[~,e] = mdsProx(b5v,'colors','k');
xlabel('1st Scaled Coordinate');
ylabel('2nd Scaled Coordinate');
Assess the relative importance of the scaled axes by plotting the first 20
eigenvalues:
figure(9);
bar(e(1:20));
xlabel('Scaled Coordinate Index');
ylabel('Eigenvalue');
12-39
12 Classification
c = compact(b5v)
c =
12-40
Regression and Classification by Bagging Decision Trees
The workflow is similar to the one for “Regression of Insurance Risk Rating
for Car Imports” on page 12-31. Again, fix the initial random seed, grow 50
trees, inspect how the ensemble error changes with accumulation of trees, and
estimate feature importance. For classification, it is best to set the minimal
leaf size to 1 and select the square root of the total number of features for
each decision split at random. These are the default settings for a TreeBagger
used for classification.
load ionosphere;
s = RandStream('mt19937ar','seed',1945);
RandStream.setDefaultStream(s);
b = TreeBagger(50,X,Y,'oobvarimp','on');
figure(10);
plot(oobError(b));
xlabel('Number of Grown Trees');
ylabel('Out-of-Bag Classification Error');
The method trains ensembles with few trees on observations that are in bag
for all trees. For such observations, it is not possible to compute the true
12-41
12 Classification
out-of-bag prediction and TreeBagger returns the most probable class for
classification and the sample mean for regression. You can change the default
value returned for in-bag observations using the DefaultYfit property. If
you set the default value to an empty string for classification, the method
excludes in-bag observations from computation of the out-of-bag error. In this
case, the curve is more variable when the number of trees is small, either
because some observations are never out of bag (and are therefore excluded)
or because their predictions are based on few trees.
b.DefaultYfit = '';
figure(11);
plot(oobError(b));
xlabel('Number of Grown Trees');
ylabel('Out-of-Bag Error Excluding in-Bag Observations');
finbag = zeros(1,b.NTrees);
for t=1:b.NTrees
finbag(t) = sum(all(~b.OOBIndices(:,1:t),2));
end
12-42
Regression and Classification by Bagging Decision Trees
figure(13);
bar(b.OOBPermutedVarDeltaError);
xlabel('Feature Index');
ylabel('Out-of-Bag Feature Importance');
idxvar = find(b.OOBPermutedVarDeltaError>0.8)
idxvar =
3 4 5 7 8
12-43
12 Classification
Having selected the five most important features, grow a larger ensemble on
the reduced feature set. Save time by not permuting out-of-bag observations
to obtain new estimates of feature importance for the reduced feature set (set
oobvarimp to 'off'). You would still be interested in obtaining out-of-bag
estimates of classification error (set oobpred to 'on').
b5v = TreeBagger(100,X(:,idxvar),Y,'oobpred','on');
figure(14);
plot(oobError(b5v));
xlabel('Number of Grown Trees');
ylabel('Out-of-Bag Classification Error');
12-44
Regression and Classification by Bagging Decision Trees
figure(15);
plot(oobMeanMargin(b5v));
12-45
12 Classification
b5v = fillProximities(b5v);
figure(16);
hist(b5v.OutlierMeasure);
xlabel('Outlier Measure');
ylabel('Number of Observations');
12-46
Regression and Classification by Bagging Decision Trees
All extreme outliers for this dataset come from the 'good' class:
b5v.Y(b5v.OutlierMeasure>40)
ans =
'g'
'g'
'g'
'g'
'g''
Just like for regression, you can plot scaled coordinates, displaying the two
classes in different colors using the colors argument of mdsProx. This
argument takes a string in which every character represents a color. To find
out the order of classes used by the ensemble, look at the ClassNames property:
b5v.ClassNames
ans =
'g'
'b'
12-47
12 Classification
The 'good' class is first and the 'bad' class is second. Display scaled
coordinates using red for 'good' and blue for 'bad' observations.
figure(17);
[s,e] = mdsProx(b5v,'colors','rb');
xlabel('1st Scaled Coordinate');
ylabel('2nd Scaled Coordinate');
Again, plot the first 20 eigenvalues obtained by scaling. The first eigenvalue in
this case clearly dominates and the first scaled coordinate is most important.
figure(18);
bar(e(1:20));
xlabel('Scaled Coordinate Index');
ylabel('Eigenvalue');
12-48
Regression and Classification by Bagging Decision Trees
[Yfit,Sfit] = oobPredict(b5v);
[fpr,tpr] = perfcurve(b5v.Y,Sfit(:,1),'g');
12-49
12 Classification
figure(19);
plot(fpr,tpr);
xlabel('False Positive Rate');
ylabel('True Positive Rate');
Instead of the standard ROC curve, you might want to plot, for example,
ensemble accuracy versus threshold on the score for the 'good' class. The
ycrit input argument of perfcurve lets you specify the criterion for the
y-axis, and the third output argument of perfcurve returns an array of
thresholds for the positive class score. Accuracy is the fraction of correctly
classified observations, or equivalently, one minus classification error.
[fpr,accu,thre] = perfcurve(b5v.Y,Sfit(:,1),'g','ycrit','accu');
figure(20);
plot(thre,accu);
xlabel('Threshold for ''good'' Returns');
ylabel('Classification Accuracy');
12-50
Regression and Classification by Bagging Decision Trees
The curve shows a flat region indicating that any threshold from 0.2 to 0.6
is a reasonable choice. By default, the function assigns classification labels
using 0.5 as the boundary between the two classes. You can find exactly
what accuracy this corresponds to:
i50 = find(accu>=0.50,1,'first')
accu(abs(thre-0.5)<eps)
returns
i50 =
ans =
0.9430
[maxaccu,iaccu] = max(accu)
returns
12-51
12 Classification
maxaccu =
0.9459
iaccu =
91
thre(iaccu)
ans =
0.5056
12-52
Performance Curves
Performance Curves
In this section...
“Introduction” on page 12-53
“What are ROC Curves?” on page 12-53
“Evaluating Classifier Performance Using perfcurve” on page 12-53
Introduction
After a classification algorithm such as NaiveBayes or TreeBagger has
trained on data, you may want to examine the performance of this algorithm
on a specific test dataset. One common way of doing this would be to compute
a gross measure of performance such as quadratic loss, accuracy, such as
quadratic loss or accuracy, averaged over the entire test dataset.
You can use perfcurve with any classifier or, more broadly, with any method
that returns a numeric score for an instance of input data. By convention
adopted here,
12-53
12 Classification
• A high score returned by a classifier for any given instance signifies that
the instance is likely from the positive class.
• A low score signifies that the instance is likely from the negative classes.
For some classifiers, you can interpret the score as the posterior probability
of observing an instance of the positive class at point X. An example of such
a score is the fraction of positive observations in a leaf of a decision tree. In
this case, scores fall into the range from 0 to 1 and scores from positive and
negative classes add up to unity. Other methods can return scores ranging
between minus and plus infinity, without any obvious mapping from the
score to the posterior class probability.
perfcurve does not impose any requirements on the input score range.
Because of this lack of normalization, you can use perfcurve to process scores
returned by any classification, regression, or fit method. perfcurve does
not make any assumptions about the nature of input scores or relationships
between the scores for different classes. As an example, consider a problem
with three classes, A, B, and C, and assume that the scores returned by some
classifier for two instances are as follows:
A B C
instance 1 0.4 0.5 0.1
instance 2 0.4 0.1 0.5
perfcurve is intended for use with classifiers that return scores, not those
that return only predicted classes. As a counter-example, consider a decision
tree that returns only hard classification labels, 0 or 1, for data with two
classes. In this case, the performance curve reduces to a single point because
classified instances can be split into positive and negative categories in one
way only.
12-54
Performance Curves
For input, perfcurve takes true class labels for some data and scores assigned
by a classifier to these data. By default, this utility computes a Receiver
Operating Characteristic (ROC) curve and returns values of 1–specificity,
or false positive rate, for X and sensitivity, or true positive rate, for Y. You
can choose other criteria for X and Y by selecting one out of several provided
criteria or specifying an arbitrary criterion through an anonymous function.
You can display the computed performance curve using plot(X,Y).
perfcurve can compute values for various criteria to plot either on the x- or
the y-axis. All such criteria are described by a 2-by-2 confusion matrix, a
2-by-2 cost matrix, and a 2-by-1 vector of scales applied to class counts.
⎛ TP FN ⎞
⎜ ⎟
⎝ FP
where TN ⎠
For example, the first row of the confusion matrix defines how the classifier
identifies instances of the positive class: C(1,1) is the count of correctly
identified positive instances and C(1,2) is the count of positive instances
misidentified as negative.
The cost matrix defines the cost of misclassification for each category:
⎛ Cost( P | P) Cost( N | P) ⎞
⎜ ⎟
⎝ Cost
where ( P | N ) Cost
Cost(I|J) ( N cost
is the | N ) of
⎠ assigning an instance of class J to class I.
Usually Cost(I|J)=0 for I=J. For flexibility, perfcurve allows you to specify
nonzero costs for correct classification as well.
12-55
12 Classification
scale( P) * TP
PPV =
If all scoresscale( P)data
in the + scale
* TP are ( N )a* certain
above FP threshold, perfcurve classifies all
instances as 'positive'. This means that TP is the total number of instances
in the positive class and FP is the total number of instances in the negative
class. In this case, PPV is simply given by the prior:
prior( P)
PPV =
The perfcurve P) + prior
prior(function (N)
returns two vectors, X and Y, of performance
measures. Each measure is some function of confusion, cost, and scale
values. You can request specific measures by name or provide a function
handle to compute a custom measure. The function you provide should take
confusion, cost, and scale as its three inputs and return a vector of output
values.
By default, perfcurve computes values of the X and Y criteria for all possible
score thresholds. Alternatively, it can compute a reduced number of specific X
values supplied as an input argument. In either case, for M requested values,
perfcurve computes M+1 values for X and Y. The first value out of these M+1
values is special. perfcurve computes it by setting the TP instance count
12-56
Performance Curves
to zero and setting TN to the total count in the negative class. This value
corresponds to the 'reject all' threshold. On a standard ROC curve, this
translates into an extra point placed at (0,0).
If there are NaN values among input scores, perfcurve can process them
in either of two ways:
That is, for any threshold, instances with NaN scores from the positive class
are counted as false negative (FN), and instances with NaN scores from the
negative class are counted as false positive (FP). In this case, the first value
of X or Y is computed by setting TP to zero and setting TN to the total count
minus the NaN count in the negative class. For illustration, consider an
example with two rows in the positive and two rows in the negative class,
each pair having a NaN score:
Class Score
Negative 0.2
Negative NaN
Positive 0.7
Positive NaN
If you discard rows with NaN scores, then as the score cutoff varies, perfcurve
computes performance measures as in the following table. For example, a
cutoff of 0.5 corresponds to the middle row where rows 1 and 3 are classified
correctly, and rows 2 and 4 are omitted.
TP FN FP TN
0 1 0 1
1 0 0 1
1 0 1 0
If you add rows with NaN scores to the false category in their respective
classes, perfcurve computes performance measures as in the following table.
For example, a cutoff of 0.5 corresponds to the middle row where now rows
12-57
12 Classification
2 and 4 are counted as incorrectly classified. Notice that only the FN and FP
columns differ between these two tables.
TP FN FP TN
0 2 1 1
1 1 1 1
1 1 2 0
For data with three or more classes, perfcurve takes one positive class and a
list of negative classes for input. The function computes the X and Y values
using counts in the positive class to estimate TP and FN, and using counts in
all negative classes to estimate TN and FP. perfcurve can optionally compute
Y values for each negative class separately and, in addition to Y, return a
matrix of size M-by-C, where M is the number of elements in X or Y and C is
the number of negative classes. You can use this functionality to monitor
components of the negative class contribution. For example, you can plot TP
counts on the X-axis and FP counts on the Y-axis. In this case, the returned
matrix shows how the FP component is split across negative classes.
12-58
Performance Curves
12-59
12 Classification
12-60
13
Markov Models
Introduction
Markov processes are examples of stochastic processes—processes that
generate random sequences of outcomes or states according to certain
probabilities. Markov processes are distinguished by being memoryless—their
next state depends only on their current state, not on the history that led them
there. Models of Markov processes are used in a wide variety of applications,
from daily stock prices to the positions of genes in a chromosome.
13-2
Markov Chains
Markov Chains
A Markov model is given visual representation with a state diagram, such
as the one below.
The rectangles in the diagram represent the possible states of the process you
are trying to model, and the arrows represent transitions between states.
The label on each arrow represents the probability of that transition. At
each step of the process, the model may generate an output, or emission,
depending on which state it is in, and then make a transition to another
state. An important characteristic of Markov models is that the next state
depends only on the current state, and not on the history of transitions that
lead to the current state.
For example, for a sequence of coin tosses the two states are heads and tails.
The most recent coin toss determines the current state of the model and each
subsequent toss determines the transition to the next state. If the coin is fair,
the transition probabilities are all 1/2. The emission might simply be the
current state. In more complicated models, random processes at each state
will generate emissions. You could, for example, roll a die to determine the
emission at any step.
13-3
13 Markov Models
Markov chains begin in an initial state i0 at step 0. The chain then transitions
to state i1 with probability T1i1 , and emits an output sk1 with probability
Ei1k1 . Consequently, the probability of observing the sequence of states
i1i2 ...ir and the sequence of emissions sk1 sk2 ...skr in the first r steps, is
13-4
Hidden Markov Models
Introduction
A hidden Markov model is one in which you observe a sequence of emissions,
but do not know the sequence of states the model went through to generate
the emissions. Analyses of hidden Markov models seek to recover the
sequence of states from the observed data.
As an example, consider a Markov model with two states and six possible
emissions. The model uses:
The model creates a sequence of numbers from the set {1, 2, 3, 4, 5, 6} with the
following rules:
• Begin by rolling the red die and writing down the number that comes up,
which is the emission.
• Toss the red coin and do one of the following:
- If the result is heads, roll the red die and write down the result.
- If the result is tails, roll the green die and write down the result.
• At each subsequent step, you flip the coin that has the same color as the die
you rolled in the previous step. If the coin comes up heads, roll the same die
as in the previous step. If the coin comes up tails, switch to the other die.
13-5
13 Markov Models
The state diagram for this model has two states, red and green, as shown in
the following figure.
You determine the emission from a state by rolling the die with the same color
as the state. You determine the transition to the next state by flipping the
coin with the same color as the state.
⎡ 0 .9 0 .1 ⎤
T=⎢ ⎥
⎣0.05 0.95⎦
⎡1 1 1 1 1 1⎤
⎢ 6⎥
E=⎢6 6 6 6 6
⎥
⎢7 1 1 1 1 1⎥
⎢⎣ 12 12 12 12 12 12 ⎥⎦
The model is not hidden because you know the sequence of states from the
colors of the coins and dice. Suppose, however, that someone else is generating
13-6
Hidden Markov Models
the emissions without showing you the dice or the coins. All you see is the
sequence of emissions. If you start seeing more 1s than other numbers, you
might suspect that the model is in the green state, but you cannot be sure
because you cannot see the color of the die being rolled.
13-7
13 Markov Models
This section shows how to use these functions to analyze hidden Markov
models.
To generate a random sequence of states and emissions from the model, use
hmmgenerate:
[seq,states] = hmmgenerate(1000,TRANS,EMIS);
The output seq is the sequence of emissions and the output states is the
sequence of states.
13-8
Hidden Markov Models
sum(states==likelystates)/1000
ans =
0.8200
In this case, the most likely sequence of states agrees with the random
sequence 82% of the time.
The following takes the emission and state sequences and returns estimates
of the transition and emission matrices:
TRANS_EST =
0.8989 0.1011
0.0585 0.9415
EMIS_EST =
0.1721 0.1721 0.1749 0.1612 0.1803 0.1393
0.5836 0.0741 0.0804 0.0789 0.0726 0.1104
You can compare the outputs with the original transition and emission
matrices, TRANS and EMIS:
TRANS
TRANS =
0.9000 0.1000
0.0500 0.9500
EMIS
13-9
13 Markov Models
EMIS =
0.1667 0.1667 0.1667 0.1667 0.1667 0.1667
0.5833 0.0833 0.0833 0.0833 0.0833 0.0833
Using hmmtrain. If you do not know the sequence of states states, but you
have initial guesses for TRANS and EMIS, you can still estimate TRANS and
EMIS using hmmtrain.
Suppose you have the following initial guesses for TRANS and EMIS.
TRANS_EST2 =
0.2286 0.7714
0.0032 0.9968
EMIS_EST2 =
0.1436 0.2348 0.1837 0.1963 0.2350 0.0066
0.4355 0.1089 0.1144 0.1082 0.1109 0.1220
If the algorithm fails to reach the desired tolerance, increase the default value
of the maximum number of iterations with the command:
hmmtrain(seq,TRANS_GUESS,EMIS_GUESS,'maxiterations',maxiter)
13-10
Hidden Markov Models
where tol is the desired value of the tolerance. Increasing the value of tol
makes the algorithm halt sooner, but the results are less accurate.
• The algorithm converges to a local maximum that does not represent the
true transition and emission matrices. If you suspect this, use different
initial guesses for the matrices TRANS_EST and EMIS_EST.
• The sequence seq may be too short to properly train the matrices. If you
suspect this, use a longer sequence for seq.
PSTATES = hmmdecode(seq,TRANS,EMIS)
hmmdecode begins with the model in state 1 at step 0, prior to the first
emission. PSTATES(i,1) is the probability that the model is in state i at the
following step 1. To change the initial state, see “Changing the Initial State
Distribution” on page 13-12.
To return the logarithm of the probability of the sequence seq, use the second
output argument of hmmdecode:
[PSTATES,logpseq] = hmmdecode(seq,TRANS,EMIS)
13-11
13 Markov Models
than the smallest positive number your computer can represent. hmmdecode
returns the logarithm of the probability to avoid this problem.
⎡0 p ⎤
T̂ = ⎢ ⎥
⎣0 T ⎦
where T is the true transition matrix. The first column of T̂ contains M+1
zeros. p must sum to 1.
⎡0⎤
Ê = ⎢ ⎥
⎣ E⎦
If the transition and emission matrices are TRANS and EMIS, respectively, you
create the augmented matrices with the following commands:
13-12
14
Design of Experiments
Introduction
Passive data collection leads to a number of problems in statistical modeling.
Observed changes in a response variable may be correlated with, but
not caused by, observed changes in individual factors (process variables).
Simultaneous changes in multiple factors may produce interactions that are
difficult to separate into individual effects. Observations may be dependent,
while a model of the data considers them to be independent.
y = 0 + 1 x1 + 2 x2 + 3 x1 x2 +
Here ε includes both experimental error and the effects of any uncontrolled
factors in the experiment. The terms β1x1 and β2x2 are main effects and the
term β3x1x2 is a two-way interaction effect. A designed experiment would
systematically manipulate x1 and x2 while measuring y, with the objective of
accurately estimating β0, β1, β2, and β3.
14-2
Full Factorial Designs
Multilevel Designs
To systematically vary experimental factors, assign each factor a discrete
set of levels. Full factorial designs measure response variables using every
treatment (combination of the factor levels). A full factorial design for n
factors with N1, ..., Nn levels requires N1 × ... × Nn experimental runs—one for
each treatment. While advantageous for separating individual effects, full
factorial designs can make large demands on data collection.
dFF = fullfact([3,4])
dFF =
1 1
2 1
3 1
1 2
2 2
3 2
1 3
2 3
3 3
1 4
2 4
3 4
14-3
14 Design of Experiments
Two-Level Designs
Many experiments can be conducted with two-level factors, using two-level
designs. For example, suppose the machine shop in the previous example
always keeps the same operator on the same machine, but wants to measure
production effects that depend on the composition of the day and night
shifts. The Statistics Toolbox function ff2n generates a full factorial list of
treatments:
dFF2 = ff2n(4)
dFF2 =
0 0 0 0
0 0 0 1
0 0 1 0
0 0 1 1
0 1 0 0
0 1 0 1
0 1 1 0
0 1 1 1
1 0 0 0
1 0 0 1
1 0 1 0
1 0 1 1
1 1 0 0
1 1 0 1
1 1 1 0
1 1 1 1
Each of the 24 = 16 rows of dFF2 represent one schedule of operators for the
day (0) and night (1) shifts.
14-4
Fractional Factorial Designs
Introduction
Two-level designs are sufficient for evaluating many production processes.
Factor levels of ±1 can indicate categorical factors, normalized factor extremes,
or simply “up” and “down” from current factor settings. Experimenters
evaluating process changes are interested primarily in the factor directions
that lead to process improvement.
For experiments with many factors, two-level full factorial designs can lead to
large amounts of data. For example, a two-level full factorial design with 10
factors requires 210 = 1024 runs. Often, however, individual factors or their
interactions have no distinguishable effects on a response. This is especially
true of higher order interactions. As a result, a well-designed experiment can
use fewer runs for estimating model parameters.
Plackett-Burman Designs
Plackett-Burman designs are used when only main effects are considered
significant. Two-level Plackett-Burman designs require a number of
experimental runs that are a multiple of 4 rather than a power of 2. The
MATLAB function hadamard generates these designs:
dPB = hadamard(8)
14-5
14 Design of Experiments
dPB =
1 1 1 1 1 1 1 1
1 -1 1 -1 1 -1 1 -1
1 1 -1 -1 1 1 -1 -1
1 -1 -1 1 1 -1 -1 1
1 1 1 1 -1 -1 -1 -1
1 -1 1 -1 -1 1 -1 1
1 1 -1 -1 -1 -1 1 1
1 -1 -1 1 -1 1 1 -1
Binary factor levels are indicated by ±1. The design is for eight runs (the rows
of dPB) manipulating seven two-level factors (the last seven columns of dPB).
The number of runs is a fraction 8/27 = 0.0625 of the runs required by a full
factorial design. Economy is achieved at the expense of confounding main
effects with any two-way interactions.
Specify general fractional factorial designs using a full factorial design for
a selected subset of basic factors and generators for the remaining factors.
Generators are products of the basic factors, giving the levels for the
remaining factors. Use the Statistics Toolbox function fracfact to generate
these designs:
14-6
Fractional Factorial Designs
-1 1 -1 1 -1 1
-1 1 1 -1 -1 1
-1 1 1 1 1 -1
1 -1 -1 -1 -1 1
1 -1 -1 1 1 -1
1 -1 1 -1 1 -1
1 -1 1 1 -1 1
1 1 -1 -1 1 1
1 1 -1 1 -1 -1
1 1 1 -1 -1 -1
1 1 1 1 1 1
This is a six-factor design in which four two-level basic factors (a, b, c, and
d in the first four columns of dfF) are measured in every combination of
levels, while the two remaining factors (in the last three columns of dfF) are
measured only at levels defined by the generators bcd and acd, respectively.
Levels in the generated columns are products of corresponding levels in the
columns that make up the generator.
These are generators for a six-factor design with factors a through f, using 24
= 16 runs to achieve resolution IV. The fracfactgen function uses an efficient
search algorithm to find generators that meet the requirements.
[dfF,confounding] = fracfact(generators);
14-7
14 Design of Experiments
confounding
confounding =
'Term' 'Generator' 'Confounding'
'X1' 'a' 'X1'
'X2' 'b' 'X2'
'X3' 'c' 'X3'
'X4' 'd' 'X4'
'X5' 'bcd' 'X5'
'X6' 'acd' 'X6'
'X1*X2' 'ab' 'X1*X2 + X5*X6'
'X1*X3' 'ac' 'X1*X3 + X4*X6'
'X1*X4' 'ad' 'X1*X4 + X3*X6'
'X1*X5' 'abcd' 'X1*X5 + X2*X6'
'X1*X6' 'cd' 'X1*X6 + X2*X5 + X3*X4'
'X2*X3' 'bc' 'X2*X3 + X4*X5'
'X2*X4' 'bd' 'X2*X4 + X3*X5'
'X2*X5' 'cd' 'X1*X6 + X2*X5 + X3*X4'
'X2*X6' 'abcd' 'X1*X5 + X2*X6'
'X3*X4' 'cd' 'X1*X6 + X2*X5 + X3*X4'
'X3*X5' 'bd' 'X2*X4 + X3*X5'
'X3*X6' 'ad' 'X1*X4 + X3*X6'
'X4*X5' 'bc' 'X2*X3 + X4*X5'
'X4*X6' 'ac' 'X1*X3 + X4*X6'
'X5*X6' 'ab' 'X1*X2 + X5*X6'
The confounding pattern shows that main effects are effectively separated
by the design, but two-way interactions are confounded with various other
two-way interactions.
14-8
Response Surface Designs
Introduction
As discussed in “Response Surface Models” on page 9-45, quadratic response
surfaces are simple models that provide a maximum or minimum without
making additional assumptions about the form of the response. Quadratic
models can be calibrated using full factorial designs with three or more levels
for each factor, but these designs generally require more runs than necessary
to accurately estimate model parameters. This section discusses designs for
calibrating quadratic models that are much more efficient, using three or five
levels for each factor, but not using all combinations of levels.
14-9
14 Design of Experiments
14-10
Response Surface Designs
Each design consists of a factorial design (the corners of a cube) together with
center and star points that allow for estimation of second-order effects. For
a full quadratic model with n factors, CCDs have enough design points to
estimate the (n+2)(n+1)/2 coefficients in a full quadratic model with n factors.
The type of CCD used (the position of the factorial and star points) is
determined by the number of factors and by the desired properties of the
design. The following table summarizes some important properties. A design
is rotatable if the prediction variance depends only on the distance of the
design point from the center of the design.
14-11
14 Design of Experiments
dCC = ccdesign(3,'type','circumscribed')
dCC =
-1.0000 -1.0000 -1.0000
-1.0000 -1.0000 1.0000
-1.0000 1.0000 -1.0000
-1.0000 1.0000 1.0000
1.0000 -1.0000 -1.0000
1.0000 -1.0000 1.0000
1.0000 1.0000 -1.0000
1.0000 1.0000 1.0000
-1.6818 0 0
1.6818 0 0
0 -1.6818 0
0 1.6818 0
0 0 -1.6818
0 0 1.6818
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
14-12
Response Surface Designs
0 0 0
The repeated center point runs allow for a more uniform estimate of the
prediction variance over the entire design space.
Box-Behnken Designs
Like the designs described in “Central Composite Designs” on page
14-9, Box-Behnken designs are used to calibrate full quadratic models.
Box-Behnken designs are rotatable and, for a small number of factors (four or
less), require fewer runs than CCDs. By avoiding the corners of the design
space, they allow experimenters to work around extreme factor combinations.
Like an inscribed CCD, however, extremes are then poorly estimated.
Design points are at the midpoints of edges of the design space and at the
center, and do not contain an embedded factorial design.
14-13
14 Design of Experiments
dBB = bbdesign(3)
dBB =
-1 -1 0
-1 1 0
1 -1 0
1 1 0
-1 0 -1
-1 0 1
1 0 -1
1 0 1
0 -1 -1
0 -1 1
0 1 -1
0 1 1
0 0 0
0 0 0
0 0 0
Again, the repeated center point runs allow for a more uniform estimate of
the prediction variance over the entire design space.
14-14
D-Optimal Designs
D-Optimal Designs
In this section...
“Introduction” on page 14-15
“Generating D-Optimal Designs” on page 14-16
“Augmenting D-Optimal Designs” on page 14-19
“Specifying Fixed Covariate Factors” on page 14-20
“Specifying Categorical Factors” on page 14-21
“Specifying Candidate Sets” on page 14-21
Introduction
Traditional experimental designs (“Full Factorial Designs” on page 14-3,
“Fractional Factorial Designs” on page 14-5, and “Response Surface Designs”
on page 14-9) are appropriate for calibrating linear models in experimental
settings where factors are relatively unconstrained in the region of interest.
In some cases, however, models are necessarily nonlinear. In other cases,
certain treatments (combinations of factor levels) may be expensive or
infeasible to measure. D-optimal designs are model-specific designs that
address these limitations of traditional designs.
14-15
14 Design of Experiments
Function Description
candexch Uses a row-exchange algorithm to generate a D-optimal design
with a specified number of runs for a specified model and a
specified candidate set. This is the second component of the
algorithm used by rowexch.
candgen Generates a candidate set for a specified model. This is the
first component of the algorithm used by rowexch.
cordexch Uses a coordinate-exchange algorithm to generate a D-optimal
design with a specified number of runs for a specified model.
daugment Uses a coordinate-exchange algorithm to augment an existing
D-optimal design with additional runs to estimate additional
model terms.
dcovary Uses a coordinate-exchange algorithm to generate a D-optimal
design with fixed covariate factors.
rowexch Uses a row-exchange algorithm to generate a D-optimal design
with a specified number of runs for a specified model. The
algorithm calls candgen and then candexch. (Call candexch
separately to specify a candidate set.)
Note The Statistics Toolbox function rsmdemo generates simulated data for
experimental settings specified by either the user or by a D-optimal design
generated by cordexch. It uses the rstool interface to visualize response
surface models fit to the data, and it uses the nlintool interface to visualize
a nonlinear model fit to the data.
14-16
D-Optimal Designs
Both cordexch and rowexch use iterative search algorithms. They operate by
incrementally changing an initial design matrix X to increase D = |XTX| at
each step. In both algorithms, there is randomness built into the selection of
the initial design and into the choice of the incremental changes. As a result,
both algorithms may return locally, but not globally, D-optimal designs. Run
each algorithm multiple times and select the best result for your final design.
Both functions have a 'tries' parameter that automates this repetition
and comparison.
For example, suppose you want a design to estimate the parameters in the
following three-factor, seven-term interaction model:
y = 0 + 1 x 1 + 2 x 2 + 3 x 3 + 12 x 1 x 2 + 13 x 1 x 3 + 23 x 2 x 3 +
nfactors = 3;
nruns = 7;
[dCE,X] = cordexch(nfactors,nruns,'interaction','tries',10)
dCE =
-1 1 1
-1 -1 -1
1 1 1
-1 1 -1
1 -1 1
14-17
14 Design of Experiments
1 -1 -1
-1 -1 1
X =
1 -1 1 1 -1 -1 1
1 -1 -1 -1 1 1 1
1 1 1 1 1 1 1
1 -1 1 -1 -1 1 -1
1 1 -1 1 -1 1 -1
1 1 -1 -1 -1 -1 1
1 -1 -1 1 1 -1 -1
Columns of the design matrix X are the model terms evaluated at each row of
the design dCE. The terms appear in order from left to right:
1 Constant term
[dRE,X] = rowexch(nfactors,nruns,'interaction','tries',10)
dRE =
-1 -1 1
1 -1 1
1 -1 -1
1 1 1
-1 -1 -1
-1 1 -1
-1 1 1
X =
1 -1 -1 1 1 -1 -1
1 1 -1 1 -1 1 -1
1 1 -1 -1 -1 -1 1
1 1 1 1 1 1 1
1 -1 -1 -1 1 1 1
1 -1 1 -1 -1 1 -1
14-18
D-Optimal Designs
1 -1 1 1 -1 -1 1
For example, the following eight-run design is adequate for estimating main
effects in a four-factor model:
dCEmain = cordexch(4,8)
dCEmain =
1 -1 -1 1
-1 -1 1 1
-1 1 -1 1
1 1 1 -1
1 1 1 1
-1 1 -1 -1
1 -1 -1 -1
-1 -1 1 -1
To estimate the six interaction terms in the model, augment the design with
eight additional runs:
dCEinteraction = daugment(dCEmain,8,'interaction')
dCEinteraction =
1 -1 -1 1
-1 -1 1 1
-1 1 -1 1
1 1 1 -1
1 1 1 1
-1 1 -1 -1
1 -1 -1 -1
-1 -1 1 -1
-1 1 1 1
-1 -1 -1 -1
1 -1 1 -1
1 1 -1 1
-1 1 1 -1
14-19
14 Design of Experiments
1 1 -1 -1
1 -1 1 1
1 1 1 -1
The augmented design is full factorial, with the original eight runs in the
first eight rows.
time = linspace(-1,1,8)';
[dCV,X] = dcovary(3,time,'linear')
dCV =
-1.0000 1.0000 1.0000 -1.0000
1.0000 -1.0000 -1.0000 -0.7143
-1.0000 -1.0000 -1.0000 -0.4286
1.0000 -1.0000 1.0000 -0.1429
1.0000 1.0000 -1.0000 0.1429
-1.0000 1.0000 -1.0000 0.4286
1.0000 1.0000 1.0000 0.7143
-1.0000 -1.0000 1.0000 1.0000
X =
1.0000 -1.0000 1.0000 1.0000 -1.0000
1.0000 1.0000 -1.0000 -1.0000 -0.7143
1.0000 -1.0000 -1.0000 -1.0000 -0.4286
1.0000 1.0000 -1.0000 1.0000 -0.1429
14-20
D-Optimal Designs
The column vector time is a fixed factor, normalized to values between ±1.
The number of rows in the fixed factor specifies the number of runs in the
design. The resulting design dCV gives factor settings for the three controlled
model factors at each time.
For example, the following eight-run design is for a linear additive model with
five factors in which the final factor is categorical with three levels:
dCEcat = cordexch(5,8,'linear','categorical',5,'levels',3)
dCEcat =
-1 -1 1 1 2
-1 -1 -1 -1 3
1 1 1 1 3
1 1 -1 -1 2
1 -1 -1 1 3
-1 1 -1 1 1
-1 1 1 -1 3
1 -1 1 -1 1
14-21
14 Design of Experiments
For example, the following uses rowexch to generate a five-run design for
a two-factor pure quadratic model using a candidate set that is produced
internally:
dRE1 = rowexch(2,5,'purequadratic','tries',10)
dRE1 =
-1 1
0 0
1 -1
1 0
1 1
The same thing can be done using candgen and candexch in sequence:
14-22
D-Optimal Designs
4
dRE2 = dC(treatments,:) % Display design
dRE2 =
0 -1
-1 -1
-1 1
1 -1
-1 0
You can replace C in this example with a design matrix evaluated at your own
candidate set. For example, suppose your experiment is constrained so that
the two factors cannot have extreme settings simultaneously. The following
produces a restricted candidate set:
Use the x2fx function to convert the candidate set to a design matrix:
my_C = x2fx(my_dC,'purequadratic')
my_C =
1 0 -1 0 1
1 -1 0 1 0
1 0 0 0 0
1 1 0 1 0
1 0 1 0 1
14-23
14 Design of Experiments
3
my_dRE = my_dC(my_treatments,:) % Display design
my_dRE =
-1 0
1 0
0 1
0 -1
0 0
14-24
15
Introduction
Statistical process control (SPC) refers to a number of different methods for
monitoring and assessing the quality of manufactured goods. Combined
with methods from the Chapter 14, “Design of Experiments”, SPC is used in
programs that define, measure, analyze, improve, and control development
and production processes. These programs are often implemented using
“Design for Six Sigma” methodologies.
15-2
Control Charts
Control Charts
A control chart displays measurements of process samples over time. The
measurements are plotted together with user-defined specification limits and
process-defined control limits. The process can then be compared with its
specifications—to see if it is in control or out of control.
The chart is just a monitoring tool. Control activity might occur if the chart
indicates an undesirable, systematic change in the process. The control
chart is used to discover the variation, so that the process can be adjusted
to reduce it.
Control charts are created with the controlchart function. Any of the
following chart types may be specified:
• Xbar or mean
• Standard deviation
• Range
• Exponentially weighted moving average
• Individual observation
• Moving range of individual observations
• Moving average of individual observations
• Proportion defective
• Number of defectives
• Defects per unit
• Count of defects
For example, the following commands create an xbar chart, using the
“Western Electric 2” rule (2 of 3 points at least 2 standard errors above the
center line) to mark out of control measurements:
load parts;
st = controlchart(runout,'rules','we2');
15-3
15 Statistical Process Control
x = st.mean;
cl = st.mu;
se = st.sigma./sqrt(st.n);
hold on
plot(cl+2*se,'m')
R = controlrules('we2',x,cl,se);
I = find(R)
15-4
Control Charts
I =
21
23
24
25
26
27
15-5
15 Statistical Process Control
Capability Studies
Before going into production, many manufacturers run a capability study to
determine if their process will run within specifications enough of the time.
Capability indices produced by such a study are used to estimate expected
percentages of defective parts.
Capability studies are conducted with the capability function. The following
capability indices are produced:
• mu — Sample mean
• sigma — Sample standard deviation
• P — Estimated probability of being within the lower (L) and upper (U)
specification limits
• Pl — Estimated probability of being below L
• Pu — Estimated probability of being above U
• Cp — (U-L)/(6*sigma)
• Cpl — (mu-L)./(3.*sigma)
• Cpu — (U-mu)./(3.*sigma)
• Cpk — min(Cpl,Cpu)
data = normrnd(3,0.005,100,1);
S = capability(data,[2.99 3.01])
S =
mu: 3.0006
sigma: 0.0047
P: 0.9669
Pl: 0.0116
Pu: 0.0215
15-6
Capability Studies
Cp: 0.7156
Cpl: 0.7567
Cpu: 0.6744
Cpk: 0.6744
capaplot(data,[2.99 3.01]);
grid on
15-7
15 Statistical Process Control
15-8
16
Function Reference
File I/O
caseread Read case names from file
casewrite Write case names to file
tblread Read tabular data from file
tblwrite Write tabular data to file
tdfread Read tab-delimited file
xptread Create dataset array from data
stored in SAS XPORT format file
16-2
Data Organization
Data Organization
Categorical Arrays (p. 16-3)
Dataset Arrays (p. 16-6)
Grouped Data (p. 16-7)
Categorical Arrays
addlevels (categorical) Add levels to categorical array
cat (categorical) Concatenate categorical arrays
categorical Create categorical array
cellstr (categorical) Convert categorical array to cell
array of strings
char (categorical) Convert categorical array to
character array
circshift (categorical) Shift categorical array circularly
ctranspose (categorical) Transpose categorical matrix
double (categorical) Convert categorical array to double
array
droplevels (categorical) Drop levels
end (categorical) Last index in indexing expression for
categorical array
flipdim (categorical) Flip categorical array along specified
dimension
fliplr (categorical) Flip categorical matrix in left/right
direction
flipud (categorical) Flip categorical matrix in up/down
direction
getlabels (categorical) Access categorical array labels
getlevels (categorical) Get categorical array levels
16-3
16 Function Reference
16-4
Data Organization
16-5
16 Function Reference
Dataset Arrays
cat (dataset) Concatenate dataset arrays
cellstr (dataset) Create cell array of strings from
dataset array
dataset Construct dataset array
datasetfun (dataset) Apply function to dataset array
variables
double (dataset) Convert dataset variables to double
array
end (dataset) Last index in indexing expression for
dataset array
export (dataset) Write dataset array to file
get (dataset) Access dataset array properties
grpstats (dataset) Summary statistics by group for
dataset arrays
horzcat (dataset) Horizontal concatenation for dataset
arrays
isempty (dataset) True for empty dataset array
join (dataset) Merge observations
length (dataset) Length of dataset array
16-6
Data Organization
Grouped Data
gplotmatrix Matrix of scatter plots by group
grp2idx Create index vector from grouping
variable
grpstats Summary statistics by group
gscatter Scatter plot by group
16-7
16 Function Reference
Descriptive Statistics
Summaries (p. 16-8)
Measures of Central Tendency
(p. 16-8)
Measures of Dispersion (p. 16-8)
Measures of Shape (p. 16-9)
Statistics Resampling (p. 16-9)
Data with Missing Values (p. 16-9)
Data Correlation (p. 16-10)
Summaries
crosstab Cross-tabulation
grpstats Summary statistics by group
summary (categorical) Summary statistics for categorical
array
tabulate Frequency table
Measures of Dispersion
iqr Interquartile range
mad Mean or median absolute deviation
16-8
Descriptive Statistics
Measures of Shape
kurtosis Kurtosis
moment Central moments
prctile Calculate percentile values
quantile Quantiles
skewness Skewness
zscore Standardized z-scores
Statistics Resampling
bootci Bootstrap confidence interval
bootstrp Bootstrap sampling
jackknife Jackknife sampling
16-9
16 Function Reference
Data Correlation
canoncorr Canonical correlation
cholcov Cholesky-like covariance
decomposition
cophenet Cophenetic correlation coefficient
corr Linear or rank correlation
corrcov Convert covariance matrix to
correlation matrix
partialcorr Linear or rank partial correlation
coefficients
tiedrank Rank adjusted for ties
16-10
Statistical Visualization
Statistical Visualization
Distribution Plots (p. 16-11)
Scatter Plots (p. 16-12)
ANOVA Plots (p. 16-12)
Regression Plots (p. 16-13)
Multivariate Plots (p. 16-13)
Cluster Plots (p. 16-13)
Classification Plots (p. 16-14)
DOE Plots (p. 16-14)
SPC Plots (p. 16-14)
Distribution Plots
boxplot Box plot
cdfplot Empirical cumulative distribution
function plot
dfittool Interactive distribution fitting
disttool Interactive density and distribution
plots
ecdfhist Empirical cumulative distribution
function histogram
fsurfht Interactive contour plot
hist3 Bivariate histogram
histfit Histogram with normal fit
normplot Normal probability plot
normspec Normal density plot between
specifications
pareto Pareto chart
probplot Probability plots
16-11
16 Function Reference
Scatter Plots
gline Interactively add line to plot
gname Add case names to plot
gplotmatrix Matrix of scatter plots by group
gscatter Scatter plot by group
lsline Add least-squares line to scatter plot
refcurve Add reference curve to plot
refline Add reference line to plot
scatterhist Scatter plot with marginal
histograms
ANOVA Plots
anova1 One-way analysis of variance
aoctool Interactive analysis of covariance
manovacluster Dendrogram of group mean clusters
following MANOVA
multcompare Multiple comparison test
16-12
Statistical Visualization
Regression Plots
addedvarplot Added-variable plot
gline Interactively add line to plot
lsline Add least-squares line to scatter plot
polytool Interactive polynomial fitting
rcoplot Residual case order plot
refcurve Add reference curve to plot
refline Add reference line to plot
robustdemo Interactive robust regression
rsmdemo Interactive response surface
demonstration
rstool Interactive response surface
modeling
view (classregtree) Plot tree
Multivariate Plots
andrewsplot Andrews plot
biplot Biplot
glyphplot Glyph plot
parallelcoords Parallel coordinates plot
Cluster Plots
dendrogram Dendrogram plot
manovacluster Dendrogram of group mean clusters
following MANOVA
silhouette Silhouette plot
16-13
16 Function Reference
Classification Plots
perfcurve Compute Receiver Operating
Characteristic (ROC) curve or other
performance curve for classifier
output
view (classregtree) Plot tree
DOE Plots
interactionplot Interaction plot for grouped data
maineffectsplot Main effects plot for grouped data
multivarichart Multivari chart for grouped data
rsmdemo Interactive response surface
demonstration
rstool Interactive response surface
modeling
SPC Plots
capaplot Process capability plot
controlchart Shewhart control charts
histfit Histogram with normal fit
normspec Normal density plot between
specifications
16-14
Probability Distributions
Probability Distributions
Distribution Objects (p. 16-15)
Distribution Plots (p. 16-16)
Probability Density (p. 16-17)
Cumulative Distribution (p. 16-19)
Inverse Cumulative Distribution
(p. 16-21)
Distribution Statistics (p. 16-23)
Distribution Fitting (p. 16-24)
Negative Log-Likelihood (p. 16-26)
Random Number Generators
(p. 16-26)
Quasi-Random Numbers (p. 16-28)
Piecewise Distributions (p. 16-29)
Distribution Objects
cdf (ProbDist) Return cumulative distribution
function (CDF) for ProbDist object
fitdist Fit probability distribution to data
icdf (ProbDistUnivKernel) Return inverse cumulative
distribution function (ICDF) for
ProbDistUnivKernel object
icdf (ProbDistUnivParam) Return inverse cumulative
distribution function (ICDF) for
ProbDistUnivParam object
iqr (ProbDistUnivKernel) Return interquartile range (IQR) for
ProbDistUnivKernel object
iqr (ProbDistUnivParam) Return interquartile range (IQR) for
ProbDistUnivParam object
16-15
16 Function Reference
Distribution Plots
boxplot Box plot
cdfplot Empirical cumulative distribution
function plot
dfittool Interactive distribution fitting
disttool Interactive density and distribution
plots
ecdfhist Empirical cumulative distribution
function histogram
16-16
Probability Distributions
Probability Density
betapdf Beta probability density function
binopdf Binomial probability density
function
chi2pdf Chi-square probability density
function
copulapdf Copula probability density function
disttool Interactive density and distribution
plots
evpdf Extreme value probability density
function
exppdf Exponential probability density
function
16-17
16 Function Reference
16-18
Probability Distributions
Cumulative Distribution
betacdf Beta cumulative distribution
function
binocdf Binomial cumulative distribution
function
cdf Cumulative distribution functions
cdf (gmdistribution) Cumulative distribution function for
Gaussian mixture distribution
cdf (piecewisedistribution) Cumulative distribution function for
piecewise distribution
cdfplot Empirical cumulative distribution
function plot
chi2cdf Chi-square cumulative distribution
function
copulacdf Copula cumulative distribution
function
16-19
16 Function Reference
16-20
Probability Distributions
16-21
16 Function Reference
16-22
Probability Distributions
Distribution Statistics
betastat Beta mean and variance
binostat Binomial mean and variance
chi2stat Chi-square mean and variance
copulastat Copula rank correlation
evstat Extreme value mean and variance
expstat Exponential mean and variance
fstat F mean and variance
gamstat Gamma mean and variance
geostat Geometric mean and variance
gevstat Generalized extreme value mean
and variance
gpstat Generalized Pareto mean and
variance
hygestat Hypergeometric mean and variance
lognstat Lognormal mean and variance
nbinstat Negative binomial mean and
variance
ncfstat Noncentral F mean and variance
nctstat Noncentral t mean and variance
ncx2stat Noncentral chi-square mean and
variance
16-23
16 Function Reference
Distribution Fitting
Supported Distributions (p. 16-24)
Piecewise Distributions (p. 16-25)
Supported Distributions
16-24
Probability Distributions
Piecewise Distributions
16-25
16 Function Reference
Negative Log-Likelihood
betalike Beta negative log-likelihood
evlike Extreme value negative
log-likelihood
explike Exponential negative log-likelihood
gamlike Gamma negative log-likelihood
gevlike Generalized extreme value negative
log-likelihood
gplike Generalized Pareto negative
log-likelihood
lognlike Lognormal negative log-likelihood
mvregresslike Negative log-likelihood for
multivariate regression
normlike Normal negative log-likelihood
wbllike Weibull negative log-likelihood
16-26
Probability Distributions
16-27
16 Function Reference
Quasi-Random Numbers
addlistener (qrandstream) Add listener for event
delete (qrandstream) Delete handle object
end (qrandset) Last index in indexing expression for
point set
eq (qrandstream) Test handle equality
findobj (qrandstream) Find objects matching specified
conditions
findprop (qrandstream) Find property of MATLAB handle
object
ge (qrandstream) Greater than or equal relation for
handles
gt (qrandstream) Greater than relation for handles
haltonset Construct Halton quasi-random
point set
isvalid (qrandstream) Test handle validity
le (qrandstream) Less than or equal relation for
handles
16-28
Probability Distributions
Piecewise Distributions
boundary (piecewisedistribution) Piecewise distribution boundaries
cdf (piecewisedistribution) Cumulative distribution function for
piecewise distribution
icdf (piecewisedistribution) Inverse cumulative distribution
function for piecewise distribution
lowerparams (paretotails) Lower Pareto tails parameters
nsegments (piecewisedistribution) Number of segments
paretotails Construct Pareto tails object
16-29
16 Function Reference
16-30
Hypothesis Tests
Hypothesis Tests
ansaribradley Ansari-Bradley test
barttest Bartlett’s test
canoncorr Canonical correlation
chi2gof Chi-square goodness-of-fit test
dwtest Durbin-Watson test
friedman Friedman’s test
jbtest Jarque-Bera test
kruskalwallis Kruskal-Wallis test
kstest One-sample Kolmogorov-Smirnov
test
kstest2 Two-sample Kolmogorov-Smirnov
test
lillietest Lilliefors test
linhyptest Linear hypothesis test
ranksum Wilcoxon rank sum test
runstest Run test for randomness
sampsizepwr Sample size and power of test
signrank Wilcoxon signed rank test
signtest Sign test
ttest One-sample and paired-sample t-test
ttest2 Two-sample t-test
vartest Chi-square variance test
vartest2 Two-sample F-test for equal
variances
vartestn Bartlett multiple-sample test for
equal variances
16-31
16 Function Reference
Analysis of Variance
ANOVA Plots (p. 16-32)
ANOVA Operations (p. 16-32)
ANOVA Plots
anova1 One-way analysis of variance
aoctool Interactive analysis of covariance
manovacluster Dendrogram of group mean clusters
following MANOVA
multcompare Multiple comparison test
ANOVA Operations
anova1 One-way analysis of variance
anova2 Two-way analysis of variance
anovan N-way analysis of variance
aoctool Interactive analysis of covariance
dummyvar Create dummy variables
friedman Friedman’s test
kruskalwallis Kruskal-Wallis test
manova1 One-way multivariate analysis of
variance
16-32
Regression Analysis
Regression Analysis
Regression Plots (p. 16-33)
Linear Regression (p. 16-34)
Nonlinear Regression (p. 16-35)
Regression Trees (p. 16-35)
Ensemble Methods (p. 16-36)
Regression Plots
addedvarplot Added-variable plot
gline Interactively add line to plot
lsline Add least-squares line to scatter plot
polytool Interactive polynomial fitting
rcoplot Residual case order plot
refcurve Add reference curve to plot
refline Add reference line to plot
robustdemo Interactive robust regression
rsmdemo Interactive response surface
demonstration
rstool Interactive response surface
modeling
view (classregtree) Plot tree
16-33
16 Function Reference
Linear Regression
coxphfit Cox proportional hazards regression
dummyvar Create dummy variables
glmfit Generalized linear model regression
glmval Generalized linear model values
invpred Inverse prediction
leverage Leverage
mnrfit Multinomial logistic regression
mnrval Multinomial logistic regression
values
mvregress Multivariate linear regression
mvregresslike Negative log-likelihood for
multivariate regression
plsregress Partial least-squares regression
polyconf Polynomial confidence intervals
polytool Interactive polynomial fitting
regress Multiple linear regression
regstats Regression diagnostics
ridge Ridge regression
robustdemo Interactive robust regression
robustfit Robust regression
rsmdemo Interactive response surface
demonstration
rstool Interactive response surface
modeling
stepwise Interactive stepwise regression
stepwisefit Stepwise regression
x2fx Convert predictor matrix to design
matrix
16-34
Regression Analysis
Nonlinear Regression
dummyvar Create dummy variables
hougen Hougen-Watson model
nlinfit Nonlinear regression
nlintool Interactive nonlinear regression
nlmefit Nonlinear mixed-effects estimation
nlparci Nonlinear regression parameter
confidence intervals
nlpredci Nonlinear regression prediction
confidence intervals
Regression Trees
catsplit (classregtree) Categorical splits used for branches
in decision tree
children (classregtree) Child nodes
classcount (classregtree) Class counts
classprob (classregtree) Class probabilities
classregtree Construct classification and
regression trees
cutcategories (classregtree) Cut categories
cutpoint (classregtree) Returns decision tree cut point
values
cuttype (classregtree) Cut types
cutvar (classregtree) Cut variable names
eval (classregtree) Predicted responses
isbranch (classregtree) Test node for branch
nodeerr (classregtree) Return vector of node errors
nodeprob (classregtree) Node probabilities
16-35
16 Function Reference
Ensemble Methods
append (TreeBagger) Append new trees to ensemble
combine (CompactTreeBagger) Combine two ensembles
compact (TreeBagger) Compact ensemble of decision trees
CompactTreeBagger Create CompactTreeBagger object
error (CompactTreeBagger) Error (misclassification probability
or MSE)
error (TreeBagger) Error (misclassification probability
or MSE)
fillProximities (TreeBagger) Proximity matrix for training data
growTrees (TreeBagger) Train additional trees and add to
ensemble
margin (CompactTreeBagger) Classification margin
margin (TreeBagger) Classification margin
mdsProx (CompactTreeBagger) Multidimensional scaling of
proximity matrix
mdsProx (TreeBagger) Multidimensional scaling of
proximity matrix
16-36
Regression Analysis
16-37
16 Function Reference
Multivariate Methods
Multivariate Plots (p. 16-38)
Multidimensional Scaling (p. 16-38)
Procrustes Analysis (p. 16-38)
Feature Selection (p. 16-39)
Feature Transformation (p. 16-39)
Multivariate Plots
andrewsplot Andrews plot
biplot Biplot
glyphplot Glyph plot
parallelcoords Parallel coordinates plot
Multidimensional Scaling
cmdscale Classical multidimensional scaling
mahal Mahalanobis distance
mdscale Nonclassical multidimensional
scaling
pdist Pairwise distance between pairs of
objects
pdist2 Pairwise distance between two sets
of observations
squareform Format distance matrix
Procrustes Analysis
procrustes Procrustes analysis
16-38
Multivariate Methods
Feature Selection
sequentialfs Sequential feature selection
Feature Transformation
Nonnegative Matrix Factorization
(p. 16-39)
Principal Component Analysis
(p. 16-39)
Factor Analysis (p. 16-39)
Factor Analysis
16-39
16 Function Reference
Cluster Analysis
Cluster Plots (p. 16-40)
Hierarchical Clustering (p. 16-40)
K-Means Clustering (p. 16-41)
Gaussian Mixture Models (p. 16-41)
Cluster Plots
dendrogram Dendrogram plot
manovacluster Dendrogram of group mean clusters
following MANOVA
silhouette Silhouette plot
Hierarchical Clustering
cluster Construct agglomerative clusters
from linkages
clusterdata Construct agglomerative clusters
from data
cophenet Cophenetic correlation coefficient
inconsistent Inconsistency coefficient
linkage Create agglomerative hierarchical
cluster tree
pdist Pairwise distance between pairs of
objects
pdist2 Pairwise distance between two sets
of observations
squareform Format distance matrix
16-40
Model Assessment
K-Means Clustering
kmeans K-means clustering
mahal Mahalanobis distance
Model Assessment
confusionmat Confusion matrix
crossval Loss estimate using cross-validation
cvpartition Create cross-validation partition for
data
repartition (cvpartition) Repartition data for cross-validation
test (cvpartition) Test indices for cross-validation
training (cvpartition) Training indices for cross-validation
16-41
16 Function Reference
Classification
Classification Plots (p. 16-42)
Discriminant Analysis (p. 16-42)
Classification Trees (p. 16-42)
Naive Bayes Classification (p. 16-43)
Ensemble Methods (p. 16-44)
Classification Plots
perfcurve Compute Receiver Operating
Characteristic (ROC) curve or other
performance curve for classifier
output
view (classregtree) Plot tree
Discriminant Analysis
classify Discriminant analysis
mahal Mahalanobis distance
Classification Trees
catsplit (classregtree) Categorical splits used for branches
in decision tree
children (classregtree) Child nodes
classcount (classregtree) Class counts
classprob (classregtree) Class probabilities
classregtree Construct classification and
regression trees
cutcategories (classregtree) Cut categories
16-42
Classification
16-43
16 Function Reference
Ensemble Methods
append (TreeBagger) Append new trees to ensemble
combine (CompactTreeBagger) Combine two ensembles
compact (TreeBagger) Compact ensemble of decision trees
CompactTreeBagger Create CompactTreeBagger object
error (CompactTreeBagger) Error (misclassification probability
or MSE)
error (TreeBagger) Error (misclassification probability
or MSE)
fillProximities (TreeBagger) Proximity matrix for training data
growTrees (TreeBagger) Train additional trees and add to
ensemble
margin (CompactTreeBagger) Classification margin
margin (TreeBagger) Classification margin
mdsProx (CompactTreeBagger) Multidimensional scaling of
proximity matrix
mdsProx (TreeBagger) Multidimensional scaling of
proximity matrix
meanMargin (CompactTreeBagger) Mean classification margin
meanMargin (TreeBagger) Mean classification margin
oobError (TreeBagger) Out-of-bag error
oobMargin (TreeBagger) Out-of-bag margins
oobMeanMargin (TreeBagger) Out-of-bag mean margins
oobPredict (TreeBagger) Ensemble predictions for out-of-bag
observations
outlierMeasure Outlier measure for data
(CompactTreeBagger)
predict (CompactTreeBagger) Predict response
predict (TreeBagger) Predict response
16-44
Classification
16-45
16 Function Reference
16-46
Design of Experiments
Design of Experiments
DOE Plots (p. 16-47)
Full Factorial Designs (p. 16-47)
Fractional Factorial Designs
(p. 16-48)
Response Surface Designs (p. 16-48)
D-Optimal Designs (p. 16-48)
Latin Hypercube Designs (p. 16-48)
Quasi-Random Designs (p. 16-49)
DOE Plots
interactionplot Interaction plot for grouped data
maineffectsplot Main effects plot for grouped data
multivarichart Multivari chart for grouped data
rsmdemo Interactive response surface
demonstration
rstool Interactive response surface
modeling
16-47
16 Function Reference
D-Optimal Designs
candexch Candidate set row exchange
candgen Candidate set generation
cordexch Coordinate exchange
daugment D-optimal augmentation
dcovary D-optimal design with fixed
covariates
rowexch Row exchange
rsmdemo Interactive response surface
demonstration
16-48
Design of Experiments
Quasi-Random Designs
addlistener (qrandstream) Add listener for event
delete (qrandstream) Delete handle object
end (qrandset) Last index in indexing expression for
point set
eq (qrandstream) Test handle equality
findobj (qrandstream) Find objects matching specified
conditions
findprop (qrandstream) Find property of MATLAB handle
object
ge (qrandstream) Greater than or equal relation for
handles
gt (qrandstream) Greater than relation for handles
haltonset Construct Halton quasi-random
point set
isvalid (qrandstream) Test handle validity
le (qrandstream) Less than or equal relation for
handles
length (qrandset) Length of point set
lt (qrandstream) Less than relation for handles
ndims (qrandset) Number of dimensions in matrix
ne (qrandstream) Not equal relation for handles
net (qrandset) Generate quasi-random point set
notify (qrandstream) Notify listeners of event
qrand (qrandstream) Generate quasi-random points from
stream
qrandset Abstract quasi-random point set
class
qrandstream Construct quasi-random number
stream
16-49
16 Function Reference
16-50
Statistical Process Control
SPC Plots
capaplot Process capability plot
controlchart Shewhart control charts
histfit Histogram with normal fit
normspec Normal density plot between
specifications
SPC Functions
capability Process capability indices
controlrules Western Electric and Nelson control
rules
gagerr Gage repeatability and
reproducibility study
16-51
16 Function Reference
GUIs
aoctool Interactive analysis of covariance
dfittool Interactive distribution fitting
disttool Interactive density and distribution
plots
fsurfht Interactive contour plot
polytool Interactive polynomial fitting
randtool Interactive random number
generation
regstats Regression diagnostics
robustdemo Interactive robust regression
rsmdemo Interactive response surface
demonstration
rstool Interactive response surface
modeling
surfht Interactive contour plot
16-52
Utilities
Utilities
combnk Enumeration of combinations
perms Enumeration of permutations
statget Access values in statistics options
structure
statset Create statistics options structure
zscore Standardized z-scores
16-53
16 Function Reference
16-54
17
Class Reference
Data Organization
In this section...
“Categorical Arrays” on page 17-2
“Dataset Arrays” on page 17-2
Categorical Arrays
categorical Arrays for categorical data
nominal Arrays for nominal categorical data
ordinal Arrays for ordinal categorical data
Dataset Arrays
dataset Arrays for statistical data
17-2
Probability Distributions
Probability Distributions
In this section...
“Distribution Objects” on page 17-3
“Quasi-Random Numbers” on page 17-3
“Piecewise Distributions” on page 17-4
Distribution Objects
ProbDist Object representing probability
distribution
ProbDistKernel Object representing nonparametric
probability distribution defined by
kernel smoothing
ProbDistParametric Object representing parametric
probability distribution
ProbDistUnivKernel Object representing univariate
kernel probability distribution
ProbDistUnivParam Object representing univariate
parametric probability distribution
Quasi-Random Numbers
haltonset Halton quasi-random point sets
qrandset Quasi-random point sets
qrandstream Quasi-random number streams
sobolset Sobol quasi-random point sets
17-3
17 Class Reference
Piecewise Distributions
paretotails Empirical distributions with Pareto
tails
piecewisedistribution Piecewise-defined distributions
Regression Analysis
In this section...
“Regression Trees” on page 17-4
“Ensemble Method Classes” on page 17-4
Regression Trees
classregtree Classification and regression trees
17-4
Model Assessment
Model Assessment
cvpartition Data partitions for cross-validation
Classification
In this section...
“Classification Trees” on page 17-5
“Naive Bayes Classification ” on page 17-5
“Ensemble Method Classes” on page 17-5
Classification Trees
classregtree Classification and regression trees
17-5
17 Class Reference
17-6
18
Functions — Alphabetical
List
addedvarplot
Syntax addedvarplot(X,y,num,inmodel)
addedvarplot(X,y,num,inmodel,stats)
18-2
addedvarplot
Examples Load the data in hald.mat, which contains observations of the heat of
reaction of various cement mixtures:
load hald
whos
Name Size Bytes Class Attributes
18-3
addedvarplot
The wide scatter and the low slope of the fitted line are evidence against
the statistical significance of adding the third column to the model.
18-4
categorical.addlevels
Syntax B = addlevels(A,newlevels)
Examples Example 1
Add levels for additional species in Fisher’s iris data:
load fisheriris
species = nominal(species,...
{'Species1','Species2','Species3'},...
{'setosa','versicolor','virginica'});
species = addlevels(species,{'Species4','Species5'});
getlabels(species)
ans =
'Species1' 'Species2' 'Species3' 'Species4' 'Species5'
Example 2
1 Load patient data from the CSV file hospital.dat and store the
information in a dataset array with observation names given by the
first column in the data (patient identification):
patients = dataset('file','hospital.dat',...
'delimiter',',',...
'ReadObsNames',true);
2 Make the {0,1}-valued variable smoke nominal, and change the labels
to 'No' and 'Yes':
patients.smoke = nominal(patients.smoke,{'No','Yes'});
18-5
categorical.addlevels
patients.smoke = addlevels(patients.smoke,...
{'0-5 Years','5-10 Years','LongTerm'});
4 Assuming the nonsmokers have never smoked, relabel the 'No' level:
patients.smoke = setlabels(patients.smoke,'Never','No');
patients.smoke = droplevels(patients.smoke,'Yes');
18-6
qrandstream.addlistener
Syntax el = addlistener(hsource,'eventname',callback)
el = addlistener(hsource,property,'eventname',callback)
18-7
gmdistribution.AIC property
18-8
andrewsplot
Syntax andrewsplot(X)
andrewsplot(X,...,'Standardize',standopt)
andrewsplot(X,...,'Quantile',alpha)
andrewsplot(X,...,'Group',group)
andrewsplot(X,...,’PropName’,PropVal,...)
h = andrewsplot(X,...)
18-9
andrewsplot
load fisheriris
andrewsplot(meas,'group',species)
18-10
andrewsplot
andrewsplot(meas,'group',species,'quantile',.25)
18-11
andrewsplot
18-12
anova1
Syntax p = anova1(X)
p = anova1(X,group)
p = anova1(X,group,displayopt)
[p,table] = anova1(...)
[p,table,stats] = anova1(...)
18-13
anova1
4 The mean squares (MS) for each source, which is the ratio SS/df.
The box plot of the columns of X suggests the size of the F-statistic and
the p value. Large differences in the center lines of the boxes correspond
to large values of F and correspondingly small values of p.
Columns of X with NaN values are disregarded.
p = anova1(X,group) performs ANOVA by group.
If X is a matrix, anova1 treats each column as a separate group, and
evaluates whether the population means of the columns are equal. This
form of anova1 is appropriate when each group has the same number of
elements (balanced ANOVA). group can be a character array or a cell
array of strings, with one row per column of X, containing group names.
Enter an empty array ([]) or omit this argument if you do not want to
specify group names.
If X is a vector, group must be a categorical variable, vector, string
array, or cell array of strings with one name for each element of X. X
values corresponding to the same value of group are placed in the same
group. This form of anova1 is appropriate when groups have different
numbers of elements (unbalanced ANOVA).
If group contains empty or NaN-valued cells or strings, the corresponding
observations in X are disregarded.
p = anova1(X,group,displayopt) enables the ANOVA table and box
plot displays when displayopt is 'on' (default) and suppresses the
displays when displayopt is 'off'. Notches in the boxplot provide a
test of group medians (see boxplot) different from the F test for means
in the ANOVA table.
[p,table] = anova1(...) returns the ANOVA table (including
column and row labels) in the cell array table. Copy a text version of
the ANOVA table to the clipboard using the Copy Text item on the
Edit menu.
18-14
anova1
Examples Example 1
Create X with columns that are constants plus random normal
disturbances with mean zero and standard deviation one:
X = meshgrid(1:5)
X =
1 2 3 4 5
1 2 3 4 5
1 2 3 4 5
1 2 3 4 5
1 2 3 4 5
X = X + normrnd(0,1,5,5)
X =
1.3550 2.0662 2.4688 5.9447 5.4897
2.0693 1.7611 1.4864 4.8826 6.3222
2.1919 0.7276 3.1905 4.8768 4.6841
2.7620 1.8179 3.9506 4.4678 4.9291
18-15
anova1
p = anova1(X)
p =
7.9370e-006
18-16
anova1
Example 2
The following example is from a study of the strength of structural
beams in Hogg. The vector strength measures deflections of beams in
thousandths of an inch under 3,000 pounds of force. The vector alloy
identifies each beam as steel ('st'), alloy 1 ('al1'), or alloy 2 ('al2').
(Although alloy is sorted in this example, grouping variables do not
need to be sorted.) The null hypothesis is that steel beams are equal in
strength to beams made of the two more expensive alloys.
alloy = {'st','st','st','st','st','st','st','st',...
'al1','al1','al1','al1','al1','al1',...
'al2','al2','al2','al2','al2','al2'};
p = anova1(strength,alloy)
p =
1.5264e-004
18-17
anova1
The p value suggests rejection of the null hypothesis. The box plot
shows that steel beams deflect more than beams made of the more
expensive alloys.
References [1] Hogg, R. V., and J. Ledolter. Engineering Statistics. New York:
MacMillan, 1987.
18-18
anova2
Syntax p = anova2(X,reps)
p = anova2(X,reps,displayopt)
[p,table] = anova2(...)
[p,table,stats] = anova2(...)
A =1 A=2
⎡ x111 x121 ⎤ ⎫
⎢x ⎬B =1
⎢ 112 x122 ⎥⎥ ⎭
⎢ x211 x221 ⎥ ⎫
⎢ ⎥ ⎬B = 2
⎢ x212 x222 ⎥ ⎭
⎢x x321 ⎥ ⎫
⎢ 311 ⎥ ⎬B = 3
⎣⎢ x312 x322 ⎥⎦ ⎭
1 The p value for the null hypothesis, H0A, that all samples from factor
A (i.e., all column-samples in X) are drawn from the same population
2 The p value for the null hypothesis, H0B, that all samples from factor
B (i.e., all row-samples in X) are drawn from the same population
18-19
anova2
3 The p value for the null hypothesis, H0AB, that the effects due to
factors A and B are additive (i.e., that there is no interaction between
factors A and B)
If any p value is near zero, this casts doubt on the associated null
hypothesis. A sufficiently small p value for H0A suggests that at least
one column-sample mean is significantly different that the other
column-sample means; i.e., there is a main effect due to factor A. A
sufficiently small p value for H0B suggests that at least one row-sample
mean is significantly different than the other row-sample means; i.e.,
there is a main effect due to factor B. A sufficiently small p value for
H0AB suggests that there is an interaction between factors A and B.
The choice of a limit for the p value to determine whether a result
is “statistically significant” is left to the researcher. It is common to
declare a result significant if the p value is less than 0.05 or 0.01.
anova2 also displays a figure showing the standard ANOVA table,
which divides the variability of the data in X into three or four parts
depending on the value of reps:
18-20
anova2
• The fourth shows the Mean Squares (MS), which is the ratio SS/df.
• The fifth shows the F statistics, which is the ratio of the mean
squares.
Examples The data below come from a study of popcorn brands and popper type
(Hogg 1987). The columns of the matrix popcorn are brands (Gourmet,
National, and Generic). The rows are popper type (Oil and Air.) The
study popped a batch of each brand three times with each popper. The
values are the yield in cups of popped popcorn.
load popcorn
popcorn
popcorn =
5.5000 4.5000 3.5000
5.5000 4.5000 4.0000
6.0000 4.0000 3.0000
6.5000 5.0000 4.0000
7.0000 5.5000 5.0000
18-21
anova2
p = anova2(popcorn,3)
p =
0.0000 0.0001 0.7462
The vector p shows the p-values for the three brands of popcorn, 0.0000,
the two popper types, 0.0001, and the interaction between brand and
popper type, 0.7462. These values indicate that both popcorn brand and
popper type affect the yield of popcorn, but there is no evidence of a
synergistic (interaction) effect of the two.
The conclusion is that you can get the greatest yield using the Gourmet
brand and an Air popper (the three values popcorn(4:6,1)).
References [1] Hogg, R. V., and J. Ledolter. Engineering Statistics. New York:
MacMillan, 1987.
18-22
anovan
Syntax p = anovan(y,group)
p = anovan(y,group,param,val)
[p,table] = anovan(y,group,param,val)
[p,table,stats] = anovan(y,group,param,val)
[p,table,stats,terms] = anovan(y,group,param,val)
Parameter Value
'alpha' A number between 0 and 1 requesting 100(1 –
alpha)% confidence bounds (default 0.05 for 95%
confidence)
'continuous' A vector of indices indicating which grouping
variables should be treated as continuous predictors
rather than as categorical predictors.
'display' 'on' displays an ANOVA table (the default)
'off' omits the display
18-23
anovan
Parameter Value
'model' The type of model used. See “Model Type” on page
18-25 for a description of this parameter.
'nested' A matrix M of 0’s and 1’s specifying the nesting
relationships among the grouping variables. M(i,j) is
1 if variable i is nested in variable j.
'random' A vector of indices indicating which grouping
variables are random effects (all are fixed by default).
See “ANOVA with Random Effects” on page 8-19 for
an example of how to use 'random'.
'sstype' 1, 2, 3 (default), or h specifies the type of sum of
squares. See “Sum of Squares” on page 18-26 for a
description of this parameter.
'varnames' A character matrix or a cell array of strings specifying
names of grouping variables, one per grouping
variable. When you do not specify 'varnames', the
default labels 'X1', 'X2', 'X3', ..., 'XN' are used.
See “ANOVA with Random Effects” on page 8-19 for
an example of how to use 'varnames'.
18-24
anovan
Model Type
This section explains how to use the argument 'model' with the syntax:
[...] = anovan(y,group,'model',modeltype)
The argument modeltype, which specifies the type of model the function
uses, can be any one of the following:
⎛N⎞
for null hypotheses on the N main effects and the ⎜ ⎟ two-factor
interactions. ⎝2⎠
• 'full' — The 'full' model computes the p-values for null
hypotheses on the N main effects and interactions at all levels.
• An integer — For an integer value of modeltype, k (k ≤ N),
anovan computes all interaction levels through the kth level. For
example, the value 3 means main effects plus two- and three-factor
interactions. The values k = 1 and k = 2 are equivalent to the
'linear' and 'interaction' specifications, respectively, while the
value k = N is equivalent to the 'full' specification.
• A matrix of term definitions having the same form as the input to the
x2fx function. All entries must be 0 or 1 (no higher powers).
For more precise control over the main and interaction terms that
anovan computes, modeltype can specify a matrix containing one row
for each main or interaction term to include in the ANOVA model. Each
row defines one term using a vector of N zeros and ones. The table
below illustrates the coding for a 3-factor ANOVA.
18-25
anovan
Sum of Squares
This section explains how to use the argument 'sstype' with the
syntax:
[...] = anovan(y,group,'sstype',type)
This syntax computes the ANOVA using the type of sum of squares
specified by type, which can be 1, 2, 3, or h. While the numbers 1 – 3
designate Type 1, Type 2, or Type 3 sum of squares, respectively, h
represents a hierarchical model similar to type 2, but with continuous
as well as categorical factors used to determine the hierarchy of
terms. The default value is 3. For a model containing main effects
but no interactions, the value of type only influences computations
on unbalanced data.
The sum of squares for any term is determined by comparing two
models. The Type 1 sum of squares for a term is the reduction in
residual sum of squares obtained by adding that term to a fit that
already includes the terms listed before it. The Type 2 sum of squares is
18-26
anovan
The models for Type 3 sum of squares have sigma restrictions imposed.
This means, for example, that in fitting R(B, AB), the array of AB
effects is constrained to sum to 0 over A for each value of B, and over B
for each value of A.
18-27
anovan
This defines a three-way ANOVA with two levels of each factor. Every
observation in y is identified by a combination of factor levels. If the
factors are A, B, and C, then observation y(1) is associated with
• Level 1 of factor A
• Level 'hi' of factor B
• Level 'may' of factor C
• Level 2 of factor A
• Level 'hi' of factor B
• Level 'june' of factor C
p = anovan(y,{g1 g2 g3})
p =
0.4174
0.0028
0.9140
Output vector p contains p-values for the null hypotheses on the N main
effects. Element p(1) contains the p value for the null hypotheses,
H0A, that samples at all levels of factor A are drawn from the same
population; element p(2) contains the p value for the null hypotheses,
H0B, that samples at all levels of factor B are drawn from the same
population; and so on.
If any p value is near zero, this casts doubt on the associated null
hypothesis. For example, a sufficiently small p value for H0A suggests
that at least one A-sample mean is significantly different from the other
A-sample means; that is, there is a main effect due to factor A. You
need to choose a bound for the p value to determine whether a result is
statistically significant. It is common to declare a result significant if
the p value is less than 0.05 or 0.01.
18-28
anovan
Two-Factor Interactions
By default, anovan computes p-values just for the three main effects.
To also compute p-values for the two-factor interactions, X1*X2, X1*X3,
18-29
anovan
p = anovan(y,{g1 g2 g3},'model','interaction')
p =
0.0347
0.0048
0.2578
0.0158
0.1444
0.5000
The first three entries of p are the p-values for the main effects. The
last three entries are the p-values for the two-factor interactions. You
can determine the order in which the two-factor interactions occur from
the ANOVAN table shown in the following figure.
18-30
anovan
Field Description
coeffs Estimated coefficients
coeffnames Name of term for each coefficient
vars Matrix of grouping variable values for each term
resid Residuals from the fitted model
The stats structure also contains the following fields if there are
random effects:
Field Description
ems Expected mean squares
denom Denominator definition
rtnames Names of random terms
varest Variance component estimates (one per random term)
varci Confidence intervals for variance components
Examples “Example: Two-Way ANOVA” on page 8-10 shows how to use anova2 to
analyze the effects of two factors on a response in a balanced design.
For a design that is not balanced, use anovan instead.
The data in carbig.mat gives measurements on 406 cars. Use anonvan
to study how the mileage depends on where and when the cars were
made:
load carbig
p = anovan(MPG,{org when},'model',2,'sstype',3,...
'varnames',{'Origin';'Mfg date'})
p =
18-31
anovan
0
0
0.3059
The p value for the interaction term is not small, indicating little
evidence that the effect of the year or manufacture (when) depends on
where the car was made (org). The linear effects of those two factors,
however, are significant.
References [1] Hogg, R. V., and J. Ledolter. Engineering Statistics. New York:
MacMillan, 1987.
18-32
ansaribradley
Syntax h = ansaribradley(x,y)
h = ansaribradley(x,y,alpha)
h = ansaribradley(x,y,alpha,tail)
[h,p] = ansaribradley(...)
[h,p,stats] = ansaribradley(...)
[...] = ansaribradley(x,y,alpha,tail,exact)
[...] = ansaribradley(x,y,alpha,tail,exact,dim)
18-33
ansaribradley
18-34
ansaribradley
load carsmall
[h,p,stats] = ansaribradley(MPG(Model_Year==82),MPG(Model_Year==76))
h =
0
p =
0.8426
stats =
W: 526.9000
Wstar: 0.1986
18-35
aoctool
Syntax aoctool(x,y,group)
aoctool(x,y,group,alpha)
aoctool(x,y,group,alpha,xname,yname,gname)
aoctool(x,y,group,alpha,xname,yname,gname,displayopt)
aoctool(x,y,group,alpha,xname,yname,gname,displayopt,model)
h = aoctool(...)
[h,atab,ctab] = aoctool(...)
[h,atab,ctab,stats] = aoctool(...)
You can use the figures to change models and to test different parts
of the model. More information about interactive use of the aoctool
function appears in “Analysis of Covariance Tool” on page 8-27.
aoctool(x,y,group,alpha) determines the confidence levels of the
prediction intervals. The confidence level is 100(1-alpha)%. The
default value of alpha is 0.05.
aoctool(x,y,group,alpha,xname,yname,gname) specifies the name
to use for the x, y, and g variables in the graph and tables. If you
enter simple variable names for the x, y, and g arguments, the aoctool
function uses those names. If you enter an expression for one of these
arguments, you can specify a name to use in place of that expression by
supplying these arguments. For example, if you enter m(:,2) as the x
argument, you might choose to enter 'Col 2' as the xname argument.
18-36
aoctool
aoctool(x,y,group,alpha,xname,yname,gname,displayopt) enables
the graph and table displays when displayopt is 'on' (default) and
suppresses those displays when displayopt is 'off'.
aoctool(x,y,group,alpha,xname,yname,gname,displayopt,model)
specifies the initial model to fit. The value of model can be any of the
following:
18-37
aoctool
load carsmall
[h,a,c,s] = aoctool(Weight,MPG,Model_Year,0.05,...
'','','','off','separate lines');
c(:,1:2)
ans =
'Term' 'Estimate'
'Intercept' [45.97983716833132]
' 70' [-8.58050531454973]
' 76' [-3.89017396094922]
' 82' [12.47067927549897]
'Slope' [-0.00780212907455]
' 70' [ 0.00195840368824]
' 76' [ 0.00113831038418]
' 82' [-0.00309671407243]
[h,a,c,s] = aoctool(Weight,MPG,Model_Year,0.05,...
'','','','off','parallel lines');
c(:,1:2)
ans =
'Term' 'Estimate'
'Intercept' [43.38984085130596]
' 70' [-3.27948192983761]
' 76' [-1.35036234809006]
18-38
aoctool
Again, there are different intercepts for each group, but this time the
slopes are constrained to be the same.
18-39
TreeBagger.append
Syntax B = append(B,other)
18-40
ProbDistKernel.BandWidth property
Values For a distribution specified to cover only the positive numbers or only
a finite interval, the data are transformed before the kernel density is
applied, and the bandwidth is on the scale of the transformed data.
Use this information to view and compare the width of the kernel
smoothing function used to create distributions.
18-41
barttest
18-42
bbdesign
dBB = bbdesign(3)
dBB =
-1 -1 0
-1 1 0
1 -1 0
18-43
bbdesign
1 1 0
-1 0 -1
-1 0 1
1 0 -1
1 0 1
0 -1 -1
0 -1 1
0 1 -1
0 1 1
0 0 0
0 0 0
0 0 0
The center point is run 3 times to allow for a more uniform estimate of
the prediction variance over the entire design space.
Visualize the design as follows:
plot3(dBB(:,1),dBB(:,2),dBB(:,3),'ro',...
'MarkerFaceColor','b')
X = [1 -1 -1 -1 1 -1 -1 -1 1 1 -1 -1; ...
1 1 1 -1 1 1 1 -1 1 1 -1 -1];
Y = [-1 -1 1 -1 -1 -1 1 -1 1 -1 1 -1; ...
1 -1 1 1 1 -1 1 1 1 -1 1 -1];
Z = [1 1 1 1 -1 -1 -1 -1 -1 -1 -1 -1; ...
1 1 1 1 -1 -1 -1 -1 1 1 1 1];
line(X,Y,Z,'Color','b')
axis square equal
18-44
bbdesign
18-45
betacdf
Syntax p = betacdf(X,A,B)
x
1
p = F ( x | a, b) = t a−1 (1 − t)b−1 dt
B(a, b) ∫
0
Examples x = 0.1:0.2:0.9;
a = 2;
b = 2;
p = betacdf(x,a,b)
p =
0.0280 0.2160 0.5000 0.7840 0.9720
a = [1 2 3];
p = betacdf(0.5,a,a)
p =
0.5000 0.5000 0.5000
18-46
betafit
x
1
t a−1 (1 − t)b−1 dt
B(a, b) ∫
F ( x | a, b) =
0
and B( · ) is the Beta function. The elements of data must lie in the
open interval (0, 1), where the beta distribution is defined. However,
it is sometimes also necessary to fit a beta distribution to data that
include exact zeros or ones. For such data, the beta likelihood function
is unbounded, and standard maximum likelihood estimation is not
possible. In that case, betafit maximizes a modified likelihood that
incorporates the zeros or ones by treating them as if they were values
that have been left-censored at sqrt(realmin) or right-censored at
1-eps/2, respectively.
[phat,pci] = betafit(data,alpha) returns confidence intervals on
the a and b parameters in the 2-by-2 matrix pci. The first column of the
matrix contains the lower and upper confidence bounds for parameter
a, and the second column contains the confidence bounds for parameter
b. The optional input argument alpha is a value in the range [0, 1]
specifying the width of the confidence intervals. By default, alpha is
0.05, which corresponds to 95% confidence intervals. The confidence
intervals are based on a normal approximation for the distribution of
the logs of the parameter estimates.
Examples This example generates 100 beta distributed observations. The true
a and b parameters are 4 and 3, respectively. Compare these to the
18-47
betafit
values returned in p by the beta fit. Note that the columns of ci both
bracket the true parameters.
data = betarnd(4,3,100,1);
[p,ci] = betafit(data,0.01)
p =
5.5328 3.8097
ci =
3.6538 2.6197
8.3781 5.5402
18-48
betainv
Syntax X = betainv(P,A,B)
x = F −1 ( p| a, b) = { x : F ( x | a, b) = p}
where
x
1
t a−1 (1 − t)b−1 dt
B(a, b) ∫
p = F ( x | a, b) =
0
18-49
betainv
values less than or equal to 0.6742 and 0.8981 occur with respective
probabilities 0.5 and 0.99.
18-50
betalike
r = betarnd(4,3,100,1);
[nlogl,AVAR] = betalike(betafit(r),r)
nlogl =
18-51
betalike
-27.5996
AVAR =
0.2783 0.1316
0.1316 0.0867
18-52
betapdf
Syntax Y = betapdf(X,A,B)
1
y = f ( x | a, b) = x a−1 (1 − x)b−1 I(0,1) ( x)
B(a, b)
Examples a = [0.5 1; 2 4]
a =
0.5000 1.0000
2.0000 4.0000
y = betapdf(0.5,a,a)
y =
0.6366 1.0000
1.5000 2.1875
18-53
betapdf
18-54
betarnd
Syntax R = betarnd(A,B)
R = betarnd(A,B,v)
R = betarnd(A,B,m,n)
R = betarnd(A,B,m,n,o,...)
r = betarnd(a,b)
r =
0.6987 0.6139
0.9102 0.8067
r = betarnd(10,10,[1 5])
r =
0.5974 0.4777 0.5538 0.5465 0.6327
r = betarnd(4,2,2,3)
18-55
betarnd
r =
0.3943 0.6101 0.5768
0.5990 0.2760 0.5474
18-56
betastat
Description [M,V] = betastat(A,B), with A>0 and B>0, returns the mean of and
variance for the beta distribution with parameters specified by A and
B. A and B can be vectors, matrices, or multidimensional arrays that
have the same size, which is also the size of M and V. A scalar input
for A or B is expanded to a constant array with the same dimensions
as the other input.
ab
(a + b + 1)(a + b)2
a = 1:6;
[m,v] = betastat(a,a)
m =
0.5000 0.5000 0.5000 0.5000 0.5000 0.5000
v =
0.0833 0.0500 0.0357 0.0278 0.0227 0.0192
18-57
gmdistribution.BIC property
18-58
binocdf
Syntax Y = binocdf(X,N,P)
x
⎛ n⎞
y = F ( x | n, p) = ∑ ⎜⎝ i ⎟⎠ pi q(n−i) I(0,1,...,n) (i)
i =0
Examples If a baseball team plays 162 games in a season and has a 50-50 chance
of winning any game, then the probability of that team winning more
than 100 games in a season is:
1 - binocdf(100,162,0.5)
The result is 0.001 (i.e., 1-0.999). If a team wins 100 or more games
in a season, this result suggests that it is likely that the team’s true
probability of winning any game is greater than 0.5.
18-59
binocdf
18-60
binofit
18-61
binofit
Examples This example generates a binomial sample of 100 elements, where the
probability of success in a given trial is 0.6, and then estimates this
probability from the outcomes in the sample.
r = binornd(100,0.6);
[phat,pci] = binofit(r,100)
phat =
0.5800
pci =
0.4771 0.6780
The 95% confidence interval, pci, contains the true value, 0.6.
18-62
binoinv
Syntax X = binoinv(Y,N,P)
Examples If a baseball team has a 50-50 chance of winning any game, what is a
reasonable range of games this team might win over a season of 162
games?
binoinv([0.05 0.95],162,0.5)
ans =
71 91
This result means that in 90% of baseball seasons, a .500 team should
win between 71 and 91 games.
18-63
binopdf
Syntax Y = binopdf(X,N,P)
⎛ n⎞
y = f ( x | n, p) = ⎜ ⎟ px q(n− x) I(0,1,...,n) ( x)
⎝ x⎠
binopdf(0,200,0.02)
ans =
0.0176
What is the most likely number of defective boards the inspector will
find?
defects=0:200;
y = binopdf(defects,200,.02);
18-64
binopdf
[x,i]=max(y);
defects(i)
ans =
4
18-65
binornd
Syntax R = binornd(N,P)
R = binornd(N,P,v)
R = binornd(N,p,m,n)
Algorithm The binornd function uses the direct method using the definition of the
binomial distribution as a sum of Bernoulli random variables.
Examples n = 10:10:60;
r1 = binornd(n,1./n)
r1 =
2 1 0 1 1 2
r2 = binornd(n,1./n,[1 6])
r2 =
0 1 2 1 3 1
r3 = binornd(n,1./n,1,6)
r3 =
0 1 1 1 0 3
18-66
binornd
18-67
binostat
Description [M,V] = binostat(N,P) returns the mean of and variance for the
binomial distribution with parameters specified by the number of trials,
N, and probability of success for each trial, P. N and P can be vectors,
matrices, or multidimensional arrays that have the same size, which
is also the size of M and V. A scalar input for N or P is expanded to a
constant array with the same dimensions as the other input.
The mean of the binomial distribution with parameters n and p is np.
The variance is npq, where q = 1–p.
Examples n = logspace(1,5,5)
n =
10 100 1000 10000 100000
[m,v] = binostat(n,1./n)
m =
1 1 1 1 1
v =
0.9000 0.9900 0.9990 0.9999 1.0000
[m,v] = binostat(n,1/2)
m =
5 50 500 5000 50000
v =
1.0e+04 *
0.0003 0.0025 0.0250 0.2500 2.5000
18-68
biplot
Purpose Biplot
Syntax biplot(coefs)
h = biplot(coefs,'Name',Value)
18-69
biplot
Default: false
PropertyName
load carsmall
x = [Acceleration Displacement Horsepower MPG Weight];
x = x(all(~isnan(x),2),:);
[coefs,score] = princomp(zscore(x));
View the data and the original variables in the space of the first three
principal components:
18-70
biplot
vbls = {'Accel','Disp','HP','MPG','Wgt'};
biplot(coefs(:,1:3),'scores',score(:,1:3),...
'varlabels',vbls);
18-71
bootci
Syntax ci = bootci(nboot,bootfun,...)
ci = bootci(nboot,{bootfun,...},'alpha',alpha)
ci = bootci(nboot,{bootfun,...},...,'type',type)
ci = bootci(nboot,{bootfun,...},...,'type','student',
'nbootstd',nbootstd)
ci = bootci(nboot,{bootfun,...},...,'type','student','stderr',
stderr)
ci = bootci(nboot,{bootfun,...},...,'Weights',weights)
ci = bootci(nboot,{bootfun,...},...,'Options',options)
18-72
bootci
ci =
bootci(nboot,{bootfun,...},...,'type','student','nbootstd',nbootstd)
computes the studentized bootstrap confidence interval of the statistic
defined by the function bootfun. The standard error of the
bootstrap statistics is estimated using bootstrap, with nbootstd
bootstrap data samples. nbootstd is a positive integer value.
The default value of nbootstd is 100.
ci =
bootci(nboot,{bootfun,...},...,'type','student','stderr',stderr)
computes the studentized bootstrap confidence interval of statistics
defined by the function bootfun. The standard error of the bootstrap
statistics is evaluated by the function stderr. stderr is a function
handle. stderr takes the same arguments as bootfun and returns the
standard error of the statistic computed by bootfun.
ci = bootci(nboot,{bootfun,...},...,'Weights',weights)
specifies observation weights. weights must be a vector of non-negative
numbers with at least one positive element. The number of elements
in weights must be equal to the number of rows in non-scalar
input arguments to bootfun. To obtain one bootstrap replicate,
18-73
bootci
18-74
bootci
Examples Compute the confidence interval for the capability index in statistical
process control:
18-75
bootstrp
18-76
bootstrp
with a call to statset. You can retrieve values of the individual fields
with a call to statget. Applicable statset parameters are:
18-77
bootstrp
bootstat(1:5,:)
ans =
0.6600
0.7969
0.5807
0.8766
0.9197
Display the indices of the data selected for the first 5 bootstrap samples.
bootsam(:,1:5)
ans =
9 8 15 11 15
14 7 6 7 14
4 6 10 3 11
3 10 11 9 2
18-78
bootstrp
15 4 13 4 14
9 4 5 2 10
8 5 4 3 13
1 9 1 15 11
10 8 6 12 3
1 4 5 2 8
1 1 10 6 2
3 10 15 10 8
14 6 10 3 8
13 12 1 2 4
12 6 4 9 8
hist(bootstat)
18-79
bootstrp
se = std(bootstat)
se =
0.1327
y = exprnd(5,100,1);
m = bootstrp(100,@mean,y);
[fi,xi] = ksdensity(m);
plot(xi,fi);
18-80
bootstrp
y = exprnd(5,100,1);
stats = bootstrp(100,@(x)[mean(x) std(x)],y);
plot(stats(:,1),stats(:,2),'o')
18-81
bootstrp
load hald
x = [ones(size(heat)),ingredients];
y = heat;
b = regress(y,x);
yfit = x*b;
resid = y - yfit;
18-82
bootstrp
se = std(bootstrp(...
1000,@(bootr)regress(yfit+bootr,x),resid));
18-83
boxplot
Syntax boxplot(X)
boxplot(X,G)
boxplot(axes,X,...)
boxplot(...,'Name',value)
18-84
boxplot
Name Value
plotstyle • 'traditional' — Traditional box style.
This is the default.
• 'compact' — Box style designed for plots
with many groups. This style changes the
defaults for some other parameters, as
described in the following table.
boxstyle • 'outline' — Draws an unfilled box with
dashed whiskers. This is the default.
• 'filled' — Draws a narrow filled box with
lines for whiskers.
colorgroup One or more grouping variables, of the same
type as permitted for G, specifying that the
box color should change when the specified
variables change. The default is [] for no box
color change.
colors Colors for boxes, specified as a single color
(such as 'r' or [1 0 0]) or multiple colors
(such as 'rgbm' or a three-column matrix of
RGB values). The sequence is replicated or
truncated as required, so for example 'rb'
gives boxes that alternate in color. The default
when no 'colorgroup' is specified is to use
the same color scheme for all boxes. The
default when 'colorgroup' is specified is a
modified hsv colormap.
18-85
boxplot
Name Value
datalim A two-element vector containing lower and
upper limits, used by 'extrememode' to
determine which points are extreme. The
default is [-Inf Inf].
extrememode • 'clip' — Moves data outside the datalim
limits to the limit. This is the default.
• 'compress' — Evenly distributes data
outside the datalim limits in a region just
outside the limit, retaining the relative
order of the points.
18-86
boxplot
Name Value
fullfactors • 'off' — One group for each unique row of
G. This is the default.
• 'on' — Create a group for each possible
combination of group variable values,
including combinations that do not appear
in the data.
factorseparator Specifies which factors should have their
values separated by a grid line. The value
may be 'auto' or a vector of grouping variable
numbers. For example, [1 2] adds a separator
line when the first or second grouping variable
changes value. 'auto' is [] for one grouping
variable and [1] for two or more grouping
variables. The default is [].
factorgap Specifies an extra gap to leave between boxes
when the corresponding grouping factor
changes value, expressed as a percentage of
the width of the plot. For example, with [3 1],
the gap is 3% of the width of the plot between
groups with different values of the first
grouping variable, and 1% between groups
with the same value of the first grouping
variable but different values for the second.
'auto' specifies that boxplot should choose a
gap automatically. The default is [].
grouporder Order of groups for plotting, specified as a
cell array of strings. With multiple grouping
variables, separate values within each string
with a comma. Using categorical arrays as
grouping variables is an easier way to control
the order of the boxes. The default is [], which
does not reorder the boxes.
18-87
boxplot
Name Value
jitter Maximum distance d to displace outliers along
the factor axis by a uniform random amount, in
order to make duplicate points visible. A d of
1 makes the jitter regions just touch between
the closest adjacent groups. The default is 0.
labels A character array, cell array of strings, or
numeric vector of box labels. There may be
one label per group or one label per X value.
Multiple label variables may be specified via a
numeric matrix or a cell array containing any
of these types.
set(gca,'XTickLabel',{' '})
18-88
boxplot
Name Value
labelverbosity • 'all' — Displays every label. This is the
default.
• 'minor' — Displays a label for a factor only
when that factor has a different value from
the previous group.
• 'majorminor' — Displays a label for a
factor when that factor or any factor major
to it has a different value from the previous
group.
medianstyle • 'line' — Draws a line for the median. This
is the default.
• 'target' — Draws a black dot inside a
white circle for the median.
notch • 'on' — Draws comparison intervals using
notches when plotstyle is 'traditional',
or triangular markers when plotstyle is
'compact'.
• 'marker' — Draws comparison intervals
using triangular markers.
• 'off' — Omits notches. This is the default.
18-89
boxplot
Name Value
outliersize Size of the marker used for outliers, in points.
The default is 6 (6/72 inch).
positions Box positions specified as a numeric vector
with one entry per group or X value. The
default is 1:numGroups, where numGroups is
the number of groups.
symbol Symbol and color to use for outliers, using
the same values as the LineSpec parameter
in plot. The default is 'r+'. If the symbol
is omitted then the outliers are invisible; if
the color is omitted then the outliers have the
same color as their corresponding box.
whisker Maximum whisker length w. The default is a
w of 1.5. Points are drawn as outliers if they
are larger than q3 + w(q3 – q1) or smaller than
q1 – w(q3 – q1), where q1 and q3 are the 25th
and 75th percentiles, respectively. The default
of 1.5 corresponds to approximately +/–2.7σ
and 99.3 coverage if the data are normally
distributed. The plotted whisker extends to
the adjacent value, which is the most extreme
data value that is not an outlier. Set whisker
to 0 to give no whiskers and to make every
point outside of q1 and q3 an outlier.
widths A scalar or vector of box widths for when
boxstyle is 'outline'. The default is half
of the minimum separation between boxes,
which is 0.5 when the positions argument
takes its default value. The list of values is
replicated or truncated as necessary.
18-90
boxplot
You can see data values and group names using the data cursor in the
figure window. The cursor shows the original values of any points
affected by the datalim parameter. You can label the group to which
an outlier belongs using the gname function.
To modify graphics properties of a box plot component, use findobj
with the Tag property to find the component’s handle. Tag values for
box plot components depend on parameter settings, and are listed in
the table below.
18-91
boxplot
Examples Example 1
Create a box plot of car mileage, grouped by country:
load carsmall
boxplot(MPG,Origin)
18-92
boxplot
Example 2
Create notched box plots for two groups of sample data:
x1 = normrnd(5,1,100,1);
x2 = normrnd(6,1,100,1);
boxplot([x1,x2],'notch','on')
18-93
boxplot
boxplot([x1,x2],'notch','on','whisker',1)
18-94
boxplot
Example 3
A plotstyle of 'compact' is useful for large numbers of groups:
X = randn(100,25);
subplot(2,1,1)
boxplot(X)
subplot(2,1,2)
boxplot(X,'plotstyle','compact')
18-95
boxplot
18-96
boxplot
18-97
piecewisedistribution.boundary
t = trnd(3,100,1);
obj = paretotails(t,0.1,0.9);
[p,q] = boundary(obj)
p =
0.1000
0.9000
q =
-1.7766
1.8432
18-98
candexch
Parameter Value
display Either 'on' or 'off' to control display of the
iteration counter. The default is 'on'.
init Initial design as an nruns-by-p matrix, where p is
the number of model terms. The default is a random
subset of the rows of C.
maxiter Maximum number of iterations. The default is 10.
18-99
candexch
Parameter Value
start A matrix of treatments as a nobs-by-p matrix, where
p is the number of model terms, specifying a set of
nobs fixed treatments to include in the design. The
default matrix is empty. candexch finds nruns-nobs
additional rows to add to the 'start' design. The
parameter provides the same functionality as the
daugment function, using a row-exchange algorithm
rather than a coordinate-exchange algorithm.
tries Number of times to try to generate a design from
a new starting point. The algorithm uses random
points for each try, except possibly the first. The
default is 1.
Examples The following example uses rowexch to generate a five-run design for a
two-factor pure quadratic model using a candidate set that is produced
internally:
dRE1 = rowexch(2,5,'purequadratic','tries',10)
dRE1 =
-1 1
0 0
1 -1
1 0
1 1
The same thing can be done using candgen and candexch in sequence:
18-100
candexch
dC =
-1 -1
0 -1
1 -1
-1 0
0 0
1 0
-1 1
0 1
1 1
C =
1 -1 -1 1 1
1 0 -1 0 1
1 1 -1 1 1
1 -1 0 1 0
1 0 0 0 0
1 1 0 1 0
1 -1 1 1 1
1 0 1 0 1
1 1 1 1 1
treatments = candexch(C,5,'tries',10) % D-opt subset
treatments =
2
1
7
3
4
dRE2 = dC(treatments,:) % Display design
dRE2 =
0 -1
-1 -1
-1 1
1 -1
-1 0
18-101
candexch
Use the x2fx function to convert the candidate set to a design matrix:
my_C = x2fx(my_dC,'purequadratic')
my_C =
1 0 -1 0 1
1 -1 0 1 0
1 0 0 0 0
1 1 0 1 0
1 0 1 0 1
18-102
candexch
18-103
candgen
Syntax dC = candgen(nfactors,'model')
[dC,C] = candgen(nfactors,'model')
[...] = candgen(nfactors,'model','Name',value)
3 The interaction terms in order (1, 2), (1, 3), ..., (1, n), (2, 3), ..., (n–1, n)
18-104
candgen
Name Value
bounds Lower and upper bounds for each factor, specified as
a 2-by-nfactors matrix. Alternatively, this value
can be a cell array containing nfactors elements,
each element specifying the vector of allowable
values for the corresponding factor.
categorical Indices of categorical predictors.
levels Vector of number of levels for each factor.
Examples The following example uses rowexch to generate a five-run design for a
two-factor pure quadratic model using a candidate set that is produced
internally:
dRE1 = rowexch(2,5,'purequadratic','tries',10)
dRE1 =
-1 1
0 0
1 -1
1 0
18-105
candgen
1 1
The same thing can be done using candgen and candexch in sequence:
18-106
candgen
1 -1
-1 0
18-107
canoncorr
U = (X-repmat(mean(X),N,1))*A
V = (Y-repmat(mean(Y),N,1))*B
hypotheses H0(k) , that the (k+1)st through dth correlations are all zero,
for k = 0:(d-1). stats contains seven fields, each a 1-by-d vector with
elements corresponding to the values of k, as described in the following
table:
18-108
canoncorr
Field Description
Wilks Wilks’ lambda (likelihood ratio) statistic
chisq
Bartlett’s approximate chi-squared statistic for H0(k)
with Lawley’s modification
pChisq Right-tail significance level for chisq
F
Rao’s approximate F statistic for H0(k)
pF Right-tail significance level for F
df1 Degrees of freedom for the chi-squared statistic, and
the numerator degrees of freedom for the F statistic
df2 Denominator degrees of freedom for the F statistic
plot(U(:,1),V(:,1),'.')
xlabel('0.0025*Disp+0.020*HP-0.000025*Wgt')
ylabel('-0.17*Accel-0.092*MPG')
18-109
canoncorr
18-110
capability
Syntax S = capability(data,specs)
• mu — Sample mean
• sigma — Sample standard deviation
• P — Estimated probability of being within limits
• Pl — Estimated probability of being below L
• Pu — Estimated probability of being above U
• Cp — (U-L)/(6*sigma)
• Cpl — (mu-L)./(3.*sigma)
• Cpu — (U-mu)./(3.*sigma)
• Cpk — min(Cpl,Cpu)
Indices are computed under the assumption that data values are
independent samples from a normal population with constant mean
and variance.
Indices divide a “specification width” (between specification limits) by
a “process width” (between control limits). Higher ratios indicate a
process with fewer measurements outside of specification.
18-111
capability
data = normrnd(3,0.005,100,1);
S = capability(data,[2.99 3.01])
S =
mu: 3.0006
sigma: 0.0047
P: 0.9669
Pl: 0.0116
Pu: 0.0215
Cp: 0.7156
Cpl: 0.7567
Cpu: 0.6744
Cpk: 0.6744
capaplot(data,[2.99 3.01]);
grid on
18-112
capability
18-113
capaplot
Syntax p = capaplot(data,specs)
[p,h] = capaplot(data,specs)
data = normrnd(3,0.005,100,1);
S = capability(data,[2.99 3.01])
S =
mu: 3.0006
sigma: 0.0047
P: 0.9669
Pl: 0.0116
Pu: 0.0215
Cp: 0.7156
Cpl: 0.7567
Cpu: 0.6744
Cpk: 0.6744
18-114
capaplot
capaplot(data,[2.99 3.01]);
grid on
18-115
caseread
Examples Read the file months.dat created using the casewrite function.
type months.dat
January
February
March
April
May
names = caseread('months.dat')
names =
January
February
March
April
May
18-116
casewrite
Syntax casewrite(strmat,'filename')
casewrite(strmat)
casewrite(strmat,'months.dat')
type months.dat
January
February
March
April
May
18-117
categorical.cat
Syntax c = cat(dim,A,B,...)
18-118
categorical class
18-119
categorical class
18-120
categorical class
18-121
categorical class
18-122
categorical class
Copy Value. To learn how this affects your use of the class, see Comparing
Semantics Handle and Value Classes in the MATLAB Object-Oriented
Programming documentation.
18-123
categorical
18-124
dataset.cat
Description ds = cat(dim, ds1, ds2, ...) concatenates the dataset arrays ds1,
ds2, ... along dimension dim by calling the dataset/horzcat or
dataset/vertcat method. dim must be 1 or 2.
18-125
classregtree.catsplit
Syntax v=catsplit(t)
v=catsplit(t,j)
Description v=catsplit(t) returns an n-by-2 cell array v. Each row in v gives left
and right values for a categorical split. For each branch node j based on
a categorical predictor variable z, the left child is chosen if z is in v(j,1)
and the right child is chosen if z is in v(j,2). The splits are in the
same order as nodes of the tree. Nodes for these splits can be found by
running cuttype and selecting 'categorical' cuts from top to bottom.
v=catsplit(t,j) takes an array j of rows and returns the splits for
the specified rows.
18-126
gmdistribution.cdf
Syntax y = cdf(obj,X)
MU = [1 2;-3 -5];
SIGMA = cat(3,[2 0;0 .5],[1 0;0 1]);
p = ones(1,2)/2;
obj = gmdistribution(MU,SIGMA,p);
18-127
gmdistribution.cdf
18-128
ccdesign
18-129
ccdesign
as an exponent
of 1/2. • 2 — 1/4 fraction.
dCC = ccdesign(2,'type','circumscribed')
dCC =
-1.0000 -1.0000
-1.0000 1.0000
1.0000 -1.0000
1.0000 1.0000
-1.4142 0
1.4142 0
0 -1.4142
0 1.4142
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
18-130
ccdesign
The center point is run 8 times to reduce the correlations among the
coefficient estimates.
Visualize the design as follows:
plot(dCC(:,1),dCC(:,2),'ro','MarkerFaceColor','b')
X = [1 -1 -1 -1; 1 1 1 -1];
Y = [-1 -1 1 -1; 1 -1 1 1];
line(X,Y,'Color','b')
axis square equal
18-131
cdf
Syntax Y = cdf('name',X,A)
Y = cdf('name',X,A,B)
Y = cdf('name',X,A,B,C)
18-132
cdf
18-133
cdf
18-134
cdf
18-135
cdf
Examples Compute the cdf of the normal distribution with mean 0 and standard
deviation 1 at inputs –2, –1, 0, 1, 2:
p1 = cdf('Normal',-2:2,0,1)
p1 =
0.0228 0.1587 0.5000 0.8413 0.9772
p2 = cdf('Poisson',0:4,1:5)
p2 =
0.3679 0.4060 0.4232 0.4335 0.4405
18-136
cdf
18-137
piecewisedistribution.cdf
Syntax P = cdf(obj,X)
t = trnd(3,100,1);
obj = paretotails(t,0.1,0.9);
[p,q] = boundary(obj)
p =
0.1000
0.9000
q =
-1.7766
1.8432
cdf(obj,q)
ans =
0.1000
0.9000
18-138
ProbDist.cdf
Syntax Y = cdf(PD, X)
18-139
cdfplot
Syntax cdfplot(X)
h = cdfplot(X)
[h,stats] = cdfplot(X)
Field Description
stats.min Minimum value
stats.max Maximum value
stats.mean Sample mean
stats.median Sample median (50th percentile)
stats.std Sample standard deviation
Examples The following example compares the empirical cdf for a sample from
an extreme value distribution with a plot of the cdf for the sampling
distribution. In practice, the sampling distribution would be unknown,
and would be chosen to match the empirical cdf.
18-140
cdfplot
y = evrnd(0,3,100,1);
cdfplot(y)
hold on
x = -20:0.1:10;
f = evcdf(x,0,3);
plot(x,f,'m')
legend('Empirical','Theoretical','Location','NW')
18-141
categorical.cellstr
Syntax B = cellstr(A)
18-142
dataset.cellstr
Syntax B = cellstr(A)
B = cellstr(A,VARS)
18-143
categorical.char
Syntax B = char(A)
18-144
chi2cdf
Syntax P = chi2cdf(X,V)
x t( −2)/ 2 e− t / 2
p = F ( x | ) = ∫0 2 / 2 Γ( / 2)
dt
probability = chi2cdf(1:5,1:5)
probability =
0.6827 0.6321 0.6084 0.5940 0.5841
18-145
chi2gof
Syntax h = chi2gof(x)
[h,p] = chi2gof(...)
[h,p,stats] = chi2gof(...)
[...] = chi2gof(X,'Name',value)
N
χ2 = ∑ ( Oi − Ei ) / Ei
2
i=1
where Oi are the observed counts and Ei are the expected counts. The
statistic has an approximate chi-square distribution when the counts
are sufficiently large. Bins in either tail with an expected count less
than 5 are pooled with neighboring bins until the count in each extreme
bin is at least 5. If bins remain in the interior with counts less than 5,
chi2gof displays a warning. In this case, you should use fewer bins,
or provide bin centers or edges, to increase the expected counts in all
bins. (See the syntax for specifying optional argument name/value pairs
below.) chi2gof sets the number of bins, nbins, to 10 by default, and
compares the test statistic to a chi-square distribution with nbins – 3
degrees of freedom to take into account the two estimated parameters.
18-146
chi2gof
The following name/value pairs determine the null distribution for the
test. Do not specify both cdf and expected.
18-147
chi2gof
elements are parameter values, one per cell. The function must take
x as its first argument, and other parameters as later arguments.
• expected — A vector with one element per bin specifying the
expected counts for each bin.
• nparams — The number of estimated parameters; used to adjust
the degrees of freedom to be nbins – 1 – nparams, where nbins is
the number of bins.
• emin — The minimum allowed expected value for a bin; any bin in
either tail having an expected value less than this amount is pooled
with a neighboring bin. Use the value 0 to prevent pooling. The
default is 5.
• frequency — A vector the same length as x containing the frequency
of the corresponding xvalues
• alpha — Significance level for the test. The default is 0.05.
Examples Example 1
Equivalent ways to test against an unspecified normal distribution
with estimated parameters:
x = normrnd(50,5,100,1);
[h,p] = chi2gof(x)
h =
0
p =
0.7532
18-148
chi2gof
[h,p] = chi2gof(x,'cdf',@(z)normcdf(z,mean(x),std(x)),'nparams',2)
h =
0
p =
0.7532
[h,p] = chi2gof(x,'cdf',{@normcdf,mean(x),std(x)})
h =
0
p =
0.7532
Example 2
Test against the standard normal:
x = randn(100,1);
[h,p] = chi2gof(x,'cdf',@normcdf)
h =
0
p =
0.9443
Example 3
Test against the standard uniform:
x = rand(100,1);
n = length(x);
edges = linspace(0,1,11);
expectedCounts = n * diff(edges);
[h,p,st] = chi2gof(x,'edges',edges,...
'expected',expectedCounts)
h =
0
p =
0.3191
18-149
chi2gof
st =
chi2stat: 10.4000
df: 9
edges: [1x11 double]
O: [6 11 4 12 15 8 14 9 11 10]
E: [1x10 double]
Example 4
Test against the Poisson distribution by specifying observed and
expected counts:
bins = 0:5;
obsCounts = [6 16 10 12 4 2];
n = sum(obsCounts);
lambdaHat = sum(bins.*obsCounts)/n;
expCounts = n*poisspdf(bins,lambdaHat);
[h,p,st] = chi2gof(bins,'ctrs',bins,...
'frequency',obsCounts, ...
'expected',expCounts,...
'nparams',1)
h =
0
p =
0.4654
st =
chi2stat: 2.5550
df: 3
edges: [1x6 double]
O: [6 16 10 12 6]
E: [7.0429 13.8041 13.5280 8.8383 6.0284]
18-150
chi2inv
Syntax X = chi2inv(P,V)
x = F −1 ( p| ) = { x : F ( x | ) = p}
where
x t( −2)/ 2 e− t / 2
p = F ( x | ) = ∫0 2 / 2 Γ( / 2)
dt
Examples Find a value that exceeds 95% of the samples from a chi-square
distribution with 10 degrees of freedom.
x = chi2inv(0.95,10)
x =
18.3070
You would observe values greater than 18.3 only 5% of the time by
chance.
18-151
chi2inv
18-152
chi2pdf
Syntax Y = chi2pdf(X,V)
x( −2)/ 2 e− x / 2
y = f ( x | ) =
2 / 2 Γ( / 2)
Examples nu = 1:6;
x = nu;
y = chi2pdf(x,nu)
y =
0.2420 0.1839 0.1542 0.1353 0.1220 0.1120
18-153
chi2pdf
18-154
chi2rnd
Syntax R = chi2rnd(V)
R = chi2rnd(V,u)
R = chi2rnd(V,m,n)
Examples Note that the first and third commands are the same, but are different
from the second command.
r = chi2rnd(1:6)
r =
0.0037 3.0377 7.8142 0.9021 3.2019 9.0729
r = chi2rnd(6,[1 6])
r =
6.5249 2.6226 12.2497 3.0388 6.3133 5.0388
r = chi2rnd(1:6,1,6)
r =
0.7638 6.0955 0.8273 3.2506 1.5469 10.9197
18-155
chi2rnd
18-156
chi2stat
Description [M,V] = chi2stat(NU) returns the mean of and variance for the
chi-square distribution with degrees of freedom parameters specified
by NU.
The mean of the chi-square distribution is ν, the degrees of freedom
parameter, and the variance is 2ν.
Examples nu = 1:10;
nu = nu'*nu;
[m,v] = chi2stat(nu)
m =
1 2 3 4 5 6 7 8 9 10
2 4 6 8 10 12 14 16 18 20
3 6 9 12 15 18 21 24 27 30
4 8 12 16 20 24 28 32 36 40
5 10 15 20 25 30 35 40 45 50
6 12 18 24 30 36 42 48 54 60
7 14 21 28 35 42 49 56 63 70
8 16 24 32 40 48 56 64 72 80
9 18 27 36 45 54 63 72 81 90
10 20 30 40 50 60 70 80 90 100
v =
2 4 6 8 10 12 14 16 18 20
4 8 12 16 20 24 28 32 36 40
6 12 18 24 30 36 42 48 54 60
8 16 24 32 40 48 56 64 72 80
10 20 30 40 50 60 70 80 90 100
12 24 36 48 60 72 84 96 108 120
14 28 42 56 70 84 98 112 126 140
16 32 48 64 80 96 112 128 144 160
18 36 54 72 90 108 126 144 162 180
20 40 60 80 100 120 140 160 180 200
18-157
chi2stat
18-158
classregtree.children
Syntax C = children(t)
C = children(t,nodes)
load fisheriris;
t = classregtree(meas,species,...
'names',{'SL' 'SW' 'PL' 'PW'})
t =
Decision tree for classification
1 if PL<2.45 then node 2 elseif PL>=2.45 then node 3 else setosa
2 class = setosa
3 if PW<1.75 then node 4 elseif PW>=1.75 then node 5 else versicolor
4 if PL<4.95 then node 6 elseif PL>=4.95 then node 7 else versicolor
5 class = virginica
6 if PW<1.65 then node 8 elseif PW>=1.65 then node 9 else versicolor
7 class = virginica
8 class = versicolor
9 class = virginica
view(t)
18-159
classregtree.children
C = children(t)
C =
2 3
0 0
4 5
6 7
0 0
8 9
0 0
0 0
18-160
classregtree.children
0 0
18-161
cholcov
Syntax T = cholcov(SIGMA)
[T,num] = cholcov(SIGMA)
[T,num] = cholcov(SIGMA,0)
18-162
cholcov
T = cholcov(C1)
T =
-0.2113 0.7887 -0.5774 0
0.7887 -0.2113 -0.5774 0
1.1547 1.1547 1.1547 1.7321
C2 = T'*T
C2 =
2.0000 1.0000 1.0000 2.0000
1.0000 2.0000 1.0000 2.0000
1.0000 1.0000 2.0000 2.0000
2.0000 2.0000 2.0000 3.0000
C3 = cov(randn(1e6,3)*T)
C3 =
1.9973 0.9982 0.9995 1.9975
0.9982 1.9962 0.9969 1.9956
0.9995 0.9969 1.9980 1.9972
1.9975 1.9956 1.9972 2.9951
18-163
categorical.circshift
Syntax B = circshift(A,shiftsize)
18-164
NaiveBayes.CIsNonEmpty property
18-165
classregtree.classcount
Syntax P = classcount(t)
P = classcount(t,nodes)
load fisheriris;
t = classregtree(meas,species,...
'names',{'SL' 'SW' 'PL' 'PW'})
t =
Decision tree for classification
1 if PL<2.45 then node 2 elseif PL>=2.45 then node 3 else setosa
2 class = setosa
3 if PW<1.75 then node 4 elseif PW>=1.75 then node 5 else versicolor
4 if PL<4.95 then node 6 elseif PL>=4.95 then node 7 else versicolor
5 class = virginica
6 if PW<1.65 then node 8 elseif PW>=1.65 then node 9 else versicolor
7 class = virginica
8 class = versicolor
9 class = virginica
view(t)
18-166
classregtree.classcount
P = classcount(t)
P =
50 50 50
50 0 0
0 50 50
0 49 5
0 1 45
0 47 1
0 2 4
0 47 0
18-167
classregtree.classcount
0 0 1
18-168
classify
18-169
classify
18-170
classify
For the linear and diaglinear types, the quadratic field is absent,
and a row x from the sample array is classified into group I rather than
group J if 0 < K+x*L. For the other types, x is classified into group I if
0 < K+x*L+x*Q*x'.
Examples For training data, use Fisher’s sepal measurements for iris versicolor
and virginica:
load fisheriris
SL = meas(51:end,1);
SW = meas(51:end,2);
group = species(51:end);
h1 = gscatter(SL,SW,group,'rb','v^',[],'off');
18-171
classify
set(h1,'LineWidth',2)
legend('Fisher versicolor','Fisher virginica',...
'Location','NW')
[X,Y] = meshgrid(linspace(4.5,8),linspace(2,4));
X = X(:); Y = Y(:);
[C,err,P,logp,coeff] = classify([X Y],[SL SW],...
group,'quadratic');
18-172
classify
hold on;
gscatter(X,Y,C,'rb','.',1,'off');
K = coeff(1,2).const;
L = coeff(1,2).linear;
Q = coeff(1,2).quadratic;
% Function to compute K + L*v + v'*Q*v for multiple vectors
% v=[x;y]. Accepts x and y as scalars or column vectors.
f = @(x,y) K + [x y]*L + sum(([x y]*Q) .* [x y], 2);
h2 = ezplot(f,[4.5 8 2 4]);
set(h2,'Color','m','LineWidth',2)
axis([4.5 8 2 4])
xlabel('Sepal Length')
ylabel('Sepal Width')
title('{\bf Classification with Fisher Training Data}')
18-173
classify
18-174
CompactTreeBagger.ClassNames property
Description The ClassNames property is a cell array containing the class names for
the response variable Y supplied to TreeBagger. This property is empty
for regression trees.
18-175
TreeBagger.ClassNames property
Description The ClassNames property is a cell array containing the class names for
the response variable Y. This property is empty for regression trees.
18-176
classregtree.classprob
Syntax P = classprob(t)
P = classprob(t,nodes)
load fisheriris;
t = classregtree(meas,species,...
'names',{'SL' 'SW' 'PL' 'PW'})
t =
Decision tree for classification
1 if PL<2.45 then node 2 elseif PL>=2.45 then node 3 else setosa
2 class = setosa
3 if PW<1.75 then node 4 elseif PW>=1.75 then node 5 else versicolor
4 if PL<4.95 then node 6 elseif PL>=4.95 then node 7 else versicolor
5 class = virginica
6 if PW<1.65 then node 8 elseif PW>=1.65 then node 9 else versicolor
7 class = virginica
8 class = versicolor
9 class = virginica
view(t)
18-177
classregtree.classprob
P = classprob(t)
P =
0.3333 0.3333 0.3333
1.0000 0 0
0 0.5000 0.5000
0 0.9074 0.0926
0 0.0217 0.9783
0 0.9792 0.0208
0 0.3333 0.6667
0 1.0000 0
18-178
classregtree.classprob
0 0 1.0000
18-179
classregtree class
18-180
classregtree class
Copy Value. To learn how this affects your use of the class, see Comparing
Semantics Handle and Value Classes in the MATLAB Object-Oriented
Programming documentation.
18-181
classregtree
Syntax t = classregtree(X,y)
t = classregtree(X,y,'Name',value)
18-182
classregtree
18-183
classregtree
load fisheriris;
18-184
classregtree
t = classregtree(meas,species,...
'names',{'SL' 'SW' 'PL' 'PW'})
t =
Decision tree for classification
1 if PL<2.45 then node 2 elseif PL>=2.45 then node 3 else setosa
2 class = setosa
3 if PW<1.75 then node 4 elseif PW>=1.75 then node 5 else versicolor
4 if PL<4.95 then node 6 elseif PL>=4.95 then node 7 else versicolor
5 class = virginica
6 if PW<1.65 then node 8 elseif PW>=1.65 then node 9 else versicolor
7 class = virginica
8 class = versicolor
9 class = virginica
view(t)
18-185
classregtree
18-186
NaiveBayes.CLevels property
Description The CLevels property is a vector of the same type as the grouping
variable, containing the unique levels of the grouping variable.
18-187
cluster
Syntax T = cluster(Z,'cutoff',c)
T = cluster(Z,'cutoff',c,'depth',d)
T = cluster(Z,'cutoff',c,'criterion',criterion)
T = cluster(Z,'maxclust',n)
load fisheriris
18-188
cluster
d = pdist(meas);
Z = linkage(d);
c = cluster(Z,'maxclust',3:5);
crosstab(c(:,1),species)
ans =
0 0 2
0 50 48
50 0 0
crosstab(c(:,2),species)
ans =
0 0 1
0 50 47
0 0 2
50 0 0
crosstab(c(:,3),species)
ans =
0 4 0
0 46 47
0 0 1
0 0 2
50 0 0
18-189
gmdistribution.cluster
Note The data in X is typically the same as the data used to create
the Gaussian mixture distribution defined by obj. Clustering with
cluster is treated as a separate step, apart from density estimation.
For cluster to provide meaningful clustering with new data, X should
come from the same population as the data used to create obj.
cluster treats NaN values as missing data. Rows of X with NaN values
are excluded from the partition.
[idx,nlogl] = cluster(obj,X) also returns nlogl, the negative
log-likelihood of the data.
[idx,nlogl,P] = cluster(obj,X) also returns the posterior
probabilities of each component for each observation in the n-by-k
matrix P. P(I,J) is the probability of component J given observation I.
[idx,nlogl,P,logpdf] = cluster(obj,X) also returns the n-by-1
vector logpdf containing the logarithm of the estimated probability
density function for each observation. The density estimate for
18-190
gmdistribution.cluster
MU1 = [1 2];
SIGMA1 = [2 0; 0 .5];
MU2 = [-3 -5];
SIGMA2 = [1 0; 0 1];
X = [mvnrnd(MU1,SIGMA1,1000);mvnrnd(MU2,SIGMA2,1000)];
scatter(X(:,1),X(:,2),10,'.')
hold on
18-191
gmdistribution.cluster
obj = gmdistribution.fit(X,2);
h = ezcontour(@(x,y)pdf(obj,[x y]),[-8 6],[-8 6]);
18-192
gmdistribution.cluster
idx = cluster(obj,X);
cluster1 = X(idx == 1,:);
cluster2 = X(idx == 2,:);
delete(h)
h1 = scatter(cluster1(:,1),cluster1(:,2),10,'r.');
h2 = scatter(cluster2(:,1),cluster2(:,2),10,'g.');
legend([h1 h2],'Cluster 1','Cluster 2','Location','NW')
18-193
gmdistribution.cluster
18-194
clusterdata
Syntax T = clusterdata(X,cutoff)
T = clusterdata(X,'Name',value)
Y = pdist(X,'euclid');
Z = linkage(Y,'single');
T = cluster(Z,'cutoff',cutoff);
Y = pdist(X,'euclid');
Z = linkage(Y,'single');
T = cluster(Z,'maxclust',cutoff);
18-195
clusterdata
Parameter Value
distance Any of the distance metric names allowed by pdist
(follow the 'minkowski' option by the value of the
exponent p)
linkage Any of the linkage methods allowed by the linkage
function
cutoff Cutoff for inconsistent or distance measure
Examples The example first creates a sample data set of random numbers. It then
uses clusterdata to compute the distances between items in the data
set and create a hierarchical cluster tree from the data set. Finally,
the clusterdata function groups the items in the data set into three
clusters. The example uses the find function to list all the items in
cluster 2, and the scatter3 function to plot the data with each cluster
shown in a different color.
X = [gallery('uniformdata',[10 3],12);...
gallery('uniformdata',[10 3],13)+1.2;...
gallery('uniformdata',[10 3],14)+2.5];
T = clusterdata(X,'maxclust',3);
find(T==2)
ans =
11
12
13
14
15
16
17
18
18-196
clusterdata
19
20
scatter3(X(:,1),X(:,2),X(:,3),100,T,'filled')
18-197
cmdscale
Syntax Y = cmdscale(D)
[Y,e] = cmdscale(D)
18-198
cmdscale
Examples Generate some points in 4-D space, but close to 3-D space, then reduce
them to distances only.
X = [normrnd(0,1,10,3) normrnd(0,.1,10,1)];
D = pdist(X,'euclidean');
[Y,e] = cmdscale(D);
% Poor reconstruction
maxerr2 = max(abs(pdist(X)-pdist(Y(:,1:2))))
% Good reconstruction
maxerr3 = max(abs(pdist(X)-pdist(Y(:,1:3))))
% Exact reconstruction
maxerr4 = max(abs(pdist(X)-pdist(Y)))
% D is now non-Euclidean
D = pdist(X,'cityblock');
[Y,e] = cmdscale(D);
% Poor reconstruction
maxerr = max(abs(pdist(X)-pdist(Y)))
18-199
NaiveBayes.CNames property
18-200
CompactTreeBagger.combine
Syntax B1 = combine(B1,B2)
18-201
combnk
Syntax C = combnk(v,k)
C = combnk('tendril',4);
last5 = C(31:35,:)
last5 =
tedr
tenl
teni
tenr
tend
c = combnk(1:4,2)
c =
3 4
2 4
2 3
1 4
1 3
1 2
18-202
TreeBagger.compact
18-203
CompactTreeBagger class
18-204
CompactTreeBagger class
Copy Value. To learn how this affects your use of the class, see Comparing
Semantics Handle and Value Classes in the MATLAB Object-Oriented
Programming documentation.
18-205
CompactTreeBagger
Description When you use the TreeBagger constructor to grow trees, it creates a
CompactTreeBagger object. You can obtain the compact object from the
full TreeBagger object using the TreeBagger/compact method. You do
not create an instance of CompactTreeBagger directly.
18-206
TreeBagger.ComputeOOBPrediction property
• OOBIndices
• OOBInstanceWeight
• oobError
• oobMargin
• oobMeanMargin
18-207
TreeBagger.ComputeOOBVarImp property
• OOBPermutedVarDeltaError
• OOBPermutedVarDeltaMeanMargin
• OOBPermutedVarCountRaiseMargin
18-208
confusionmat
Syntax C = confusionmat(group,grouphat)
C = confusionmat(group,grouphat,'order',grouporder)
[C,order] = confusionmat(...)
Examples Example 1
Display the confusion matrix for data with two misclassifications and
one missing classification:
[C,order] = confusionmat(g1,g2)
C =
18-209
confusionmat
2 0 0 0
0 1 1 0
0 0 0 1
0 0 0 0
order =
1
2
3
4
Example 2
Randomize the measurements and groups in Fisher’s iris data:
load fisheriris
numObs = length(species);
p = randperm(numObs);
meas = meas(p,:);
species = species(p);
half = floor(numObs/2);
training = meas(1:half,:);
trainingSpecies = species(1:half);
sample = meas(half+1:end,:);
grouphat = classify(sample,training,trainingSpecies);
group = species(half+1:end);
[C,order] = confusionmat(group,grouphat)
C =
22 0 0
2 22 0
0 0 29
order =
'virginica'
18-210
confusionmat
'versicolor'
'setosa'
18-211
controlchart
Syntax controlchart(X)
controlchart(x,group)
controlchart(X,group)
[stats,plotdata] = controlchart(x,[group])
controlchart(x,group,'name',value)
18-212
controlchart
18-213
controlchart
18-214
controlchart
18-215
controlchart
Examples Create xbar and r control charts for the data in parts.mat:
load parts
st = controlchart(runout,'chart',{'xbar' 'r'});
18-216
controlchart
18-217
controlrules
Syntax R = controlrules('rules',x,cl,se)
[R,RULES] = controlrules('rules',x,cl,se)
18-218
controlrules
For multi-point rules, a rule violation at point i indicates that the set
of points ending at point i triggered the rule. Point i is considered to
have violated the rule only if it is one of the points violating the rule’s
condition.
Any points with NaN as their x, cl, or se values are not considered to
have violated rules, and are not counted in the rules for other points.
Control rules can be specified in the controlchart function as values
for the 'rules' parameter.
[R,RULES] = controlrules('rules',x,cl,se) returns a cell array
of text strings RULES listing the rules applied.
Examples Create an xbar chart using the we2 rule to mark out of control
measurements:
load parts;
st = controlchart(runout,'rules','we2');
x = st.mean;
cl = st.mu;
se = st.sigma./sqrt(st.n);
hold on
plot(cl+2*se,'m')
18-219
controlrules
R = controlrules('we2',x,cl,se);
I = find(R)
I =
21
23
18-220
controlrules
24
25
26
27
18-221
gmdistribution.Converged property
Description Logical true if the algorithm has converged; logical false if the
algorithm has not converged.
18-222
cophenet
Syntax c = cophenet(Z,Y)
[c,d] = cophenet(Z,Y)
where:
18-223
cophenet
18-224
copulacdf
Syntax Y = copulacdf('Gaussian',U,rho)
Y = copulacdf('t',U,rho,NU)
Y = copulacdf('family',U,alpha)
Examples u = linspace(0,1,10);
[U1,U2] = meshgrid(u,u);
F = copulacdf('Clayton',[U1(:) U2(:)],1);
surf(U1,U2,reshape(F,10,10))
xlabel('u1')
ylabel('u2')
18-225
copulacdf
18-226
copulafit
18-227
copulafit
load stockreturns
x = stocks(:,1);
y = stocks(:,2);
scatterhist(x,y)
18-228
copulafit
Transform the data to the copula scale (unit square) using a kernel
estimator of the cumulative distribution function:
u = ksdensity(x,x,'function','cdf');
v = ksdensity(y,y,'function','cdf');
scatterhist(u,v)
xlabel('u')
ylabel('v')
18-229
copulafit
Fit a t copula:
r = copularnd('t',Rho,nu,1000);
18-230
copulafit
u1 = r(:,1);
v1 = r(:,2);
scatterhist(u1,v1)
xlabel('u')
ylabel('v')
set(get(gca,'children'),'marker','.')
Transform the random sample back to the original scale of the data:
x1 = ksdensity(x,u1,'function','icdf');
y1 = ksdensity(y,v1,'function','icdf');
18-231
copulafit
scatterhist(x1,y1)
set(get(gca,'children'),'marker','.')
18-232
copulaparam
tau = -0.5
rho = copulaparam('gaussian',tau)
18-233
copulaparam
rho =
-0.7071
18-234
copulapdf
Syntax Y = copulapdf('Gaussian',U,rho)
Y = copulapdf('t',U,rho,NU)
Y = copulapdf('family',U,alpha)
Examples u = linspace(0,1,10);
[U1,U2] = meshgrid(u,u);
F = copulapdf('Clayton',[U1(:) U2(:)],1);
surf(U1,U2,reshape(F,10,10))
xlabel('u1')
ylabel('u2')
18-235
copulapdf
18-236
copulastat
Syntax R = copulastat('Gaussian',rho)
R = copulastat('t',rho,NU)
R = copulastat('family',alpha)
R = copulastat(...,'type','type')
18-237
copulastat
18-238
copularnd
Syntax U = copularnd('Gaussian',rho,N)
U = copularnd('t',rho,NU,N)
U = copularnd('family',alpha,N)
tau = -0.5
rho = copulaparam('gaussian',tau)
rho =
-0.7071
18-239
copularnd
18-240
cordexch
The order of the columns of X for a full quadratic model with n terms is:
3 The interaction terms in order (1, 2), (1, 3), ..., (1, n), (2, 3), ..., (n–1, n)
18-241
cordexch
name Value
bounds Lower and upper bounds for each factor, specified as
a 2-by-nfactors matrix. Alternatively, this value
can be a cell array containing nfactors elements,
each element specifying the vector of allowable
values for the corresponding factor.
categorical Indices of categorical predictors.
display Either 'on' or 'off' to control display of the
iteration counter. The default is 'on'.
excludefun Handle to a function that excludes undesirable
runs. If the function is f, it must support the syntax
b = f(S), where S is a matrix of treatments with
nfactors columns and b is a vector of Boolean
values with the same number of rows as S. b(i) is
true if the method should exclude ith row S.
init Initial design as a nruns-by-nfactors matrix. The
default is a randomly selected set of points.
18-242
cordexch
name Value
maxiter Maximum number of iterations. The default is 10.
tries Number of times to try to generate a design from
a new starting point. The algorithm uses random
points for each try, except possibly the first. The
default is 1.
Algorithm Both cordexch and rowexch use iterative search algorithms. They
operate by incrementally changing an initial design matrix X to increase
D = |XTX| at each step. In both algorithms, there is randomness
built into the selection of the initial design and into the choice of the
incremental changes. As a result, both algorithms may return locally,
but not globally, D-optimal designs. Run each algorithm multiple times
and select the best result for your final design. Both functions have a
'tries' parameter that automates this repetition and comparison.
Unlike the row-exchange algorithm used by rowexch, cordexch does not
use a candidate set. (Or rather, the candidate set is the entire design
space.) At each step, the coordinate-exchange algorithm exchanges a
single element of X with a new element evaluated at a neighboring
point in design space. The absence of a candidate set reduces demands
on memory, but the smaller scale of the search means that the
coordinate-exchange algorithm is more likely to become trapped in a
local minimum.
Examples Suppose you want a design to estimate the parameters in the following
three-factor, seven-term interaction model:
y = 0 + 1 x 1 + 2 x 2 + 3 x 3 + 12 x 1 x 2 + 13 x 1 x 3 + 23 x 2 x 3 +
nfactors = 3;
nruns = 7;
[dCE,X] = cordexch(nfactors,nruns,'interaction','tries',10)
dCE =
18-243
cordexch
-1 1 1
-1 -1 -1
1 1 1
-1 1 -1
1 -1 1
1 -1 -1
-1 -1 1
X =
1 -1 1 1 -1 -1 1
1 -1 -1 -1 1 1 1
1 1 1 1 1 1 1
1 -1 1 -1 -1 1 -1
1 1 -1 1 -1 1 -1
1 1 -1 -1 -1 -1 1
1 -1 -1 1 1 -1 -1
Columns of the design matrix X are the model terms evaluated at each
row of the design dCE. The terms appear in order from left to right:
constant term, linear terms (1, 2, 3), interaction terms (12, 13, 23). Use
X to fit the model, as described in “Linear Regression” on page 9-3, to
response data measured at the design points in dCE.
18-244
corr
Description RHO = corr(X) returns a p-by-p matrix containing the pairwise linear
correlation coefficient between each pair of columns in the n-by-p
matrix X.
RHO = corr(X,Y) returns a p1-by-p2 matrix containing the pairwise
correlation coefficient between each pair of columns in the n-by-p1 and
n-by-p2 matrices X and Y.
[RHO,PVAL] = corr(X,Y) also returns PVAL, a matrix of p-values for
testing the hypothesis of no correlation against the alternative that
there is a nonzero correlation. Each element of PVAL is the p value for
the corresponding element of RHO. If PVAL(i, j) is small, say less than
0.05, then the correlation RHO(i,j) is significantly different from zero.
[RHO,PVAL] = corr(X,Y,'name',value) specifies one or more optional
name/value pairs. Specify name inside single quotes. The following
table lists valid parameters and their values.
Parameter Values
type • 'Pearson' (the default) computes Pearson’s
linear correlation coefficient
• 'Kendall' computes Kendall’s tau
• 'Spearman' computes Spearman’s rho
rows • 'all' (the default) uses all rows regardless of
missing values (NaNs)
• 'complete' uses only rows with no missing
values
18-245
corr
Parameter Values
Using the 'pairwise' option for the rows parameter may return a
matrix that is not positive definite. The 'complete' option always
returns a positive definite matrix, but in general the estimates are
based on fewer observations.
corr computes p-values for Pearson’s correlation using a Student’s
t distribution for a transformation of the correlation. This correlation
is exact when X and Y are normal. corr computes p-values for
Kendall’s tau and Spearman’s rho using either the exact permutation
distributions (for small sample sizes), or large-sample approximations.
corr computes p-values for the two-tailed test by doubling the more
significant of the two one-tailed p-values.
References [1] Gibbons, J.D. (1985) Nonparametric Statistical Inference, 2nd ed.,
M. Dekker.
[4] Best, D.J. and D.E. Roberts (1975) "Algorithm AS 89: The Upper
Tail Probabilities of Spearman’s rho", Applied Statistics, 24:377-379.
18-246
corr
18-247
corrcov
Syntax R = corrcov(C)
[R,sigma] = corrcov(C)
load hospital
X = [hospital.Weight hospital.BloodPressure];
C = cov(X)
C =
706.0404 27.7879 41.0202
27.7879 45.0622 23.8194
41.0202 23.8194 48.0590
R = corrcoef(X)
R =
1.0000 0.1558 0.2227
0.1558 1.0000 0.5118
0.2227 0.5118 1.0000
corrcov(C)
ans =
1.0000 0.1558 0.2227
0.1558 1.0000 0.5118
0.2227 0.5118 1.0000
18-248
TreeBagger.Cost property
18-249
gmdistribution.CovType property
18-250
coxphfit
Syntax b = coxphfit(X,y)
b = coxphfit(X,y,'name',value)
[b,logl,H,stats] = coxphfit(...)
Name Value
baseline The X values at which to compute the baseline
hazard. Default is mean(X), so the hazard at X is
h(t)*exp((X-mean(X))*b). Enter 0 to compute
the baseline relative to 0, so the hazard at X is
h(t)*exp(X*b).
censoring A Boolean array of the same size as y that is 1
for observations that are right-censored and 0 for
observations that are observed exactly. Default is all
observations observed exactly.
frequency An array of the same size as y containing nonnegative
integer counts. The jth element of this vector gives
the number of times the method observes the jth
element of y and the jth row of X. Default is one
observation per row of X and y.
18-251
coxphfit
Name Value
init A vector containing initial values for the estimated
coefficients b.
options A structure specifying control parameters for
the iterative algorithm used to estimate b. A
call to statset can create this argument. For
parameter names and default values, type
statset('coxphfit').
x = 4*rand(100,1);
A = 50*exp(-0.5*x); B = 2;
y = wblrnd(A,B);
[b,logL,H,stats] = coxphfit(x,y);
Show the Cox estimate of the baseline survivor function together with
the known Weibull function:
stairs(H(:,1),exp(-H(:,2)))
18-252
coxphfit
xx = linspace(0,100);
line(xx,1-wblcdf(xx,50*exp(-0.5*mean(x)),B),'color','r')
xlim([0,50])
legend('Survivor Function','Weibull Function')
References [1] Cox, D.R., and D. Oakes. Analysis of Survival Data. London:
Chapman & Hall, 1984.
18-253
NaiveBayes.CPrior property
Description The CPrior property is a vector of length NClasses containing the class
priors. The priors for empty classes are zero.
18-254
createns
Syntax NS = createns(X)
NS = createns(X,'Name',Value)
18-255
createns
- 'minkowski'
- 'chebychev'
• 'exhaustive' — Create an ExhaustiveSearcher object. If
you do not specify NSMethod, this is the default value when the
default criteria for 'kdtree' do not apply.
Distance
A string or a function handle specifying the default distance
metric used when you call the knnsearch method to find nearest
neighbors for future query points. If you specify a distance metric
but not an NSMethod, this input determines the type of object
createns creates, according to the default values described in
NSMethod.
For both KDTreeSearcher and ExhaustiveSearcher objects, the
following options apply:
18-256
createns
P
A positive scalar, p, indicating the exponent of the Minkowski
distance. This parameter is only valid when Distance is
'minkowski'. Default is 2.
Cov
A positive definite matrix indicating the covariance matrix when
computing the Mahalanobis distance. This parameter is only
valid when Distance is 'mahalanobis'. Default is nancov(X).
Scale
A vector S with the length equal to the number of columns in
X. Each coordinate of X and each query point is scaled by the
corresponding element of S when computing the standardized
18-257
createns
load fisheriris
x = meas(:,3:4);
% Since x has only two columns and the Distance is Minkowski,
% createns creates a KDTreeSearcher object by default:
knnobj = createns(x,'Distance','minkowski','P',5)
knnobj =
KDTreeSearcher
Properties:
BucketSize: 50
X: [150x2 double]
Distance: 'minkowski'
DistParameter: 5
18-258
crosstab
Purpose Cross-tabulation
Examples Example 1
Cross-tabulate two vectors with three and four distinct values,
respectively:
x = [1 1 2 3 1]; y = [1 2 5 3 1];
18-259
crosstab
table = crosstab(x,y)
table =
2 1 0 0
0 0 0 1
0 0 1 0
Example 2
Generate two independent vectors, each containing 50 discrete uniform
random numbers in the range 1:3:
x1 = unidrnd(3,50,1);
x2 = unidrnd(3,50,1);
[table,chi2,p] = crosstab(x1,x2)
table =
1 6 7
5 5 2
11 7 6
chi2 =
7.5449
p =
0.1097
At the 95% confidence level, the p value fails to reject the null
hypothesis that table is independent in each dimension.
Example 3
The file carbig.mat contains measurements of large model cars during
the years 1970-1982:
load carbig
[table,chi2,p,labels] = crosstab(cyl4,when,org)
table(:,:,1) =
82 75 25
12 22 38
table(:,:,2) =
0 4 3
23 26 17
table(:,:,3) =
18-260
crosstab
3 3 4
12 25 32
chi2 =
207.7689
p =
0
label =
'Other' 'Early' 'USA'
'Four' 'Mid' 'Europe'
[] 'Late' 'Japan'
table and label together show that the number of four-cylinder cars
made in the USA during the late period of the data was table(2,3,1)
or 38 cars.
18-261
crossval
testval = fun(XTRAIN,XTEST)
Each time it is called, fun should use XTRAIN to fit a model, then return
some criterion testval computed on XTEST using that fitted model.
X can be a column vector or a matrix. Rows of X correspond to
observations; columns correspond to variables or features. Each row of
vals contains the result of applying fun to one test set. If testval is
a non-scalar value, crossval converts it to a row vector using linear
indexing and stored in one row of vals.
vals = crossval(fun,X,Y,...) is used when data are stored in
separate variables X, Y, ... . All variables (column vectors, matrices, or
arrays) must have the same number of rows. fun is called with the
training subsets of X, Y, ... , followed by the test subsets of X, Y, ... ,
as follows:
testvals = fun(XTRAIN,YTRAIN,...,XTEST,YTEST,...)
18-262
crossval
yfit = predfun(XTRAIN,ytrain,XTEST)
Each time it is called, predfun should use XTRAIN and ytrain to fit a
regression model and then return fitted values in a column vector yfit.
Each row of yfit contains the predicted values for the corresponding
row of XTEST. crossval computes the squared errors between yfit
and the corresponding response test set, and returns the overall mean
across all test sets.
mcr = crossval('mcr',X,y,'Predfun',predfun) returns mcr, a
scalar containing a 10-fold cross-validation estimate of misclassification
rate (the proportion of misclassified samples) for the function predfun.
The matrix X contains predictor values and the vector y contains class
labels. predfun should use XTRAIN and YTRAIN to fit a classification
model and return yfit as the predicted class labels for XTEST.
crossval computes the number of misclassifications between yfit
and the corresponding response test set, and returns the overall
misclassification rate across all test sets.
val = crossval(criterion,X1,X2,...,y,'Predfun',predfun),
where criterion is 'mse' or 'mcr', returns a cross-validation estimate
of mean-squared error (for a regression model) or misclassification rate
(for a classification model) with predictor values in X1, X2, ... and,
respectively, response values or class labels in y. X1, X2, ... and y must
have the same number of rows. predfun is a function handle called
with the training subsets of X1, X2, ..., the training subset of y, and the
test subsets of X1, X2, ..., as follows:
yfit=predfun(X1TRAIN,X2TRAIN,...,ytrain,X1TEST,X2TEST,...)
18-263
crossval
Name Value
holdout A scalar specifying the ratio or the number
of observations p for holdout cross-validation.
When 0 < p < 1, approximately p*n observations
for the test set are randomly selected. When p
is an integer, p observations for the test set are
randomly selected.
kfold A scalar specifying the number of folds k for
k-fold cross-validation.
leaveout Specifies leave-one-out cross-validation. The
value must be 1.
mcreps A positive integer specifying the number of
Monte-Carlo repetitions for validation. Ifthe
first input of crossval is 'mse' or 'mcr',
crossval returns the mean of mean-squared
error or misclassification rate across all of the
Monte-Carlo repetitions. Otherwise, crossval
concatenates the values vals from all of
the Monte-Carlo repetitions along the first
dimension.
partition An object c of the cvpartition class, specifying
the cross-validation type and partition.
stratify A column vector group specifying groups for
stratification. Both training and test sets have
roughly the same class proportions as in group.
NaNs or empty strings in group are treated as
missing values, and the corresponding rows of
the data are ignored.
options A struct that specifies options that govern the
computation of crossval. One option requests
that crossval conduct multiple function
evaluations using multiple processors, if the
Parallel Computing Toolbox is available. Two
18-264
crossval
Name Value
18-265
crossval
and mcreps are specified, the first Monte-Carlo repetition uses the
partition information in the cvpartition object, and the repartition
method is called to generate new partitions for each of the remaining
repetitions. If no cross-validation type is specified, the default is 10-fold
cross-validation.
Examples Example 1
Compute mean-squared error for regression using 10-fold
cross-validation:
load('fisheriris');
y = meas(:,1);
X = [ones(size(y,1),1),meas(:,2:4)];
regf=@(XTRAIN,ytrain,XTEST)(XTEST*regress(ytrain,XTRAIN));
cvMse = crossval('mse',X,y,'predfun',regf)
cvMse =
0.1015
Example 2
Compute misclassification rate using stratified 10-fold cross-validation:
load('fisheriris');
y = species;
X = meas;
cp = cvpartition(y,'k',10); % Stratified cross-validation
18-266
crossval
cvMCR = crossval('mcr',X,y,'predfun',classf,'partition',cp)
cvMCR =
0.0200
Example 3
Compute the confusion matrix using stratified 10-fold cross-validation:
load('fisheriris');
y = species;
X = meas;
order = unique(y); % Order of the group labels
cp = cvpartition(y,'k',10); % Stratified cross-validation
f = @(xtr,ytr,xte,yte)confusionmat(yte,...
classify(xte,xtr,ytr),'order',order);
cfMat = crossval(f,X,y,'partition',cp);
cfMat = reshape(sum(cfMat),3,3)
cfMat =
50 0 0
0 48 2
0 1 49
18-267
categorical.ctranspose
Syntax B = ctranspose(A)
18-268
classregtree.cutcategories
Syntax C = cutcategories(t)
C = cutcategories(t,nodes)
load carsmall
t = classregtree([MPG Cylinders],Origin,...
'names',{'MPG' 'Cyl'},'cat',2)
t =
Decision tree for classification
1 if Cyl=4 then node 2 elseif Cyl in {6 8} then node 3 else USA
2 if MPG<31.5 then node 4 elseif MPG>=31.5 then node 5 else USA
3 if Cyl=6 then node 6 elseif Cyl=8 then node 7 else USA
4 if MPG<21.5 then node 8 elseif MPG>=21.5 then node 9 else USA
5 if MPG<41 then node 10 elseif MPG>=41 then node 11 else Japan
6 if MPG<17 then node 12 elseif MPG>=17 then node 13 else USA
7 class = USA
8 class = France
9 class = USA
10 class = Japan
11 class = Germany
12 class = Germany
13 class = USA
18-269
classregtree.cutcategories
view(t)
C = cutcategories(t)
C =
[4] [1x2 double]
[] []
[6] [ 8]
[] []
[] []
[] []
18-270
classregtree.cutcategories
[] []
[] []
[] []
[] []
[] []
[] []
[] []
C{1,2}
ans =
6 8
18-271
classregtree.cutpoint
Syntax v = cutpoint(t)
v = cutpoint(t,nodes)
load carsmall
t = classregtree([MPG Cylinders],Origin,...
'names',{'MPG' 'Cyl'},'cat',2)
t =
Decision tree for classification
1 if Cyl=4 then node 2 elseif Cyl in {6 8} then node 3 else USA
2 if MPG<31.5 then node 4 elseif MPG>=31.5 then node 5 else USA
3 if Cyl=6 then node 6 elseif Cyl=8 then node 7 else USA
4 if MPG<21.5 then node 8 elseif MPG>=21.5 then node 9 else USA
5 if MPG<41 then node 10 elseif MPG>=41 then node 11 else Japan
6 if MPG<17 then node 12 elseif MPG>=17 then node 13 else USA
7 class = USA
8 class = France
9 class = USA
10 class = Japan
11 class = Germany
12 class = Germany
13 class = USA
18-272
classregtree.cutpoint
view(t)
v = cutpoint(t)
v =
NaN
31.5000
NaN
21.5000
41.0000
17.0000
18-273
classregtree.cutpoint
NaN
NaN
NaN
NaN
NaN
NaN
NaN
18-274
classregtree.cuttype
Syntax c = cuttype(t)
c = cuttype(t,nodes)
load carsmall
t = classregtree([MPG Cylinders],Origin,...
'names',{'MPG' 'Cyl'},'cat',2)
t =
Decision tree for classification
1 if Cyl=4 then node 2 elseif Cyl in {6 8} then node 3 else USA
2 if MPG<31.5 then node 4 elseif MPG>=31.5 then node 5 else USA
3 if Cyl=6 then node 6 elseif Cyl=8 then node 7 else USA
4 if MPG<21.5 then node 8 elseif MPG>=21.5 then node 9 else USA
5 if MPG<41 then node 10 elseif MPG>=41 then node 11 else Japan
6 if MPG<17 then node 12 elseif MPG>=17 then node 13 else USA
7 class = USA
18-275
classregtree.cuttype
8 class = France
9 class = USA
10 class = Japan
11 class = Germany
12 class = Germany
13 class = USA
view(t)
c = cuttype(t)
c =
18-276
classregtree.cuttype
'categorical'
'continuous'
'categorical'
'continuous'
'continuous'
'continuous'
''
''
''
''
''
''
''
18-277
classregtree.cutvar
Syntax v = cutvar(t)
v = cutvar(t,nodes)
[v,num] = cutvar(...)
load carsmall
t = classregtree([MPG Cylinders],Origin,...
'names',{'MPG' 'Cyl'},'cat',2)
t =
Decision tree for classification
1 if Cyl=4 then node 2 elseif Cyl in {6 8} then node 3 else USA
2 if MPG<31.5 then node 4 elseif MPG>=31.5 then node 5 else USA
3 if Cyl=6 then node 6 elseif Cyl=8 then node 7 else USA
4 if MPG<21.5 then node 8 elseif MPG>=21.5 then node 9 else USA
5 if MPG<41 then node 10 elseif MPG>=41 then node 11 else Japan
6 if MPG<17 then node 12 elseif MPG>=17 then node 13 else USA
7 class = USA
8 class = France
9 class = USA
10 class = Japan
11 class = Germany
12 class = Germany
13 class = USA
18-278
classregtree.cutvar
view(t)
[v,num] = cutvar(t)
v =
'Cyl'
'MPG'
'Cyl'
'MPG'
'MPG'
18-279
classregtree.cutvar
'MPG'
''
''
''
''
''
''
''
num =
2
1
2
1
1
1
0
0
0
0
0
0
0
18-280
cvpartition class
Copy Value. To learn how this affects your use of the class, see Comparing
Semantics Handle and Value Classes in the MATLAB Object-Oriented
Programming documentation.
18-281
cvpartition class
load('fisheriris');
CVO = cvpartition(species,'k',10);
err = zeros(CVO.NumTestSets,1);
for i = 1:CVO.NumTestSets
trIdx = CVO.training(i);
teIdx = CVO.test(i);
ytest = classify(meas(teIdx,:),meas(trIdx,:),...
species(trIdx,:));
err(i) = sum(~strcmp(ytest,species(teIdx)));
end
cvErr = sum(err)/sum(CVO.TestSize);
18-282
cvpartition
Syntax c = cvpartition(n,'kfold',k)
c = cvpartition(group,'kfold',k)
c = cvpartition(n,'holdout',p)
c = cvpartition(group,'holdout',p)
c = cvpartition(n,'leaveout')
c = cvpartition(n,'resubstitution')
18-283
cvpartition
load fisheriris;
y = species;
c = cvpartition(y,'k',10);
fun = @(xT,yT,xt,yt)(sum(~strcmp(yt,classify(xt,xT,yT))));
rate = sum(crossval(fun,meas,y,'partition',c))...
/sum(c.TestSize)
rate =
0.0200
18-284
dataset class
Description Dataset arrays are used to collect heterogeneous data and metadata
including variable and observation names into a single container
variable. Dataset arrays are suitable for storing column-oriented or
tabular data that are often stored as columns in a text file or in a
spreadsheet, and can accommodate variables of different types, sizes,
units, etc.
Dataset arrays can contain different kinds of variables, including
numeric, logical, character, categorical, and cell. However, a dataset
array is a different class than the variables that it contains. For
example, even a dataset array that contains only variables that are
double arrays cannot be operated on as if it were itself a double array.
However, using dot subscripting, you can operate on variable in a
dataset array as if it were a workspace variable.
You can subscript dataset arrays using parentheses much like ordinary
numeric arrays, but in addition to numeric and logical indices, you can
use variable and observation names as indices.
Construction Use the dataset constructor to create a dataset array from variables in
the MATLAB workspace. You can also create a dataset array by reading
data from a text or spreadsheet file. You can access each variable in a
dataset array much like fields in a structure, using dot subscripting. See
the following section for a list of operations available for dataset arrays.
18-285
dataset class
18-286
dataset class
18-287
dataset class
Copy Value. To learn how this affects your use of the class, see Comparing
Semantics Handle and Value Classes in the MATLAB Object-Oriented
Programming documentation.
Examples Load a dataset array from a .mat file and create some simple subsets:
load hospital
h1 = hospital(1:10,:)
h2 = hospital(:,{'LastName' 'Age' 'Sex' 'Smoker'})
18-288
dataset
Syntax A = dataset(VAR1,VAR2,...)
A = dataset(...,{VAR,name},...)
A = dataset(...,{VAR,name_1,...,name_m},...)
A = dataset(...,'VarNames',{name_1,...,name_m},...)
A = dataset(...,'ObsNames',{name_1,...,name_n},...)
A = dataset('File',filename,param1,val1,param2,val2,...)
A = dataset('XLSFile',filename,param1,val1,param2,val2,...)
A = dataset('XPTFile',xptfilename, ...)
18-289
dataset
A = dataset('File',filename,param1,val1,param2,val2,...)
creates dataset array A from column-oriented data in the text file
specified by the string filename. Variables in A are of type double
if data in the corresponding column of the file, following the column
header, are entirely numeric; otherwise the variables in A are cell arrays
of strings. dataset converts empty fields to either NaN (for a numeric
variable) or the empty string (for a string-valued variable). dataset
ignores insignificant white space in the file.
The following optional parameter name/value pairs are available:
18-290
dataset
A = dataset('XLSFile',filename,param1,val1,param2,val2,...)
creates dataset array A from column-oriented data in the Excel®
spreadsheet specified by the string filename. Variables in A are of
type double if data in the corresponding column of the spreadsheet,
following the column header, are entirely numeric; otherwise the
variables in A are cell arrays of strings. Optional parameter name/value
pairs are as follows:
18-291
dataset
18-292
dataset
load cereal
cereal = dataset(Calores,Protein,Fat,Sodium,Fiber,Carbo,...
Sugars,'ObsNames',Name)
cereal.Properties.VarDescription = Variables(4:10,2);
load cities
categories = cellstr(categories);
cities = dataset({ratings,categories{:}},...
'ObsNames',cellstr(names))
patients = dataset('File','hospital.dat',...
'Delimiter',',','ReadObsNames',true)
patients2 = dataset('XLSFile','hospital.xls',...
'ReadObsNames',true)
1 Load patient data from the CSV file hospital.dat and store the
information in a dataset array with observation names given by the
first column in the data (patient identification):
18-293
dataset
'format','%s%s%s%f%f%f%f%f%f%f%f%f', ...
'Delimiter',',','ReadObsNames','on');
You can also load the data without specifying a format string.
dataset will automatically create dataset variables that are either
double arrays or cell arrays of strings, depending on the contents
of the file:
patients = dataset('file','hospital.dat',...
'delimiter',',',...
'ReadObsNames',true);
2 Make the {0,1}-valued variable smoke nominal, and change the labels
to 'No' and 'Yes':
patients.smoke = nominal(patients.smoke,{'No','Yes'});
patients.smoke = addlevels(patients.smoke,...
{'0-5 Years','5-10 Years','LongTerm'});
4 Assuming the nonsmokers have never smoked, relabel the 'No' level:
patients.smoke = setlabels(patients.smoke,'Never','No');
patients.smoke = droplevels(patients.smoke,'Yes');
18-294
dataset
18-295
dataset.datasetfun
Syntax b = datasetfun(fun,A)
[b,c,...] = datasetfun(fun,A)
[b,...] = datasetfun(fun,A,...,'UniformOutput',false)
[b,...] = datasetfun(fun,A,...,'DatasetOutput',true)
[b,...] = datasetfun(fun,A,...,'DataVars',vars)
[b,...] = datasetfun(fun,A,...,'ObsNames',obsnames)
[b,...] = datasetfun(fun,A,...,'ErrorHandler',efun)
18-296
dataset.datasetfun
18-297
dataset.datasetfun
warning(S.identifier,S.message);
A = NaN;
B = NaN;
load hospital
stats = ...
datasetfun(@mean,hospital,...
'DataVars',{'Weight','BloodPressure'},...
'UniformOutput',false)
stats =
[154] [1x2 double]
stats{2}
ans =
122.7800 82.9600
datasetfun(@hist,hospital,...
'DataVars','BloodPressure',...
'UniformOutput',false);
title('{\bf Blood Pressure}')
legend('Systolic','Diastolic','Location','N')
18-298
dataset.datasetfun
18-299
daugment
The order of the columns of X for a full quadratic model with n terms is:
3 The interaction terms in order (1, 2), (1, 3), ..., (1, n), (2, 3), ..., (n–1, n)
18-300
daugment
of model are powers for the factors in the columns. For example, if a
model has factors X1, X2, and X3, then a row [0 1 2] in model specifies
the term (X1.^0).*(X2.^1).*(X3.^2). A row of all zeros in model
specifies a constant term, which can be omitted.
[dCE2,X] = daugment(...,param1,val1,param2,val2,...) specifies
additional parameter/value pairs for the design. Valid parameters and
their values are listed in the following table.
Parameter Value
'bounds' Lower and upper bounds for each factor, specified
as a 2-by-nfactors matrix, where nfactors is the
number of factors. Alternatively, this value can be
a cell array containing nfactors elements, each
element specifying the vector of allowable values for
the corresponding factor.
'categorical' Indices of categorical predictors.
'display' Either 'on' or 'off' to control display of the
iteration counter. The default is 'on'.
'excludefun' Handle to a function that excludes undesirable
runs. If the function is f, it must support the syntax
b = f(S), where S is a matrix of treatments with
nfactors columns, where nfactors is the number
of factors, and b is a vector of Boolean values with
the same number of rows as S. b(i) is true if the ith
row S should be excluded.
'init' Initial design as an mruns-by-nfactors matrix,
where nfactors is the number of factors. The
default is a randomly selected set of points.
'levels' Vector of number of levels for each factor.
18-301
daugment
Parameter Value
'maxiter' Maximum number of iterations. The default is 10.
'tries' Number of times to try to generate a design from
a new starting point. The algorithm uses random
points for each try, except possibly the first. The
default is 1.
Examples The following eight-run design is adequate for estimating main effects
in a four-factor model:
dCEmain = cordexch(4,8)
dCEmain =
1 -1 -1 1
-1 -1 1 1
-1 1 -1 1
1 1 1 -1
1 1 1 1
-1 1 -1 -1
1 -1 -1 -1
-1 -1 1 -1
To estimate the six interaction terms in the model, augment the design
with eight additional runs:
dCEinteraction = daugment(dCEmain,8,'interaction')
dCEinteraction =
1 -1 -1 1
-1 -1 1 1
-1 1 -1 1
18-302
daugment
1 1 1 -1
1 1 1 1
-1 1 -1 -1
1 -1 -1 -1
-1 -1 1 -1
-1 1 1 1
-1 -1 -1 -1
1 -1 1 -1
1 1 -1 1
-1 1 1 -1
1 1 -1 -1
1 -1 1 1
1 1 1 -1
The augmented design is full factorial, with the original eight runs in
the first eight rows.
18-303
dcovary
The order of the columns of X for a full quadratic model with n terms is:
3 The interaction terms in order (1, 2), (1, 3), ..., (1, n), (2, 3), ..., (n–1, n)
18-304
dcovary
Parameter Value
'bounds' Lower and upper bounds for each factor, specified as
a 2-by-nfactors matrix. Alternatively, this value
can be a cell array containing nfactors elements,
each element specifying the vector of allowable
values for the corresponding factor.
'categorical' Indices of categorical predictors.
'display' Either 'on' or 'off' to control display of the
iteration counter. The default is 'on'.
'excludefun' Handle to a function that excludes undesirable
runs. If the function is f, it must support the syntax
b = f(S), where S is a matrix of treatments with
nfactors columns and b is a vector of Boolean
values with the same number of rows as S. b(i) is
true if the ith row S should be excluded.
'init' Initial design as an mruns-by-nfactors matrix. The
default is a randomly selected set of points.
18-305
dcovary
Parameter Value
'maxiter' Maximum number of iterations. The default is 10.
'tries' Number of times to try to generate a design from
a new starting point. The algorithm uses random
points for each try, except possibly the first. The
default is 1.
Examples Example 1
Suppose you want a design to estimate the parameters in a three-factor
linear additive model, with eight runs that necessarily occur at different
times. If the process experiences temporal linear drift, you may want
to include the run time as a variable in the model. Produce the design
as follows:
time = linspace(-1,1,8)';
[dCV1,X] = dcovary(3,time,'linear')
dCV1 =
-1.0000 1.0000 1.0000 -1.0000
1.0000 -1.0000 -1.0000 -0.7143
-1.0000 -1.0000 -1.0000 -0.4286
1.0000 -1.0000 1.0000 -0.1429
1.0000 1.0000 -1.0000 0.1429
-1.0000 1.0000 -1.0000 0.4286
1.0000 1.0000 1.0000 0.7143
-1.0000 -1.0000 1.0000 1.0000
X =
1.0000 -1.0000 1.0000 1.0000 -1.0000
1.0000 1.0000 -1.0000 -1.0000 -0.7143
1.0000 -1.0000 -1.0000 -1.0000 -0.4286
1.0000 1.0000 -1.0000 1.0000 -0.1429
1.0000 1.0000 1.0000 -1.0000 0.1429
1.0000 -1.0000 1.0000 -1.0000 0.4286
1.0000 1.0000 1.0000 1.0000 0.7143
1.0000 -1.0000 -1.0000 1.0000 1.0000
18-306
dcovary
Example 2
The following example uses the dummyvar function to block an eight-run
experiment into 4 blocks of size 2 for estimating a linear additive model
with two factors:
The first two columns of dCV2 contain the settings for the two factors;
the last three columns are dummy variable codings for the four blocks.
18-307
CompactTreeBagger.DefaultYfit property
18-308
TreeBagger.DefaultYfit property
18-309
qrandstream.delete
Syntax delete(h)
Description delete(h) deletes the handle object h, where h is a scalar handle. The
delete method deletes a handle object but does not clear the handle
from the workspace. A deleted handle is no longer valid.
18-310
CompactTreeBagger.DeltaCritDecisionSplit property
18-311
TreeBagger.DeltaCritDecisionSplit property
18-312
dendrogram
Syntax H = dendrogram(Z)
H = dendrogram(Z,p)
[H,T] = dendrogram(...)
[H,T,perm] = dendrogram(...)
[...] = dendrogram(...,'colorthreshold',t)
[...] = dendrogram(...,'orientation','orient')
[...] = dendrogram(...,'labels',S)
18-313
dendrogram
Value Description
'top' Top to bottom (default)
'bottom' Bottom to top
'left' Left to right
'right' Right to left
Examples X = rand(100,2);
Y = pdist(X,'cityblock');
Z = linkage(Y,'average');
[H,T] = dendrogram(Z,'colorthreshold','default');
set(H,'LineWidth',2)
18-314
dendrogram
find(T==20)
ans =
20
49
62
65
73
96
This output indicates that leaf node 20 in the dendrogram contains the
original data points 20, 49, 62, 65, 73, and 96.
18-315
dataset.Description property
18-316
dfittool
Syntax dfittool
dfittool(y)
dfittool(y,cens)
dfittool(y,cens,freq)
dfittool(y,cens,freq,dsname)
Description dfittool opens a graphical user interface for displaying fit distributions
to data. To fit distributions to your data and display them over plots
over plots of the empirical distributions, you can import data from the
workspace.
dfittool(y) displays the Distribution Fitting Tool and creates a data
set with data specified by the vector y.
dfittool(y,cens) uses the vector cens to specify whether the
observation y(j) is censored, (cens(j)==1) and/or observed, exactly
(cens(j)==0). If cens is omitted or empty, no y values are censored.
dfittool(y,cens,freq) uses the vector freq to specify the frequency
of each element of y. If freq is omitted or empty, all y values have a
frequency of 1.
dfittool(y,cens,freq,dsname) creates a data set with the name
dsname using the data vector y, censoring indicator cens, and frequency
vector freq.
For more information, see “Modeling Your Data Using the Distribution
Fitting GUI” on page 5-11.
18-317
qrandset.Dimensions property
18-318
dataset.DimNames property
Purpose Two-element cell array of strings giving names of dimensions of data set
Description A two-element cell array of strings giving the names of the two
dimensions of the data set. The default is {'Observations'
'Variables'}.
18-319
categorical.disp
Syntax disp(A)
Description disp(A) prints the categorical array A without printing the array
name. In all other ways it’s the same as leaving the semicolon off an
expression, except that empty arrays don’t display.
18-320
classregtree.disp
Syntax display(t)
18-321
cvpartition.disp
Syntax disp(c)
18-322
dataset.disp
Syntax disp(ds)
Description disp(ds) prints the dataset array ds, including variable names and
observation names (if present), without printing the dataset name. In
all other ways it’s the same as leaving the semicolon off an expression.
For numeric or categorical variables that are 2-D and have three or
fewer columns, disp prints the actual data using either short g, long
g, or bank format, depending on the current command line setting.
Otherwise, disp prints the size and type of each dataset element.
For character variables that are 2-D and 10 or fewer characters wide,
disp prints quoted strings. Otherwise, disp prints the size and type of
each dataset element.
For cell variables that are 2-D and have three or fewer columns,
disp prints the contents of each cell (or its size and type if too large).
Otherwise, disp prints the size of each dataset element.
For time series variables, disp prints columns for both the time and
the data. If the variable is 2-D and has three or fewer columns, disp
prints the actual data Otherwise, disp prints the size and type of each
dataset element.
For other types of variables, disp prints the size and type of each
dataset element.
18-323
gmdistribution.disp
Syntax disp(obj)
18-324
NaiveBayes.disp
Syntax disp(nb)
18-325
piecewisedistribution.disp
Syntax disp(A)
18-326
qrandset.disp
Syntax disp(p)
18-327
qrandstream.disp
Syntax disp(q)
18-328
categorical.display
Syntax display(A)
18-329
classregtree.display
Syntax display(t)
display(A)
18-330
cvpartition.display
Syntax display(c)
18-331
dataset.display
Syntax display(ds)
Description display(ds) prints the dataset array ds, including variable names and
observation names (if present). dataset callsdisplay when a you do
not use a semicolon to terminate a statement
For numeric or categorical variables that are 2-D and have three or
fewer columns, display prints the actual data. Otherwise, display
prints the size and type of each dataset element.
For character variables that are 2-D and 10 or fewer characters wide,
display prints quoted strings. Otherwise, display prints the size and
type of each dataset element.
For cell variables that are 2-D and have three or fewer columns,
display prints the contents of each cell (or its size and type if too large).
Otherwise, display prints the size of each dataset element.
For time series variables, display prints columns for both the time and
the data. If the variable is 2-D and has three or fewer columns, display
prints the actual data. Otherwise, display prints the size and type of
each dataset element.
For other types of variables, display prints the size and type of each
dataset element.
18-332
gmdistribution.display
Syntax display(obj)
18-333
NaiveBayes.display
Syntax display(nb)
18-334
piecewisedistribution.display
Syntax display(A)
18-335
ProbDist.DistName property
• 'kernel'
• 'beta'
• 'binomial'
• 'birnbaumsaunders'
• 'exponential'
• 'extreme value'
• 'gamma'
• 'generalized extreme value'
• 'generalized pareto'
• 'inversegaussian'
• 'logistic'
• 'loglogistic'
• 'lognormal'
• 'nakagami'
• 'negative binomial'
• 'normal'
• 'poisson'
• 'rayleigh'
• 'rician'
18-336
ProbDist.DistName property
• 'tlocationscale'
• 'weibull'
Use this information to view and compare the type of distribution used
to create distribution objects.
18-337
NaiveBayes.Dist property
18-338
gmdistribution.DistName property
18-339
disttool
Syntax disttool
18-340
categorical.double
Syntax B = double(A)
18-341
dataset.double
Syntax b = double(A)
b = double(a,vars)
18-342
categorical.droplevels
Syntax B = droplevels(A)
B = droplevels(A,oldlevels)
Examples Example 1
Drop unused age levels from the data in hospital.mat:
load hospital
edges = 0:10:100;
labels = strcat(num2str((0:10:90)','%d'),{'s'});
AgeGroup = ordinal(hospital.Age,labels,[],edges);
AgeGroup = droplevels(AgeGroup);
getlabels(AgeGroup)
ans =
'20s' '30s' '40s' '50s'
Example 2
1 Load patient data from the CSV file hospital.dat and store the
information in a dataset array with observation names given by the
first column in the data (patient identification):
patients = dataset('file','hospital.dat',...
'delimiter',',',...
18-343
categorical.droplevels
'ReadObsNames',true);
2 Make the {0,1}-valued variable smoke nominal, and change the labels
to 'No' and 'Yes':
patients.smoke = nominal(patients.smoke,{'No','Yes'});
patients.smoke = addlevels(patients.smoke,...
{'0-5 Years','5-10 Years','LongTerm'});
4 Assuming the nonsmokers have never smoked, relabel the 'No' level:
patients.smoke = setlabels(patients.smoke,'Never','No');
patients.smoke = droplevels(patients.smoke,'Yes');
18-344
dummyvar
Syntax D = dummyvar(group)
18-345
dummyvar
Examples Suppose you are studying the effects of two machines and three
operators on a process. Use group to organize predictor data on
machine-operator combinations:
machine = [1 1 1 1 2 2 2 2]';
operator = [1 2 3 1 2 3 1 2]';
group = [machine operator]
group =
1 1
1 2
1 3
1 1
2 2
2 3
2 1
2 2
D = dummyvar(group)
D =
1 0 1 0 0
1 0 0 1 0
1 0 0 0 1
1 0 1 0 0
18-346
dummyvar
0 1 0 1 0
0 1 0 0 1
0 1 1 0 0
0 1 0 1 0
18-347
dwtest
Examples Fit a straight line to the census data and note the autocorrelation in
the residuals:
load census
n = length(cdate);
X = [ones(n,1),cdate];
18-348
dwtest
[b,bint,r1] = regress(pop,X);
p1 = dwtest(r1,X)
plot(cdate,r1,'b-',cdate,zeros(n,1),'k:')
X = [ones(n,1),cdate,cdate.^2];
[b,bint,r2] = regress(pop,X);
p2 = dwtest(r2,X)
line(cdate,r2,'color','r')
18-349
ecdf
Parameter Value
'censoring' Boolean vector of the same size as x. Elements are
1 for observations that are right-censored and 0 for
observations that are observed exactly. Default is all
observations observed exactly.
'frequency' Vector of the same size as x containing nonnegative
integer counts. The jth element of this vector
gives the number of times the jth element of x was
observed. Default is 1 observation per element of x.
'alpha' Value between 0 and 1 for a confidence level of
100(1-alpha)%. Default is alpha=0.05 for 95%
confidence.
18-350
ecdf
Parameter Value
'function' Type of function returned as the f output argument,
chosen from 'cdf' (default), 'survivor', or
'cumulative hazard'.
'bounds' Either 'on' to include bounds, or 'off' (the default)
to omit them. Used only for plotting.
Examples Generate random failure times and random censoring times, and
compare the empirical cdf with the known true cdf:
18-351
ecdf
References [1] Cox, D. R., and D. Oakes. Analysis of Survival Data. London:
Chapman & Hall, 1984.
18-352
ecdfhist
Syntax n = ecdfhist(f,x)
n = ecdfhist(f,x,m)
n = ecdfhist(f,x,c)
[n,c] = ecdfhist(...)
ecdfhist(...)
Examples The following code generates random failure times and random
censoring times, and compares the empirical pdf with the known true
pdf.
18-353
ecdfhist
set(get(gca,'Children'),'FaceColor',[.8 .8 1])
hold on
18-354
categorical.end
Syntax end(A,k,n)
18-355
dataset.end
Syntax end(A,k,n)
18-356
qrandset.end
Syntax end(p,k,n)
Description end(p,k,n) is called for indexing expressions involving the point set
p when end is part of the k-th index out of n indices. For example, the
expression p(end-1,:) calls p’s end method with end(p,1,2).
18-357
evcdf
Syntax P = evcdf(X,mu,sigma)
[P,PLO,PUP] = evcdf(X,mu,sigma,pcov,alpha)
X − ˆ
ˆ
and then transforming those bounds to the scale of the output P. The
computed bounds give approximately the desired confidence level when
you estimate mu, sigma, and pcov from large samples, but in smaller
samples other methods of computing the confidence bounds might be
more accurate.
The type 1 extreme value distribution is also known as the Gumbel
distribution. The version used here is suitable for modeling minima;
the mirror image of this distribution can be used to model maxima by
negating X. See “Extreme Value Distribution” on page B-19 for more
details. If x has a Weibull distribution, then X = log(x) has the type 1
extreme value distribution.
18-358
evcdf
18-359
evfit
18-360
evfit
18-361
evinv
Syntax X = evinv(P,mu,sigma)
[X,XLO,XUP] = evinv(P,mu,sigma,pcov,alpha)
ˆ + ˆ q
18-362
evinv
18-363
qrandstream.eq
Syntax h1 == h2
tf = eq(h1, h2)
18-364
CompactTreeBagger.error
18-365
CompactTreeBagger.error
18-366
TreeBagger.error
18-367
TreeBagger.error
18-368
classregtree.eval
18-369
classregtree.eval
load fisheriris;
t = classregtree(meas,species,...
'names',{'SL' 'SW' 'PL' 'PW'})
t =
Decision tree for classification
1 if PL<2.45 then node 2 elseif PL>=2.45 then node 3 else setosa
2 class = setosa
3 if PW<1.75 then node 4 elseif PW>=1.75 then node 5 else versicolor
4 if PL<4.95 then node 6 elseif PL>=4.95 then node 7 else versicolor
5 class = virginica
6 if PW<1.65 then node 8 elseif PW>=1.65 then node 9 else versicolor
7 class = virginica
8 class = versicolor
9 class = virginica
view(t)
18-370
classregtree.eval
sfit = eval(t,meas);
pct = mean(strcmp(sfit,species))
pct =
0.9800
18-371
classregtree.eval
18-372
evlike
18-373
evpdf
Syntax Y = evpdf(X,mu,sigma)
18-374
evrnd
Syntax R = evrnd(mu,sigma)
R = evrnd(mu,sigma,v)
R = evrnd(mu,sigma,m,n)
18-375
evstat
18-376
expcdf
Syntax P = expcdf(X,mu)
[P,PLO,PUP] = expcdf(X,mu,pcov,alpha)
x −t −x
1
p = F ( x |u) = ∫ e dt = 1 − e
0
Examples The following code shows that the median of the exponential
distribution is *log(2).
mu = 10:10:60;
p = expcdf(log(2)*mu,mu)
p =
0.5000 0.5000 0.5000 0.5000 0.5000 0.5000
18-377
expcdf
mu = 1:6;
x = mu;
p = expcdf(x,mu)
p =
0.6321 0.6321 0.6321 0.6321 0.6321 0.6321
18-378
expfit
mu = 3;
data = exprnd(mu,100,1); % Simulated data
18-379
expfit
[muhat,muci] = expfit(data)
muhat =
2.7511
muci =
2.2826
3.3813
18-380
ExhaustiveSearcher class
Superclasses NeighborSearcher
Name/Value Pairs
Both the ExhaustiveSearcher and the createns functions accept one
or more of the following optional name/value pairs as input:
Distance
A string or function handle specifying the default distance metric
used when you call the knnsearch method.
18-381
ExhaustiveSearcher class
18-382
ExhaustiveSearcher class
P
A positive scalar indicating the exponent of Minkowski distance.
This parameter is only valid when Distance is 'minkowski'.
Default is 2.
Cov
A positive definite matrix indicating the covariance matrix when
computing the Mahalanobis distance. This parameter is only
valid when Distance is 'mahalanobis'. Default is nancov(X).
Scale
A vector S with the length equal to the number of columns in
X. Each coordinate of X and each query point is scaled by the
corresponding element of S when computing the standardized
Euclidean distance. This parameter is only valid when Distance
is 'seuclidean'. Default is nanstd(X).
Properties X
A matrix used to create the object
Distance
A string specifying a built-in distance metric or a function handle
that you provide when you create the object. This property is the
default distance metric used when you call the knnsearch method
to find nearest neighbors for future query points.
DistParameter
Specifies the additional parameter for the chosen distance metric.
The value is:
18-383
ExhaustiveSearcher class
load fisheriris
x = meas(:,3:4);
NS = ExhaustiveSearcher(x,'distance','minkowski')
NS =
ExhaustiveSearcher
Properties:
X: [150x2 double]
Distance: 'minkowski'
DistParameter: 2
load fisheriris
x = meas(:,3:4);
NS = createns(x,'NsMethod','exhaustive',...
'distance','minkowski')
NS =
ExhaustiveSearcher
18-384
ExhaustiveSearcher class
Properties:
X: [150x2 double]
Distance: 'minkowski'
DistParameter: 2
For more in-depth examples using the knnsearch method, see the
method reference page or see “Example: Classifying Query Data Using
knnsearch” on page 12-22.
18-385
expinv
Syntax X = expinv(P,mu)
[X,XLO,XUP] = expinv(X,mu,pcov,alpha)
x = F −1 ( p| ) = − ln(1 − p)
Examples Let the lifetime of light bulbs be exponentially distributed with µ = 700
hours. What is the median lifetime of a bulb?
expinv(0.50,700)
ans =
18-386
expinv
485.2030
Suppose you buy a box of “700 hour” light bulbs. If 700 hours is the
mean life of the bulbs, half of them will burn out in less than 500 hours.
18-387
explike
18-388
dataset.export
Syntax export(DS,'file',filename)
export(DS)
export(DS,'file',filename,'Delimiter',delim)
export(DS,'XLSfile',filename)
export(DS,'XPTFile',filename)
export(DS,...,'WriteVarNames',false)
export(DS,...,'WriteObsNames',false)
18-389
dataset.export
18-390
dataset.export
mismatch in the file between the number of fields on each line and
the number of column headings on the first line. Writing a dataset
array that contains a cell-valued variable whose cell contents are not
all the same length will result in a different number of fields on each
line in the file.
Examples Move data between external text files and dataset arrays in the
MATLAB workspace:
A = dataset('file','sat2.dat','delimiter',',')
A =
Test Gender Score
'Verbal' 'Male' 470
'Verbal' 'Female' 530
'Quantitative' 'Male' 520
'Quantitative' 'Female' 480
B = dataset('file','HighScores.txt','delimiter','\t')
B =
Test Gender Score
'Verbal' 'Female' 530
'Quantitative' 'Male' 520
18-391
exppdf
Syntax Y = exppdf(X,mu)
−x
1
y = f ( x| ) = e
The exponential pdf is the gamma pdf with its first parameter equal to 1.
The exponential distribution is appropriate for modeling waiting
times when the probability of waiting an additional period of time is
independent of how long you have already waited. For example, the
probability that a light bulb will burn out in its next minute of use is
relatively independent of how many minutes it has already burned.
Examples y = exppdf(5,1:5)
y =
0.0067 0.0410 0.0630 0.0716 0.0736
y = exppdf(1:5,1:5)
y =
0.3679 0.1839 0.1226 0.0920 0.0736
18-392
exprnd
Syntax R = exprnd(mu)
R = exprnd(mu,v)
R = exprnd(mu,m,n)
Examples n1 = exprnd(5:10)
n1 =
7.5943 18.3400 2.7113 3.0936 0.6078 9.5841
n2 = exprnd(5:10,[1 6])
n2 =
3.2752 1.1110 23.5530 23.4303 5.7190 3.9876
n3 = exprnd(5,2,3)
n3 =
24.3339 13.5271 1.8788
4.7932 4.3675 2.6468
18-393
expstat
Description [m,v] = expstat(mu) returns the mean of and variance for the
exponential distribution with parameters mu. mu can be a vectors,
matrix, or multidimensional array. The mean of the exponential
distribution is µ, and the variance is µ2.
18-394
factoran
x = + Λf + e
cov( x) = ΛΛ T + Ψ
18-395
factoran
Field Description
loglike Maximized log-likelihood value
dfe Error degrees of freedom = ((d-m)^2 - (d+m))/2
factoran does not compute the chisq and p fields unless dfe is positive
and all the specific variance estimates in psi are positive (see “Heywood
Case” on page 18-402 below). If X is a covariance matrix, then you must
also specify the 'nobs' parameter if you want factoran to compute the
chisq and p fields.
[lambda,psi,T,stats,F] = factoran(X,m) also returns, in F,
predictions of the common factors, known as factor scores. F is an n-by-m
matrix where each row is a prediction of m common factors. If X is a
covariance matrix, factoran cannot compute F. factoran rotates F
using the same criterion as for lambda.
[...] = factoran(...,param1,val1,param2,val2,...) enables
you to specify optional parameter name/value pairs to control the model
fit and the outputs. The following are the valid parameter/value pairs.
Parameter Value
'xtype' Type of input in the matrix X. 'xtype' can be one
of:
'data' Raw data (default)
'covariance'Positive definite covariance or
correlation matrix
18-396
factoran
Parameter Value
'scores' Method for predicting factor scores. 'scores' is
ignored if X is not raw data.
'wls' Synonyms for a weighted
'Bartlett' least-squares estimate that
treats F as fixed (default)
'regression'Synonyms for a minimum mean
'Thomson' squared error prediction that is
equivalent to a ridge regression
'start' Starting point for the specific variances psi in
the maximum likelihood optimization. Can be
specified as:
'random' Chooses d uniformly distributed
values on the interval [0,1].
'Rsquared' Chooses the starting vector
as a scale factor times
diag(inv(corrcoef(X))) (default).
For examples, see Jöreskog [2].
Positive Performs the given number of
integer maximum likelihood fits, each
initialized as with 'random'.
factoran returns the fit with the
highest likelihood.
Matrix Performs one maximum likelihood
fit for each column of the specified
matrix. The ith optimization is
initialized with the values from the
ith column. The matrix must have
d rows.
18-397
factoran
Parameter Value
'rotate' Method used to rotate factor loadings and scores.
'rotate' can have the same values as the
'Method' parameter of rotatefactors. See
the reference page for rotatefactors for a full
description of the available methods.
'none' Performs no rotation.
'equamax' Special case of the orthomax
rotation. Use the 'normalize',
'reltol', and 'maxit' parameters
to control the details of the rotation.
'orthomax' Orthogonal rotation that maximizes
a criterion based on the variance of
the loadings.
Use the 'coeff', 'normalize',
'reltol', and 'maxit' parameters
to control the details of the rotation.
'parsimax' Special case of the orthomax rotation
(default). Use the 'normalize',
'reltol', and ’maxit’ parameters
to control the details of the rotation.
'pattern' Performs either an oblique rotation
(the default) or an orthogonal
rotation to best match a specified
pattern matrix. Use the 'type'
parameter to choose the type
of rotation. Use the 'target'
parameter to specify the pattern
matrix.
18-398
factoran
Parameter Value
'procrustes'Performs either an oblique (the
default) or an orthogonal rotation to
best match a specified target matrix
in the least squares sense.
Use the 'type' parameter to choose
the type of rotation. Use 'target'
to specify the target matrix.
'promax' Performs an oblique procrustes
rotation to a target matrix
determined by factoran as a
function of an orthomax solution.
Use the 'power' parameter to
specify the exponent for creating the
target matrix. Because 'promax'
uses 'orthomax' internally, you can
also specify the parameters that
apply to 'orthomax'.
'quartimax' Special case of the orthomax rotation
(default). Use the 'normalize',
'reltol', and ’maxit’ parameters
to control the details of the rotation.
'varimax' Special case of the orthomax rotation
(default). Use the 'normalize',
'reltol', and 'maxit' parameters
to control the details of the rotation.
18-399
factoran
Parameter Value
Function Function handle to rotation function
of the form
[B,T] =
myrotation(A,...)
18-400
factoran
Parameter Value
'power' Exponent for creating the target matrix in the
'promax' rotation. Must be ≥ 1. Default is 4.
'userargs' Denotes the beginning of additional input values
for a user-defined rotation function. factoran
appends all subsequent values, in order and
without processing, to the rotation function
argument list, following the unrotated factor
loadings matrix A. See “Example 4” on page 18-407.
'nobs' If X is a covariance or correlation matrix, indicates
the number of observations that were used in its
estimation. This allows calculation of significance
for the null hypothesis even when the original data
are not available. There is no default. 'nobs' is
ignored if X is raw data.
'delta' Lower bound for the specific variances psi during
the maximum likelihood optimization. Default is
0.005.
'optimopts' Structure that specifies control parameters for
the iterative algorithm the function uses to
compute maximum likelihood estimates. Create
this structure with the function statset. Enter
statset('factoran') to see the names and
default values of the parameters that factoran
accepts in the options structure. See the reference
page for statset for more information about these
options.
18-401
factoran
Heywood Case
If elements of psi are equal to the value of the 'delta' parameter
(i.e., they are essentially zero), the fit is known as a Heywood case, and
interpretation of the resulting estimates is problematic. In particular,
there can be multiple local maxima of the likelihood, each with different
estimates of the loadings and the specific variances. Heywood cases
can indicate overfitting (i.e., m is too large), but can also be the result
of underfitting.
Examples Example 1
Load the carbig data, and fit the default model with two factors.
18-402
factoran
load carbig
18-403
factoran
Example 2
Although the estimates are the same, the use of a covariance matrix
rather than raw data doesn’t let you request scores or significance level:
[Lambda,Psi,T] = factoran(cov(X),2,'xtype','cov')
[Lambda,Psi,T] = factoran(corrcoef(X),2,'xtype','cov')
Example 3
Use promax rotation:
[Lambda,Psi,T,stats,F] = factoran(X,2,'rotate','promax',...
'powerpm',4);
18-404
factoran
invT = inv(T)
Lambda0 = Lambda*invT
18-405
factoran
biplot(Lambda,'LineWidth',2,'MarkerSize',20)
18-406
factoran
Example 4
Syntax for passing additional arguments to a user-defined rotation
function:
[Lambda,Psi,T] = ...
factoran(X,2,'rotate',@myrotation,'userargs',1,'two');
18-407
factoran
18-408
TreeBagger.FBoot property
18-409
fcdf
Syntax P = fcdf(X,V1,V2)
⎡ ( + 2 ) ⎤ 1 1 −2
Γ⎢ 1 ⎥ ⎛ ⎞ 2 2
p = F ( x | 1 , 2 ) = ∫
x ⎣ 2 ⎦ 1 t
0 ⎛ 1 ⎞ ⎛ 2 ⎞
⎜ ⎟ 1 + 2
dt
Γ⎜ ⎟Γ⎜ ⎟ ⎝ 2 ⎠ ⎡ ⎛ ⎞ ⎤ 2
⎝ 2⎠ ⎝ 2 ⎠ 1
⎢1 + ⎜ ⎟ t ⎥
⎣ ⎝ 2 ⎠ ⎦
nu1 = 1:5;
nu2 = 6:10;
x = 2:6;
F1 = fcdf(x,nu1,nu2)
F1 =
0.7930 0.8854 0.9481 0.9788 0.9919
F2 = 1 - fcdf(1./x,nu2,nu1)
F2 =
0.7930 0.8854 0.9481 0.9788 0.9919
18-410
fcdf
18-411
ff2n
Description dFF2 = ff2n(n) gives factor settings dFF2 for a two-level full factorial
design with n factors. dFF2 is m-by-n, where m is the number of
treatments in the full-factorial design. Each row of dFF2 corresponds
to a single treatment. Each column contains the settings for a single
factor, with values of 0 and 1 for the two levels.
18-412
TreeBagger.fillProximities
Syntax B = fillProximities(B)
B = fillProximities(B,'param1',val1,'param2',val2,...
18-413
qrandstream.findobj
Description The findobj method of the handle class follows the same syntax as
the MATLAB findobj command, except that the first argument must
be an array of handles to objects.
hm = findobj(h, 'conditions') searches the handle object array
h and returns an array of handle objects matching the specified
conditions. Only the public members of the objects of h are considered
when evaluating the conditions.
18-414
qrandstream.findprop
Syntax p = findprop(h,'propname')
18-415
finv
Syntax X = finv(P,V1,V2)
x = F −1 ( p| 1 , 2 ) = { x : F ( x | 1 , 2 ) = p}
where
⎡ ( + 2 ) ⎤ 1 1 −2
x Γ⎢ 1 ⎥ ⎛ ⎞ 2 2
p = F ( x | 1 , 2 ) = ∫ ⎣ 2 ⎦ 1 t
⎜ ⎟ 1 + 2
dt
⎛ ⎞ ⎛ ⎞
0 Γ⎜ 1 ⎟Γ⎜ 2 ⎟ ⎝ 2 ⎠ ⎡ ⎛ 1 ⎞ ⎤ 2
⎝ 2⎠ ⎝ 2 ⎠ ⎢1 + ⎜ ⎟ t ⎥
⎣ ⎝ 2 ⎠ ⎦
Examples Find a value that should exceed 95% of the samples from an F
distribution with 5 degrees of freedom in the numerator and 10 degrees
of freedom in the denominator.
x = finv(0.95,5,10)
x =
3.3258
You would observe values greater than 3.3258 only 5% of the time by
chance.
18-416
finv
18-417
gmdistribution.fit
Parameter Value
'Start' Method used to choose initial component parameters.
One of the following:
18-418
gmdistribution.fit
Parameter Value
'Replicates' A positive integer giving the number of times to
repeat the EM algorithm, each time with a new set of
parameters. The solution with the largest likelihood
is returned. A value larger than 1 requires the
'randSample' start method. The default is 1.
'CovType' 'diagonal' if the covariance matrices are restricted
to be diagonal; 'full' otherwise. The default is
'full'.
'SharedCov' Logical true if all the covariance matrices are
restricted to be the same (pooled estimate); logical
false otherwise.
'Regularize' A nonnegative regularization number added to
the diagonal of covariance matrices to make them
positive-definite. The default is 0.
'Options' Options structure for the iterative EM algorithm, as
created by statset. gmdistribution.fit uses the
parameters 'Display' with a default value of 'off',
'MaxIter' with a default value of 100, and 'TolFun'
with a default value of 1e6.
18-419
gmdistribution.fit
MU1 = [1 2];
SIGMA1 = [2 0; 0 .5];
MU2 = [-3 -5];
SIGMA2 = [1 0; 0 1];
X = [mvnrnd(MU1,SIGMA1,1000);mvnrnd(MU2,SIGMA2,1000)];
scatter(X(:,1),X(:,2),10,'.')
hold on
18-420
gmdistribution.fit
options = statset('Display','final');
obj = gmdistribution.fit(X,2,'Options',options);
10 iterations, log-likelihood = -7046.78
18-421
gmdistribution.fit
ComponentMeans = obj.mu
ComponentMeans =
0.9391 2.0322
-2.9823 -4.9737
ComponentCovariances = obj.Sigma
ComponentCovariances(:,:,1) =
1.7786 -0.0528
-0.0528 0.5312
18-422
gmdistribution.fit
ComponentCovariances(:,:,2) =
1.0491 -0.0150
-0.0150 0.9816
MixtureProportions = obj.PComponents
MixtureProportions =
0.5000 0.5000
AIC = zeros(1,4);
obj = cell(1,4);
for k = 1:4
obj{k} = gmdistribution.fit(X,k);
AIC(k)= obj{k}.AIC;
end
[minAIC,numComponents] = min(AIC);
numComponents
numComponents =
2
model = obj{2}
model =
Gaussian mixture distribution
with 2 components in 2 dimensions
Component 1:
Mixing proportion: 0.500000
Mean: 0.9391 2.0322
Component 2:
Mixing proportion: 0.500000
Mean: -2.9823 -4.9737
Both the Akaike and Bayes information are negative log-likelihoods for
the data with penalty terms for the number of estimated parameters.
They are often used to determine an appropriate number of components
for a model when the number of components is unspecified.
18-423
gmdistribution.fit
References [1] McLachlan, G., and D. Peel. Finite Mixture Models. Hoboken, NJ:
John Wiley & Sons, Inc., 2000.
18-424
NaiveBayes.fit
18-425
NaiveBayes.fit
18-426
NaiveBayes.fit
If the prior probabilities don’t sum to one, fit will normalize them.
• 'KSWidth' – The bandwidth of the kernel smoothing window. The
default is to select a default bandwidth automatically for each
combination of feature and class, using a value that is optimal for
a Gaussian distribution. You can specify the value as one of the
following:
18-427
NaiveBayes.fit
See Also “Naive Bayes Classification” on page 12-6, “Grouped Data” on page 2-34
18-428
fitdist
18-429
fitdist
18-430
fitdist
Parameter Values
'censoring'A Boolean vector the same size as X, containing 1s when
the corresponding elements in X are right-censored
observations and 0s when the corresponding elements
are exact observations. Default is a vector of 0s.
18-431
fitdist
Parameter Values
'options' A structure created by the statset function to specify
control parameters for the iterative fitting algorithm.
'n' For 'binomial' distributions only, a positive integer
specifying the N parameter (number of trials).
'theta' For 'generalized pareto' distributions only, value
specifying the theta (threshold) parameter for the
generalized Pareto distribution. Default is 0.
'kernel' For 'kernel' distributions only, a string specifying the
type of kernel smoother to use. Choices are:
• 'normal' (default)
• 'box'
• 'triangle'
• 'epanechnikov'
'support' For 'kernel' distributions only, any of the following to
specify the support:
18-432
fitdist
load carsmall
ksd = fitdist(MPG,'kernel')
ksd =
kernel distribution
Kernel = normal
Bandwidth = 4.11428
Support = unbounded
18-433
fitdist
load carsmall
wd = fitdist(MPG,'weibull','by',Origin)
Algorithm The fitdist function fits most distributions using maximum likelihood.
Two exceptions are the normal and lognormal distributions with
uncensored data. For the uncensored normal distribution, the estimated
value of the sigma parameter is the square root of the unbiased
estimate of the variance. For the uncensored lognormal distribution,
the estimated value of the sigma parameter is the square root of the
unbiased estimate of the variance of the log of the data.
18-434
fitdist
18-435
categorical.flipdim
Syntax B = flipdim(A,dim)
18-436
categorical.fliplr
Syntax B = fliplr(A)
18-437
categorical.flipud
Syntax B = flipud(A)
18-438
fpdf
Syntax Y = fpdf(X,V1,V2)
⎡ ( + 2 ) ⎤ 1 1 −2
Γ⎢ 1 ⎥ ⎛ ⎞ 2 2
y = f ( x | 1 , 2 ) = ⎣ ⎦ 1
2 x
⎜ ⎟ 1 + 2
⎛ ⎞ ⎛ ⎞
Γ⎜ 1 ⎟Γ⎜ 2 ⎟ ⎝ 2 ⎠ ⎡ ⎛ ⎞ ⎤ 2
⎝ 2⎠ ⎝ 2 ⎠ 1
⎢1 + ⎜ ⎟ x ⎥
⎣ ⎝ 2⎠ ⎦
Examples y = fpdf(1:6,2,2)
y =
0.2500 0.1111 0.0625 0.0400 0.0278 0.0204
z = fpdf(3,5:10,5:10)
z =
0.0689 0.0659 0.0620 0.0577 0.0532 0.0487
18-439
fracfact
Examples Suppose you wish to determine the effects of four two-level factors, for
which there may be two-way interactions. A full-factorial design would
require 24 = 16 runs. The fracfactgen function finds generators for a
resolution IV (separating main effects) fractional-factorial design that
requires only 23 = 8 runs:
18-440
fracfact
[dfF,confounding] = fracfact(generators)
dfF =
-1 -1 -1 -1
-1 -1 1 1
-1 1 -1 1
-1 1 1 -1
1 -1 -1 1
1 -1 1 -1
1 1 -1 -1
1 1 1 1
confounding =
'Term' 'Generator' 'Confounding'
'X1' 'a' 'X1'
'X2' 'b' 'X2'
'X3' 'c' 'X3'
'X4' 'abc' 'X4'
'X1*X2' 'ab' 'X1*X2 + X3*X4'
'X1*X3' 'ac' 'X1*X3 + X2*X4'
'X1*X4' 'bc' 'X1*X4 + X2*X3'
'X2*X3' 'bc' 'X1*X4 + X2*X3'
'X2*X4' 'ac' 'X1*X3 + X2*X4'
'X3*X4' 'ab' 'X1*X2 + X3*X4'
18-441
fracfactgen
18-442
fracfactgen
Examples Suppose you wish to determine the effects of four two-level factors, for
which there may be two-way interactions. A full-factorial design would
require 24 = 16 runs. The fracfactgen function finds generators for a
resolution IV (separating main effects) fractional-factorial design that
requires only 23 = 8 runs:
[dfF,confounding] = fracfact(generators)
dfF =
-1 -1 -1 -1
-1 -1 1 1
-1 1 -1 1
-1 1 1 -1
1 -1 -1 1
1 -1 1 -1
1 1 -1 -1
1 1 1 1
confounding =
'Term' 'Generator' 'Confounding'
'X1' 'a' 'X1'
'X2' 'b' 'X2'
18-443
fracfactgen
18-444
friedman
Syntax p = friedman(X,reps)
p = friedman(X,reps,displayopt)
[p,table] = friedman(...)
[p,table,stats] = friedman(...)
xijk = + i + j + ijk
18-445
friedman
18-446
friedman
Assumptions
Friedman’s test makes the following assumptions about the data in X:
The classical two-way ANOVA replaces the first assumption with the
stronger assumption that data come from normal distributions.
Examples Let’s repeat the example from the anova2 function, this time applying
Friedman’s test. Recall that the data below come from a study of
popcorn brands and popper type (Hogg 1987). The columns of the
matrix popcorn are brands (Gourmet, National, and Generic). The
rows are popper type (Oil and Air). The study popped a batch of each
brand three times with each popper. The values are the yield in cups
of popped popcorn.
18-447
friedman
load popcorn
popcorn
popcorn =
5.5000 4.5000 3.5000
5.5000 4.5000 4.0000
6.0000 4.0000 3.0000
6.5000 5.0000 4.0000
7.0000 5.5000 5.0000
7.0000 5.0000 4.5000
p = friedman(popcorn,3)
p =
0.0010
The small p value of 0.001 indicates the popcorn brand affects the yield
of popcorn. This is consistent with the results from anova2.
References [1] Hogg, R. V., and J. Ledolter. Engineering Statistics. New York:
MacMillan, 1987.
18-448
frnd
Syntax R = frnd(V1,V2)
R = frnd(V1,V2,v)
R = frnd(V1,V2,m,n)
Examples n1 = frnd(1:6,1:6)
n1 =
0.0022 0.3121 3.0528 0.3189 0.2715 0.9539
n2 = frnd(2,2,[2 3])
n2 =
0.3186 0.9727 3.0268
0.2052 148.5816 0.2191
18-449
frnd
18-450
fstat
Description [M,V] = fstat(V1,V2) returns the mean of and variance for the F
distribution with numerator degrees of freedom V1 and denominator
degrees of freedom V2. V1 and V2 can be vectors, matrices, or
multidimensional arrays that all have the same size, which is also the
size of M and V. A scalar input for V1 or V2 is expanded to a constant
arrays with the same dimensions as the other input.
The mean of the F distribution for values of ν2 greater than 2 is
2
2 − 2
2 22 ( 1 + 2 − 2)
1 ( 2 − 2)2 ( 2 − 4)
Examples fstat returns NaN when the mean and variance are undefined.
[m,v] = fstat(1:5,1:5)
m =
NaN NaN 3.0000 2.0000 1.6667
v =
NaN NaN NaN NaN 8.8889
18-451
fsurfht
Syntax fsurfht(fun,xlims,ylims)
fsurfht(fun,xlims,ylims,p1,p2,p3,p4,p5)
Examples Plot the Gaussian likelihood function for the gas.mat data.
load gas
function z = gauslike(mu,sigma,p1)
n = length(p1);
z = ones(size(mu));
for i = 1:n
z = z .* (normpdf(p1(i),mu,sigma));
end
The gauslike function calls normpdf, treating the data sample as fixed
and the parameters µ and σ as variables. Assume that the gas prices
are normally distributed, and plot the likelihood surface of the sample.
18-452
fsurfht
The sample mean is the x value at the maximum, but the sample
standard deviation is not the y value at the maximum.
mumax = mean(price1)
mumax =
115.1500
sigmamax = std(price1)*sqrt(19/20)
sigmamax =
3.7719
18-453
fsurfht
18-454
fullfact
Description dFF = fullfact(levels) gives factor settings dFF for a full factorial
design with n factors, where the number of levels for each factor is given
by the vector levels of length n. dFF is m-by-n, where m is the number
of treatments in the full-factorial design. Each row of dFF corresponds
to a single treatment. Each column contains the settings for a single
factor, with integer values from one to the number of levels.
18-455
gagerr
Syntax gagerr(y,{part,operator})
gagerr(y,GROUP)
gagerr(y,part)
gagerr(...,param1,val1,param2,val2,...)
[TABLE, stats] = gagerr(...)
18-456
gagerr
18-457
gagerr
Examples Conduct a gage R&R study for a simulated measurement system using
a mixed ANOVA model without interactions:
y = randn(100,1); % measurements
part = ceil(3*rand(100,1)); % parts
operator = ceil(4*rand(100,1)); % operators
gagerr(y,{part, operator},'randomoperator',true) % analysis
18-458
gagerr
Note: The last column of the above table does not have to sum to 100%
18-459
gagerr
18-460
gamcdf
Syntax gamcdf(X,A,B)
[P,PLO,PUP] = gamcdf(X,A,B,pcov,alpha)
x −t
1 a −1 b
p = F ( x | a, b) = ∫
ba Γ(a) 0
t e dt
Examples a = 1:6;
b = 5:10;
prob = gamcdf(a.*b,a,b)
prob =
0.6321 0.5940 0.5768 0.5665 0.5595 0.5543
18-461
gamcdf
See Also cdf, gampdf, gaminv, gamstat, gamfit, gamlike, gamrnd, gamma
“Gamma Distribution” on page B-27
18-462
gamfit
18-463
gamfit
a = 2; b = 4;
data = gamrnd(a,b,100,1);
[p,ci] = gamfit(data)
p =
2.1990 3.7426
ci =
1.6840 2.8298
2.7141 4.6554
18-464
gaminv
Syntax X = gaminv(P,A,B)
[X,XLO,XUP] = gamcdf(P,A,B,pcov,alpha)
x = F −1 ( p| a, b) = { x : F ( x | a, b) = p}
where
x −t
1 a −1 b
p = F ( x | a, b) = ∫
ba Γ(a) 0
t e dt
Examples This example shows the relationship between the gamma cdf and its
inverse function.
18-465
gaminv
a = 1:5;
b = 6:10;
x = gaminv(gamcdf(1:5,a,b),a,b)
x =
1.0000 2.0000 3.0000 4.0000 5.0000
18-466
gamlike
a = 2; b = 3;
r = gamrnd(a,b,100,1);
[nlogL,AVAR] = gamlike(gamfit(r),r)
nlogL =
267.5648
AVAR =
18-467
gamlike
0.0788 -0.1104
-0.1104 0.1955
18-468
gampdf
Syntax Y = gampdf(X,A,B)
−x
1 a −1
y = f ( x | a, b) = x eb
ba Γ(a)
mu = 1:5;
y = gampdf(1,1,mu)
y =
0.3679 0.3033 0.2388 0.1947 0.1637
y1 = exppdf(1,mu)
y1 =
0.3679 0.3033 0.2388 0.1947 0.1637
18-469
gamrnd
Syntax R = gamrnd(A,B)
R = gamrnd(A,B,v)
R = gamrnd(A,B,m,n)
Examples n1 = gamrnd(1:5,6:10)
n1 =
9.1132 12.8431 24.8025 38.5960 106.4164
n2 = gamrnd(5,10,[1 5])
n2 =
30.9486 33.5667 33.6837 55.2014 46.8265
n3 = gamrnd(2:6,3,1,5)
n3 =
12.8715 11.3068 3.0982 15.6012 21.6739
See Also randg, random, gampdf, gamcdf, gaminv, gamstat, gamfit, gamlike
“Gamma Distribution” on page B-27
18-470
gamstat
Description [M,V] = gamstat(A,B) returns the mean of and variance for the
gamma distribution with shape parameters in A and scale parameters
in B. A and B can be vectors, matrices, or multidimensional arrays that
have the same size, which is also the size of M and V. A scalar input
for A or B is expanded to a constant array with the same dimensions
as the other input.
The mean of the gamma distribution with parameters a and b is ab.
The variance is ab2.
[m,v] = gamstat(1:5,1./(1:5))
m =
1 1 1 1 1
v =
1.0000 0.5000 0.3333 0.2500 0.2000
18-471
qrandstream.ge
Syntax h1 >= h2
18-472
geocdf
Syntax Y = geocdf(X,P)
floor ( x)
y = F ( x | p) = ∑ pqi
i =0
where q = 1 − p .
The result, y, is the probability of observing up to x trials before a
success, when the probability of success in any given trial is p.
Examples Suppose you toss a fair coin repeatedly. If the coin lands face up (heads),
that is a success. What is the probability of observing three or fewer
tails before getting a heads?
p = geocdf(3,0.5)
p =
0.9375
18-473
geoinv
Syntax X = geoinv(Y,P)
Description X = geoinv(Y,P) returns the smallest positive integer X such that the
geometric cdf evaluated at X is equal to or exceeds Y. You can think of
Y as the probability of observing X successes in a row in independent
trials where P is the probability of success in each trial.
Y and P can be vectors, matrices, or multidimensional arrays that all
have the same size. A scalar input for P or Y is expanded to a constant
array with the same dimensions as the other input. The values in P
and Y must lie on the interval [0 1].
Examples The probability of correctly guessing the result of 10 coin tosses in a row
is less than 0.001 (unless the coin is not fair).
psychic = geoinv(0.999,0.5)
psychic =
9
The example below shows the inverse method for generating random
numbers from the geometric distribution.
rndgeo = geoinv(rand(2,5),0.5)
rndgeo =
0 1 3 1 0
0 1 0 2 0
18-474
geomean
Syntax m = geomean(x)
geomean(X,dim)
1
⎡ n ⎤n
m = ⎢∏ xi ⎥
⎢⎣ i=1 ⎥⎦
Examples The arithmetic mean is greater than or equal to the geometric mean.
x = exprnd(1,10,6);
geometric = geomean(x)
geometric =
0.7466 0.6061 0.6038 0.2569 0.7539 0.3478
average = mean(x)
average =
1.3509 1.1583 0.9741 0.5319 1.0088 0.8122
18-475
geopdf
Syntax Y = geopdf(X,P)
y = f ( x | p) = pq x I(0,1,...) ( x)
where q = 1 − p .
Examples Suppose you toss a fair coin repeatedly. If the coin lands face up
(heads), that is a success. What is the probability of observing exactly
three tails before getting a heads?
p = geopdf(3,0.5)
p =
0.0625
18-476
geornd
Syntax R = geornd(P)
R = geornd(P,v)
R = geornd(P,m,n)
r2 = geornd(0.01,[1 5])
r2 =
65 18 334 291 63
r3 = geornd(0.5,1,6)
r3 =
0 7 1 3 1 0
18-477
geostat
Description [M,V] = geostat(P) returns the mean of and variance for the
geometric distribution with corresponding probabilities in P.
The mean of the geometric distribution with parameter p is q/p, where q
= 1-p. The variance is q/p2.
18-478
dataset.get
Syntax get(A)
s = get(A)
p = get(A,PropertyName)
p = get(A,{PropertyName1,PropertyName2,...})
Description get(A) displays a list of property/value pairs for the dataset array A.
s = get(A) returns the values in a scalar structure s with field names
given by the properties.
p = get(A,PropertyName) returns the value of the property specified
by the string PropertyName.
p = get(A,{PropertyName1,PropertyName2,...}) allows multiple
property names to be specified and returns their values in a cell array.
Examples Create a dataset array from Fisher’s iris data and access the
information:
load fisheriris
NumObs = size(meas,1);
NameObs = strcat({'Obs'},num2str((1:NumObs)','%-d'));
iris = dataset({nominal(species),'species'},...
{meas,'SL','SW','PL','PW'},...
'ObsNames',NameObs);
get(iris)
Description: ''
Units: {}
DimNames: {'Observations' 'Variables'}
UserData: []
ObsNames: {150x1 cell}
VarNames: {'species' 'SL' 'SW' 'PL' 'PW'}
ON = get(iris,'ObsNames');
ON(1:3)
18-479
dataset.get
ans =
'Obs1'
'Obs2'
'Obs3'
18-480
categorical.getlabels
Examples Example 1
Display levels in a nominal and an ordinal array:
standings = nominal({'Leafs','Canadiens','Bruins'});
getlabels(standings)
ans =
'Bruins' 'Canadiens' 'Leafs'
standings = ordinal(1:3,{'Leafs','Canadiens','Bruins'});
getlabels(standings)
ans =
'Leafs' 'Canadiens' 'Bruins'
Example 2
Display age groups containing data in hospital.mat:
load hospital
edges = 0:10:100;
labels = strcat(num2str((0:10:90)','%d'),{'s'});
AgeGroup = ordinal(hospital.Age,labels,[],edges);
AgeGroup = droplevels(AgeGroup);
getlabels(AgeGroup)
ans =
'20s' '30s' '40s' '50s'
18-481
categorical.getlevels
Syntax S = getlevels(A)
18-482
gevcdf
Syntax P = gevcdf(X,K,sigma,mu)
18-483
gevfit
18-484
gevfit
18-485
gevinv
Syntax X = gevinv(P,K,sigma,mu)
18-486
gevlike
18-487
gevpdf
Syntax Y = gevpdf(X,K,sigma,mu)
18-488
gevrnd
Syntax R = gevrnd(K,sigma,mu)
R = gevrnd(K,sigma,mu,m,n,p)
R = gevrnd(K,sigma,mu,[m,n,p])
18-489
gevstat
18-490
gline
Syntax gline(h)
gline
hline = gline(...)
Description gline(h) allows you to draw a line segment in the figure with handle h
by clicking the pointer at the two endpoints. A rubber-band line tracks
the pointer movement.
gline with no input arguments defaults to h = gcf and draws in the
current figure.
hline = gline(...) returns the handle hline to the line.
x = 1:10;
y = x + randn(1,10);
scatter(x,y,25,'b','*')
lsline
mu = mean(y);
hold on
plot([1 10],[mu mu],'ro')
18-491
gline
18-492
glmfit
Syntax b = glmfit(X,y,distr)
b = glmfit(X,y,distr,param1,val1,param2,val2,...)
[b,dev] = glmfit(...)
[b,dev,stats] = glmfit(...)
18-493
glmfit
18-494
glmfit
18-495
glmfit
Example Fit a probit regression model for y on x. Each y(i) is the number
of successes in n(i) trials.
18-496
glmfit
[3] Collett, D. Modeling Binary Data. New York: Chapman & Hall,
2002.
18-497
glmval
18-498
glmval
Parameter Value
'confidence' — the confidence A scalar between 0 and 1
level for the confidence bounds
'size' — the size parameter (N) A scalar, or a vector with one
for a binomial model value for each row of X
'offset' — used as an additional A vector
predictor variable, but with a
coefficient value fixed at 1.0
'constant' • 'on' — Includes a constant
term in the model. The
coefficient of the constant term
is the first element of b.
• 'off' — Omit the constant
term
Examples Fit a probit regression model for y on x. Each y(i) is the number
of successes in n(i) trials.
18-499
glmval
[3] Collett, D. Modeling Binary Data. New York: Chapman & Hall,
2002.
18-500
glyphplot
Syntax glyphplot(X)
glyphplot(X,'glyph','face')
glyphplot(X,'glyph','face','features',f)
glyphplot(X,...,'grid',[rows,cols])
glyphplot(X,...,'grid',[rows,cols],'page',p)
glyphplot(X,...,'centers',C)
glyphplot,...,'centers',C,'radius',r)
glyphplot(X,...,'obslabels',labels)
glyphplot(X,...,'standardize',method)
glyphplot(X,...,prop1,val1,...)
h = glyphplot(X,...)
Description glyphplot(X) creates a star plot from the multivariate data in the
n-by-p matrix X. Rows of X correspond to observations, columns
to variables. A star plot represents each observation as a “star”
whose ith spoke is proportional in length to the ith coordinate of that
observation. glyphplot standardizes X by shifting and scaling each
column separately onto the interval [0,1] before making the plot, and
centers the glyphs on a rectangular grid that is as close to square as
possible. glyphplot treats NaNs in X as missing values, and does not
plot the corresponding rows of X. glyphplot(X,'glyph','star') is a
synonym for glyphplot(X).
glyphplot(X,'glyph','face') creates a face plot from X. A face plot
represents each observation as a “face,” whose ith facial feature is
drawn with a characteristic proportional to the ith coordinate of that
observation. The features are described in “Face Features” on page
18-503Face Features.
glyphplot(X,'glyph','face','features',f) creates a face plot
where the ith element of the index vector f defines which facial feature
will represent the ith column of X. f must contain integers from 0 to
17, where 0 indicate that the corresponding column of X should not be
plotted. See “Face Features” on page 18-503 for more information.
18-501
glyphplot
18-502
glyphplot
to the lines making up each face and to the pupils, respectively. h(:,3)
contains handles to the text objects for the labels, if present.
Face Features
The following table describes the correspondence between the columns
of the vector f, the value of the 'Features' input parameter, and the
facial features of the glyph plot. If X has fewer than 17 columns, unused
features are displayed at their default value.
18-503
glyphplot
glyphplot(X,'standardize','column',...
'obslabels',Model,...
'grid',[2 2],...
'page','scroll');
glyphplot(X,'glyph','face',...
'obslabels',Model,...
'grid',[2 3],...
'page',9);
18-504
glyphplot
18-505
gmdistribution class
18-506
gmdistribution class
Properties All objects of the class have the properties listed in the following table.
18-507
gmdistribution class
Copy Value. To learn how this affects your use of the class, see Comparing
Semantics Handle and Value Classes in the MATLAB Object-Oriented
Programming documentation.
References McLachlan, G., and D. Peel, Finite Mixture Models, John Wiley & Sons,
New York, 2000.
18-508
gmdistribution
mu = [1 2;-3 -5];
sigma = cat(3,[2 0;0 .5],[1 0;0 1]);
p = ones(1,2)/2;
obj = gmdistribution(mu,sigma,p);
18-509
gmdistribution
References [1] McLachlan, G., and D. Peel. Finite Mixture Models. Hoboken, NJ:
John Wiley & Sons, Inc., 2000.
18-510
gname
Syntax gname(cases)
gname
h = gname(cases,line_handle)
Description gname(cases) displays a figure window and waits for you to press
a mouse button or a keyboard key. The input argument cases is a
character array or a cell array of strings, in which each row of the
character array or each element of the cell array contains the case
name of a point. Moving the mouse over the graph displays a pair of
cross-hairs. If you position the cross-hairs near a point with the mouse
and click once, the graph displays the label corresponding to that point.
Alternatively, you can click and drag the mouse to create a rectangle
around several points. When you release the mouse button, the graph
displays the labels for all points in the rectangle. Right-click a point to
remove its label. When you are done labelling points, press the Enter
or Escape key to stop labeling.
gname with no arguments labels each case with its case number.
cases typically contains unique case names for each point, and is a cell
array of strings or a character matrix with each row representing a
name. cases can also be any grouping variable, which gname converts
to labels.
h = gname(cases,line_handle) returns a vector of handles to the text
objects on the plot. Use the scalar line_handle to identify the correct
line if there is more than one line object on the plot.
You can use gname to label plots created by the plot, scatter,
gscatter, plotmatrix, and gplotmatrix functions.
Examples This example uses the city ratings data sets to find out which cities are
the best and worst for education and the arts.
load cities
education = ratings(:,6);
arts = ratings(:,7);
18-511
gname
plot(education,arts,'+')
gname(names)
Click the point at the top of the graph to display its label, “New York.”
18-512
gpcdf
Syntax P = gpcdf(X,K,sigma,theta)
X −θ 1
K < 0, 0 ≤ ≤− .
σ K
References [1] Embrechts, P., C. Klüppelberg, and T. Mikosch. Modelling Extremal
Events for Insurance and Finance. New York: Springer, 1997.
18-513
gpfit
X −θ 1
0≤ ≤−
σ K
18-514
gpfit
18-515
gpinv
Syntax X = gpinv(P,K,sigma,theta)
X −θ 1
K < 0, 0 ≤ ≤− .
σ K
References [1] Embrechts, P., C. Klüppelberg, and T. Mikosch. Modelling Extremal
Events for Insurance and Finance. New York: Springer, 1997.
18-516
gplike
X −θ 1
K < 0, 0 ≤ ≤− .
σ K
References [1] Embrechts, P., C. Klüppelberg, and T. Mikosch. Modelling Extremal
Events for Insurance and Finance. New York: Springer, 1997.
18-517
gppdf
Syntax P = gppdf(X,K,sigma,theta)
X −θ 1
K < 0, 0 ≤ ≤− .
σ K
References [1] Embrechts, P., C. Klüppelberg, and T. Mikosch. Modelling Extremal
Events for Insurance and Finance. New York: Springer, 1997.
18-518
gplotmatrix
Syntax gplotmatrix(x,y,group)
gplotmatrix(x,y,group,clr,sym,siz)
gplotmatrix(x,y,group,clr,sym,siz,doleg)
gplotmatrix(x,y,group,clr,sym,siz,doleg,dispopt)
gplotmatrix(x,y,group,clr,sym,siz,doleg,dispopt,xnam,ynam)
[h,ax,bigax] = gplotmatrix(...)
18-519
gplotmatrix
Examples Load the cities data. The ratings array has ratings of the cities in
nine categories (category names are in the array categories). group
is a code whose value is 2 for the largest cities. You can make scatter
plots of the first three categories against the other four, grouped by
the city size code:
load discrim
gplotmatrix(ratings(:,1:2),ratings(:,[4 7]),group)
The output figure (not shown) has an array of graphs with each city
group represented by a different color. The graphs are a little easier to
read if you specify colors and plotting symbols, label the axes with the
rating categories, and move the legend off the graphs:
gplotmatrix(ratings(:,1:2),ratings(:,[4 7]),group,...
18-520
gplotmatrix
'br','.o',[],'on','',categories(1:2,:),...
categories([4 7],:))
18-521
gprnd
Syntax R = gprnd(K,sigma,theta)
R = gprnd(K,sigma,theta,m,n,p)
R = gprnd(K,sigma,theta,[m,n,p])
X −θ 1
0≤ ≤−
σ K
References [1] Embrechts, P., C. Klüppelberg, and T. Mikosch. Modelling Extremal
Events for Insurance and Finance. New York: Springer, 1997.
18-522
gpstat
X −θ 1
K < 0, 0 ≤ ≤− .
σ K
References [1] Embrechts, P., C. Klüppelberg, and T. Mikosch. Modelling Extremal
Events for Insurance and Finance. New York: Springer, 1997.
18-523
TreeBagger.growTrees
Syntax B = growTrees(B,ntrees)
B = growTrees(B,ntrees,’param1’,val1,’param2’,val2,...)
• 'UseParallel' — If 'always'
and if a matlabpool of the
18-524
TreeBagger.growTrees
18-525
TreeBagger.growTrees
18-526
grp2idx
Syntax [G,GN]=grp2idx(S)
[indices,names] = grp2idx(group)
[G,GN,GL] = grp2idx(S)
• For numeric and logical grouping variables, the order is the sorted
order of group.
• For categorical grouping variables, the order is the order of
getlabels(group).
• For string grouping variables, the order is the order of first
appearance in group.
18-527
grp2idx
load hospital
edges = 0:10:100;
labels = strcat(num2str((0:10:90)','%d'),{'s'});
AgeGroup = ordinal(hospital.Age,labels,[],edges);
ages = hospital.Age(1:5)
ages =
38
43
38
40
49
group = AgeGroup(1:5)
group =
30s
40s
30s
40s
40s
indices = grp2idx(group)
indices =
4
5
4
5
5
18-528
grpstats
Description means = grpstats(X) computes the mean of the entire sample without
grouping, where X is a matrix of observations.
means = grpstats(X,group) returns the means of each column of X by
group. The array, group defines the grouping such that two elements
of X are in the same group if their corresponding group values are
the same. (See “Grouped Data” on page 2-34.) The grouping variable
group can be a categorical variable, vector, string array, or cell array
of strings. It can also be a cell array containing several grouping
variables (such as {g1 g2 g3}) to group the values in X by each unique
combination of grouping variable values.
grpstats(X,group,alpha) displays a plot of the means versus index
with 100(1-alpha)% confidence intervals around each mean.
dsstats = grpstats(ds,groupvars), when ds is a dataset array,
returns a dataset dsstats that contains the mean, computed by group,
for variables in ds. groupvars specifies the grouping variables in ds
that define the groups, and is a positive integer, a vector of positive
integers, the nam of a dataset variable, a cell array containing one or
more dataset variable names, or a logical vector. A grouping variable
may be a vector of categorical, logical, or numeric values, a character
array of strings, or a cell vector of strings. dsstats contains those
grouping variables, plus one variable giving the number of observations
in ds for each group, as well as one variable for each of the remaining
dataset variables in ds. These variables must be numeric or logical.
dsstats contains one observation for each group of observations in ds.
18-529
grpstats
• 'mean' — mean
• 'sem' — standard error of the mean
• 'numel' — count, or number of non-NaN elements
• 'gname' — group name
• 'std' — standard deviation
• 'var' — variance
• 'min' — minimum
• 'max' — maximum
• 'range' — maximum - minimum
• 'meanci' — 95% confidence interval for the mean
• 'predci' — 95% prediction interval for a new observation
18-530
grpstats
18-531
grpstats
18-532
grpstats
18-533
dataset.grpstats
Syntax B = grpstats(A,groupvars)
B = grpstats(A,groupvars,whichstats)
B = grpstats(A,groupvars,whichstats,...,'DataVars',vars)
B = grpstats(A,groupvars,whichstats,...,'VarNames',names)
• 'mean' — mean
• 'sem' — standard error of the mean
• 'numel' — count, or number of non-NaN elements
• 'gname' — group name
• 'std' — standard deviation
18-534
dataset.grpstats
• 'var' — variance
• 'meanci' — 95% confidence interval for the mean
• 'predci' — 95% prediction interval for a new observation
18-535
dataset.grpstats
Examples Compute blood pressure statistics for the data in hospital.mat, by sex
and smoker status:
load hospital
grpstats(hospital,...
{'Sex','Smoker'},...
{@median,@iqr},...
'DataVars','BloodPressure')
ans =
Sex Smoker GroupCount
Female_0 Female false 40
Female_1 Female true 13
Male_0 Male false 26
Male_1 Male true 21
median_BloodPressure
Female_0 119.5 79
Female_1 129 91
Male_0 119 79
Male_1 129 92
iqr_BloodPressure
Female_0 6.5 5.5
Female_1 8 5.5
Male_0 7 6
Male_1 10.5 4.5
18-536
gscatter
Syntax gscatter(x,y,group)
gscatter(x,y,group,clr,sym,siz)
gscatter(x,y,group,clr,sym,siz,doleg)
gscatter(x,y,group,clr,sym,siz,doleg,xnam,ynam)
h = gscatter(...)
18-537
gscatter
Examples Load the cities data and look at the relationship between the ratings
for climate (first column) and housing (second column) grouped by city
size. We’ll also specify the colors and plotting symbols.
load discrim
gscatter(ratings(:,1),ratings(:,2),group,'br','xo')
18-538
qrandstream.gt
Syntax h1 > h2
18-539
haltonset class
Superclasses qrandset
Description haltonset is a quasi-random point set class that produces points from
the Halton sequence.
18-540
haltonset class
Copy Handle. To learn how this affects your use of the class, see
Semantics Comparing Handle and Value Classes in the MATLAB Object-Oriented
Programming documentation.
18-541
haltonset
Syntax p = haltonset(d)
p = haltonset(d,prop1,val1,prop2,val2,...)
Examples Generate a 3-D Halton point set, skip the first 1000 values, and then
retain every 101st point:
p = haltonset(3,'Skip',1e3,'Leap',1e2)
p =
Halton point set in 3 dimensions (8.918019e+013 points)
Properties:
Skip : 1000
Leap : 100
ScrambleMethod : none
p = scramble(p,'RR2')
p =
Halton point set in 3 dimensions (8.918019e+013 points)
Properties:
Skip : 1000
Leap : 100
ScrambleMethod : RR2
18-542
haltonset
X0 = net(p,4)
X0 =
0.0928 0.6950 0.0029
0.6958 0.2958 0.8269
0.3013 0.6497 0.4141
0.9087 0.7883 0.2166
X = p(1:3:11,:)
X =
0.0928 0.6950 0.0029
0.9087 0.7883 0.2166
0.3843 0.9840 0.9878
0.6831 0.7357 0.7923
18-543
harmmean
Syntax m = harmmean(X)
harmmean(X,dim)
n
m=
n
1
∑ xi
i=1
Examples The arithmetic mean is greater than or equal to the harmonic mean.
x = exprnd(1,10,6);
harmonic = harmmean(x)
harmonic =
0.3382 0.3200 0.3710 0.0540 0.4936 0.0907
average = mean(x)
average =
1.3509 1.1583 0.9741 0.5319 1.0088 0.8122
18-544
hist3
Syntax hist3(X)
hist3(X,nbins)
hist3(X,ctrs)
hist3(X,'Edges',edges)
N = hist3(X,...)
[N,C] = hist3(X,...)
hist3(...,param1,val1,param2,val2,...)
Description hist3(X) bins the elements of the m-by-2 matrix X into a 10-by-10 grid
of equally spaced containers, and plots a histogram. Each column of X
corresponds to one dimension in the bin grid.
hist3(X,nbins) plots a histogram using an nbins(1)-by-nbins(2) grid
of bins. hist3(X,'Nbins',nbins) is equivalent to hist3(X,nbins).
hist3(X,ctrs), where ctrs is a two-element cell array of numeric
vectors with monotonically non-decreasing values, uses a 2-D grid
of bins centered on ctrs{1} in the first dimension and on ctrs{2}
in the second. hist3 assigns rows of X falling outside the range of
that grid to the bins along the outer edges of the grid, and ignores
rows of X containing NaNs. hist3(X,'Ctrs',ctrs) is equivalent to
hist3(X,ctrs).
hist3(X,'Edges',edges), where edges is a two-element cell array
of numeric vectors with monotonically non-decreasing values, uses a
2-D grid of bins with edges at edges{1} in the first dimension and at
edges{2} in the second. The (i, j)th bin includes the value X(k,:) if
18-545
hist3
Examples Example 1
Make a 3-D figure using a histogram with a density plot underneath:
load seamount
dat = [-y,x]; % Grid corrected for negative y-values
hold on
hist3(dat) % Draw histogram in 2D
xb = linspace(min(dat(:,1)),max(dat(:,1)),size(n,1)+1);
yb = linspace(min(dat(:,2)),max(dat(:,2)),size(n,1)+1);
h = pcolor(xb,yb,n1);
18-546
hist3
view(3);
Example 2
Use the car data to make a histogram on a 7-by-7 grid of bins.
18-547
hist3
load carbig
X = [MPG,Weight];
hist3(X,[7 7]);
xlabel('MPG'); ylabel('Weight');
hist3(X,[7 7],'FaceAlpha',.65);
xlabel('MPG'); ylabel('Weight');
set(gcf,'renderer','opengl');
18-548
hist3
Specify bin centers, different in each direction; get back counts, but
don’t make the plot.
Example 3
Make a histogram with bars colored according to height.
load carbig
X = [MPG,Weight];
hist3(X,[7 7]);
xlabel('MPG'); ylabel('Weight');
set(gcf,'renderer','opengl');
set(get(gca,'child'),'FaceColor','interp','CDataMode',...
'auto');
18-549
hist3
18-550
categorical.hist
Syntax hist(Y)
hist(Y,X)
hist(ax,...)
N = hist(...)
[N,X] = hist(...)
Description hist(Y) plots a histogram bar plot of the counts for each level of
the categorical vector Y. If Y is an m-by-n categorical matrix, hist
computes counts for each column of Y, and plots a group of n bars for
each categorical level.
hist(Y,X) plots bars only for the levels specified in X. X is a categorical
vector or a cell array of level names as strings.
hist(ax,...) plots into the axes with handle ax instead of gca.
N = hist(...) returns the counts for each categorical level. If Y is
a matrix, hist works down the columns of Y and returns a matrix of
counts with one column for each coluimn of Y and one row for each
cetegorical level.
[N,X] = hist(...) returns the categorical levels corresponding to
each count in N, or corresponding to each column of N if Y is a matrix.
load hospital
edges = 0:10:100;
labels = strcat(num2str((0:10:90)','%d'),{'s'});
AgeGroup = ordinal(hospital.Age,labels,[],edges);
AgeGroup = droplevels(AgeGroup);
hist(AgeGroup)
18-551
categorical.hist
18-552
histfit
Syntax histfit(data)
histfit(data,nbins)
histfit(data,nbins,dist)
h = histfit(...)
Description histfit(data) plots a histogram of the values in the vector data using
the number of bins equal to the square root of the number of elements
in data, then superimposes a fitted normal distribution.
histfit(data,nbins) uses nbins bins for the histogram.
histfit(data,nbins,dist) plots a histogram with a density from the
distribution specified by dist, one of the following strings:
• 'beta'
• 'birnbaumsaunders'
• 'exponential'
• 'extreme value' or ev'
• 'gamma'
• 'generalized extreme value' or 'gev'
• 'generalized pareto' or 'gp'
• 'inversegaussian'
• 'logistic'
• 'loglogistic'
• 'lognormal'
• 'nakagami'
• 'negative binomial' or 'nbin'
• 'normal' (default)
• 'poisson'
18-553
histfit
• 'rayleigh'
• 'rician'
• 'tlocationscale'
• 'weibull' or 'wbl'
Examples r = normrnd(10,1,100,1);
histfit(r)
h = get(gca,'Children');
set(h(2),'FaceColor',[.8 .8 1])
18-554
histfit
18-555
hmmdecode
Note The function hmmdecode begins with the model in state 1 at step
0, prior to the first emission. hmmdecode computes the probabilities in
PSTATES based on the fact that the model begins in state 1.
18-556
hmmdecode
[seq,states] = hmmgenerate(100,trans,emis);
pStates = hmmdecode(seq,trans,emis);
[seq,states] = hmmgenerate(100,trans,emis,...
'Symbols',{'one','two','three','four','five','six'})
pStates = hmmdecode(seq,trans,emis,...
'Symbols',{'one','two','three','four','five','six'});
18-557
hmmestimate
Purpose Hidden Markov model parameter estimates from emissions and states
18-558
hmmestimate
Markov model. If the i → j transition does not occur in states, you can
set PSEUDOTR(i,j) to be a positive number representing an estimate of
the expected number of such transitions in the sequence states.
[seq,states] = hmmgenerate(1000,trans,emis);
[estimateTR,estimateE] = hmmestimate(seq,states);
18-559
hmmestimate
18-560
hmmgenerate
The length of both seq and states is len. TRANS(i,j) is the probability
of transition from state i to state j. EMIS(k,l) is the probability that
symbol l is emitted from state k.
18-561
hmmgenerate
[seq,states] = hmmgenerate(100,trans,emis)
[seq,states] = hmmgenerate(100,trans,emis,...
'Symbols',{'one','two','three','four','five','six'},...
'Statenames',{'fair';'loaded'})
18-562
hmmtrain
18-563
hmmtrain
Tolerance
The input argument ’tolerance' controls how many steps the
hmmtrain algorithm executes before the function returns an answer.
The algorithm terminates when all of the following three quantities are
less than the value that you specify for tolerance:
• The log likelihood that the input sequence seq is generated by the
currently estimated values of the transition and emission matrices
• The change in the norm of the transition matrix, normalized by the
size of the matrix
18-564
hmmtrain
maxiterations
The maximum number of iterations, 'maxiterations', controls the
maximum number of steps the algorithm executes before it terminates.
If the algorithm executes maxiter iterations before reaching the
specified tolerance, the algorithm terminates and the function returns a
warning. If this occurs, you can increase the value of 'maxiterations'
to make the algorithm reach the desired tolerance before terminating.
seq1 = hmmgenerate(100,trans,emis);
seq2 = hmmgenerate(200,trans,emis);
seqs = {seq1,seq2};
[estTR,estE] = hmmtrain(seqs,trans,emis);
18-565
hmmviterbi
Note The function hmmviterbi begins with the model in state 1 at step
0, prior to the first emission. hmmviterbi computes the most likely path
based on the fact that the model begins in state 1.
[seq,states] = hmmgenerate(100,trans,emis);
estimatedStates = hmmviterbi(seq,trans,emis);
18-566
hmmviterbi
[seq,states] = ...
hmmgenerate(100,trans,emis,...
'Statenames',{'fair';'loaded'});
estimatesStates = ...
hmmviterbi(seq,trans,emis,...
'Statenames',{'fair';'loaded'});
18-567
categorical.horzcat
Syntax C = horzcat(dim,A,B,...)
C = horzcat(A,B)
18-568
dataset.horzcat
18-569
hougen
1 x2 − x3 / 5
yˆ =
1 + 2 x1 + 3 x2 + 4 x3
References [1] Bates, D. M., and D. G. Watts. Nonlinear Regression Analysis and
Its Applications. Hoboken, NJ: John Wiley & Sons, Inc., 1988.
18-570
hygecdf
Syntax hygecdf(X,M,K,N)
⎛ K ⎞⎛ M − K ⎞
x ⎜ ⎟⎜ ⎟
i N −i ⎠
p = F( x| M, K , N ) = ∑ ⎝ ⎠ ⎝
i =0 ⎛M⎞
⎜ ⎟
⎝N⎠
Examples Suppose you have a lot of 100 floppy disks and you know that 20 of them
are defective. What is the probability of drawing zero to two defective
floppies if you select 10 at random?
p = hygecdf(2,100,20,10)
p =
0.6812
18-571
hygeinv
Syntax hygeinv(P,M,K,N)
Examples Suppose you are the Quality Assurance manager for a floppy disk
manufacturer. The production line turns out floppy disks in batches of
1,000. You want to sample 50 disks from each batch to see if they have
defects. You want to accept 99% of the batches if there are no more
than 10 defective disks in the batch. What is the maximum number of
defective disks should you allow in your sample of 50?
x = hygeinv(0.99,1000,10,50)
x =
3
x = hygeinv(0.50,1000,10,50)
x =
0
18-572
hygepdf
Syntax Y = hygepdf(X,M,K,N)
⎛ K ⎞⎛ M − K ⎞
⎜ ⎟⎜ ⎟
x N−x⎠
y = f ( x| M, K , N ) = ⎝ ⎠ ⎝
⎛M⎞
⎜ ⎟
⎝N⎠
Examples Suppose you have a lot of 100 floppy disks and you know that 20 of them
are defective. What is the probability of drawing 0 through 5 defective
floppy disks if you select 10 at random?
p = hygepdf(0:5,100,20,10)
p =
0.0951 0.2679 0.3182 0.2092 0.0841 0.0215
18-573
hygernd
Syntax R = hygernd(M,K,N)
R = hygernd(M,K,N,v)
R = hygernd(M,K,N,m,n)
18-574
hygestat
Description [MN,V] = hygestat(M,K,N) returns the mean of and variance for the
hypergeometric distribution with corresponding size of the population,
M, number of items with the desired characteristic in the population, K,
and number of samples drawn, N. Vector or matrix inputs for M, K, and
N must have the same size, which is also the size of MN and V. A scalar
input for M, K, or N is expanded to a constant matrix with the same
dimensions as the other inputs.
The mean of the hypergeometric distribution with parameters M, K, and
N is NK/M, and the variance is NK(M-K)(M-N)/[M^2(M-1)].
[m,v] = hygestat(10.^(1:4),10.^(0:3),9)
m =
0.9000 0.9000 0.9000 0.9000
v =
0.0900 0.7445 0.8035 0.8094
[m,v] = binostat(9,0.1)
m =
0.9000
v =
0.8100
18-575
icdf
Syntax Y = icdf(name,X,A)
Y = icdf(name,X,A,B)
Y = icdf(name,X,A,B,C)
18-576
icdf
18-577
icdf
18-578
icdf
18-579
icdf
Examples Compute the icdf of the normal distribution with mean 0 and standard
deviation 1 at inputs 0.1, 0.3, ..., 0.9:
x1 = icdf('Normal',0.1:0.2:0.9,0,1)
x1 =
-1.2816 -0.5244 0 0.5244 1.2816
x2 = icdf('Poisson',0.1:0.2:0.9,0:4)
x2 =
NaN 0 2 4 7
18-580
icdf
18-581
ProbDistUnivKernel.icdf
Syntax Y = icdf(PD, P)
18-582
ProbDistUnivParam.icdf
Syntax Y = icdf(PD, P)
18-583
piecewisedistribution.icdf
Syntax X = icdf(obj,P)
t = trnd(3,100,1);
obj = paretotails(t,0.1,0.9);
[p,q] = boundary(obj)
p =
0.1000
0.9000
q =
-1.7766
1.8432
icdf(obj,p)
ans =
-1.7766
1.8432
18-584
inconsistent
Syntax Y = inconsistent(Z)
Y = inconsistent(Z,d)
Column Description
1 Mean of the heights of all the links included in the
calculation.
2 Standard deviation of the heights of all the links included
in the calculation.
3 Number of links included in the calculation.
4 Inconsistency coefficient.
For leaf nodes, nodes that have no further nodes under them, the
inconsistency coefficient is set to 0.
18-585
inconsistent
W = inconsistent(Z,3)
W =
0.1313 0 1.0000 0
0.1386 0 1.0000 0
0.1463 0.0109 2.0000 0.7071
0.2391 0 1.0000 0
0.1951 0.0568 4.0000 0.9425
0.2308 0.0543 4.0000 0.9320
0.2395 0.0748 4.0000 0.7636
0.2654 0.0945 4.0000 0.9203
0.3769 0.0950 3.0000 1.1040
18-586
inconsistent
References [1] Jain, A., and R. Dubes. Algorithms for Clustering Data. Upper
Saddle River, NJ: Prentice-Hall, 1988.
18-587
ProbDist.InputData property
• data
• cens
• freq
Values Possible values for the three fields in the structure are any data
supplied to the fitdist function:
Use this information to view and compare the data supplied to create
distributions.
18-588
categorical.int8
Syntax B = int8(A)
See Also For more information on signed integers, see “Integers” in the MATLAB
documentation.
double, uint8
18-589
categorical.int16
Syntax B = int16(A)
See Also For more information on signed integers, see “Integers” in the MATLAB
documentation.
double, uint16
18-590
categorical.int32
Syntax B = int32(A)
See Also For more information on signed integers, see “Integers” in the MATLAB
documentation.
double, uint32
18-591
categorical.int64
Syntax B = int64(A)
See Also For more information on signed integers, see “Integers” in the MATLAB
documentation.
double, uint64
18-592
interactionplot
Syntax interactionplot(Y,GROUP)
interactionplot(Y,GROUP,'varnames',VARNAMES)
[h,AX,bigax] = interactionplot(...)
Examples Display interaction plots for data with four 3-level factors named 'A',
'B','C', and 'D':
18-593
interactionplot
y = randn(1000,1); % response
group = ceil(3*rand(1000,4)); % four 3-level factors
interactionplot(y,group,'varnames',{'A','B','C','D'})
18-594
categorical.intersect
Syntax C = intersect(A,B)
18-595
invpred
Syntax X0 = invpred(X,Y,Y0)
[X0,DXLO,DXUP] = invpred(X,Y,Y0)
[X0,DXLO,DXUP] = invpred(X,Y,Y0,name1,val1,name2,val2,...)
Name Value
'alpha' A value between 0 and 1 specifying a
confidence level of 100*(1-alpha)%. Default
is alpha=0.05 for 95% confidence.
'predopt' Either 'observation', the default value to
compute the intervals for X0 at which a new
observation could equal Y0, or 'curve' to
compute intervals for the X0 value at which
the curve is equal to Y0.
18-596
invpred
Examples x = 4*rand(25,1);
y = 10 + 5*x + randn(size(x));
scatter(x,y)
x0 = invpred(x,y,20)
18-597
categorical.ipermute
Syntax A = ipermute(B,order)
18-598
iqr
Syntax y = iqr(X)
iqr(X,dim)
Description y = iqr(X) returns the interquartile range of the values in X. For vector
input, y is the difference between the 75th and the 25th percentiles
of the sample in X. For matrix input, y is a row vector containing the
interquartile range of each column of X. For N-dimensional arrays, iqr
operates along the first nonsingleton dimension of X.
iqr(X,dim) calculates the interquartile range along the dimension
dim of X.
Remarks The IQR is a robust estimate of the spread of the data, since changes in
the upper and lower 25% of the data do not affect it. If there are outliers
in the data, then the IQR is more representative than the standard
deviation as an estimate of the spread of the body of the data. The IQR
is less efficient than the standard deviation as an estimate of the spread
when the data is all from the normal distribution.
Multiply the IQR by 0.7413 to estimate σ (the second parameter of the
normal distribution.)
Examples This Monte Carlo simulation shows the relative efficiency of the IQR to
the sample standard deviation for normal data.
x = normrnd(0,1,100,100);
s = std(x);
s_IQR = 0.7413*iqr(x);
efficiency = (norm(s-1)./norm(s_IQR-1)).^2
efficiency =
0.3297
18-599
ProbDistUnivKernel.iqr
Syntax Y = iqr(PD)
18-600
ProbDistUnivParam.iqr
Syntax Y = iqr(PD)
18-601
classregtree.isbranch
Syntax ib = isbranch(t)
ib = isbranch(t,nodes)
load fisheriris;
t = classregtree(meas,species,...
'names',{'SL' 'SW' 'PL' 'PW'})
t =
Decision tree for classification
1 if PL<2.45 then node 2 elseif PL>=2.45 then node 3 else setosa
2 class = setosa
3 if PW<1.75 then node 4 elseif PW>=1.75 then node 5 else versicolor
4 if PL<4.95 then node 6 elseif PL>=4.95 then node 7 else versicolor
5 class = virginica
6 if PW<1.65 then node 8 elseif PW>=1.65 then node 9 else versicolor
7 class = virginica
8 class = versicolor
9 class = virginica
view(t)
18-602
classregtree.isbranch
ib = isbranch(t)
ib =
1
0
1
1
0
1
0
0
18-603
classregtree.isbranch
18-604
categorical.isempty
Syntax TF = isempty(A)
18-605
dataset.isempty
Syntax tf = isempty(A)
Description tf = isempty(A) returns true (1) if A is an empty dataset and false (0)
otherwise. An empty array has no elements, that is prod(size(A))==0.
18-606
categorical.isequal
Syntax TF = isequal(A,B)
TF = isequal(A,B,C,...)
Description TF = isequal(A,B) is true (1) if the categorical arrays A and B are the
same class, have the same size and the same sets of levels, and contain
the same values, and false (0) otherwise.
TF = isequal(A,B,C,...) is true (1) if all the input arguments are
equal.
Elements with undefined levels are not considered equal to each other.
18-607
categorical.islevel
Syntax I = islevel(levels,A)
Examples Display age levels in the data in hospitl.mat, before and after dropping
occupied levels:
load hospital
edges = 0:10:100;
labels = strcat(num2str((0:10:90)','%d'),{'s'});
disp(labels')
'0s' '10s' '20s' '30s' '40s' '50s' '60s' '70s' '80s' '90s'
AgeGroup = ordinal(hospital.Age,labels,[],edges);
I = islevel(labels,AgeGroup);
disp(I')
1 1 1 1 1 1 1 1 1 1
AgeGroup = droplevels(AgeGroup);
I = islevel(labels,AgeGroup);
disp(I')
0 0 1 1 1 1 0 0 0 0
18-608
ordinal.ismember
Syntax I = ismember(A,levels)
[I,IDX] = ismember(A,levels)
Examples Example 1
For nominal data:
load hospital
sex = hospital.Sex; % Nominal
smokers = hospital.Smoker; % Logical
I = ismember(sex(smokers),'Female');
I(1:5)
ans =
0
1
0
0
0
I = (sex(smokers) == 'Female');
Example 2
For ordinal data:
load hospital
edges = 0:10:100;
18-609
ordinal.ismember
labels = strcat(num2str((0:10:90)','%d'),{'s'});
AgeGroup = ordinal(hospital.Age,labels,[],edges);
I = ismember(AgeGroup(1:5),{'20s','30s'})
I =
1
0
1
0
0
18-610
categorical.ismember
Syntax TF = ismember(A,levels)
[TF,LOC] = ismember(A,levels)
18-611
categorical.isscalar
Syntax TF = isscalar(A)
18-612
categorical.isundefined
Syntax I = isundefined(A)
A = ordinal([1 2 3 2 1],{'lo','med','hi'})
A =
lo med hi med lo
A = droplevels(A,{'med','hi'})
Warning: OLDLEVELS contains categorical levels that
were present in A, caused some array elements to
have undefined levels.
A =
lo <undefined> <undefined> <undefined> lo
I = isundefined(A)
I =
0 1 1 1 0
18-613
qrandstream.isvalid
Syntax tf = isvalid(h)
18-614
categorical.isvector
Syntax TF = isvector(A)
18-615
gmdistribution.Iters property
18-616
iwishrnd
Syntax W = iwishrnd(Tau,df)
W = iwishrnd(Tau,df,DI)
[W,DI] = iwishrnd(Tau,df)
18-617
jackknife
s = x;
s(i,:) = [];
jackstat(i,:) = jackfun(s);
18-618
jackknife
Examples Estimate the bias of the MLE variance estimator of random samples
taken from the vector y using jackknife. The bias has a known
formula in this problem, so you can compare the jackknife value to
this formula.
y = exprnd(5,100,1);
m = jackknife(@var,y,1);
n = length(y);
18-619
jbtest
Syntax h = jbtest(x)
h = jbtest(x,alpha)
[h,p] = jbtest(...)
[h,p,jbstat] = jbtest(...)
[h,p,jbstat,critval] = jbtest(...)
[h,p,...] = jbtest(x,alpha,mctol)
n ⎛⎜ 2 ( k − 3 ) ⎞
2
JB = s + ⎟
6⎜ 4 ⎟
⎝ ⎠
18-620
jbtest
Examples Use jbtest to determine if car mileage, in miles per gallon (MPG),
follows a normal distribution across different makes of cars:
load carbig
[h,p] = jbtest(MPG)
h =
1
p =
0.0022
The p value is below the default significance level of 5%, and the test
rejects the null hypothesis that the distribution is normal.
With a log transformation, the distribution becomes closer to normal,
but the p value is still well below 5%:
18-621
jbtest
[h,p] = jbtest(log(MPG))
h =
1
p =
0.0078
[h,p] = jbtest(log(MPG),0.0075)
h =
0
p =
0.0078
References [1] Jarque, C. M., and A. K. Bera. “A test for normality of observations
and regression residuals.” International Statistical Review. Vol. 55,
No. 2, 1987, pp. 163–172.
18-622
johnsrnd
Syntax r = johnsrnd(quantiles,m,n)
r = johnsrnd(quantiles)
[r,type] = johnsrnd(...)
[r,type,coefs] = johnsrnd(...)
18-623
johnsrnd
Examples Generate random values with longer tails than a standard normal:
18-624
johnsrnd
18-625
johnsrnd
Generate random values that match some sample data well in the
right-hand tail:
load carbig;
qnorm = [.5 1 1.5 2];
q = quantile(Acceleration, normcdf(qnorm));
r = johnsrnd([qnorm;q],1000,1);
[q;quantile(r,normcdf(qnorm))]
ans =
16.7000 18.2086 19.5376 21.7263
16.8190 18.2474 19.4492 22.4156
18-626
johnsrnd
[r,type,coefs] = johnsrnd([qnorm;q],0)
r =
[]
type =
SU
coefs =
1.0920 0.5829 18.4382 1.4494
18-627
dataset.join
Syntax C = join(A,B)
C = join(A,B,key)
C = join(A,B,param1,val1,param2,val2,...)
[C,IB] = join(...)
C = join(A,B,'Type',TYPE,...)
[C,IA,IB] = join(A,B,'Type',TYPE,...)
18-628
dataset.join
You may provide either the 'Keys' parameter, or both the 'LeftKeys'
and 'RightKeys' parameters. The value for these parameters
is a positive integer, a variable name, a cell array containing a
variable name, or a logical vector with one true entry. 'LeftKeys' or
'RightKeys' must both specify the same number of key variables, and
join pairs the left and right keys are paired in the order specified.
18-629
dataset.join
load fisheriris
NumObs = size(meas,1);
NameObs = strcat({'Obs'},num2str((1:NumObs)','%-d'));
iris = dataset({nominal(species),'species'},...
{meas,'SL','SW','PL','PW'},...
'ObsNames',NameObs);
Create a separate dataset array with the diploid chromosome counts for
each species of iris:
snames = nominal({'setosa';'versicolor';'virginica'});
CC = dataset({snames,'species'},{[38;108;70],'cc'})
CC =
species cc
setosa 38
versicolor 108
virginica 70
Broadcast the data in CC to the rows of iris using the key variable
species in each dataset:
iris2 = join(iris,CC);
iris2([1 2 51 52 101 102],:)
ans =
species SL SW PL PW cc
Obs1 setosa 5.1 3.5 1.4 0.2 38
18-630
dataset.join
Create two datasets and join them using the 'MergeKeys' flag:
% Create two data sets that both contain the key variable
% 'Key1'. The two arrays contain observations with common
% values of Key1, but each array also contains observations
% with values of Key1 not present in the other.
a = dataset({'a' 'b' 'c' 'e' 'h'}',[1 2 3 11 17]',...
'VarNames',{'Key1' 'Var1'})
b = dataset({'a' 'b' 'd' 'e'}',[4 5 6 7]',...
'VarNames',{'Key1' 'Var2'})
18-631
KDTreeSearcher class
Superclasses NeighborSearcher
Name/Value Pairs
KDTreeSearcher and createns accept one or more of the following
optional name/value pairs as input:
Distance
A string specifying the default distance metric used when you call
the knnsearch method.
18-632
KDTreeSearcher class
Properties X
A matrix used to create the object
Distance
A string specifying a built-in distance metric that you provide
when you create the object. This property is the default distance
metric used when you call the knnsearch method to find nearest
neighbors for future query points.
DistParameter
Specifies the additional parameter for the chosen distance metric.
The value is:
18-633
KDTreeSearcher class
load fisheriris
x = meas(:,3:4);
kdtreeobj = kdtreesearcher(x,'distance','minkowski')
kdtreeobj =
KDTreeSearcher
Properties:
BucketSize: 50
X: [150x2 double]
Distance: 'minkowski'
DistParameter: 2
load fisheriris
x = meas(:,3:4);
kdtreeobj = createns(x,'NsMethod','kdtree',...
'distance','minkowski')
kdtreeobj =
KDTreeSearcher
Properties:
BucketSize: 50
X: [150x2 double]
Distance: 'minkowski'
18-634
KDTreeSearcher class
DistParameter: 2
For more in-depth examples using the knnsearch method, see the
method reference page or see “Example: Classifying Query Data Using
knnsearch” on page 12-22.
References [1] Friedman, J. H., Bentely, J., and Finkel, R. A. (1977). An Algorithm
for Finding Best Matches in Logarithmic Expected Time, ACM
Transactions on Mathematical Software 3, 209.
18-635
ProbDistKernel.Kernel property
Values 'normal'
'box'
'triangle'
'epanechnikov'
Use this information to view and compare the kernel smoothing
function used to create distributions.
18-636
kmeans
Description IDX = kmeans(X,k) partitions the points in the n-by-p data matrix X
into k clusters. This iterative partitioning minimizes the sum, over
all clusters, of the within-cluster sums of point-to-cluster-centroid
distances. Rows of X correspond to points, columns correspond to
variables. kmeans returns an n-by-1 vector IDX containing the cluster
indices of each point. By default, kmeans uses squared Euclidean
distances.
[IDX,C] = kmeans(X,k) returns the k cluster centroid locations in
the k-by-p matrix C.
[IDX,C,sumd] = kmeans(X,k) returns the within-cluster sums of
point-to-centroid distances in the 1-by-k vector sumd.
[IDX,C,sumd,D] = kmeans(X,k) returns distances from each point to
every centroid in the n-by-k matrix D.
[...] = kmeans(...,param1,val1,param2,val2,...) enables
you to specify optional parameter/value pairs to control the iterative
algorithm used by kmeans. Valid parameter strings are listed in the
following table.
18-637
kmeans
Parameter Value
'distance' Distance measure, in p-dimensional space. kmeans
minimizes with respect to this parameter. kmeans
computes centroid clusters differently for the
different supported distance measures.
'sqEuclidean' Squared Euclidean distance
(default). Each centroid is the
mean of the points in that cluster.
'cityblock' Sum of absolute differences, i.e.,
the L1 distance. Each centroid
is the component-wise median of
the points in that cluster.
'cosine' One minus the cosine of the
included angle between points
(treated as vectors). Each
centroid is the mean of the points
in that cluster, after normalizing
those points to unit Euclidean
length.
'correlation' One minus the sample correlation
between points (treated as
sequences of values). Each
centroid is the component-wise
mean of the points in that cluster,
after centering and normalizing
those points to zero mean and
unit standard deviation.
'Hamming' Percentage of bits that differ (only
suitable for binary data). Each
centroid is the component-wise
median of points in that cluster.
18-638
kmeans
Parameter Value
'emptyaction' Action to take if a cluster loses all its member
observations.
'error' Treat an empty cluster as an
error (default).
'drop' Remove any clusters that
become empty. kmeans sets the
corresponding return values in C
and D to NaN.
'singleton' Create a new cluster consisting
of the one point furthest from its
centroid.
'onlinephase' Flag indicating whether kmeans should perform an
online update phase in addition to a batch update
phase. The online phase can be time consuming
for large data sets, but guarantees a solution that
is a local minimum of the distance criterion, that
is, a partition of the data where moving any single
point to a different cluster increases the total sum
of distances.
'on' Perform online update (default).
'off' Do not perform online update.
'options' Options for the iterative algorithm used to minimize
the fitting criterion, as created by statset.
'replicates' Number of times to repeat the clustering, each
with a new set of initial cluster centroid positions.
kmeans returns the solution with the lowest value
for sumd. You can supply 'replicates' implicitly
by supplying a 3D array as the value for the
'start' parameter.
18-639
kmeans
Parameter Value
'start' Method used to choose the initial cluster centroid
positions, sometimes known as seeds.
'sample' Select k observations from X at
random (default).
'uniform' Select k points uniformly at
random from the range of X. Not
valid with Hamming distance.
'cluster' Perform a preliminary clustering
phase on a random 10%
subsample of X. This preliminary
phase is itself initialized using
’sample’.
Matrix k-by-p matrix of centroid starting
locations. In this case, you can
pass in [] for k, and kmeans infers
k from the first dimension of the
matrix. You can also supply a 3-D
array, implying a value for the
'replicates' parameter from
the array’s third dimension.
1 The first phase uses batch updates, where each iteration consists
of reassigning points to their nearest cluster centroid, all at once,
followed by recalculation of cluster centroids. This phase occasionally
does not converge to solution that is a local minimum, that is, a
partition of the data where moving any single point to a different
cluster increases the total sum of distances. This is more likely
for small data sets. The batch phase is fast, but potentially only
approximates a solution as a starting point for the second phase.
18-640
kmeans
2 The second phase uses online updates, where points are individually
reassigned if doing so will reduce the sum of distances, and cluster
centroids are recomputed after each reassignment. Each iteration
during the second phase consists of one pass though all the points.
The second phase will converge to a local minimum, although there
may be other local minima with lower total sum of distances. The
problem of finding the global minimum can only be solved in general
by an exhaustive (or clever, or lucky) choice of starting points, but
using several replicates with random starting points typically results
in a solution that is a global minimum.
Examples The following creates two clusters from separated random data:
X = [randn(100,2)+ones(100,2);...
randn(100,2)-ones(100,2)];
opts = statset('Display','final');
[idx,ctrs] = kmeans(X,2,...
'Distance','city',...
'Replicates',5,...
'Options',opts);
5 iterations, total sum of distances = 284.671
4 iterations, total sum of distances = 284.671
4 iterations, total sum of distances = 284.671
3 iterations, total sum of distances = 284.671
3 iterations, total sum of distances = 284.671
plot(X(idx==1,1),X(idx==1,2),'r.','MarkerSize',12)
hold on
plot(X(idx==2,1),X(idx==2,2),'b.','MarkerSize',12)
18-641
kmeans
plot(ctrs(:,1),ctrs(:,2),'kx',...
'MarkerSize',12,'LineWidth',2)
plot(ctrs(:,1),ctrs(:,2),'ko',...
'MarkerSize',12,'LineWidth',2)
legend('Cluster 1','Cluster 2','Centroids',...
'Location','NW')
18-642
ExhaustiveSearcher.knnsearch
18-643
ExhaustiveSearcher.knnsearch
18-644
ExhaustiveSearcher.knnsearch
load fisheriris
x = meas(:,3:4);
exhaustiveobj = ExhaustiveSearcher(x,'Distance','cosine')
exhaustiveobj =
ExhaustiveSearcher
Properties:
X: [150x2 double]
18-645
ExhaustiveSearcher.knnsearch
Distance: 'cosine'
DistParameter: []
18-646
ExhaustiveSearcher.knnsearch
18-647
KDTreeSearcher.knnsearch
18-648
KDTreeSearcher.knnsearch
load fisheriris
x = meas(:,3:4);
kdtreeNS = KDTreeSearcher(x,'Distance','minkowski','P',5)
kdtreeNS =
KDTreeSearcher
Properties:
BucketSize: 50
X: [150x2 double]
Distance: 'minkowski'
DistParameter: 5
18-649
KDTreeSearcher.knnsearch
'distance','chebychev');
18-650
KDTreeSearcher.knnsearch
References [1] Friedman, J. H., Bentely, J., and Finkel, R. A. (1977). An Algorithm
for Finding Best Matches in Logarithmic Expected Time, ACM
Transactions on Mathematical Software 3, 209.
18-651
KDTreeSearcher.knnsearch
18-652
knnsearch
Description IDX = knnsearch(X,Y) finds the nearest neighbor in X for each point
in Y. X is an mx-by-n matrix and Y is an my-by-n matrix. Rows of X and
Y correspond to observations and columns correspond to variables. IDX
is a column vector with my rows. Each row in IDX contains the index of
nearest neighbor in X for the corresponding row in Y.
[IDX,Dist] = knnsearch(X,Y) returns an my-by-1 vector Dist
containing the distances between each observation in Y and the
corresponding closest observation in X. That is, Dist(i) is the distance
between X(IDX(i),:) and Y(i,:).
[IDX,Dist] = knnsearch(X,Y,'Name',Value) accepts one or more
optional comma-separated name/value pairs. Specify Name inside single
quotes.
knnsearch does not save a search object. To create a search object,
use createns.
18-653
knnsearch
Distance
A string or a function handle specifying the distance metric. The
value can be one of the following:
18-654
knnsearch
18-655
knnsearch
Examples Find the 10 nearest neighbors in x to each point in y using first the
'minkowski' distance metric with a p value of 5, and then using the
'chebychev' distance metric. Visually compare the results:
load fisheriris
x = meas(:,3:4);
y = [5 1.45;6 2;2.75 .75];
18-656
knnsearch
'linestyle','none','markersize',10)
18-657
knnsearch
References [1] Friedman, J. H., Bentely, J., and Finkel, R. A. (1977) An Algorithm
for Finding Best Matches in Logarithmic Expected Time, ACM
Transactions on Mathematical Software 3, 209.
18-658
kruskalwallis
Syntax p = kruskalwallis(X)
p = kruskalwallis(X,group)
p = kruskalwallis(X,group,displayopt)
[p,table] = kruskalwallis(...)
[p,table,stats] = kruskalwallis(...)
18-659
kruskalwallis
The entries in the ANOVA table are the usual sums of squares, degrees
of freedom, and other quantities calculated on the ranks. The usual F
statistic is replaced by a chi-square statistic. The p value measures the
significance of the chi-square statistic.
The second figure displays box plots of each column of X (not the ranks
of X).
p = kruskalwallis(X,group) uses the values in group (a character
array or cell array) as labels for the box plot of the samples in X, when
X is a matrix. Each row of group contains the label for the data in the
corresponding column of X, so group must have length equal to the
number of columns in X. (See “Grouped Data” on page 2-34.)
When X is a vector, kruskalwallis performs a Kruskal-Wallis test on
the samples contained in X, as indexed by input group (a categorical
variable, vector, character array, or cell array). Each element in group
identifies the group (i.e., sample) to which the corresponding element in
vector X belongs, so group must have the same length as X. The labels
contained in group are also used to annotate the box plot.
It is not necessary to label samples sequentially (1, 2, 3, ...). For
example, if X contains measurements taken at three different
temperatures, -27°, 65°, and 110°, you could use these numbers as the
sample labels in group. If a row of group contains an empty cell or
empty string, that row and the corresponding observation in X are
disregarded. NaNs in either input are similarly ignored.
p = kruskalwallis(X,group,displayopt) enables the table and box
plot displays when displayopt is 'on' (default) and suppresses the
displays when displayopt is 'off'.
[p,table] = kruskalwallis(...) returns the ANOVA table
(including column and row labels) in cell array table.
[p,table,stats] = kruskalwallis(...) returns a stats structure
that you can use to perform a follow-up multiple comparison test. The
kruskalwallis test evaluates the hypothesis that all samples come
from populations that have the same median, against the alternative
that the medians are not all the same. Sometimes it is preferable to
18-660
kruskalwallis
Assumptions
The Kruskal-Wallis test makes the following assumptions about the
data in X:
The classical one-way ANOVA test replaces the first assumption with
the stronger assumption that the populations have normal distributions.
Examples This example compares the material strength study used with the
anova1 function, to see if the nonparametric Kruskal-Wallis procedure
leads to the same conclusion. The example studies the strength of
beams made from three alloys:
alloy = {'st','st','st','st','st','st','st','st',...
'al1','al1','al1','al1','al1','al1',...
'al2','al2','al2','al2','al2','al2'};
anova1(strength,alloy,'off')
ans =
1.5264e-004
kruskalwallis(strength,alloy,'off')
ans =
18-661
kruskalwallis
0.0018
Both tests find that the three alloys are significantly different, though
the result is less significant according to the Kruskal-Wallis test. It
is typical that when a data set has a reasonable fit to the normal
distribution, the classical ANOVA test is more sensitive to differences
between groups.
To understand when a nonparametric test may be more appropriate,
let’s see how the tests behave when the distribution is not normal. You
can simulate this by replacing one of the values by an extreme value
(an outlier).
strength(20)=120;
anova1(strength,alloy,'off')
ans =
0.2501
kruskalwallis(strength,alloy,'off')
ans =
0.0060
Now the classical ANOVA test does not find a significant difference, but
the nonparametric procedure does. This illustrates one of the properties
of nonparametric procedures - they are often not severely affected by
changes in a small portion of the data.
18-662
ksdensity
18-663
ksdensity
Parameter Value
'censoring' A logical vector of the same length as x, indicating
which entries are censoring times. Default is no
censoring.
'kernel' The type of kernel smoother to use. Choose the
value as 'normal' (default), 'box', 'triangle', or
'epanechnikov'.
Alternatively, you can specify some other function,
as a function handle or as a string, e.g., @normpdf
or 'normpdf'. The function must take a single
argument that is an array of distances between data
values and places where the density is evaluated. It
must return an array of the same size containing
corresponding values of the kernel function.
'npoints' The number of equally spaced points in xi. Default
is 100.
'support' • 'unbounded' allows the density to extend over the
whole real line (default).
• 'positive' restricts the density to positive
values.
• A two-element vector gives finite lower and upper
bounds for the support of the density.
'weights' Vector of the same length as x, assigning weight to
each x value.
18-664
ksdensity
Parameter Value
'width' The bandwidth of the kernel-smoothing window. The
default is optimal for estimating normal densities,
but you may want to choose a smaller value to reveal
features such as multiple modes.
'function' The function type to estimate, chosen from among
'pdf', 'cdf', 'icdf', 'survivor', or 'cumhazard'
for the density, cumulative probability, inverse
cumulative probability, survivor, or cumulative
hazard functions, respectively.
For 'icdf',
f=ksdensity(x,yi,...,'function','icdf')
computes the estimated inverse CDF of the values
in x, and evaluates it at the probability values
specified in yi.
In place of the kernel functions listed above, you can specify another
kernel function by using @ (such as @normpdf) or quotes (such as
'normpdf'). ksdensity calls the function with a single argument that
is an array containing distances between data values in x and locations
in xi where the density is evaluated. The function must return an array
of the same size containing corresponding values of the kernel function.
When the 'function' parameter value is 'pdf', this kernel function
returns density values, otherwise it returns cumulative probability
values. Specifying a custom kernel when the 'function' parameter
value is 'icdf' returns an error.
If the 'support' parameter is 'positive', ksdensity transforms x
using a log function, estimates the density of the transformed values,
and transforms back to the original scale. If 'support' is a vector
[L U], ksdensity uses the transformation log((X-L)/(U-X)). The
'width' parameter and u outputs are on the scale of the transformed
values.
Examples Generate a mixture of two normal distributions and plots the estimated
density:
18-665
ksdensity
x = [randn(30,1); 5+randn(30,1)];
[f,xi] = ksdensity(x);
plot(xi,f);
x = [randn(30,1); 5+randn(30,1)];
xi = linspace(-10,15,201);
f = ksdensity(x,xi,'function','cdf');
plot(xi,f);
18-666
ksdensity
x = [randn(30,1); 5+randn(30,1)];
yi = linspace(.01,.99,99);
g = ksdensity(x,yi,'function','icdf');
plot(yi,g);
18-667
ksdensity
References [1] Bowman, A. W., and A. Azzalini. Applied Smoothing Techniques for
Data Analysis. New York: Oxford University Press, 1997.
18-668
kstest
Syntax h = kstest(x)
h = kstest(x,CDF)
h = kstest(x,CDF,alpha)
h = kstest(x,CDF,alpha,type)
[h,p,ksstat,cv] = kstest(...)
max(| F ( x) − G( x)|)
where F(x) is the empirical cdf and G(x) is the standard normal cdf.
h = kstest(x,CDF) compares the distribution of x to the hypothesized
continuous distribution defined by CDF, which is either a two-column
matrix or a ProbDist object of the ProbDistUnivParam class or
ProbDistUnivKernel class. When CDF is a matrix, column 1 contains
a set of possible x values, and column 2 contains the corresponding
hypothesized cumulative distribution function values G(x). If possible,
define CDF so that column 1 contains the values in x. If there are
values in x not found in column 1 of CDF, kstest approximates G(x)
by interpolation. All values in x must lie in the interval between the
smallest and largest values in the first column of CDF. If the second
argument is empty ([]), kstest uses the standard normal distribution.
The Kolmogorov-Smirnov test requires that CDF be predetermined. It
is not accurate if CDF is estimated from the data. To test x against a
normal distribution without specifying the parameters, use lillietest
instead.
18-669
kstest
x = -2:1:4
x =
-2 -1 0 1 2 3 4
[h,p,k,c] = kstest(x,[],0.05,0)
h =
0
p =
0.13632
k =
0.41277
c =
0.48342
18-670
kstest
The test fails to reject the null hypothesis that the values come from a
standard normal distribution. This illustrates the difficulty of testing
normality in small samples. (The Lilliefors test, implemented by the
Statistics Toolbox function lillietest, may be more appropriate.)
The following figure illustrates the test statistic:
xx = -3:.1:5;
F = cdfplot(x);
hold on
G = plot(xx,normcdf(xx),'r-');
set(F,'LineWidth',2)
set(G,'LineWidth',2)
legend([F G],...
'Empirical','Standard Normal',...
'Location','NW')
18-671
kstest
[h,p,ksstat] = kstest(x,[],0.05,'smaller')
h =
0
p =
0.068181
k =
0.41277
The test statistic is the same as before, but the p value is smaller.
18-672
kstest
[h,p,k] = kstest(x,[],0.05,'larger')
h =
0
p =
0.77533
k =
0.12706
18-673
kstest2
Syntax h = kstest2(x1,x2)
h = kstest2(x1,x2,alpha,type)
[h,p] = kstest2(...)
[h,p,ks2stat] = kstest2(...)
18-674
kstest2
x = -1:1:5
y = randn(20,1);
[h,p,k] = kstest2(x,y)
h =
0
p =
0.0774
k =
0.5214
F1 = cdfplot(x);
hold on
F2 = cdfplot(y)
set(F1,'LineWidth',2,'Color','r')
set(F2,'LineWidth',2)
legend([F1 F2],'F1(x)','F2(x)','Location','NW')
18-675
kstest2
18-676
kstest2
18-677
kurtosis
Purpose Kurtosis
Syntax k = kurtosis(X)
k = kurtosis(X,flag)
k = kurtosis(X,flag,dim)
E( x − )4
k=
4
where μ is the mean of x, σ is the standard deviation of x, and E(t)
represents the expected value of the quantity t. kurtosis computes a
sample version of this population value.
18-678
kurtosis
1 n
∑(x − x)
4
n i=1 i
k1 =
2
⎛1 n 2⎞
⎜ ∑ ( xi − x ) ⎟
⎜n ⎟
⎝ i=1 ⎠
When you set flag to 0, the following equation applies:
n −1
k0 =
( n − 2) ( n − 3)
( ( n + 1) k1 − 3 ( n − 1) ) + 3
This bias-corrected formula requires that X contain at least four
elements.
k = kurtosis(X)
k =
2.1658 1.2967 1.6378 1.9589
18-679
categorical.labels property
18-680
qrandstream.le
Syntax h1 <= h2
Description Handles are equal if they are handles for the same object. All
comparisons use a number associated with each handle object. Nothing
can be assumed about the result of a handle comparison except that the
repeated comparison of two handles in the same MATLAB session will
yield the same result. The order of handle values is purely arbitrary
and has no connection to the state of the handle objects being compared.
h1 <= h2 performs element-wise comparisons between handle arrays
h1 and h2. h1 and h2 must be of the same dimensions unless one is a
scalar. The result is a logical array of the same dimensions, where each
element is an element-wise <= result.
If one of h1 or h2 is scalar, scalar expansion is performed and the result
will match the dimensions of the array that is not scalar.
tf = le(h1, h2) stores the result in a logical array of the same
dimensions.
18-681
qrandset.Leap property
Description Number of points to leap over and omit for each point taken from the
sequence. The Leap property of a point set contains a positive integer
which specifies the number of points in the sequence to leap over
and omit for every point taken. The default Leap value is 0, which
corresponds to taking every point from the sequence.
Leaping is a technique used to improve the quality of a point set.
However, you must choose the Leap values with care; many Leap values
create sequences that fail to touch on large sub-hyper-rectangles of the
unit hypercube, and so fail to be a uniform quasi-random point set.
18-682
dataset.length
Syntax n = length(A)
18-683
qrandset.length
Syntax length(p)
18-684
categorical.length
Syntax n = length(A)
18-685
categorical.levelcounts
Syntax C = levelcounts(A)
C = levelcounts(A,dim)
Examples Count the number of patients in each age group in the data in
hospital.mat:
load hospital
edges = 0:10:100;
labels = strcat(num2str((0:10:90)','%d'),{'s'});
disp(labels')
'0s' '10s' '20s' '30s' '40s' '50s' '60s' '70s' '80s' '90s'
AgeGroup = ordinal(hospital.Age,labels,[],edges);
I = islevel(labels,AgeGroup);
disp(I')
0 1 1 1 1 1 1 1 1 1
c = levelcounts(AgeGroup);
disp(c')
0 0 15 41 42 2 0 0 0 0
AgeGroup = droplevels(AgeGroup);
I = islevel(labels,AgeGroup);
disp(I')
0 0 1 1 1 1 0 0 0 0
c = levelcounts(AgeGroup);
disp(c')
15 41 42 2
18-686
categorical.levelcounts
18-687
leverage
Purpose Leverage
Syntax h = leverage(data)
h = leverage(data,model)
Examples One rule of thumb is to compare the leverage to 2p/n where n is the
number of observations and p is the number of parameters in the model.
For the Hald data set this value is 0.7692.
load hald
h = max(leverage(ingredients,'linear'))
h =
0.7004
Since 0.7004 < 0.7692, there are no high leverage points using this rule.
18-688
leverage
18-689
lhsdesign
Syntax X = lhsdesign(n,p)
X = lhsdesign(...,'smooth','off')
X = lhsdesign(...,'criterion',criterion)
X = lhsdesign(...,'iterations',k)
Criterion Description
'none' No iteration
'maximin' Maximize minimum distance between points
'correlation' Reduce correlation
18-690
lhsnorm
Syntax X = lhsnorm(mu,sigma,n)
X = lhsnorm(mu,sigma,n,flag)
[X,Z] = lhsnorm(...)
18-691
lillietest
Syntax h = lillietest(x)
h = lillietest(x,alpha)
h = lillietest(x,alpha,distr)
[h,p] = lillietest(...)
[h,p,kstat] = lillietest(...)
[h,p,kstat,critval] = lillietest(...)
[h,p,...] = lillietest(x,alpha,distr,mctol)
where SCDF is the empirical cdf estimated from the sample and CDF is
the normal cdf with mean and standard deviation equal to the mean
and standard deviation of the sample.
lillietest uses a table of critical values computed using Monte Carlo
simulation for sample sizes less than 1000 and significance levels
between 0.001 and 0.50. The table is larger and more accurate than the
table introduced by Lilliefors. Critical values for a test are computed
by interpolating into the table, using an analytic approximation when
extrapolating for larger sample sizes.
18-692
lillietest
Examples Use lillietest to determine if car mileage, in miles per gallon (MPG),
follows a normal distribution across different makes of cars:
load carbig.mat
[h,p] = lillietest(MPG)
Warning: P is less than the smallest tabulated value, returning 0.001.
h =
1
18-693
lillietest
p =
1.0000e-003
[h,p] = lillietest(MPG,0.05,'norm',1e-4)
h =
1
p =
8.3333e-006
18-694
linhyptest
Syntax p = linhyptest(beta,COVB,c,H,dfe)
[p,t,r] = linhyptest(...)
• COVB = eye(k)
• c = zeros(k,1)
• H = eye(K)
• dfe = Inf
Note The following functions return outputs suitable for use as the
COVB input argument to linhyptest: nlinfit, coxphfit, glmfit,
mnrfit, regstats, robustfit. nlinfit returns COVB directly; the other
functions return COVB in stats.covb.
18-695
linhyptest
load hald
stats = regstats(heat,ingredients,'linear');
beta = stats.beta
beta =
62.4054
1.5511
0.5102
0.1019
-0.1441
SIGMA = stats.covb;
dfe = stats.fstat.dfe;
H = [0 0 0 1 0;0 0 0 0 1];
c = [0;0];
[p,F] = linhyptest(beta,SIGMA,c,H,dfe)
p =
0.4668
F =
0.8391
18-696
linkage
Syntax Z = linkage(y)
Z = linkage(y,method)
Z = linkage(X,method,metric)
Z = linkage(X,method,inputs)
Method Description
'average' Unweighted average distance (UPGMA).
'centroid' Centroid distance (UPGMC). Y must contain
Euclidean distances.
18-697
linkage
Method Description
'complete' Furthest distance.
'median' Weighted center of mass distance (WPGMC). Y must
contain Euclidean distances.
'single' Shortest distance. This is the default.
'ward' Inner squared distance (minimum variance
algorithm). Y must contain Euclidean distances.
'weighted' Weighted average distance (WPGMA).
Linkages
The following notation is used to describe the linkages used by the
various methods:
18-698
linkage
• Average linkage uses the average distance between all pairs of objects
in any two clusters:
nr ns
1
d(r, s) =
nr ns
∑ ∑ dist( xri , xsj )
i=1 j =1
d(r, s) = xr − xs 2
where
nr
1
xr =
nr
∑ xri
i=1
d(r, s) = xr − xs 2
18-699
linkage
1
xr = ( x p + xq )
2
• Ward’s linkage uses the incremental sum of squares; that is, the
increase in the total within-cluster sum of squares as a result of
joining two clusters. The within-cluster sum of squares is defined as
the sum of the squares of the distances between all objects in the
cluster and the centroid of the cluster. The equivalent distance is:
2
2
xr − xs 2
d (r, s) = nr ns
(nr + ns )
where 2
is Euclidean distance, and xr and xs are the centroids of
clusters r and s, as defined in the centroid linkage.
18-700
logncdf
Syntax P = logncdf(X,mu,sigma)
[P,PLO,PUP] = logncdf(X,mu,sigma,pcov,alpha)
X − ˆ
ˆ
and then transforming those bounds to the scale of the output P. The
computed bounds give approximately the desired confidence level when
you estimate mu, sigma, and pcov from large samples, but in smaller
samples other methods of computing the confidence bounds might be
more accurate.
The lognormal cdf is
−(ln(t) − )2
1 xe 2 2
p = F ( x| , ) =
2 ∫0 t
dt
18-701
logncdf
Examples x = (0:0.2:10);
y = logncdf(x,0,1);
plot(x,y); grid;
xlabel('x'); ylabel('p');
18-702
lognfit
18-703
lognfit
data = lognrnd(0,3,100,1);
[parmhat,parmci] = lognfit(data,0.01)
parmhat =
-0.2480 2.8902
parmci =
-1.0071 2.4393
0.5111 3.5262
18-704
logninv
Syntax X = logninv(P,mu,sigma)
[X,XLO,XUP] = logninv(P,mu,sigma,pcov,alpha)
ˆ + ˆ q
where q is the Pth quantile from a normal distribution with mean 0 and
standard deviation 1. The computed bounds give approximately the
desired confidence level when you estimate mu, sigma, and pcov from
large samples, but in smaller samples other methods of computing the
confidence bounds might be more accurate.
The lognormal inverse function is defined in terms of the lognormal
cdf as
x = F −1 ( p| , ) = { x : F ( x | , ) = p}
where
18-705
logninv
−(ln(t) − )2
1 xe 2 2
p = F ( x| , ) =
2 ∫0 t
dt
Examples p = (0.005:0.01:0.995);
crit = logninv(p,1,0.5);
plot(p,crit)
xlabel('Probability'); ylabel('Critical Value'); grid
18-706
lognlike
18-707
lognpdf
Syntax Y = lognpdf(X,mu,sigma)
−( ln x − )
2
1 2
y = f ( x| , ) =
2
e
x 2
(
m = exp + 2 / 2 )
( )( ( ) )
v = exp 2 + 2 exp 2 − 1
= log ⎛⎜ m2 / v + m2 ⎞⎟
⎝ ⎠
(
= log v / m2 + 1 )
18-708
lognpdf
Examples x = (0:0.02:10);
y = lognpdf(x,0,1);
plot(x,y); grid;
xlabel('x'); ylabel('p')
18-709
lognrnd
Syntax R = lognrnd(mu,sigma)
R = lognrnd(mu,sigma,v)
R = lognrnd(mu,sigma,m,n)
(
m = exp + 2 / 2 )
( )( ( ) )
v = exp 2 + 2 exp 2 − 1
18-710
lognrnd
= log ⎛⎜ m2 / v + m2 ⎞⎟
⎝ ⎠
(
= log v / m2 + 1 )
Examples Generate one million lognormally distributed random numbers with
mean 1 and variance 2:
m = 1;
v = 2;
mu = log((m^2)/sqrt(v+m^2));
sigma = sqrt(log(v/(m^2)+1));
[M,V]= lognstat(mu,sigma)
M =
1
V =
2.0000
X = lognrnd(mu,sigma,1,1e6);
MX = mean(X)
MX =
0.9974
VX = var(X)
VX =
1.9776
18-711
lognstat
(
m = exp + 2 / 2 )
( )( ( ) )
v = exp 2 + 2 exp 2 − 1
= log ⎛⎜ m2 / v + m2 ⎞⎟
⎝ ⎠
(
= log v / m2 + 1 )
Examples Generate one million lognormally distributed random numbers with
mean 1 and variance 2:
m = 1;
v = 2;
18-712
lognstat
mu = log((m^2)/sqrt(v+m^2));
sigma = sqrt(log(v/(m^2)+1));
[M,V]= lognstat(mu,sigma)
M =
1
V =
2.0000
X = lognrnd(mu,sigma,1,1e6);
MX = mean(X)
MX =
0.9974
VX = var(X)
VX =
1.9776
18-713
paretotails.lowerparams
t = trnd(3,100,1);
obj = paretotails(t,0.1,0.9);
lowerparams(obj)
ans =
-0.1901 1.1898
upperparams(obj)
ans =
0.3646 0.5103
18-714
qrandstream.lt
Syntax h1 < h2
18-715
lsline
Syntax lsline
h = lsline
Examples Use lsline together with scatter plots produced by scatter and
various line styles of plot:
x = 1:10;
y1 = x + randn(1,10);
scatter(x,y1,25,'b','*')
hold on
y2 = 2*x + randn(1,10);
plot(x,y2,'mo')
y3 = 3*x + randn(1,10);
plot(x,y3,'rx:')
y4 = 4*x + randn(1,10);
plot(x,y4,'g+--')
lsline
18-716
lsline
18-717
mad
Syntax y = mad(X)
Y = mad(X,1)
Y = mad(X,0)
Description y = mad(X) returns the mean absolute deviation of the values in X. For
vector input, y is mean(abs(X-mean(X)). For a matrix input, y is a
row vector containing the mean absolute deviation of each column of
X. For N-dimensional arrays, mad operates along the first nonsingleton
dimension of X.
Y = mad(X,1) returns the median absolute deviation of the values in X.
For vector input, y is median(abs(X-median(X)). For a matrix input, y is
a row vector containing the median absolute deviation of each column of
X. For N-dimensional arrays, mad operates along the first nonsingleton
dimension of X.
Y = mad(X,0) is the same as mad(X), and returns the mean absolute
deviation of the values in X.
mad(X,flag,dim) computes absolute deviations along the dimension
dim of X. flag is 0 or 1 to indicate mean or median absolute deviation,
respectively.
mad treats NaNs as missing values and removes them.
For normally distributed data, multiply mad by one of the following
factors to obtain an estimate of the normal scale parameter σ:
Examples The following compares the robustness of different scale estimates for
normally distributed data in the presence of outliers:
x = normrnd(0,1,1,50);
xo = [x 10]; % Add outlier
18-718
mad
r1 = std(xo)/std(x)
r1 =
1.7385
r2 = mad(xo,0)/mad(x,0)
r2 =
1.2306
r3 = mad(xo,1)/mad(x,1)
r3 =
1.0602
References [1] Mosteller, F., and J. Tukey. Data Analysis and Regression. Upper
Saddle River, NJ: Addison-Wesley, 1977.
18-719
mahal
Syntax d = mahal(Y,X)
d1 = mahal(Y,X) % Mahalanobis
d1 =
1.3592
21.1013
23.8086
1.4727
scatter(X(:,1),X(:,2))
18-720
mahal
hold on
scatter(Y(:,1),Y(:,2),100,d1,'*','LineWidth',2)
hb = colorbar;
ylabel(hb,'Mahalanobis Distance')
legend('X','Y','Location','NW')
18-721
gmdistribution.mahal
Syntax D = mahal(obj,X)
MU1 = [1 2];
SIGMA1 = [2 0; 0 .5];
MU2 = [-3 -5];
SIGMA2 = [1 0; 0 1];
X = [mvnrnd(MU1,SIGMA1,1000);mvnrnd(MU2,SIGMA2,1000)];
scatter(X(:,1),X(:,2),10,'.')
hold on
18-722
gmdistribution.mahal
obj = gmdistribution.fit(X,2);
h = ezcontour(@(x,y)pdf(obj,[x y]),[-8 6],[-8 6]);
18-723
gmdistribution.mahal
D = mahal(obj,X);
delete(h)
scatter(X(:,1),X(:,2),10,D(:,1),'.')
hb = colorbar;
ylabel(hb,'Mahalanobis Distance to Component 1')
18-724
gmdistribution.mahal
18-725
maineffectsplot
Syntax maineffectsplot(Y,GROUP)
maineffectsplot(Y,GROUP,param1,val1,param2,val2,...)
[figh,AXESH] = maineffectsplot(...)
18-726
maineffectsplot
Examples Display main effects plots for car weight with two grouping variables,
model year and number of cylinders:
load carsmall;
maineffectsplot(Weight,{Model_Year,Cylinders}, ...
'varnames',{'Model Year','# of Cylinders'})
18-727
manova1
Syntax d = manova1(X,group)
d = manova1(X,group,alpha)
[d,p] = manova1(...)
[d,p,stats] = manova1(...)
18-728
manova1
If the ith p value is near zero, this casts doubt on the hypothesis that
the group means lie on a space of i-1 dimensions. The choice of a
critical p value to determine whether the result is judged statistically
significant is left to the researcher and is specified by the value of the
input argument alpha. It is common to declare a result significant if
the p value is less than 0.05 or 0.01.
[d,p,stats] = manova1(...) also returns stats, a structure
containing additional MANOVA results. The structure contains the
following fields.
Field Contents
W Within-groups sum of squares and cross-products
matrix
B Between-groups sum of squares and cross-products
matrix
T Total sum of squares and cross-products matrix
dfW Degrees of freedom for W
dfB Degrees of freedom for B
dfT Degrees of freedom for T
lambda Vector of values of Wilk’s lambda test statistic for
testing whether the means have dimension 0, 1, etc.
chisq Transformation of lambda to an approximate
chi-square distribution
chisqdf Degrees of freedom for chisq
eigenval Eigenvalues of W-1B
eigenvec Eigenvectors of W-1B; these are the coefficients for
the canonical variables C, and they are scaled so the
within-group variance of the canonical variables is 1
18-729
manova1
Field Contents
canon Canonical variables C, equal to XC*eigenvec, where XC
is X with columns centered by subtracting their means
mdist A vector of Mahalanobis distances from each point
to the mean of its group
Examples you can use manova1 to determine whether there are differences in
the averages of four car characteristics, among groups defined by the
country where the cars were made.
18-730
manova1
load carbig
[d,p] = manova1([MPG Acceleration Weight Displacement],...
Origin)
d =
3
p =
0
0.0000
0.0075
0.1934
There are four dimensions in the input matrix, so the group means
must lie in a four-dimensional space. manova1 shows that you cannot
reject the hypothesis that the means lie in a 3-D subspace.
18-731
manovacluster
Syntax manovacluster(stats)
manovacluster(stats,method)
H = manovacluster(stats,method)
Method Description
'single' Shortest distance (default)
'complete' Largest distance
'average' Average distance
'centroid' Centroid distance
'ward' Incremental sum of squares
Examples Let’s analyze the larger car data set to determine which countries
produce cars with the most similar characteristics.
load carbig
18-732
manovacluster
18-733
CompactTreeBagger.margin
18-734
CompactTreeBagger.margin
18-735
TreeBagger.margin
18-736
TreeBagger.margin
18-737
mdscale
Syntax Y = mdscale(D,p)
[Y,stress] = mdscale(D,p)
[Y,stress,disparities] = mdscale(D,p)
[...] = mdscale(...,param1,val1,param2,val2,...)
18-738
mdscale
18-739
mdscale
In this case, you can pass in [] for p and mdscale infers p from
the second dimension of the matrix. You can also supply a 3-D
array, implying a value for 'Replicates' from the array’s third
dimension.
• 'Replicates' — Number of times to repeat the scaling, each with a
new initial configuration. The default is 1.
• 'Options' — Options for the iterative algorithm used to minimize
the fitting criterion. Pass in an options structure created by statset.
For example,
18-740
mdscale
18-741
CompactTreeBagger.mdsProx
18-742
CompactTreeBagger.mdsProx
18-743
TreeBagger.mdsProx
18-744
ProbDistUnivKernel.median
Syntax M = median(PD)
18-745
ProbDistUnivParam.median
Syntax M = median(PD)
18-746
ProbDistUnivParam.mean
Syntax M = mean(PD)
18-747
CompactTreeBagger.meanMargin
18-748
CompactTreeBagger.meanMargin
18-749
TreeBagger.meanMargin
18-750
TreeBagger.meanMargin
18-751
TreeBagger.MergeLeaves property
Description The MergeLeaves property is true if decision trees have their leaves
with the same parent merged for splits that do not decrease the total
risk, and false otherwise. The default value is false.
18-752
ordinal.mergelevels
Syntax B = mergelevels(A,oldlevels,newlevel)
B = mergelevels(A,oldlevels)
Examples Example 1
For nominal data:
load fisheriris
species = nominal(species);
species = mergelevels(species,...
{'setosa','virginica'},'parent');
species = setlabels(species,'hybrid','versicolor');
getlabels(species)
ans =
'hybrid' 'parent'
18-753
ordinal.mergelevels
Example 2
For ordinal data:
A = ordinal([1 2 3 2 1],{'lo','med','hi'})
A =
lo med hi med lo
A = mergelevels(A,{'lo','med'},'bad')
A =
bad bad hi bad bad
18-754
CompactTreeBagger.Method property
18-755
TreeBagger.Method property
18-756
mhsample
Description smpl =
mhsample(start,nsamples,'pdf',pdf,'proppdf',proppdf,'proprnd',proprnd)
draws nsamples random samples from a target stationary distribution
pdf using the Metropolis-Hastings algorithm.
start is a row vector containing the start value of the Markov
Chain, nsamples is an integer specifying the number of samples to be
generated, and pdf, proppdf, and proprnd are function handles created
using @. proppdf defines the proposal distribution density, and proprnd
defines the random number generator for the proposal distribution. pdf
and proprnd take one argument as an input with the same type and
size as start. proppdf takes two arguments as inputs with the same
type and size as start.
smpl is a column vector or matrix containing the samples. If the log
density function is preferred, 'pdf' and 'proppdf' can be replaced
with 'logpdf' and 'logproppdf'. The density functions used in
Metropolis-Hastings algorithm are not necessarily normalized.
The proposal distribution q(x,y) gives the probability density for
choosing x as the next point when y is the current point. It is sometimes
written as q(x|y).
If the proppdf or logproppdf satisfies q(x,y) = q(y,x), that is, the
proposal distribution is symmetric, mhsample implements Random
Walk Metropolis-Hastings sampling. If the proppdf or logproppdf
satisfies q(x,y) = q(x), that is, the proposal distribution is independent of
current values, mhsample implements Independent Metropolis-Hastings
sampling.
18-757
mhsample
Examples Estimate the second order moment of a Gamma distribution using the
Independent Metropolis-Hastings sampling.
alpha = 2.43;
beta = 1;
pdf = @(x)gampdf(x,alpha,beta); %target distribution
proppdf = @(x,y)gampdf(x,floor(alpha),floor(alpha)/alpha);
proprnd = @(x)sum(...
exprnd(floor(alpha)/alpha,floor(alpha),1));
nsamples = 5000;
smpl = mhsample(1,nsamples,'pdf',pdf,'proprnd',proprnd,...
'proppdf',proppdf);
xxhat = cumsum(smpl.^2)./(1:nsamples)';
18-758
mhsample
plot(1:nsamples,xxhat)
delta = .5;
pdf = @(x) normpdf(x);
proppdf = @(x,y) unifpdf(y-x,-delta,delta);
proprnd = @(x) x + rand*2*delta - delta;
nsamples = 15000;
x = mhsample(1,nsamples,'pdf',pdf,'proprnd',proprnd,'symmetric',1);
histfit(x,50)
h = get(gca,'Children');
set(h(2),'FaceColor',[.8 .8 1])
18-759
mhsample
18-760
TreeBagger.MinLeaf property
18-761
mle
• 'beta'
• 'bernoulli'
• 'binomial'
• 'birnbaumsaunders'
• 'discrete uniform' or 'unid'
• 'exponential'
• 'extreme value' or 'ev'
• 'gamma'
• 'generalized extreme value' or 'gev'
• 'generalized pareto' or 'gp'
• 'geometric'
18-762
mle
• 'inversegaussian'
• 'logistic'
• 'loglogistic'
• 'lognormal'
• 'nakagami'
• 'negative binomial' or 'nbin'
• 'normal'
• 'poisson'
• 'rayleigh'
• 'rician'
• 'tlocationscale'
• 'uniform'
• 'weibull' or 'wbl'
Name Value
'censoring' A Boolean vector of the same size as data,
containing ones when the corresponding
elements of data are right-censored
observations and zeros when the corresponding
elements are exact observations. The default
is that all observations are observed exactly.
Censoring is not supported for all distributions.
'frequency' A vector of the same size as data, containing
nonnegative integer frequencies for the
corresponding elements in data. The default is
one observation per element of data.
18-763
mle
Name Value
'alpha' A value between 0 and 1 specifying a confidence
level of 100(1-alpha)% for pci. The default is
0.05 for 95% confidence.
'ntrials' A scalar, or a vector of the same size as data,
containing the total number of trials for the
corresponding element of data. Applies only to
the binomial distribution.
'options' A structure created by a call to statset,
containing numerical options for the fitting
algorithm. Not applicable to all distributions.
mle can also fit custom distributions that you define using distribution
functions, in one of three ways.
[...] = mle(data,'pdf',pdf,'cdf',cdf,'start',start,...)
returns MLEs for the parameters of the distribution defined by the
probability density and cumulative distribution functions pdf and
cdf. pdf and cdf are function handles created using the @ sign. They
accept as inputs a vector data and one or more individual distribution
parameters, and return vectors of probability density values and
cumulative probability values, respectively. If the 'censoring'
name/value pair is not present, you can omit the 'cdf' name/value
pair. mle computes the estimates by numerically maximizing the
distribution’s log-likelihood, and start is a vector containing initial
values for the parameters.
[...] =
mle(data,'logpdf',logpdf,'logsf',logsf,'start',start,...)
returns MLEs for the parameters of the distribution defined by the log
probability density and log survival functions logpdf and logsf.
logpdf and logsf are function handles created using the @ sign. They
accept as inputs a vector data and one or more individual distribution
parameters, and return vectors of logged probability density values and
logged survival function values, respectively. This form is sometimes
more robust to the choice of starting point than using pdf and cdf
18-764
mle
nloglf must accept all four arguments even if you do not supply the
'censoring' or 'frequency' name/value pairs (see above). However,
nloglf can safely ignore its cens and freq arguments in that case.
nloglf returns a scalar negative log-likelihood value and, optionally,
a negative log-likelihood gradient vector (see the 'GradObj' statset
parameter below). start is a vector containing initial values for the
distribution’s parameters.
pdf, cdf, logpdf, logsf, or nloglf can also be cell arrays whose first
element is a function handle as defined above, and whose remaining
elements are additional arguments to the function. mle places these
arguments at the end of the argument list in the function call.
The following optional argument name/value pairs are valid only when
'pdf' and 'cdf', 'logpdf' and 'logcdf', or 'nloglf' are given:
18-765
mle
Parameter Value
'GradObj' 'on' or 'off', indicating whether or not fmincon
can expect the function provided with the 'nloglf'
name/value pair to return the gradient vector of
the negative log-likelihood as a second output. The
default is 'off'. Ignored when using fminsearch.
18-766
mle
Examples The following returns an MLE and a 95% confidence interval for the
success probability of a binomial distribution with 20 trials:
[phat,pci] = mle(data,'distribution','binomial',...
'alpha',.05,'ntrials',20)
phat =
0.7370
pci =
0.7171
0.7562
See Also betafit, binofit, evfit, expfit, gamfit, gevfit, gpfit, lognfit,
nbinfit, normfit, mlecov, poissfit, raylfit, statset, unifit,
wblfit
18-767
mlecov
18-768
mlecov
nloglf must accept all four arguments even if you do not supply the
'censoring' or 'frequency' name/value pairs (see below). However,
nloglf can safely ignore its cens and freq arguments in that case.
nloglf returns a scalar negative log-likelihood value and, optionally,
the negative log-likelihood gradient vector (see the 'gradient'
name/value pair below).
pdf, cdf, logpdf, logsf, and nloglf can also be cell arrays whose first
element is a function handle, as defined above, and whose remaining
elements are additional arguments to the function. The mle function
places these arguments at the end of the argument list in the function
call.
[...] =
mlecov(params,data,...,param1,val1,param2,val2,...) specifies
optional parameter name/value pairs chosen from the following table.
18-769
mlecov
Parameter Value
'censoring' Boolean vector of the same size as data, containing
1’s when the corresponding elements of data are
right-censored observations and 0’s when the
corresponding elements are exact observations. The
default is that all observations are observed exactly.
Censoring is not supported for all distributions.
'frequency' A vector of the same size as data containing
nonnegative frequencies for the corresponding
elements in data. The default is one observation per
element of data.
'options' A structure opts containing numerical options for
the finite difference Hessian calculation. You create
opts by calling statset. The applicable statset
parameters are:
18-770
mlecov
x = betarnd(1.23,3.45,25,1);
phat = mle(x,'dist','beta')
acov = mlecov(phat,x,'logpdf',@betalogpdf)
18-771
mnpdf
Syntax Y = mnpdf(X,PROB)
18-772
mnpdf
Note that the visualization does not show x3, which is determined by
the constraint x1 + x2 + x3 = n.
18-773
mnrfit
Syntax B = mnrfit(X,Y)
B = mnrfit(X,Y,param1,val1,param2,val2,...)
[B,dev] = mnrfit(...)
[B,dev,stats] = mnrfit(...)
• 'model' — The type of model to fit; one of the text strings 'nominal'
(the default), 'ordinal', or 'hierarchical'
18-774
mnrfit
18-775
mnrfit
• t — t statistics for B
• p — p-values for B
• resid — Residuals
• residp — Pearson residuals
• residd — Deviance residuals
Examples Fit multinomial logistic regression models to data with one predictor
variable and three categories in the response variable:
x = [-3 -2 -1 0 1 2 3]';
Y = [1 11 13; 2 9 14; 6 14 5; 5 10 10; 5 14 6; 7 13 5;...
8 11 6];
bar(x,Y,'stacked'); ylim([0 25]);
18-776
mnrfit
18-777
mnrfit
% response categories:
xx = linspace(-4,4)';
pHatNom = mnrval(betaHatNom,xx,'model','nominal',...
'interactions','on');
line(xx,cumsum(25*pHatNom,2),'LineWidth',2);
18-778
mnrfit
18-779
mnrfit
References [1] McCullagh, P., and J. A. Nelder. Generalized Linear Models. New
York: Chapman & Hall, 1990.
18-780
mnrnd
Syntax r = mnrnd(n,p)
R = mnrnd(n,p,m)
R = mnrnd(N,P)
n = 1e3;
p = [0.2,0.3,0.5];
R = mnrnd(n,p,2)
R =
215 282 503
194 303 503
18-781
mnrnd
n = 1e3;
P = [0.2, 0.3, 0.5; ...
0.3, 0.4, 0.3;];
R = mnrnd(n,P)
R =
186 290 524
290 389 321
18-782
mnrval
• 'model' — The type of model that was fit by mnrfit; one of the text
strings 'nominal' (the default), 'ordinal', or 'hierarchical'.
18-783
mnrval
Examples Fit multinomial logistic regression models to data with one predictor
variable and three categories in the response variable:
x = [-3 -2 -1 0 1 2 3]';
Y = [1 11 13; 2 9 14; 6 14 5; 5 10 10; 5 14 6; 7 13 5;...
8 11 6];
bar(x,Y,'stacked');
ylim([0 25]);
18-784
mnrval
18-785
mnrval
xx = linspace(-4,4)';
pHatNom = mnrval(betaHatNom,xx,'model','nominal',...
'interactions','on');
line(xx,cumsum(25*pHatNom,2),'LineWidth',2);
18-786
mnrval
18-787
mnrval
References [1] McCullagh, P., and J. A. Nelder. Generalized Linear Models. New
York: Chapman & Hall, 1990.
18-788
moment
Syntax m = moment(X,order)
moment(X,order,dim)
Remarks Note that the central first moment is zero, and the second central
moment is the variance computed using a divisor of n rather than n –
1, where n is the length of the vector x or the number of rows in the
matrix X.
The central moment of order k of a distribution is defined as
mk = E( x − ) k
where E(x) is the expected value of x.
m = moment(X,3)
m =
-0.0282 0.0571 0.1253 0.1460 -0.4486
18-789
moment
18-790
gmdistribution.Mu property
18-791
multcompare
Syntax c = multcompare(stats)
c = multcompare(stats,param1,val1,param2,val2,...)
[c,m] = multcompare(...)
[c,m,h] = multcompare(...)
[c,m,h,gnames] = multcompare(...)
18-792
multcompare
These numbers indicate that the mean of group 2 minus the mean of
group 5 is estimated to be 8.2206, and a 95% confidence interval for
the true mean is [1.9442, 14.4971].
In this example the confidence interval does not contain 0.0, so the
difference is significant at the 0.05 level. If the confidence interval did
contain 0.0, the difference would not be significant at the 0.05 level.
The multcompare function also displays a graph with each group mean
represented by a symbol and an interval around the symbol. Two means
are significantly different if their intervals are disjoint, and are not
significantly different if their intervals overlap. You can use the mouse
to select any group, and the graph will highlight any other groups that
are significantly different from it.
c = multcompare(stats,param1,val1,param2,val2,...) specifies
one or more of the parameter name/value pairs described in the
following table.
Parameter Values
'alpha' Scalar between 0 and 1 that determines the
confidence levels of the intervals in the matrix
c and in the figure (default is 0.05). The
confidence level is 100(1-alpha)%.
'display' Either 'on' (the default) to display a graph
of the estimates with comparison intervals
around them, or 'off' to omit the graph. See
“Examples” on page 18-797.
'ctype' Specifies the type of critical value to use for the
multiple comparison. “Values of ctype” on page
18-795 describes the allowed values for ctype.
18-793
multcompare
Parameter Values
'dimension' A vector specifying the dimension or dimensions
over which the population marginal means
are to be calculated. Use only if you create
stats with the function anovan. The default
is 1 to compute over the first dimension. See
“Dimension Parameter” on page 18-797 for more
information.
'estimate' Specifies the estimate to be compared. The
allowable values of estimate depend on the
function that was the source of the stats
structure, as described in “Values of estimate”
on page 18-796
title('')
xlabel('')
18-794
multcompare
anova1, if all means are based on the same sample size.) You can click
on any estimate to see which means are significantly different from it.
Values of ctype
The following table describes the allowed values for the parameter
ctype.
Value Description
'hsd' or Use Tukey’s honestly significant difference
'tukey-kramer' criterion. This is the default, and it is based on the
Studentized range distribution. It is optimal for
balanced one-way ANOVA and similar procedures
with equal sample sizes. It has been proven
to be conservative for one-way ANOVA with
different sample sizes. According to the unproven
Tukey-Kramer conjecture, it is also accurate for
problems where the quantities being compared
are correlated, as in analysis of covariance with
unbalanced covariate values.
'lsd' Use Tukey’s least significant difference procedure.
This procedure is a simple t-test. It is reasonable
if the preliminary test (say, the one-way ANOVA
F statistic) shows a significant difference. If it is
used unconditionally, it provides no protection
against multiple comparisons.
'bonferroni' Use critical values from the t distribution, after a
Bonferroni adjustment to compensate for multiple
comparisons. This procedure is conservative, but
usually less so than the Scheffé procedure.
18-795
multcompare
Value Description
'dunn-sidak' Use critical values from the t distribution, after
an adjustment for multiple comparisons that was
proposed by Dunn and proved accurate by Sidák.
This procedure is similar to, but less conservative
than, the Bonferroni procedure.
'scheffe' Use critical values from Scheffé’s S procedure,
derived from the F distribution. This procedure
provides a simultaneous confidence level for
comparisons of all linear combinations of the
means, and it is conservative for comparisons of
simple differences of pairs.
Values of estimate
The allowable values of the parameter 'estimate' depend on the
function that was the source of the stats structure, according to the
following table.
Source Values
'anova1' Ignored. Always compare the group means.
'anova2' Either 'column' (the default) or 'row' to
compare column or row means.
'anovan' Ignored. Always compare the population
marginal means as specified by the dim
argument.
'aoctool' Either 'slope', 'intercept', or 'pmm' to
compare slopes, intercepts, or population
marginal means. If the analysis of covariance
model did not include separate slopes, then
'slope' is not allowed. If it did not include
separate intercepts, then no comparisons are
possible.
18-796
multcompare
Source Values
'friedman' Ignored. Always compare average column ranks.
'kruskalwallis' Ignored. Always compare average group ranks.
Dimension Parameter
The dimension parameter is a vector specifying the dimension or
dimensions over which the population marginal means are to be
calculated. For example, if dim = 1, the estimates that are compared
are the means for each value of the first grouping variable, adjusted by
removing effects of the other grouping variables as if the design were
balanced. If dim = [1 3], population marginal means are computed for
each combination of the first and third grouping variables, removing
effects of the second grouping variable. If you fit a singular model, some
cell means may not be estimable and any population marginal means
that depend on those cell means will have the value NaN.
Population marginal means are described by Milliken and Johnson
(1992) and by Searle, Speed, and Milliken (1980). The idea behind
population marginal means is to remove any effect of an unbalanced
design by fixing the values of the factors specified by dim, and averaging
out the effects of other factors as if each factor combination occurred
the same number of times. The definition of population marginal
means does not depend on the number of observations at each
factor combination. For designed experiments where the number of
observations at each factor combination has no meaning, population
marginal means can be easier to interpret than simple means ignoring
other factors. For surveys and other studies where the number of
observations at each combination does have meaning, population
marginal means may be harder to interpret.
Examples Example 1
The following example performs a 1-way analysis of variance (ANOVA)
and displays group means with their names.
load carsmall
18-797
multcompare
[p,t,st] = anova1(MPG,Origin,'off');
[c,m,h,nms] = multcompare(st,'display','off');
[nms num2cell(m)]
ans =
'USA' [21.1328] [0.8814]
'Japan' [31.8000] [1.8206]
'Germany' [28.4444] [2.3504]
'France' [23.6667] [4.0711]
'Sweden' [22.5000] [4.9860]
'Italy' [28.0000] [7.0513]
You can click the graphs of each country to compare its mean to those of
other countries.
18-798
multcompare
Example 2
The following continues the example described in the anova1 reference
page, which is related to testing the material strength in structural
beams. From the anova1 output you found significant evidence that
the three types of beams are not equivalent in strength. Now you can
determine where those differences lie. First you create the data arrays
and you perform one-way ANOVA.
[c,m,h,nms] = multcompare(s);
[nms num2cell(c)]
ans =
'st' [1] [2] [ 3.6064] [ 7] [10.3936]
'al1' [1] [3] [ 1.6064] [ 5] [ 8.3936]
'al2' [2] [3] [-5.6280] [-2] [ 1.6280]
18-799
multcompare
The third row of the output matrix shows that the differences in
strength between the two alloys is not significant. A 95% confidence
interval for the difference is [-5.6, 1.6], so you cannot reject the
hypothesis that the true difference is zero.
The first two rows show that both comparisons involving the first group
(steel) have confidence intervals that do not include zero. In other
words, those differences are significant. The graph shows the same
information.
18-800
multcompare
18-801
multivarichart
Syntax multivarichart(y,GROUP)
multivarichart(Y)
multivarichart(...,param1,val1,param2,val2,...)
[charthandle,AXESH] = multivarichart(...)
18-802
multivarichart
Examples Display a multivari chart for data with two grouping variables:
y = randn(100,1); % response
group = [ceil(3*rand(100,1)) ceil(2*rand(100,1))];
multivarichart(y,group)
18-803
multivarichart
y = randn(1000,1); % response
group = {ceil(2*rand(1000,1)),ceil(3*rand(1000,1)), ...
ceil(2*rand(1000,1)),ceil(3*rand(1000,1))};
multivarichart(y,group)
18-804
multivarichart
18-805
mvncdf
Syntax y = mvncdf(X)
y = mvncdf(X,mu,SIGMA)
y = mvncdf(xl,xu,mu,SIGMA)
[y,err] = mvncdf(...)
[...] = mvncdf(...,options)
18-806
mvncdf
18-807
mvncdf
18-808
mvncdf
[5] Genz, A., and F. Bretz. “Comparison of Methods for the Computation
of Multivariate t Probabilities.” Journal of Computational and
Graphical Statistics. Vol. 11, No. 4, 2002, pp. 950–971.
18-809
mvnpdf
Syntax y = mvnpdf(X)
y = mvnpdf(X,MU)
y = mvnpdf(X,MU,SIGMA)
Examples mu = [1 -1];
SIGMA = [.9 .4; .4 .3];
X = mvnrnd(mu,SIGMA,10);
p = mvnpdf(X,mu,SIGMA);
18-810
mvnpdf
18-811
mvregress
18-812
mvregress
18-813
mvregress
norm(beta(k)-beta(k-1))<sqrt(p)*tolbeta*(1+norm(beta(k)))
where p = length(beta).
• 'tolobj' — Convergence tolerance for changes in the objective
function. The default is eps^(3/4). The test is
where obj is the objective function. If both tolobj and tolbeta are
0, the function performs maxiter iterations with no convergence test.
• 'beta0' — A vector of p elements to be used as the initial estimate
for beta. Default is a zero vector. Not used for the 'mvn' algorithm.
• 'covar0' — A d-by-d matrix to be used as the initial estimate for
SIGMA. Default is the identity matrix. For the 'cwls' algorithm, this
matrix is usually a diagonal matrix and it is not changed during
the iterations, so the input value is used as the weighting matrix at
each iteration.
• 'outputfcn' — An output function.
• 'varformat' — Either 'beta' to compute COVB for beta only
(default), or 'full' to compute COVB for both beta and SIGMA.
• 'vartype' — Either 'hessian' to compute COVB using the Hessian
or observed information (default), or 'fisher' to compute COVB using
18-814
mvregress
Examples Predict regional flu estimates based on Google™ queries using the
national CDC estimates as a predictor:
load flu
18-815
mvregress
for j=1:nobs
X{j} = [eye(nregions), x(j)*eye(nregions)];
end
[beta,sig,resid,vars,loglik2] = mvregress(X,y);
for j=1:nregions;
set(h(nregions+j),'color',get(h(j),'color'));
end
chisq =
96.4556
p =
18-816
mvregress
References [1] Little, Roderick J. A., and Donald B. Rubin. Statistical Analysis with
Missing Data. 2nd ed., Hoboken, NJ: John Wiley & Sons, Inc., 2002.
18-817
mvregress
18-818
mvregresslike
18-819
mvregresslike
format is either:
18-820
mvnrnd
Syntax R = mvnrnd(MU,SIGMA)
r = mvnrnd(MU,SIGMA,cases)
Examples mu = [2 3];
SIGMA = [1 1.5; 1.5 3];
r = mvnrnd(mu,SIGMA,100);
plot(r(:,1),r(:,2),'+')
18-821
mvnrnd
18-822
mvtcdf
Syntax y = mvtcdf(X,C,DF)
y = mvtcdf(xl,xu,C,DF)
[y,err] = mvtcdf(...)
[...] = mvntdf(...,options)
18-823
mvtcdf
18-824
mvtcdf
[3] Genz, A., and F. Bretz. “Comparison of Methods for the Computation
of Multivariate t Probabilities.” Journal of Computational and
Graphical Statistics. Vol. 11, No. 4, 2002, pp. 950–971.
18-825
mvtcdf
18-826
mvtpdf
Syntax y = mvtpdf(X,C,df)
[X1,X2] = meshgrid(linspace(-2,2,25)',linspace(-2,2,25)');
X = [X1(:) X2(:)];
C = [1 .4; .4 1];
df = 2;
p = mvtpdf(X,C,df);
surf(X1,X2,reshape(p,25,25))
18-827
mvtpdf
18-828
mvtrnd
Syntax R = mvtrnd(C,df,cases)
R = mvtrnd(C,df)
18-829
mvtrnd
18-830
cvpartition.N property
18-831
NaiveBayes class
18-832
NaiveBayes class
Copy Value. To learn how this affects your use of the class, see Comparing
Semantics Handle and Value Classes in the MATLAB Object-Oriented
Programming documentation.
Examples Predict the class label using the Naive Bayes classifier
load fisheriris
O1 = NaiveBayes.fit(meas,species);
C1 = O1.predict(meas);
cMat1 = confusionmat(species,C1)
This returns:
cMat1 =
50 0 0
0 47 3
0 3 47
Use the Gaussian distribution for features 1 and 3 and use the kernel
density estimation for features 2 and 4:
O2 = NaiveBayes.fit(meas,species,'dist',...
{'normal','kernel','normal','kernel'});
C2 = O2.predict(meas);
cMat2 = confusionmat(species,C2)
This returns:
18-833
NaiveBayes class
cMat2 =
50 0 0
0 47 3
0 3 47
[2] Vangelis M., Ion A., and Geogios P. Spam Filtering with Naive
Bayes - Which Naive Bayes? (2006) Third Conference on Email and
Anti-Spam.
18-834
NaiveBayes
18-835
nancov
Syntax Y = nancov(X)
Y = nancov(X1,X2)
Y = nancov(...,1)
Y = nancov(...,'pairwise')
Examples Generate random data for two variables (columns) with random missing
values:
X = rand(10,2);
p = randperm(numel(X));
X(p(1:5)) = NaN
X =
0.8147 0.1576
NaN NaN
0.1270 0.9572
18-836
nancov
0.9134 NaN
0.6324 NaN
0.0975 0.1419
0.2785 0.4218
0.5469 0.9157
0.9575 0.7922
0.9649 NaN
X(:,3) = sum(X,2)
X =
0.8147 0.1576 0.9723
NaN NaN NaN
0.1270 0.9572 1.0842
0.9134 NaN NaN
0.6324 NaN NaN
0.0975 0.1419 0.2394
0.2785 0.4218 0.7003
0.5469 0.9157 1.4626
0.9575 0.7922 1.7497
0.9649 NaN NaN
Compute the covariance matrix for the three variables after removing
observations (rows) with NaN values:
Y = nancov(X)
Y =
0.1311 0.0096 0.1407
0.0096 0.1388 0.1483
0.1407 0.1483 0.2890
18-837
nanmax
Syntax y = nanmax(X)
Y = nanmax(X1,X2)
y = nanmax(X,[],dim)
[y,indices] = nanmax(...)
Examples Find column maxima and their indices for data with missing values:
X = magic(3);
X([1 6:9]) = repmat(NaN,1,5)
X =
NaN 1 NaN
3 5 NaN
4 NaN NaN
[y,indices] = nanmax(X)
y =
4 5 NaN
indices =
3 2 1
18-838
nanmean
Syntax y = nanmean(X)
y = nanmean(X,dim)
Note If X contains a vector of all NaN values along some dimension, the
vector is empty once the NaN values are removed, so the sum of the
remaining elements is 0. Since the mean involves division by 0, its
value is NaN. The output NaN is not a mean of NaN values.
X = magic(3);
X([1 6:9]) = repmat(NaN,1,5)
X =
NaN 1 NaN
3 5 NaN
4 NaN NaN
y = nanmean(X)
y =
3.5000 3.0000 NaN
18-839
nanmedian
Syntax y = nanmedian(X)
y = nanmedian(X,dim)
X = magic(3);
X([1 6:9]) = repmat(NaN,1,5)
X =
NaN 1 NaN
3 5 NaN
4 NaN NaN
y = nanmedian(X)
y =
3.5000 3.0000 NaN
18-840
nanmin
Syntax y = nanmin(X)
Y = nanmin(X1,X2)
y = nanmin(X,[],dim)
[y,indices] = nanmin(...)
Examples Find column minima and their indices for data with missing values:
X = magic(3);
X([1 6:9]) = repmat(NaN,1,5)
X =
NaN 1 NaN
3 5 NaN
4 NaN NaN
[y,indices] = nanmin(X)
y =
3 1 NaN
indices =
2 1 1
18-841
nanstd
Syntax y = nanstd(X)
y = nanstd(X,1)
y = nanstd(X,flag,dim)
Examples Find column standard deviations for data with missing values:
X = magic(3);
X([1 6:9]) = repmat(NaN,1,5)
X =
NaN 1 NaN
3 5 NaN
4 NaN NaN
y = nanstd(X)
y =
0.7071 2.8284 NaN
18-842
nansum
Syntax y = nansum(X)
y = nansum(X,dim)
Note If X contains a vector of all NaN values along some dimension, the
vector is empty once the NaN values are removed, so the sum of the
remaining elements is 0. The output 0 is not a sum of NaN values.
X = magic(3);
X([1 6:9]) = repmat(NaN,1,5)
X =
NaN 1 NaN
3 5 NaN
4 NaN NaN
y = nansum(X)
y =
7 6 0
18-843
nanvar
Syntax y = nanvar(X)
y = nanvar(X,1)
y = nanvar(X,w)
y = nanvar(X,w,dim)
Examples Find column standard deviations for data with missing values:
X = magic(3);
X([1 6:9]) = repmat(NaN,1,5)
X =
NaN 1 NaN
3 5 NaN
4 NaN NaN
y = nanvar(X)
18-844
nanvar
y =
0.5000 8.0000 NaN
18-845
nbincdf
Syntax Y = nbincdf(X,R,P)
x
⎛ r + i − 1⎞ r i
y = F ( x | r, p) = ∑ ⎜⎝ i
⎟ p q I(0,1,...) (i)
⎠
i =0
Γ(r + i)
Γ(r)Γ(i + 1)
Examples x = (0:15);
p = nbincdf(x,3,0.5);
stairs(x,p)
18-846
nbincdf
18-847
nbinfit
18-848
nbininv
Syntax X = nbininv(Y,R,P)
Examples How many times would you need to flip a fair coin to have a 99%
probability of having observed 10 heads?
flips = nbininv(0.99,10,0.5) + 10
flips =
33
Note that you have to flip at least 10 times to get 10 heads. That is why
the second term on the right side of the equals sign is a 10.
18-849
nbinpdf
Syntax Y = nbinpdf(X,R,P)
⎛ r + x − 1⎞ r x
y = f ( x | r, p) = ⎜ ⎟ p q I(0,1,...) ( x)
⎝ x ⎠
Γ(r + x)
Γ(r)Γ( x + 1)
Examples x = (0:10);
y = nbinpdf(x,3,0.5);
plot(x,y,'+')
set(gca,'Xlim',[-0.5,10.5])
18-850
nbinpdf
18-851
nbinrnd
Examples Suppose you want to simulate a process that has a defect probability of
0.01. How many units might Quality Assurance inspect before finding
three defective items?
r = nbinrnd(3,0.01,1,6)+3
r =
496 142 420 396 851 178
18-852
nbinrnd
18-853
nbinstat
Description [M,V] = nbinstat(R,P) returns the mean of and variance for the
negative binomial distribution with corresponding number of successes,
R and probability of success in a single trial, P. R and P can be vectors,
matrices, or multidimensional arrays that all have the same size, which
is also the size of M and V. A scalar input for R or P is expanded to a
constant array with the same dimensions as the other input.
The mean of the negative binomial distribution with parameters r and p
is rq / p, where q = 1 – p. The variance is rq / p2.
The simplest motivation for the negative binomial is the case of
successive random trials, each having a constant probability P of
success. The number of extra trials you must perform in order to
observe a given number R of successes has a negative binomial
distribution. However, consistent with a more general interpretation
of the negative binomial, nbinstat allows R to be any positive value,
including nonintegers.
Examples p = 0.1:0.2:0.9;
r = 1:5;
[R,P] = meshgrid(r,p);
[M,V] = nbinstat(R,P)
M =
9.0000 18.0000 27.0000 36.0000 45.0000
2.3333 4.6667 7.0000 9.3333 11.6667
1.0000 2.0000 3.0000 4.0000 5.0000
0.4286 0.8571 1.2857 1.7143 2.1429
0.1111 0.2222 0.3333 0.4444 0.5556
V =
90.0000 180.0000 270.0000 360.0000 450.0000
7.7778 15.5556 23.3333 31.1111 38.8889
2.0000 4.0000 6.0000 8.0000 10.0000
18-854
nbinstat
18-855
ncfcdf
Syntax P = ncfcdf(X,NU1,NU2,DELTA)
⎛⎛1 ⎞j ⎞
∞ ⎜⎜ ⎟ − ⎟
F ( x | 1 , 2 , ) = ∑ ⎜⎜ ⎝
2 ⎠ ⎟I ⎛ 1 ⋅ x 1 + j, 2 ⎞
j!
e2 ⎟ ⎜⎜ + ⋅ x 2 ⎟
2 ⎟⎠
j =0 ⎜ ⎟ ⎝ 2 1
⎜ ⎟
⎝ ⎠
Examples Compare the noncentral F cdf with δ = 10 to the F cdf with the same
number of numerator and denominator degrees of freedom (5 and 20
respectively).
x = (0.01:0.1:10.01)';
p1 = ncfcdf(x,5,20,10);
p = fcdf(x,5,20);
plot(x,p,'-',x,p1,'-')
18-856
ncfcdf
18-857
ncfinv
Syntax X = ncfinv(P,NU1,NU2,DELTA)
Examples One hypothesis test for comparing two sample variances is to take
their ratio and compare it to an F distribution. If the numerator and
denominator degrees of freedom are 5 and 20 respectively, then you
reject the hypothesis that the first variance is equal to the second
variance if their ratio is less than that computed below.
critical = finv(0.95,5,20)
critical =
2.7109
Suppose the truth is that the first variance is twice as big as the second
variance. How likely is it that you would detect this difference?
prob = 1 - ncfcdf(critical,5,20,2)
prob =
0.1297
If the true ratio of variances is 2, what is the typical (median) value you
would expect for the F statistic?
ncfinv(0.5,5,20,2)
ans =
1.2786
18-858
ncfinv
18-859
ncfpdf
Syntax Y = ncfpdf(X,NU1,NU2,DELTA)
Examples Compare the noncentral F pdf with δ = 10 to the F pdf with the same
number of numerator and denominator degrees of freedom (5 and 20
respectively).
x = (0.01:0.1:10.01)';
p1 = ncfpdf(x,5,20,10);
p = fpdf(x,5,20);
plot(x,p,'-',x,p1,'-')
18-860
ncfpdf
18-861
ncfrnd
Syntax R = ncfrnd(NU1,NU2,DELTA)
R = ncfrnd(NU1,NU2,DELTA,v)
R = ncfrnd(NU1,NU2,DELTA,m,n)
r = ncfrnd(10,100,4,1,6)
r =
2.5995 0.8824 0.8220 1.4485 1.4415 1.4864
r1 = frnd(10,100,1,6)
r1 =
0.9826 0.5911 1.0967 0.9681 2.0096 0.6598
18-862
ncfrnd
18-863
ncfstat
2 ( + 1 )
1 ( 2 − 2)
where ν2 > 2.
The variance is
2
⎛ ⎞ ⎡ ( + )2 + (2 + )( − 2) ⎤
2⎜ 2 ⎟ ⎢ 1 1 2
⎥
⎝ 1 ⎠ ⎢⎣ ( 2 − 2)2 ( 2 − 4) ⎥⎦
where ν2 > 4.
18-864
ncfstat
18-865
NaiveBayes.NClasses property
Description The NClasses property specifies the number of classes in the grouping
variable used to create the Naive Bayes classifier.
18-866
gmdistribution.NComponents property
18-867
nctcdf
Syntax P = nctcdf(X,NU,DELTA)
Examples Compare the noncentral t cdf with DELTA = 1 to the t cdf with the same
number of degrees of freedom (10).
x = (-5:0.1:5)';
p1 = nctcdf(x,10,1);
p = tcdf(x,10);
plot(x,p,'-',x,p1,'-')
18-868
nctinv
Syntax X = nctinv(P,NU,DELTA)
18-869
nctpdf
Syntax Y = nctpdf(X,V,DELTA)
Examples Compare the noncentral t pdf with DELTA = 1 to the t pdf with the same
number of degrees of freedom (10):
x = (-5:0.1:5)';
nct = nctpdf(x,10,1);
t = tpdf(x,10);
plot(x,nct,'b-','LineWidth',2)
hold on
plot(x,t,'g--','LineWidth',2)
legend('nct','t')
18-870
nctpdf
18-871
nctrnd
Syntax R = nctrnd(V,DELTA)
R = nctrnd(V,DELTA,v)
R = nctrnd(V,DELTA,m,n)
Examples nctrnd(10,1,5,1)
ans =
1.6576
1.0617
1.4491
0.2930
3.6297
18-872
nctrnd
18-873
nctstat
Description [M,V] = nctstat(NU,DELTA) returns the mean of and variance for the
noncentral t pdf with NU degrees of freedom and noncentrality parameter
DELTA. NU and DELTA can be vectors, matrices, or multidimensional
arrays that all have the same size, which is also the size of M and V. A
scalar input for NU or DELTA is expanded to a constant array with the
same dimensions as the other input.
The mean of the noncentral t distribution with parameters ν and δ is
( / 2)1 / 2 Γ(( − 1) / 2)
Γ( / 2)
where ν > 1.
The variance is
2
⎡ Γ(( − 1) / 2) ⎤
(1 + 2 ) − 2 ⎢
( − 2 ) 2 ⎣ Γ( / 2) ⎥⎦
where ν > 2.
m =
1.0837
v =
1.3255
18-874
nctstat
18-875
ncx2cdf
Syntax P = ncx2cdf(X,V,DELTA)
⎛⎛1 ⎞j ⎞
∞ ⎜ ⎜ ⎟ − ⎟
F ( x | , ) = ∑ ⎜⎜ ⎝ ⎟ Pr ⎡ 2
2 ⎠ ⎤
j!
e2 ⎟ ⎣ +2 j ≤ x ⎦
j =0 ⎜ ⎟
⎜ ⎟
⎝ ⎠
x = (0:0.1:10)';
ncx2 = ncx2cdf(x,4,2);
chi2 = chi2cdf(x,4);
plot(x,ncx2,'b-','LineWidth',2)
hold on
plot(x,chi2,'g--','LineWidth',2)
legend('ncx2','chi2','Location','NW')
18-876
ncx2cdf
18-877
ncx2inv
Syntax X = ncx2inv(P,V,DELTA)
18-878
ncx2pdf
Syntax Y = ncx2pdf(X,V,DELTA)
x = (0:0.1:10)';
ncx2 = ncx2pdf(x,4,2);
chi2 = chi2pdf(x,4);
plot(x,ncx2,'b-','LineWidth',2)
hold on
plot(x,chi2,'g--','LineWidth',2)
legend('ncx2','chi2')
18-879
ncx2pdf
18-880
ncx2rnd
Syntax R = ncx2rnd(V,DELTA)
R = ncx2rnd(V,DELTA,v)
R = ncx2rnd(V,DELTA,m,n)
Examples ncx2rnd(4,2,6,3)
ans =
6.8552 5.9650 11.2961
5.2631 4.2640 5.9495
9.1939 6.7162 3.8315
10.3100 4.4828 7.1653
2.1142 1.9826 4.6400
3.8852 5.3999 0.9282
18-881
ncx2rnd
18-882
ncx2stat
18-883
categorical.ndims
Syntax n = ndims(A)
18-884
gmdistribution.NDimensions property
18-885
dataset.ndims
Syntax n = ndims(A)
18-886
qrandset.ndims
Syntax n = ndims(p)
18-887
NaiveBayes.NDims property
Description The NDims property specifies the number of dimensions, which is equal
to the number of features in the training data used to create the Naive
Bayes classifier.
18-888
qrandstream.ne
Syntax h1 ~= h2
Description Handles are equal if they are handles for the same object and are
unequal otherwise.
h1 ~= h2 performs element-wise comparisons between handle arrays
h1 and h2. h1 and h2 must be of the same dimensions unless one is a
scalar. The result is a logical array of the same dimensions, where each
element is an element-wise ~= result.
If one of h1 or h2 is scalar, scalar expansion is performed and the result
will match the dimensions of the array that is not scalar.
tf = ne(h1, h2) stores the result in a logical array of the same
dimensions.
18-889
NeighborSearcher class
Properties X
A matrix used to create the object.
Distance
A string specifying a built-in distance metric (applies to both
ExhaustiveSearcher and KDTreeSearcher) or a function handle
(only applies to ExhaustiveSearcher) that you provide when you
create the object. This property is the default distance metric used
when you call the knnsearch method to find nearest neighbors
for future query points.
DistParameter
Specifies the additional parameter for the chosen distance metric.
The value is:
18-890
NeighborSearcher class
18-891
qrandset.net
Syntax X = net(p,n)
Description X = net(p,n) returns the first n points X from the point set p of the
qrandset class. X is n-by-d, where d is the dimension of the point set.
Objects p of the @qrandset class encapsulate properties of a specified
quasi-random sequence. Values of the point set are not generated and
stored in memory until p is accessed using net or parenthesis indexing.
Examples Use haltonset to generate a 3-D Halton point set, skip the first 1000
values, and then retain every 101st point:
p = haltonset(3,'Skip',1e3,'Leap',1e2)
p =
Halton point set in 3 dimensions (8.918019e+013 points)
Properties:
Skip : 1000
Leap : 100
ScrambleMethod : none
p = scramble(p,'RR2')
p =
Halton point set in 3 dimensions (8.918019e+013 points)
Properties:
Skip : 1000
Leap : 100
ScrambleMethod : RR2
X0 = net(p,4)
X0 =
0.0928 0.6950 0.0029
0.6958 0.2958 0.8269
18-892
qrandset.net
X = p(1:3:11,:)
X =
0.0928 0.6950 0.0029
0.9087 0.7883 0.2166
0.3843 0.9840 0.9878
0.6831 0.7357 0.7923
18-893
nlinfit
yhat = modelfun(b,X)
18-894
nlinfit
18-895
nlinfit
load reaction
beta = nlinfit(reactants,rate,@hougen,beta)
beta =
1.2526
0.0628
0.0400
0.1124
1.1914
References [1] Seber, G. A. F., and C. J. Wild. Nonlinear Regression. Hoboken, NJ:
Wiley-Interscience, 2003.
18-896
nlinfit
18-897
nlintool
Syntax nlintool(X,y,fun,beta0)
nlintool(X,y,fun,beta0,alpha)
nlintool(X,y,fun,beta0,alpha,'xname','yname')
load reaction
nlintool(reactants,rate,@hougen,beta,0.01,xn,yn)
18-898
nlintool
18-899
nlmefit
yfit = modelfun(PHI,XFUN,VFUN)
18-900
nlmefit
Note If modelfun can compute yfit for more than one vector of model
parameters per call, use the 'Vectorization' parameter (described
later) for improved performance.
18-901
nlmefit
18-902
nlmefit
Parameter Value
FEParamsSelect A vector specifying which elements of
the parameter vector PHI include a fixed
effect, given as a numeric vector of indices
between 1 and p or as a 1-by-p logical
vector. If q is the specified number of
elements, then the model includes q fixed
effects.
FEConstDesign A p-by-q design matrix ADESIGN, where
ADESIGN*beta are the fixed components of
the p elements of PHI.
FEGroupDesign A p-by-q-by-m array specifying a different
p-by-q fixed-effects design matrix for each
of the m groups.
FEObsDesign A p-by-q-by-n array specifying a different
p-by-q fixed-effects design matrix for each
of the n observations.
REParamsSelect A vector specifying which elements of the
parameter vector PHI include a random
effect, given as a numeric vector of indices
between 1 and p or as a 1-by-p logical
vector. The model includes r random
effects, where r is the specified number of
elements.
REConstDesign A p-by-r design matrix BDESIGN, where
BDESIGN*B are the random components of
the p elements of PHI.
REGroupDesign A p-by-r-by-m array specifying a different
p-by-r random-effects design matrix for
each of m groups.
REObsDesign A p-by-r-by-n array specifying a different
p-by-r random-effects design matrix for
each of n observations.
18-903
nlmefit
Parameter Value
RefineBeta0 Determines whether nlmefit makes an
initial refinement of beta0 by first fitting
modelfun without random effects and
replacing beta0 with beta. Choices are
'on' and 'off'. The default value is 'on'.
ApproximationType The method used to approximate the
likelihood of the model. Choices are:
18-904
nlmefit
Parameter Value
Vectorization Indicates acceptable sizes for the PHI,
XFUN, and VFUN input arguments to
modelfun. Choices are:
18-905
nlmefit
Parameter Value
CovParameterization Specifies the parameterization used
internally for the scaled covariance matrix.
Choices are 'chol' for the Cholesky
factorization or 'logm' the matrix
logarithm. The default is 'logm'.
CovPattern Specifies an r-by-r logical or numeric
matrix P that defines the pattern of the
random-effects covariance matrix PSI.
nlmefit estimates the variances along
the diagonal of PSI and the covariances
specified by nonzeroes in the off-diagonal
elements of P. Covariances corresponding
to zero off-diagonal elements in P are
constrained to be zero. If P does not specify
a row-column permutation of a block
diagonal matrix, nlmefit adds nonzero
elements to P as needed. The default
value of P is eye(r), corresponding to
uncorrelated random effects.
Alternatively, P may be a 1-by-r vector
containing values in 1:r, with equal values
specifying groups of random effects. In this
case, nlmefit estimates covariances only
within groups, and constrains covariances
across groups to be zero.
18-906
nlmefit
Parameter Value
ParamTransform A vector of P values specifying a
transformation function f() for each of the
P parameters:
XB = ADESIGN*BETA + BDESIGN*B
PHI = f(XB)
Each element of the vector must be one
of the following integer codes specifying
the transformation for the corresponding
value of PHI:
18-907
nlmefit
Parameter Value
h = plot(time,CIRC','o','LineWidth',2);
xlabel('Time (days)')
ylabel('Circumference (mm)')
18-908
nlmefit
model=@(PHI,t)(PHI(:,1))./(1+exp(-(t-PHI(:,2))./PHI(:,3)));
18-909
nlmefit
Fit the model using nlmefit with default settings (that is, assuming
each parameter is the sum of a fixed and a random effect, with no
correlation among the random effects):
TIME = repmat(time,5,1);
NUMS = repmat((1:5)',size(time));
[beta2,PSI2,stats2,b2] = nlmefit(TIME(:),CIRC(:),...
NUMS(:),[],model,beta0,'REParamsSelect',[1 3])
beta2 =
191.3190
723.7610
346.2527
PSI2 =
962.0491 0
18-910
nlmefit
0 298.1869
stats2 =
logl: -131.5457
mse: 59.7881
aic: 275.0913
bic: 284.4234
sebeta: NaN
dfe: 29
b2 =
-28.5254 31.6061 -36.5071 39.0738 -5.6475
10.0034 -0.7633 6.0080 -9.4630 -5.7853
The log-likelihood logl is unaffected, and both the Akaike and Bayesian
information criteria (aic and bic) are reduced, supporting the decision
to drop the second random effect from the model.
Use the estimated fixed effects in beta2 and the estimated random
effects for each tree in b2 to plot the model through the data:
colors = get(h,'Color');
tplot = 0:0.1:1600;
for I = 1:5
fitted_model=@(t)(PHI(1,I))./(1+exp(-(t-PHI(2,I))./ ...
PHI(3,I)));
plot(tplot,fitted_model(tplot),'Color',colors{I}, ...
'LineWidth',2)
end
18-911
nlmefit
18-912
nlmefit
18-913
nlmefitsa
Input Definitions:
Arguments In the following list of arguments, the following variable definitions
apply:
• n — number of observations
• h — number of predictor variables
• m — number of groups
• g — number of group-specific predictor variables
• p — number of parameters
18-914
nlmefitsa
X
An n-by-h matrix of n observations on h predictor variables.
Y
An n-by-1 vector of responses.
GROUP
A grouping variable indicating which of m groups each observation
belongs to. GROUP can be a categorical variable, a numeric vector,
a character matrix with rows for group names, or a cell array
of strings.
V
An m-by-g matrix of g group-specific predictor variables for each
of the m groups in the data. These are predictor values that take
on the same value for all observations in a group. Rows of V are
ordered according to GRP2IDX(GROUP). Use an m-by-g cell array
for V if any of the group-specific predictor values vary in size
across groups. Specify [] for V if there are no group predictors.
MODELFUN
A handle to a function that accepts predictor values and model
parameters, and returns fitted values. MODELFUN has the form
YFIT = MODELFUN(PHI,XFUN,VFUN) with input arguments
18-915
nlmefitsa
Name/Value Pairs
By default, nlmefitsa fits a model where each model parameter is the
sum of a corresponding fixed and random effect. Use the following
parameter name/value pairs to fit a model with a different number of
or dependence on fixed or random effects. Use at most one parameter
name with an 'FE' prefix and one parameter name with an 'RE' prefix.
Note that some choices change the way nlmefitsa calls MODELFUN, as
described further below.
FEParamsSelect
A vector specifying which elements of the model parameter vector
PHI include a fixed effect, as a numeric vector with elements in
1:p, or as a 1-by-p logical vector. The model will include f fixed
effects, where f is the specified number of elements.
FEConstDesign
18-916
nlmefitsa
CovPattern
Specifies an r-by-r logical or numeric matrix PAT that defines the
pattern of the random effects covariance matrix PSI. nlmefitsa
computes estimates for the variances along the diagonal of
PSI as well as covariances that correspond to non-zeroes in
the off-diagonal of PAT. nlmefitsa constrains the remaining
covariances, i.e., those corresponding to off-diagonal zeroes in
PAT, to be zero. PAT must be a row-column permutation of a block
diagonal matrix, and nlmefitsa adds non-zero elements to PAT
as needed to produce such a pattern. The default value of PAT is
eye(r), corresponding to uncorrelated random effects.
18-917
nlmefitsa
• 'constant' — y = f + a*e
• 'proportional' — y = f + b*f*e
• 'combined' — y = f + (a+b*f)*e
• 'exponential' — y = f*exp(a*e), or equivalently log(y) = log(f)
+ a*e
18-918
nlmefitsa
ErrorParameters
A scalar or two-element vector specifying starting values for
parameters of the error model. This specifies the a, b, or [a b]
values depending on the ErrorModel parameter.
LogLikMethod
Specifies the method for approximating the log likelihood. Choices
are:
NBurnIn
Number of initial burn-in iterations during which the parameter
estimates are not recomputed. Default is 5.
NChains
Number c of "chains" simulated. Default is 1. Setting c>1 causes
c simulated coefficient vectors to be computed for each group
during each iteration. Default depends on the data, and is chosen
to provide about 100 groups across all chains.
NIterations
Number of iterations. This can be a scalar or a three-element
vector. Controls how many iterations are performed for each of
three phases of the algorithm:
1 simulated annealing
18-919
nlmefitsa
18-920
nlmefitsa
ParamTransform
A vector of p values specifying a transformation function f() for
each of the p parameters:
XB = ADESIGN*BETA + BDESIGN*B
PHI = f(XB)
Replicates
Number REPS of estimations to perform starting from the starting
values in the vector BETA0. If BETA0 is a matrix, REPS must
match the number of columns in BETA0. Default is the number of
columns in BETA0.
Vectorization
Determines the possible sizes of the PHI, XFUN, and VFUN input
arguments to MODELFUN. Possible values are:
18-921
nlmefitsa
Output BETA
Arguments Estimates of the fixed effects
PSI
An r-by-r estimated covariance matrix for the random effects. By
default, r is equal to the number of model parameters p.
STATS
A structure with the following fields:
18-922
nlmefitsa
load indomethacin
model = @(phi,t)(phi(:,1).*exp(-phi(:,2).*t)+phi(:,3).*exp(-phi(:,4).*t));
phi0 = [1 1 1 1];
% log transform for 2nd and 4th parameters
xform = [0 1 0 1];
[beta,PSI,stats,br] = nlmefitsa(time,concentration,...
subject,[],model,phi0,'ParamTransform',xform)
18-923
nlmefitsa
18-924
nlmefitsa
18-925
nlmefitsa
( ) ∫ p ( y| , b, 2 ) p ( b|Σ ) db
p y| , 2 , Σ =
where
18-926
nlmefitsa
18-927
gmdistribution.NlogL property
18-928
ProbDistParametric.NLogL property
Purpose Read-only value specifying negative log likelihood for input data to
ProbDistParametric object
Values The value is a numeric scalar for a distribution fit to input data, that
is, a distribution created using the fitdist function. This property is
empty for distributions created without fitting to data, that is, by using
the ProbDistUnivParam.ProbDistUnivParam constructor. Use this
information to view and compare the negative log likelihood for input
data supplied to create distributions.
18-929
ProbDistUnivKernel.NLogL property
Purpose Read-only value specifying negative log likelihood for input data to
ProbDistUnivKernel object
Values The value is a numeric scalar for a distribution fit to input data, that is,
a distribution created using the fitdist function. Use this information
to view and compare the negative log likelihood for input data used to
create distributions.
18-930
nlparci
Syntax ci = nlparci(beta,resid,'covar',sigma)
ci = nlparci(beta,resid,'jacobian',J)
ci = nlparci(...,'alpha',alpha)
load reaction
[beta,resid,J,Sigma] = ...
nlinfit(reactants,rate,@hougen,beta);
ci = nlparci(beta,resid,'jacobian',J)
ci =
-0.7467 3.2519
-0.0377 0.1632
18-931
nlparci
-0.0312 0.1113
-0.0609 0.2857
-0.7381 3.1208
18-932
nlpredci
Description [ypred,delta] =
nlpredci(modelfun,x,beta,resid,'covar',sigma) returns
predictions, ypred, and 95% confidence interval half-widths, delta, for
the nonlinear regression model defined by modelfun, at input values x.
modelfun is a function handle, specified using @, that accepts two
arguments—a coefficient vector and the array x—and returns a vector
of fitted y values. Before calling nlpredci, use nlinfit to fit modelfun
by nonlinear least squares and get estimated coefficient values beta,
residuals resid, and estimated coefficient covariance matrix sigma.
[ypred,delta] =
nlpredci(modelfun,x,beta,resid,'jacobian',J) is an alternative
syntax that also computes 95% confidence intervals. J is the Jacobian
computed by nlinfit. If the 'robust' option is used with nlinfit, use
the 'covar' input rather than the 'jacobian' input so that the
required sigma parameter takes the robust fitting into account.
[...] = nlpredci(...,param1,val1,param2,val2,...) accepts
optional parameter name/value pairs.
Parameter Value
'alpha' A value between 0 and 1 that specifies the confidence
level as 100(1-alpha)%. Default is 0.05.
'mse' The mean squared error returned by nlinfit. This is
required to predict new observations (see 'predopt') if
the robust option is used with nlinfit; otherwise, the
'mse' is computed from the residuals and does not take
the robust fitting into account.
18-933
nlpredci
Parameter Value
'predopt' Either 'curve' (the default) to compute confidence
intervals for the estimated curve (function value) at
x, or 'observation' for prediction intervals for a
new observation at x. If 'observation'is specified
after using a robust option with nlinfit, the 'mse'
parameter must be supplied to specify the robust
estimate of the mean squared error.
'simopt' Either 'on' for simultaneous bounds, or 'off' (the
default) for nonsimultaneous bounds.
Examples Continuing the example from nlinfit, you can determine the predicted
function value at the value newX and the half-width of a confidence
interval for it.
load reaction;
[beta,resid,J,Sigma] = nlinfit(reactants,rate,@hougen,...
beta);
newX = reactants(1:2,:);
[ypred, delta] = nlpredci(@hougen,newX,beta,resid,...
'Covar',Sigma);
ypred =
8.4179
3.9542
delta =
0.2805
18-934
nlpredci
0.2474
18-935
nnmf
D = sqrt(norm(A-W*H,'fro')/(N*M))
Parameter Value
'algorithm' Either 'als' (the default) to use an alternating
least-squares algorithm, or 'mult' to use a
multiplicative update algorithm.
In general, the 'als' algorithm converges faster
and more consistently. The 'mult' algorithm is
more sensitive to initial values, which makes it a
good choice when using 'replicates' to find W
and H from multiple random starting values.
18-936
nnmf
Parameter Value
'w0' An n-by-k matrix to be used as the initial value
for W.
'h0' A k-by-m matrix to be used as the initial value
for H.
'options' An options structure as created by the statset
function. nnmf uses the following fields of the
options structure: Display, TolX, TolFun,
and MaxIter. Unlike in optimization settings,
reaching MaxIter iterations is treated as
convergence.
'replicates' The number of times to repeat the factorization,
using new random starting values for W and H,
except at the first replication if 'w0' and 'h0'
are given. This is most beneficial with the 'mult'
algorithm. The default is 1.
Examples Example 1
Compute a nonnegative rank-two approximation of the measurements
of the four variables in Fisher’s iris data:
load fisheriris
[W,H] = nnmf(meas,2);
H
H =
0.6852 0.2719 0.6357 0.2288
0.8011 0.5740 0.1694 0.0087
The first and third variables in meas (sepal length and petal length,
with coefficients 0.6852 and 0.6357, respectively) provide relatively
strong weights to the first column of W. The first and second variables
in meas (sepal length and sepal width, with coefficients 0.8011and
0.5740) provide relatively strong weights to the second column of W.
18-937
nnmf
Create a biplot of the data and the variables in meas in the column
space of W:
biplot(H','scores',W,'varlabels',{'sl','sw','pl','pw'});
axis([0 1.1 0 1.1])
xlabel('Column 1')
ylabel('Column 2')
Example 2
Starting from a random array X with rank 20, try a few iterations at
several replicates using the multiplicative algorithm:
X = rand(100,20)*rand(20,50);
opt = statset('MaxIter',5,'Display','final');
18-938
nnmf
[W0,H0] = nnmf(X,5,'replicates',10,...
'options',opt,...
'algorithm','mult');
rep iteration rms resid |delta x|
1 5 0.560887 0.0245182
2 5 0.66418 0.0364471
3 5 0.609125 0.0358355
4 5 0.608894 0.0415491
5 5 0.619291 0.0455135
6 5 0.621549 0.0299965
7 5 0.640549 0.0438758
8 5 0.673015 0.0366856
9 5 0.606835 0.0318931
10 5 0.633526 0.0319591
Final root mean square residual = 0.560887
Continue with more iterations from the best of these results using
alternating least squares:
opt = statset('Maxiter',1000,'Display','final');
[W,H] = nnmf(X,5,'w0',W0,'h0',H0,...
'options',opt,...
'algorithm','als');
rep iteration rms resid |delta x|
1 80 0.256914 9.78625e-005
Final root mean square residual = 0.256914
References [1] Berry, M. W., et al. “Algorithms and Applications for Approximate
Nonnegative Matrix Factorization.” Computational Statistics and Data
Analysis. Vol. 52, No. 1, 2007, pp. 155–173.
18-939
classregtree.nodeerr
Syntax e = nodeerr(t)
e = nodeerr(t,nodes)
load fisheriris;
t = classregtree(meas,species,...
'names',{'SL' 'SW' 'PL' 'PW'})
t =
Decision tree for classification
1 if PL<2.45 then node 2 elseif PL>=2.45 then node 3 else setosa
2 class = setosa
3 if PW<1.75 then node 4 elseif PW>=1.75 then node 5 else versicolor
4 if PL<4.95 then node 6 elseif PL>=4.95 then node 7 else versicolor
5 class = virginica
6 if PW<1.65 then node 8 elseif PW>=1.65 then node 9 else versicolor
7 class = virginica
8 class = versicolor
9 class = virginica
18-940
classregtree.nodeerr
view(t)
e = nodeerr(t)
e =
0.6667
0
0.5000
0.0926
0.0217
0.0208
18-941
classregtree.nodeerr
0.3333
0
0
18-942
classregtree.nodeprob
Syntax p = nodeprob(t)
p = nodeprob(t,nodes)
load fisheriris;
t = classregtree(meas,species,...
'names',{'SL' 'SW' 'PL' 'PW'})
t =
Decision tree for classification
1 if PL<2.45 then node 2 elseif PL>=2.45 then node 3 else setosa
2 class = setosa
3 if PW<1.75 then node 4 elseif PW>=1.75 then node 5 else versicolor
4 if PL<4.95 then node 6 elseif PL>=4.95 then node 7 else versicolor
5 class = virginica
6 if PW<1.65 then node 8 elseif PW>=1.65 then node 9 else versicolor
7 class = virginica
8 class = versicolor
9 class = virginica
view(t)
18-943
classregtree.nodeprob
p = nodeprob(t)
p =
1.0000
0.3333
0.6667
0.3600
0.3067
0.3200
0.0400
0.3133
18-944
classregtree.nodeprob
0.0067
18-945
classregtree.nodesize
load fisheriris;
t = classregtree(meas,species,...
'names',{'SL' 'SW' 'PL' 'PW'})
t =
Decision tree for classification
1 if PL<2.45 then node 2 elseif PL>=2.45 then node 3 else setosa
2 class = setosa
3 if PW<1.75 then node 4 elseif PW>=1.75 then node 5 else versicolor
4 if PL<4.95 then node 6 elseif PL>=4.95 then node 7 else versicolor
5 class = virginica
6 if PW<1.65 then node 8 elseif PW>=1.65 then node 9 else versicolor
7 class = virginica
8 class = versicolor
9 class = virginica
view(t)
18-946
classregtree.nodesize
sizes = nodesize(t)
sizes =
150
50
100
54
46
48
6
47
18-947
classregtree.nodesize
18-948
qrandstream.notify
Syntax notify(h,'eventname')
notify(h,'eventname',data)
18-949
nominal class
Superclasses categorical
Description Nominal arrays are used to store discrete values that are not
numeric and that do not have an ordering. A nominal array provides
efficient storage and convenient manipulation of such data, while also
maintaining meaningful labels for the values. Nominal arrays are often
used as grouping variables.
You can subscript, concatenate, reshape, etc. nominal arrays much
like ordinary numeric arrays. You can test equality between elements
of two nominal arrays, or between a nominal array and a single string
representing a nominal value.
Construction Use the nominal constructor to create a nominal array from a numeric,
logical, or character array, or from a cell array of strings.
Methods Each nominal array carries along a list of possible values that it
can store, known as its levels. The list is created when you create a
nominal array, and you can access it using the getlevels method, or
modify it using the addlevels, mergelevels, or droplevels methods.
Assignment to the array will also add new levels automatically if the
values assigned are not already levels of the array.
You can change the order of the list of levels for a nominal array using
the reorderlevels method, however, that order has no significance for
the values in the array. The order is used only for display purposes, or
when you convert the nominal array to numeric values using methods
such as double or subsindex, or compare two arrays using isequal. If
you need to work with values that have a mathematical ordering, you
should use an ordinal array instead.
18-950
nominal class
Inherited Methods
Methods in the following table are inherited from categorical.
18-951
nominal class
18-952
nominal class
18-953
nominal class
Copy Value. To learn how this affects your use of the class, see Comparing
Semantics Handle and Value Classes in the MATLAB Object-Oriented
Programming documentation.
18-954
nominal class
ismember(colors,{'red' 'blue'})
18-955
nominal
Syntax B = nominal(A)
B = nominal(A,labels)
B = nominal(A,labels,levels)
B = nominal(A,labels,[],edges)
18-956
nominal
load fisheriris
species = nominal(species);
summary(species)
setosa versicolor virginica
50 50 50
colors1 = nominal({'r' 'b' 'g'; 'g' 'r' 'b'; 'b' 'r' 'g'},...
{'blue' 'green' 'red'})
Create a nominal array from characters, and provide both explicit labels
and an explicit order for display:
colors2 = nominal({'r' 'b' 'g'; 'g' 'r' 'b'; 'b' 'r' 'g'}, ...
{'red' 'green' 'blue'},{'r' 'g' 'b'})
Create a nominal array from integer data, merging odd and even values
into only two nominal levels. Provide explicit labels:
1 Load patient data from the CSV file hospital.dat and store the
information in a dataset array with observation names given by the
first column in the data (patient identification):
patients = dataset('file','hospital.dat',...
'delimiter',',',...
'ReadObsNames',true);
18-957
nominal
2 Make the {0,1}-valued variable smoke nominal, and change the labels
to 'No' and 'Yes':
patients.smoke = nominal(patients.smoke,{'No','Yes'});
patients.smoke = addlevels(patients.smoke,...
{'0-5 Years','5-10 Years','LongTerm'});
4 Assuming the nonsmokers have never smoked, relabel the 'No' level:
patients.smoke = setlabels(patients.smoke,'Never','No');
patients.smoke = droplevels(patients.smoke,'Yes');
18-958
normcdf
Syntax P = normcdf(X,mu,sigma)
[P,PLO,PUP] = normcdf(X,mu,sigma,pcov,alpha)
X − ˆ
ˆ
and then transforming those bounds to the scale of the output P. The
computed bounds give approximately the desired confidence level when
you estimate mu, sigma, and pcov from large samples, but in smaller
samples other methods of computing the confidence bounds might be
more accurate.
The normal cdf is
−(t − ) 2
1 x
p = F ( x| , ) = ∫−∞ e 2 dt
2
2
18-959
normcdf
p = normcdf([-1 1]);
p(2)-p(1)
ans =
0.6827
18-960
normfit
18-961
normfit
Examples In this example the data is a two-column random normal matrix. Both
columns have µ = 10 and σ = 2. Note that the confidence intervals below
contain the "true values."
data = normrnd(10,2,100,2);
[mu,sigma,muci,sigmaci] = normfit(data)
mu =
10.1455 10.0527
sigma =
1.9072 2.1256
muci =
9.7652 9.6288
10.5258 10.4766
sigmaci =
1.6745 1.8663
2.2155 2.4693
18-962
norminv
Syntax X = norminv(P,mu,sigma)
[X,XLO,XUP] = norminv(P,mu,sigma,pcov,alpha)
ˆ + ˆ q
where q is the Pth quantile from a normal distribution with mean 0 and
standard deviation 1. The computed bounds give approximately the
desired confidence level when you estimate mu, sigma, and pcov from
large samples, but in smaller samples other methods of computing the
confidence bounds may be more accurate.
The normal inverse function is defined in terms of the normal cdf as
x = F −1 ( p| , ) = { x : F ( x | , ) = p}
where
18-963
norminv
−(t − ) 2
1 x
p = F ( x| , ) = ∫−∞ e 2 dt
2
2
The result, x, is the solution of the integral equation above where you
supply the desired probability, p.
Examples Find an interval that contains 95% of the values from a standard
normal distribution.
x = norminv([0.025 0.975],0,1)
x =
-1.9600 1.9600
Note that the interval x is not the only such interval, but it is the
shortest.
xl = norminv([0.01 0.96],0,1)
xl =
-2.3263 1.7507
18-964
normlike
18-965
normpdf
Syntax Y = normpdf(X,mu,sigma)
−( x − )2
1
y = f ( x| , ) = e 2
2
2
Examples mu = [0:0.1:2];
[y i] = max(normpdf(1.5,mu,1));
MLE = mu(i)
MLE =
1.5000
See Also pdf, normcdf, norminv, normstat, normfit, normlike, normrnd, mvnpdf
“Normal Distribution” on page B-83
18-966
normplot
Syntax h = normplot(X)
Examples Generate a normal sample and a normal probability plot of the data.
x = normrnd(10,1,25,1);
normplot(x)
18-967
normplot
18-968
normrnd
Syntax R = normrnd(mu,sigma)
R = normrnd(mu,sigma,v)
R = normrnd(mu,sigma,m,n)
Examples n1 = normrnd(1:6,1./(1:6))
n1 =
2.1650 2.3134 3.0250 4.0879 4.8607 6.2827
n2 = normrnd(0,1,[1 5])
n2 =
0.0591 1.7971 0.2641 0.8717 -1.4462
18-969
normrnd
18-970
normspec
Syntax normspec(specs)
normspec(specs,mu,sigma)
normspec(specs,mu,sigma,region)
p = normspec(...)
[p,h] = normspec(...)
Examples A production process fills cans of paint. The average amount of paint in
any can is 1 gallon, but variability in the process produces a standard
deviation of 2 ounces (2/128 gallons). What is the probability that cans
will be filled under specification by 3 or more ounces?
p = normspec([1-3/128,Inf],1,2/128,'outside')
p =
0.0668
18-971
normspec
18-972
normstat
Examples n = 1:5;
[m,v] = normstat(n'*n,n'*n)
m =
1 2 3 4 5
2 4 6 8 10
3 6 9 12 15
4 8 12 16 20
5 10 15 20 25
v =
1 4 9 16 25
4 16 36 64 100
9 36 81 144 225
16 64 144 256 400
25 100 225 400 625
18-973
piecewisedistribution.nsegments
Syntax n = nsegments(obj)
t = trnd(3,100,1);
obj = paretotails(t,0.1,0.9);
n = nsegments(obj)
n =
3
18-974
CompactTreeBagger.NTrees property
Description The NTrees property is a scalar equal to the number of decision trees
in the ensemble.
18-975
TreeBagger.NTrees property
Description The NTrees property is a scalar equal to the number of decision trees
in the ensemble.
18-976
ProbDistParametric.NumParams property
Values This value is an integer that counts both the specified parameters and
parameters that are fit to the data. Use this information to view and
compare the number of parameters supplied to create distributions.
18-977
dataset.numel
Syntax n = numel(A)
n = numel(A, varargin)
18-978
categorical.numel
Syntax n = numel(A)
n = numel(A, varargin)
18-979
classregtree.numnodes
Syntax n = numnodes(t)
load fisheriris;
t = classregtree(meas,species,...
'names',{'SL' 'SW' 'PL' 'PW'})
t=
Decision tree for classification
1 if PL<2.45 then node 2 elseif PL>=2.45 then node 3 else setosa
2 class = setosa
3 if PW<1.75 then node 4 elseif PW>=1.75 then node 5 else versicolor
4 if PL<4.95 then node 6 elseif PL>=4.95 then node 7 else versicolor
5 class = virginica
6 if PW<1.65 then node 8 elseif PW>=1.65 then node 9 else versicolor
7 class = virginica
8 class = versicolor
9 class = virginica
view(t)
18-980
classregtree.numnodes
n = numnodes(t)
n =
9
18-981
cvpartition.NumTestSets property
18-982
TreeBagger.NVarToSample property
18-983
dataset.ObsNames property
Description A cell array of nonempty, distinct strings giving the names of the
observations in the data set. This property may be empty, but if not
empty, the number of strings must equal the number of observations.
18-984
TreeBagger.oobError
18-985
TreeBagger.oobError
18-986
TreeBagger.OOBIndices property
18-987
TreeBagger.OOBInstanceWeight property
18-988
TreeBagger.oobMargin
18-989
TreeBagger.oobMargin
18-990
TreeBagger.oobMeanMargin
18-991
TreeBagger.oobMeanMargin
18-992
TreeBagger.OOBPermutedVarCountRaiseMargin
property
18-993
TreeBagger.OOBPermutedVarDeltaError property
18-994
TreeBagger.OOBPermutedVarDeltaMeanMargin
property
18-995
TreeBagger.oobPredict
Syntax Y = oobPredict(B)
Y = oobPredict(B,'param1',val1,'param2',val2,...)
18-996
ordinal class
Superclasses categorical
Description Ordinal arrays are used to store discrete values that have an ordering
but are not numeric. An ordinal array provides efficient storage
and convenient manipulation of such data, while also maintaining
meaningful labels for the values. Ordinal arrays are often used as
grouping variables.
Like a numerical array, an ordinal array can have any size or
dimension. You can subscript, concatenate, reshape, sort, etc. ordinal
arrays, much like ordinary numeric arrays. You can make comparisons
between elements of two ordinal arrays, or between an ordinal array
and a single string representing a ordinal value.
Construction Use the ordinal constructor to create an ordinal array from a numeric,
logical, or character array, or from a cell array of strings.
Methods Each ordinal array carries along a list of possible values that it can
store, known as its levels. The list is created when you create an
ordinal array, and you can access it using the getlevels method, or
modify it using the addlevels, mergelevels, or droplevels methods.
Assignment to the array will also add new levels automatically if the
values assigned are not already levels of the array. The ordering on
values stored in an ordinal array is defined by the order of the list of
levels. You can change that order using the reorderlevels method.
The following table lists operations available for ordinal arrays.
18-997
ordinal class
Inherited Methods
Methods in the following table are inherited from categorical.
18-998
ordinal class
18-999
ordinal class
18-1000
ordinal class
Copy Value. To learn how this affects your use of the class, see Comparing
Semantics Handle and Value Classes in the MATLAB Object-Oriented
Programming documentation.
18-1001
ordinal class
18-1002
ordinal
Syntax B = ordinal(A)
B = ordinal(A,labels)
B = ordinal(A,labels,levels)
B = ordinal(A,labels,[],edges)
18-1003
ordinal
Examples Create an ordinal array from integer data, and provide explicit labels:
Create an ordinal array from integer data, and provide both explicit
labels and an explicit order:
load fisheriris
m = floor(min(meas(:)));
M = floor(max(meas(:)));
labels = num2str((m:M)');
edges = m:M+1;
cms = ordinal(meas,labels,[],edges)
meas(1:5,:)
ans =
5.1000 3.5000 1.4000 0.2000
4.9000 3.0000 1.4000 0.2000
18-1004
ordinal
load hospital
edges = 0:10:100;
labels = strcat(num2str((0:10:90)','%d'),{'s'});
AgeGroup = ordinal(hospital.Age,labels,[],edges);
hospital.Age(1:5)
ans =
38
43
38
40
49
AgeGroup(1:5)
ans =
30s
40s
30s
40s
40s
18-1005
CompactTreeBagger.outlierMeasure
18-1006
CompactTreeBagger.outlierMeasure
18-1007
TreeBagger.OutlierMeasure property
18-1008
parallelcoords
Syntax parallelcoords(X)
parallelcoords(X,...,'Standardize','on')
parallelcoords(X,...,'Standardize','PCA')
parallelcoords(X,...,'Standardize','PCAStd')
parallelcoords(X,...,'Quantile',alpha)
parallelcoords(X,...,'Group',group)
parallelcoords(X,...,'Labels',labels)
parallelcoords(X,...,PropertyName,PropertyValue,...)
h = parallelcoords(X,...)
parallelcoords(axes,...)
18-1009
parallelcoords
18-1010
ProbDistUnivParam.paramci
Syntax CI = paramci(PD)
CI = paramci(PD, Alpha)
18-1011
ProbDistUnivParam.paramci
18-1012
ProbDistParametric.ParamCov property
Values This covariance matrix includes estimates for both the specified
parameters and parameters that are fit to the data. For specified
parameters, the covariance is 0, indicating the parameter is known
exactly. Use this information to view and compare the descriptions of
parameters supplied to create distributions.
18-1013
ProbDistParametric.ParamDescription property
Values This cell array includes a brief description of the meaning of both
the specified parameters and parameters that are fit to the data.
The description is the same as the parameter name when no further
description information is available. Use this information to view and
compare the descriptions of parameters used to create distributions.
18-1014
ProbDistParametric.ParamIsFixed property
Values This array specifies a 1 (true) for fixed parameters, and a 0 (false)
for parameters that are estimated from the input data. Use this
information to view and compare the fixed parameters used to create
distributions.
18-1015
ProbDistParametric.ParamNames property
Values This cell array includes the names of both the specified parameters and
parameters that are fit to the data. Use this information to view and
compare the names of parameters used to create distributions.
18-1016
NaiveBayes.Params property
18-1017
ProbDistParametric.Params property
Values This array includes the values of both the specified parameters and
parameters that are fit to the data. Use this information to view and
compare the values of parameters used to create distributions.
18-1018
classregtree.parent
Syntax p = parent(t)
p = parent(t,nodes)
load fisheriris;
t = classregtree(meas,species,...
'names',{'SL' 'SW' 'PL' 'PW'})
t =
Decision tree for classification
1 if PL<2.45 then node 2 else node 3
2 class = setosa
3 if PW<1.75 then node 4 else node 5
4 if PL<4.95 then node 6 else node 7
5 class = virginica
6 if PW<1.65 then node 8 else node 9
7 class = virginica
8 class = versicolor
9 class = virginica
view(t)
18-1019
classregtree.parent
p = parent(t)
p =
0
1
1
3
3
4
4
6
18-1020
classregtree.parent
18-1021
pareto
Syntax pareto(y,names)
[h,ax] = pareto(...)
Description pareto(y,names) displays a Pareto chart where the values in the vector
y are drawn as bars in descending order. Each bar is labeled with the
associated value in the string matrix or cell array, names. pareto(y)
labels each bar with the index of the corresponding element in y.
The line above the bars shows the cumulative percentage.
[h,ax] = pareto(...) returns a combination of patch and line object
handles to the two axes created in ax.
Examples Create a Pareto chart from data measuring the number of manufactured
parts rejected for various types of defects.
defects = {'pits';'cracks';'holes';'dents'};
quantity = [5 3 19 25];
pareto(quantity,defects)
18-1022
pareto
18-1023
paretotails class
Superclasses piecewisedistribution
Inherited Methods
Methods in the following table are inherited from
piecewisedistribution.
18-1024
paretotails class
Copy Value. To learn how this affects your use of the class, see Comparing
Semantics Handle and Value Classes in the MATLAB Object-Oriented
Programming documentation.
18-1025
paretotails
18-1026
paretotails
The pdf method in the tails is the GPD density, but in the center it is
computed as the slope of the interpolated cdf.
The paretotails class is a subclass of the piecewisedistribution
class, and many of its methods are derived from that class.
t = trnd(3,100,1);
obj = paretotails(t,0.1,0.9);
[p,q] = boundary(obj);
x = linspace(-5,5);
plot(x,cdf(obj,x),'b-','LineWidth',2)
hold on
plot(x,tcdf(x,3),'r:','LineWidth',2)
plot(q,p,'bo','LineWidth',2,'MarkerSize',5)
legend('Pareto Tails Object','t Distribution',...
'Location','NW')
18-1027
paretotails
18-1028
partialcorr
⎛ S11 S12 ⎞
S=⎜ ⎟
⎝ S12T S22 ⎠
then the partial correlation matrix of X, controlling for Z, can be defined
formally as a normalized version of the covariance matrix Sxy = S11
– (S12S22–1S12T)
[RHO,PVAL] = partialcorr(...) also returns PVAL, a matrix of
p-values for testing the hypothesis of no partial correlation against the
alternative that there is a nonzero partial correlation. Each element of
PVAL is the p value for the corresponding element of RHO. If PVAL(I,J)
is small, say less than 0.05, then the partial correlation, RHO(I,J), is
significantly different from zero.
18-1029
partialcorr
[...] = partialcorr(...,param1,val1,param2,val2,...)
specifies additional parameters and their values. Valid parameter/value
pars are listed in the following table.
Parameter Values
'type' • 'Pearson' — To compute Pearson (linear)
partial correlations. This is the default.
• 'Spearman' — To compute Spearman (rank)
partial correlations.
'rows' • 'all' — To use all rows regardless of missing
(NaN) values. This is the default.
• 'complete' — To use only rows with no missing
values.
• 'pairwise' — To compute RHO(I,J) using rows
with no missing values in column I or J.
'tail' • 'both' (the default) — the correlation is not zero.
• 'right' — the correlation is greater than zero.
The alternative
hypothesis • 'left' — the correlation is less than zero.
against which
to compute
p-values for
testing the
hypothesis
of no partial
correlation.
A 'pairwise' value for the rows parameter can produce a RHO that is
not positive definite. A 'complete' value always produces a positive
definite RHO, but when data is missing, the estimates will be based on
fewer observations, in general.
partialcorr computes p-values for linear and rank partial correlations
using a Student’s t distribution for a transformation of the correlation.
18-1030
partialcorr
This is exact for linear partial correlation when X and Z are normal, but
is a large-sample approximation otherwise.
18-1031
pcacov
latent =
517.7969
67.4964
12.4054
18-1032
pcacov
0.2372
explained =
86.5974
11.2882
2.0747
0.0397
18-1033
pcares
Examples This example shows the drop in the residuals from the first row of the
Hald data as the number of component dimensions increases from one
to three.
load hald
r1 = pcares(ingredients,1);
r2 = pcares(ingredients,2);
r3 = pcares(ingredients,3);
r11 = r1(1,:)
r11 =
2.0350 2.8304 -6.8378 3.0879
r21 = r2(1,:)
r21 =
-2.4037 2.6930 -1.6482 2.3425
18-1034
pcares
r31 = r3(1,:)
r31 =
0.2008 0.1957 0.2045 0.1921
References [1] Jackson, J. E., A User’s Guide to Principal Components, John Wiley
and Sons, 1991.
18-1035
gmdistribution.PComponents property
18-1036
pdf
Syntax Y = pdf(name,X,A)
Y = pdf(name,X,A,B)
Y = pdf(name,X,A,B,C)
18-1037
pdf
18-1038
pdf
18-1039
pdf
18-1040
pdf
Examples Compute the pdf of the normal distribution with mean 0 and standard
deviation 1 at inputs –2, –1, 0, 1, 2:
p1 = pdf('Normal',-2:2,0,1)
p1 =
0.0540 0.2420 0.3989 0.2420 0.0540
p2 = pdf('Poisson',0:4,1:5)
p2 =
0.3679 0.2707 0.2240 0.1954 0.1755
18-1041
pdf
18-1042
gmdistribution.pdf
Syntax y = pdf(obj,X)
MU = [1 2;-3 -5];
SIGMA = cat(3,[2 0;0 .5],[1 0;0 1]);
p = ones(1,2)/2;
obj = gmdistribution(MU,SIGMA,p);
18-1043
gmdistribution.pdf
18-1044
piecewisedistribution.pdf
Syntax P = pdf(obj,X)
Note For a Pareto tails object, the pdf is computed using the
generalized Pareto distribution in the tails. In the center, the pdf is
computed using the slopes of the cdf, which are interpolated between
a set of discrete values. Therefore the pdf in the center is piecewise
constant. It is noisy for a cdffun specified in paretotails via the
'ecdf' option, and somewhat smoother for the 'kernel' option, but
generally not a good estimate of the underlying density of the original
data.
t = trnd(3,100,1);
obj = paretotails(t,0.1,0.9);
[p,q] = boundary(obj)
p =
0.1000
0.9000
q =
-1.7766
1.8432
pdf(obj,q)
ans =
0.2367
0.1960
18-1045
piecewisedistribution.pdf
18-1046
ProbDist.pdf
Syntax Y = pdf(PD, X)
18-1047
pdist
Syntax D = pdist(X)
D = pdist(X,distance)
D = pdist(X,'minkowski',P)
D = pdist(X,'mahalanobis',C)
Metric Description
'euclidean' Euclidean distance (default).
'seuclidean' Standardized Euclidean distance. Each
coordinate difference between rows in X is
scaled by dividing by the corresponding
element of the standard deviation
S=nanstd(X). To specify another value for
S, use D=pdist(X,'seuclidean',S).
'cityblock' City block metric.
18-1048
pdist
Metric Description
'minkowski' Minkowski distance. The default exponent is
2. To specify a different exponent, use D =
pdist(X,'minkowski',P), where P is a scalar
'chebychev' positive
Chebychevvalue of the (maximum
distance exponent. coordinate
difference).
'mahalanobis' Mahalanobis distance, using the sample
covariance of X as computed by nancov. To
compute the distance with a different covariance,
use D = pdist(X,'mahalanobis',C), where
the matrix C is symmetric and positive definite.
'cosine' One minus the cosine of the included angle
between points (treated as vectors).
'correlation' One minus the sample correlation between
points (treated as sequences of values).
'spearman' One minus the sample Spearman’s rank
correlation between observations (treated as
sequences of values).
'hamming' Hamming distance, which is the percentage of
coordinates that differ.
'jaccard' One minus the Jaccard coefficient, which is the
percentage of nonzero coordinates that differ.
custom distance A distance function specified using @:
function D = pdist(X,@distfun)
A distance function must be of form
d2 = distfun(XI,XJ)
18-1049
pdist
Metric Description
• Euclidean distance
2
dst = ( xs − xt )( xs − xt )′
2
dst = ( xs − xt )V −1 ( xs − xt )′
2
dst = ( xs − xt )C −1 ( xs − xt )′
18-1050
pdist
n
dst = ∑ xsj − xtj
j =1
Notice that the city block distance is a special case of the Minkowski
metric, where p=1.
• Minkowski metric
n p
dst = p ∑ xsj − xtj
j =1
Notice that for the special case of p = 1, the Minkowski metric gives
the city block metric, for the special case of p = 2, the Minkowski
metric gives the Euclidean distance, and for the special case of p = ∞,
the Minkowski metric gives the Chebychev distance.
• Chebychev distance
{
dst = max j xsj − xtj }
Notice that the Chebychev distance is a special case of the Minkowski
metric, where p = ∞.
• Cosine distance
xs xt′
dst = 1 −
( xs xs′ ) ( xt xt′ )
• Correlation distance
18-1051
pdist
( xs − xs ) ( xt − xt )′
dst = 1 −
( xs − xs ) ( xs − xs )′ ( xt − xt ) ( xt − xt )′
where
1 1
xs = ∑ xsj and xt = n ∑ xtj
n j j
• Hamming distance
• Jaccard distance
( ) ((
# ⎡ xsj ≠ xtj ∩ xsj ≠ 0 ∪ xtj ≠ 0 ⎤ ) ( ))
dst = ⎣ ⎦
( ) (
# ⎡ xsj ≠ 0 ∪ xtj ≠ 0 ⎤
⎣ ⎦ )
• Spearman distance
( rs − rs ) ( rt − rt )′
dst = 1 −
( rs − rs ) ( rs − rs )′ ( rt − rt ) ( rt − rt )′
where
- rsj is the rank of xsj taken over x1j, x2j, ...xmj, as computed by
tiedrank
- rs and rt are the coordinate-wise rank vectors of xs and xt, i.e.,
rs = (rs1, rs2, ... rsn)
1 ( n + 1)
- rs = ∑
n j
rsj =
2
18-1052
pdist
1 ( n + 1)
- rt = ∑
n j
rtj =
2
Examples Generate random data and find the unweighted Euclidean distance and
then find the weighted distance using two different methods:
18-1053
pdist2
Syntax D = pdist2(X,Y)
D = pdist2(X,Y,distance)
D = pdist2(X,Y,'minkowski',P)
D = pdist2(X,Y,'mahalanobis',C)
D = pdist2(X,Y,distance,'Smallest',K)
D = pdist2(X,Y,distance,'Largest',K)
[D,I] = pdist2(X,Y,distance,'Smallest',K)
[D,I] = pdist2(X,Y,distance,'Largest',K)
Metric Description
'euclidean' Euclidean distance (default).
'seuclidean' Standardized Euclidean distance. Each coordinate
difference between rows in X and Y is scaled
by dividing by the corresponding element
of the standard deviation computed from X,
S=nanstd(X). To specify another value for S, use D =
PDIST2(X,Y,'seuclidean',S).
'cityblock' City block metric.
'minkowski' Minkowski distance. The default exponent is 2. To
compute the distance with a different exponent,
use D = pdist2(X,Y,'minkowski',P), where the
exponent P is a scalar positive value.
18-1054
pdist2
Metric Description
'chebychev' Chebychev distance (maximum coordinate
difference).
'mahalanobis' Mahalanobis distance, using the sample covariance
of X as computed by nancov. To compute the
distance with a different covariance, use D =
pdist2(X,Y,'mahalanobis',C) where the matrix C
is symmetric and positive definite.
'cosine' One minus the cosine of the included angle between
points (treated as vectors).
'correlation' One minus the sample correlation between points
(treated as sequences of values).
18-1055
pdist2
Metric Description
• Euclidean distance
2
dst = ( xs − yt )( xs − yt )′
2
dst = ( xs − yt )V −1 ( xs − yt )′
18-1056
pdist2
2
dst = ( xs − yt )C −1 ( xs − yt )′
n
dst = ∑ xsj − ytj
j =1
Notice that the city block distance is a special case of the Minkowski
metric, where p=1.
• Minkowski metric
n p
dst = p ∑ xsj − ytj
j =1
Notice that for the special case of p = 1, the Minkowski metric gives
the City Block metric, for the special case of p = 2, the Minkowski
metric gives the Euclidean distance, and for the special case of p=∞,
the Minkowski metric gives the Chebychev distance.
• Chebychev distance
{
dst = max j xsj − ytj }
Notice that the Chebychev distance is a special case of the Minkowski
metric, where p=∞.
• Cosine distance
18-1057
pdist2
⎛ xs yt′ ⎞
dst = ⎜ 1 − ⎟
⎜
⎝ ( xs xs′ ) ( yt yt′ ) ⎟⎠
• Correlation distance
( xs − xs ) ( yt − yt )′
dst = 1 −
( xs − xs ) ( xs − xs )′ ( yt − yt ) ( yt − yt )′
where
1
xs = ∑ xsj and
n j
1
yt = ∑ ytj
n j
• Hamming distance
• Jaccard distance
( ) ((
# ⎡ xsj ≠ ytj ∩ xsj ≠ 0 ∪ ytj ≠ 0 ⎤
⎣ ⎦) ( ))
dst =
⎡ (
# xsj ≠ 0 ∪ ytj ≠ 0
⎣ ) (
⎤
⎦ )
• Spearman distance
( rs − rs ) ( rt − rt )′
dst = 1 −
( rs − rs ) ( rs − rs )′ ( rt − rt ) ( rt − rt )′
where
18-1058
pdist2
- rsj is the rank of xsj taken over x1j, x2j, ...xmx,j, as computed by
tiedrank
- rtj is the rank of ytj taken over y1j, y2j, ...ymy,j, as computed by
tiedrank
- rs and rt are the coordinate-wise rank vectors of xs and yt, i.e. rs =
(rs1, rs2, ... rsn) and rt = (rt1, rt2, ... rtn)
1 ( n + 1)
- rs = ∑
n j
rsj =
2
1 ( n + 1)
- rt = ∑
n j
rtj =
2
Examples Generate random data and find the unweighted Euclidean distance,
then find the weighted distance using two different methods:
18-1059
pearsrnd
Syntax r = pearsrnd(mu,sigma,skew,kurt,m,n)
[r,type] = pearsrnd(...)
[r,type,coefs] = pearsrnd(...)
Some combinations of moments are not valid for any random variable,
and in particular, the kurtosis must be greater than the square of the
skewness plus 1. The kurtosis of the normal distribution is defined
to be 3.
r = pearsrnd(mu,sigma,skew,kurt) returns a scalar value.
r = pearsrnd(mu,sigma,skew,kurt,m,n,...) or r =
pearsrnd(mu,sigma,skew,kurt,[m,n,...]) returns an m-by-n-by-...
array.
[r,type] = pearsrnd(...) returns the type of the specified
distribution within the Pearson system. type is a scalar integer from
0 to 7. Set m and n to zero to identify the distribution type without
generating any random values.
The seven distribution types in the Pearson system correspond to the
following distributions:
• 0 — Normal distribution
• 1 — Four-parameter beta distribution
18-1060
pearsrnd
d(log(p(x))) -(a + x)
differential equation = .
dx (c(0) + c(1) ⋅ x + c(2) ⋅ x 2 )
Examples Generate random values from the standard normal distribution:
[r,type] = pearsrnd(0,1,1,4,0,0);
r =
[]
type =
1
18-1061
perfcurve
18-1062
perfcurve
18-1063
perfcurve
Input
Arguments labels labels can be a numeric vector, logical vector,
character matrix, cell array of strings or
categorical vector.
scores scores is a numeric vector of scores returned by a
classifier for some data. This vector must have as
many elements as labels does.
posclass posclass is the positive class label (scalar),
either numeric (for numeric labels) or char. The
specified positive class must be in the array of
input labels.
Name/Value Pairs
18-1064
perfcurve
18-1065
perfcurve
UseNearest 'on' to use nearest values found in the data instead of the
specified numeric XVals or TVals and 'off' otherwise. If you
specify numeric XVals and set UseNearest to ’on’, perfcurve
returns nearest unique values X found in the data, as well
as corresponding values of Y and T. If you specify numeric
XVals and set UseNearest to 'off', perfcurve returns these
XVals sorted. By default this parameter is set to 'on'. If you
compute confidence bounds by cross-validation or bootstrap, this
parameter is always 'off'.
18-1066
perfcurve
18-1067
perfcurve
18-1068
perfcurve
18-1069
perfcurve
cos t( P | N ) − cos t( N | N ) N
S= *
cos t( N | P)is− the
where cost(I|J) cos tcost
(P | P
of) assigning
P an instance
of class J to class I, and P=TP+FN and N=TN+FP are the
total instance counts in the positive and negative class,
respectively. perfcurve then finds the optimal operating
point by moving the straight line with slope S from the
upper left corner of the ROC plot (FPR=0, TPR=1) down
and to the right until it intersects the ROC curve.
SUBY An array of Y values for negative subclasses. If you
only specify one negative class, SUBY is identical to Y.
Otherwise SUBY is a matrix of size M-by-K, where M is
the number of returned values for X and Y, and K is
the number of negative classes. perfcurve computes
Y values by summing counts over all negative classes.
SUBY gives values of the Y criterion for each negative
class separately. For each negative class, perfcurve
places a new column in SUBY and fills it with Y values for
TN and FP counted just for this class.
SUBYNAMES A cell array of negative class names. If you provide
an input array, negClass, of negative class names,
perfcurve copies it into SUBYNAMES. If you do not provide
negClass, perfcurve extracts SUBYNAMES from input
labels. The order of SUBYNAMES is the same as the order
of columns in SUBY, that is, SUBY(:,1) is for negative
class SUBYNAMES{1} etc.
18-1070
perfcurve
load fisheriris
x = meas(51:end,1:2);
% iris data, 2 classes and 2 features
y = (1:100)'>50;
% versicolor=0, virginica=1
b = glmfit(x,y,'binomial');
% logistic regression
p = glmval(b,x,'logit');
% fit probabilities for scores
[X,Y,T,AUC] = perfcurve(species(51:end,:),p,'virginica');
plot(X,Y)
xlabel('False positive rate'); ylabel('True positive rate')
title('ROC for classification by logistic regression')
[X,Y] = perfcurve(species(51:end,:),p,'virginica',...
'nboot',1000,'xvals','all');
% plot errors
18-1071
perfcurve
errorbar(X,Y(:,1),Y(:,2)-Y(:,1),Y(:,3)-Y(:,1));
References [1] T. Fawcett, ROC Graphs: Notes and Practical Considerations for
Researchers, 2004.
[6] W. Briggs and R. Zaretzki, The Skill Plot: A Graphical Technique for
Evaluating Continuous Diagnostic Tests, Biometrics 63, 250-261, 2008.
[7] http://www2.cs.uregina.ca/~hamilton/courses/831/notes/lift_chart/lift_chart.html;
http://www.dmreview.com/news/5329-1.html.
[9] http://www.stata.com/statalist/archive/2003-02/msg00060.html
18-1072
perfcurve
18-1073
perms
Syntax P = perms(v)
ans =
6 4 2
6 2 4
4 6 2
4 2 6
2 4 6
2 6 4
See Also
combnk
18-1074
categorical.permute
Syntax B = permute(A,order)
18-1075
piecewisedistribution class
Copy Value. To learn how this affects your use of the class, see Comparing
Semantics Handle and Value Classes in the MATLAB Object-Oriented
Programming documentation.
18-1076
piecewisedistribution
18-1077
plsregress
18-1078
plsregress
18-1079
plsregress
[XL,YL,XS,YS,BETA,PCTVAR,MSE] =
plsregress(...,param1,val1,param2,val2,...) specifies optional
parameter name/value pairs from the following table to control the
calculation of MSE.
ParameterValue
'cv' The method used to compute MSE.
[XL,YL,XS,YS,BETA,PCTVAR,MSE,stats] =
PLSREGRESS(X,Y,ncomp,...) returns a structure stats with the
following fields:
18-1080
plsregress
load spectra
X = NIR;
y = octane;
[XL,yl,XS,YS,beta,PCTVAR] = plsregress(X,y,10);
plot(1:10,cumsum(100*PCTVAR(2,:)),'-bo');
xlabel('Number of PLS components');
ylabel('Percent Variance Explained in y');
18-1081
plsregress
stem(residuals)
xlabel('Observation');
ylabel('Residual');
18-1082
plsregress
18-1083
sobolset.PointOrder property
Description The PointOrder property contains a string that specifies the order in
which the Sobol sequence points are produced. The property value must
be one of 'standard' or 'graycode'. When set to 'standard' the points
produced match the original Sobol sequence implementation. When set
to 'graycode', the sequence is generated using an implementation that
uses the Gray code of the index instead of the index itself.
18-1084
qrandstream.PointSet property
Description The PointSet property contains a copy of the point set from which the
stream is providing points. The point set is specified during construction
of a quasi-random stream and cannot subsequently be altered.
18-1085
poisscdf
Syntax P = poisscdf(X,lambda)
floor ( x)
i
p = F ( x | ) = e− ∑
i =0
i!
probability = 1-poisscdf(4,2)
probability =
0.0527
probability = poisscdf(4,4)
probability =
0.6288
18-1086
poisscdf
18-1087
poissfit
1 n
̂ = ∑ xi
n i=1
Examples r = poissrnd(5,10,2);
[l,lci] = poissfit(r)
l =
7.4000 6.3000
lci =
5.8000 4.8000
9.1000 7.9000
18-1088
poissinv
Syntax X = poissinv(P,lambda)
Examples If the average number of defects (λ) is two, what is the 95th percentile
of the number of defects?
poissinv(0.95,2)
ans =
5
median_defects = poissinv(0.50,2)
median_defects =
2
18-1089
poisspdf
Syntax Y = poisspdf(X,lambda)
x −
y = f ( x|) = e I(0,1,...) ( x)
x!
Examples A computer hard disk manufacturer has observed that flaws occur
randomly in the manufacturing process at the average rate of two flaws
in a 4 GB hard disk and has found this rate to be acceptable. What is
the probability that a disk will be manufactured with no defects?
In this problem, λ = 2 and x = 0.
p = poisspdf(0,2)
p =
0.1353
18-1090
poissrnd
Syntax R = poissrnd(lambda)
R = poissrnd(lambda,m)
R = poissrnd(lambda,m,n)
lambda = 2;
random_sample1 = poissrnd(lambda,1,10)
random_sample1 =
1 0 1 2 1 3 4 2 0 0
random_sample3 = poissrnd(lambda(ones(1,10)))
random_sample3 =
3 2 1 1 0 0 4 0 2 0
18-1091
poissrnd
18-1092
poisstat
Syntax M = poisstat(lambda)
[M,V] = poisstat(lambda)
Examples Find the mean and variance for the Poisson distribution with λ = 2.
18-1093
polyconf
Syntax Y = polyconf(p,X)
[Y,DELTA] = polyconf(p,X,S)
[Y,DELTA] = polyconf(p,X,S,param1,val1,param2,val2,...)
Parameter Value
'alpha' A value between 0 and 1 specifying a confidence level
of 100*(1-alpha)%. The default is 0.05.
'mu' A two-element vector containing centering and
scaling parameters. With this option, polyconf uses
(X-mu(1))/mu(2) in place of X.
'predopt' Either 'observation' (the default) to compute
prediction intervals for new observations at the
values in X, or 'curve' to compute confidence
intervals for the fit evaluated at the values in X. See
below.
'simopt' Either 'off' (the default) for nonsimultaneous
bounds, or 'on' for simultaneous bounds. See below.
18-1094
polyconf
18-1095
polyconf
Examples This example uses code from the documentation example function
polydemo, and calls the documentation example function polystr
to convert the coefficient vector p into a string for the polynomial
expression displayed in the figure title. It combines the functions
polyfit, polyval, roots, and polyconf to produce a formatted display
of data with a polynomial fit.
18-1096
polyconf
xdata = -5:5;
ydata = x.^2 - 5*x - 3 + 5*randn(size(x));
18-1097
polyconf
% Add a legend.
legend([hdata,hfit,hroots,hconf],...
'Data','Fit','Real Roots of Fit',...
18-1098
polyconf
18-1099
polytool
Syntax polytool
polytool(x,y)
polytool(x,y,n)
polytool(x,y,n,alpha)
polytool(x,y,n,alpha,xname,yname)
h = polytool(...)
Description polytool
polytool(x,y) fits a line to the vectors x and y and displays an
interactive plot of the result in a graphical interface. You can use the
interface to explore the effects of changing the parameters of the fit and
to export fit results to the workspace.
polytool(x,y,n) initially fits a polynomial of degree n. The default
is 1, which produces a linear fit.
polytool(x,y,n,alpha) initially plots 100(1 - alpha)% confidence
intervals on the predicted values. The default is 0.05 which results in
95% confidence intervals.
polytool(x,y,n,alpha,xname,yname) labels the x and y values on the
graphical interface using the strings xname and yname. Specify n and
alpha as [] to use their default values.
h = polytool(...) outputs a vector of handles, h, to the line objects
in the plot. The handles are returned in the degree: data, fit, lower
bounds, upper bounds.
18-1100
gmdistribution.posterior
Syntax P = posterior(obj,X)
[P,nlogl] = posterior(obj,X)
MU1 = [1 2];
SIGMA1 = [2 0; 0 .5];
MU2 = [-3 -5];
SIGMA2 = [1 0; 0 1];
X = [mvnrnd(MU1,SIGMA1,1000);mvnrnd(MU2,SIGMA2,1000)];
scatter(X(:,1),X(:,2),10,'.')
hold on
18-1101
gmdistribution.posterior
obj = gmdistribution.fit(X,2);
h = ezcontour(@(x,y)pdf(obj,[x y]),[-8 6],[-8 6]);
18-1102
gmdistribution.posterior
P = posterior(obj,X);
delete(h)
scatter(X(:,1),X(:,2),10,P(:,1),'.')
hb = colorbar;
ylabel(hb,'Component 1 Probability')
18-1103
gmdistribution.posterior
18-1104
NaiveBayes.posterior
18-1105
NaiveBayes.posterior
18-1106
prctile
Syntax Y = prctile(X,p)
Y = prctile(X,p,dim)
Examples x = (1:5)'*(1:5)
x =
1 2 3 4 5
2 4 6 8 10
3 6 9 12 15
4 8 12 16 20
5 10 15 20 25
y = prctile(x,[25 50 75])
18-1107
prctile
y =
1.7500 3.5000 5.2500 7.0000 8.7500
3.0000 6.0000 9.0000 12.0000 15.0000
4.2500 8.5000 12.7500 17.0000 21.2500
18-1108
CompactTreeBagger.predict
18-1109
CompactTreeBagger.predict
18-1110
NaiveBayes.predict
Description cpre = predict(nb,test) classifies each row of data in test into one
of the classes according to the NaiveBayes classifier nb, and returns the
predicted class level cpre. test is an N-by-nb.ndims matrix, where N is
the number of observations in the test data. Rows of test correspond to
points, columns of test correspond to features. cpre is an N-by-1 vector
of the same type as nb.CLevels, and it indicates the class to which
each row of test has been assigned.
cpre = predict(...,'HandleMissing',val) specifies how predict
treats NaN (missing values). val can be one of the following:
18-1111
TreeBagger.predict
Syntax Y = predict(B,X)
[Y,stdevs] = predict(B,X)
[Y,scores] = predict(B,X)
[Y,scores,stdevs] = predict(B,X)
Y = predict(B,X,'param1',val1,'param2',val2,...)
18-1112
TreeBagger.predict
18-1113
princomp
18-1114
princomp
Examples Compute principal components for the ingredients data in the Hald
data set, and the variance accounted for by each component.
load hald;
[pc,score,latent,tsquare] = princomp(ingredients);
pc,latent
pc =
0.0678 -0.6460 0.5673 -0.5062
0.6785 -0.0200 -0.5440 -0.4933
-0.0290 0.7553 0.4036 -0.5156
-0.7309 -0.1085 -0.4684 -0.4844
latent =
517.7969
67.4964
12.4054
0.2372
The following command and plot show that two components account for
98% of the variance:
cumsum(latent)./sum(latent)
ans =
0.86597
0.97886
0.9996
1
biplot(pc(:,1:2),'Scores',score(:,1:2),'VarLabels',...
{'X1' 'X2' 'X3' 'X4'})
18-1115
princomp
References [1] Jackson, J. E., A User’s Guide to Principal Components, John Wiley
and Sons, 1991, p. 592.
18-1116
princomp
18-1117
TreeBagger.Prior property
Description The Prior property is a vector with prior probabilities for classes. This
property is empty for ensembles of regression trees.
18-1118
ProbDist class
Copy Value. To learn how this affects your use of the class, see Copying
Semantics Objects in the MATLAB Programming Fundamentals documentation.
18-1119
ProbDist class
18-1120
ProbDistKernel class
Superclasses ProbDist
Note The above methods are inherited from the ProbDist class.
18-1121
ProbDistKernel class
Note Some of the above properties are inherited from the ProbDist
class.
Copy Value. To learn how this affects your use of the class, see Copying
Semantics Objects in the MATLAB Programming Fundamentals documentation.
18-1122
ProbDistParametric class
Superclasses ProbDist
Note The above methods are inherited from the ProbDist class.
18-1123
ProbDistParametric class
Note Some of the above properties are inherited from the ProbDist
class.
Copy Value. To learn how this affects your use of the class, see Copying
Semantics Objects in the MATLAB Programming Fundamentals documentation.
18-1124
ProbDistParametric class
18-1125
ProbDistUnivKernel class
Superclasses ProbDistKernel
18-1126
ProbDistUnivKernel class
Copy Value. To learn how this affects your use of the class, see Copying
Semantics Objects in the MATLAB Programming Fundamentals documentation.
References [1] Bowman, A. W., and A. Azzalini. Applied Smoothing Techniques for
Data Analysis. New York: Oxford University Press, 1997.
18-1127
ProbDistUnivKernel class
18-1128
ProbDistUnivKernel
Syntax PD = ProbDistUnivKernel(X)
PD = ProbDistUnivKernel(X, param1, val1, param2, val2, ...)
Description
Tip Although you can use this constructor function to create a
ProbDistUnivKernel object, using the fitdist function is an easier way
to create the ProbDistUnivKernel object.
18-1129
ProbDistUnivKernel
Parameter Values
'censoring' A Boolean vector the same size as X, containing 1s when
the corresponding elements in X are right-censored
observations and 0s when the corresponding elements
are exact observations. Default is a vector of 0s.
• 'normal' (default)
• 'box'
• 'triangle'
• 'epanechnikov'
18-1130
ProbDistUnivKernel
Parameter Values
'support' Any of the following to specify the support:
References [1] Bowman, A. W., and A. Azzalini. Applied Smoothing Techniques for
Data Analysis. New York: Oxford University Press, 1997.
18-1131
ProbDistUnivParam class
Superclasses ProbDistParametric
18-1132
ProbDistUnivParam class
18-1133
ProbDistUnivParam class
Copy Value. To learn how this affects your use of the class, see Copying
Semantics Objects in the MATLAB Programming Fundamentals documentation.
18-1134
ProbDistUnivParam class
18-1135
ProbDistUnivParam
18-1136
ProbDistUnivParam
• 'rayleigh'
• 'rician'
• 'tlocationscale'
• 'weibull' or 'wbl'
pd = ProbDistUnivParam('normal',[100 10])
pd =
normal distribution
mu = 100
sigma = 10
18-1137
ProbDistUnivParam
random(pd,4,5)
ans =
18-1138
probplot
Syntax probplot(Y)
probplot(distribution,Y)
probplot(Y,cens,freq)
probplot(ax,Y)
probplot(...,'noref')
probplot(ax,PD)
probplot(ax,fun,params)
h = probplot(...)
18-1139
probplot
The y axis scale is based on the selected distribution. The x axis has a
log scale for the Weibull and lognormal distributions, and a linear scale
for the others.
Not all distributions are appropriate for all data sets, and probplot will
error when asked to create a plot with a data set that is inappropriate
for a specified distribution. Appropriate data ranges for each
distribution are given parenthetically in the list above.
probplot(Y,cens,freq) or probplot(distname,Y,cens,freq)
requires a vector Y. cens is a vector of the same size as Y and contains
1 for observations that are right-censored and 0 for observations that
are observed exactly. freq is a vector of the same size as Y, containing
integer frequencies for the corresponding elements in Y.
probplot(ax,Y) takes a handle ax to an existing probability plot, and
adds additional lines for the samples in Y. ax is a handle for a set of axes.
probplot(...,'noref') omits the reference line.
probplot(ax,PD) takes a probability distribution object, PD, and
adds a fitted line to the axes specified by ax to represent the
probability distribution specified by PD. PD is a ProbDist object of the
ProbDistUnivParam class or ProbDistUnivKernel class.
probplot(ax,fun,params) takes a function fun and a set of
parameters, params, and adds fitted lines to the axes of an existing
probability plot specified by ax. fun is a function handle to a cdf
function, specified with @ (for example, @wblcdf). params is the set of
parameters required to evaluate fun, and is specified as a cell array
or vector. The function must accept a vector of X values as its first
argument, then the optional parameters, and must return a vector of
cdf values evaluated at X.
h = probplot(...) returns handles to the plotted lines.
Examples Example 1
The following plot assesses two samples, one from a Weibull distribution
and one from a Rayleigh distribution, to see if they may have come
from a Weibull population.
18-1140
probplot
x1 = wblrnd(3,3,100,1);
x2 = raylrnd(3,100,1);
probplot('weibull',[x1 x2])
legend('Weibull Sample','Rayleigh Sample','Location','NW')
Example 2
Consider the following data, with about 20% outliers:
left_tail = -exprnd(1,10,1);
right_tail = exprnd(5,10,1);
center = randn(80,1);
18-1141
probplot
data = [left_tail;center;right_tail];
Neither a normal distribution nor a t distribution fits the tails very well:
probplot(data);
p = mle(data,'dist','tlo');
t = @(data,mu,sig,df)cdf('tlocationscale',data,mu,sig,df);
h = probplot(gca,t,p);
set(h,'color','r','linestyle','-')
title('{\bf Probability Plot}')
legend('Data','Normal','t','Location','NW')
18-1142
probplot
18-1143
procrustes
Syntax d = procrustes(X,Y)
[d,Z] = procrustes(X,Y)
[d,Z,transform] = procrustes(X,Y)
[...] = procrustes(...,'scaling',flag)
[...] = procrustes(...,'reflection',flag)
sum(sum((X-repmat(mean(X,1),size(X,1),1)).^2,1))
• c — Translation component
• T — Orthogonal rotation and reflection component
• b — Scale component
That is:
c = transform.c;
18-1144
procrustes
T = transform.T;
b = transform.b;
Z = b*Y*T + c;
Examples This example creates some random points in two dimensions, then
rotates, scales, translates, and adds some noise to those points. It uses
procrustes to conform Y to X, then plots the original X and Y with the
transformed Y.
n = 10;
X = normrnd(0,1,[n 2]);
S = [0.5 -sqrt(3)/2; sqrt(3)/2 0.5];
Y = normrnd(0.5*X*S+2,0.05,n,2);
[d,Z,tr] = procrustes(X,Y);
plot(X(:,1),X(:,2),'rx',...
Y(:,1),Y(:,2),'b.',...
Z(:,1),Z(:,2),'bx');
18-1145
procrustes
18-1146
CompactTreeBagger.proximity
18-1147
TreeBagger.Proximity property
18-1148
classregtree.prune
Syntax t2 = prune(t1,'level',level)
t2 = prune(t1,'nodes',nodes)
t2 = prune(t1)
load fisheriris;
t1 = classregtree(meas,species,...
'names',{'SL' 'SW' 'PL' 'PW'},...
'splitmin',5)
t1 =
Decision tree for classification
1 if PL<2.45 then node 2 elseif PL>=2.45 then node 3 else setosa
2 class = setosa
18-1149
classregtree.prune
view(t1)
18-1150
classregtree.prune
Display the next largest tree from the optimal pruning sequence:
t2 = prune(t1,'level',1)
t2 =
Decision tree for classification
1 if PL<2.45 then node 2 elseif PL>=2.45 then node 3 else setosa
2 class = setosa
3 if PW<1.75 then node 4 elseif PW>=1.75 then node 5 else versicolor
4 if PL<4.95 then node 6 elseif PL>=4.95 then node 7 else versicolor
5 class = virginica
6 class = versicolor
7 class = virginica
view(t2)
18-1151
classregtree.prune
18-1152
TreeBagger.Prune property
Description The Prune property is true if decision trees are pruned and false if they
are not. Pruning decision trees is not recommended for ensembles. The
default value is false.
18-1153
qrandstream.qrand
Syntax x = qrand(q)
X = qrand(q,n)
q = qrandstream('halton',3,'Skip',1e3,'Leap',1e2)
q =
Halton quasi-random stream in 3 dimensions
Point set properties:
Skip : 1000
Leap : 100
ScrambleMethod : none
nextIdx = q.State
nextIdx =
1
X1 = qrand(q,4)
X1 =
0.0928 0.3475 0.0051
0.6958 0.2035 0.2371
0.3013 0.8496 0.4307
18-1154
qrandstream.qrand
X2 = qrand(q,4)
X2 =
0.2446 0.0238 0.8102
0.5298 0.7540 0.0438
0.3843 0.5112 0.2758
0.8335 0.2245 0.4694
nextIdx = q.State
nextIdx =
9
reset(q)
nextIdx = q.State
nextIdx =
1
X = qrand(q,4)
X =
0.0928 0.3475 0.0051
0.6958 0.2035 0.2371
0.3013 0.8496 0.4307
0.9087 0.5629 0.6166
18-1155
qrandset class
18-1156
qrandset class
Copy Value. To learn how this affects your use of the class, see Comparing
Semantics Handle and Value Classes in the MATLAB Object-Oriented
Programming documentation.
18-1157
qrandset
18-1158
qrandstream class
18-1159
qrandstream class
Copy Handle. To learn how this affects your use of the class, see
Semantics Comparing Handle and Value Classes in the MATLAB Object-Oriented
Programming documentation.
18-1160
qrandstream
Syntax q = qrandstream(type,d)
q = qrandstream(type,d,prop1,val1,prop2,val2,...)
q = qrandstream(p)
Examples Construct a 3-D Halton stream, based on a point set that skips the first
1000 values and then retains every 101st point:
q = qrandstream('halton',3,'Skip',1e3,'Leap',1e2)
q =
Halton quasi-random stream in 3 dimensions
Point set properties:
Skip : 1000
Leap : 100
ScrambleMethod : none
nextIdx = q.State
nextIdx =
1
X1 = qrand(q,4)
18-1161
qrandstream
X1 =
0.0928 0.3475 0.0051
0.6958 0.2035 0.2371
0.3013 0.8496 0.4307
0.9087 0.5629 0.6166
nextIdx = q.State
nextIdx =
5
X2 = qrand(q,4)
X2 =
0.2446 0.0238 0.8102
0.5298 0.7540 0.0438
0.3843 0.5112 0.2758
0.8335 0.2245 0.4694
nextIdx = q.State
nextIdx =
9
Use reset to reset the stream, and then generate another sample:
reset(q)
nextIdx = q.State
nextIdx =
1
X = qrand(q,4)
X =
0.0928 0.3475 0.0051
0.6958 0.2035 0.2371
0.3013 0.8496 0.4307
0.9087 0.5629 0.6166
18-1162
qqplot
Syntax qqplot(X)
qqplot(X,Y)
qqplot(X,PD)
qqplot(X,Y,pvec)
h = qqplot(X,Y,pvec)
x = poissrnd(10,50,1);
y = poissrnd(5,100,1);
qqplot(x,y);
18-1163
qqplot
18-1164
quantile
Purpose Quantiles
Syntax Y = quantile(X,p)
Y = quantile(X,p,dim)
1 The sorted values in X are taken as the (0.5/n), (1.5/n), ..., ([n–0.5]/n)
quantiles.
18-1165
qrandstream.rand
Syntax rand
rand(q,n)
rand(q)
rand(q,m,n)
rand(q,[m,n])
rand(q,m,n,p,...)
rand(q,[m,n,p,...])
Examples Generate the first 256 points from a 5-D Sobol sequence:
q = qrandstream('sobol',5);
X = rand(q,256,5);
18-1166
randg
Syntax Y = randg
Y = randg(A)
Y = randg(A,m)
Y = randg(A,m,n,...)
Y = randg(A,[m,n,p])
reset(RandStream.getDefaultStream,0);
r = randg(1,[10,1]);
18-1167
randg
Calling randg changes the current states of rand, randn, and randi,
and therefore alters the outputs of subsequent calls to those functions.
To generate gamma random numbers and specify both the scale and
shape parameters, you should call gamrnd rather than calling randg
directly.
References [1] Marsaglia, G., and W. W. Tsang. “A Simple Method for Generating
Gamma Variables.” ACM Transactions on Mathematical Software. Vol.
26, 2000, pp. 363–372.
18-1168
random
Syntax Y = random(name,A)
Y = random(name,A,B)
Y = random(name,A,B,C)
Y = random(...,m,n,...)
Y = random(...,[m,n,...])
18-1169
random
18-1170
random
18-1171
random
18-1172
random
Examples Generate a 2-by-4 array of random values from the normal distribution
with mean 0 and standard deviation 1:
x1 = random('Normal',0,1,2,4)
x1 =
1.1650 0.0751 -0.6965 0.0591
0.6268 0.3516 1.6961 1.7971
x2 = random('Poisson',1:6,1,6)
x2 =
0 0 1 2 5 7
18-1173
gmdistribution.random
Syntax y = random(obj)
Y = random(obj,n)
[Y,idx] = random(obj,n)
MU = [1 2;-3 -5];
SIGMA = cat(3,[2 0;0 .5],[1 0;0 1]);
p = ones(1,2)/2;
obj = gmdistribution(MU,SIGMA,p);
18-1174
gmdistribution.random
Y = random(obj,1000);
scatter(Y(:,1),Y(:,2),10,'.')
18-1175
gmdistribution.random
18-1176
piecewisedistribution.random
Syntax r = random(obj)
R = random(obj,n)
R = random(obj,m,n)
R = random(obj,[m,n])
R = random(obj,m,n,p,...)
R = random(obj,[m,n,p,...])
t = trnd(3,100,1);
obj = paretotails(t,0.1,0.9);
r = random(obj)
r =
0.8285
18-1177
ProbDist.random
Syntax Y = random(PD)
Y = random(PD, N)
Y = random(PD, N, M, ...)
18-1178
randsample
Syntax y = randsample(n,k)
y = randsample(population,k)
y = randsample(...,replace)
y = randsample(...,true,w)
y = randsample(s, ...)
18-1179
randtool
Syntax randtool
18-1180
randtool
Histogram
Parameter
bounds
Parameter
value
Parameter
control Additional Sample again Export to
parameters from the same workspace
distribution
18-1181
randtool
• Use the controls at the bottom of the window to set parameter values
for the distribution and to change their upper and lower bounds.
• Draw another sample from the same distribution, with the same
size and parameters.
• Export the current sample to your workspace. A dialog box enables
you to provide a name for the sample.
18-1182
range
Syntax range(X)
y = range(X,dim)
Description range(X) returns the difference between the maximum and the
minimum of a sample. For vectors, range(x) is the range of the
elements. For matrices, range(X) is a row vector containing the range
of each column of X. For N-dimensional arrays, range operates along
the first nonsingleton dimension of X.
y = range(X,dim) operates along the dimension dim of X.
range treats NaNs as missing values and ignores them.
The range is an easily-calculated estimate of the spread of a sample.
Outliers have an undue influence on this statistic, which makes it an
unreliable estimator.
rv = normrnd(0,1,1000,5);
near6 = range(rv)
near6 =
6.1451 6.4986 6.2909 5.8894 7.0002
18-1183
ranksum
Syntax p = ranksum(x,y)
[p,h] = ranksum(x,y)
[p,h] = ranksum(x,y,'alpha',alpha)
[p,h] = ranksum(...,'method',method)
[p,h,stats] = ranksum(...)
18-1184
ranksum
Examples Test the hypothesis of equal medians for two independent unequal-sized
samples. The sampling distributions are identical except for a shift
of 0.25.
x = unifrnd(0,1,10,1);
y = unifrnd(0.25,1.25,15,1);
[p,h] = ranksum(x,y)
p =
0.0375
h =
1
The test rejects the null hypothesis of equal medians at the default
5% significance level.
18-1185
raylcdf
Syntax P = raylcdf(X,B)
⎛ −t2 ⎞
x t ⎜ 2⎟
∫0 b2
2b ⎠
y = F ( x | b) = e⎝ dt
Examples x = 0:0.1:3;
p = raylcdf(x,1);
plot(x,p)
0.8
0.6
0.4
0.2
0
0 0.5 1 1.5 2 2.5 3
18-1186
raylfit
Syntax raylfit(data,alpha)
[phat,pci] = raylfit(data,alpha)
18-1187
raylinv
Syntax X = raylinv(P,B)
Examples x = raylinv(0.9,1)
x =
2.1460
18-1188
raylpdf
Syntax Y = raylpdf(X,B)
⎛ − x2 ⎞
x ⎜ 2⎟
2b ⎠
y = f ( x | b) = e⎝
b2
Examples x = 0:0.1:3;
p = raylpdf(x,1);
plot(x,p)
18-1189
raylrnd
Syntax R = raylrnd(B)
R = raylrnd(B,v)
R = raylrnd(B,m,n)
Examples r = raylrnd(1:5)
r =
1.7986 0.8795 3.3473 8.9159 3.5182
18-1190
raylstat
Description [M,V] = raylstat(B) returns the mean of and variance for the
Rayleigh distribution with scale parameter B.
4 − 2
b
2
18-1191
rcoplot
Syntax rcoplot(r,rint)
Examples The following plots residuals and prediction intervals from a regression
of a linearly additive model to the data in moore.mat:
load moore
X = [ones(size(moore,1),1) moore(:,1:5)];
y = moore(:,6);
alpha = 0.05;
[betahat,Ibeta,res,Ires,stats] = regress(y,X,alpha);
rcoplot(res,Ires)
18-1192
rcoplot
The interval around the first residual, shown in red, does not contain
zero. This indicates that the residual is larger than expected in 95% of
new observations, and suggests the data point is an outlier.
18-1193
refcurve
Syntax refcurve(p)
refcurve
hcurve = refcurve(...)
Examples Example 1
Plot data from a population with a polynomial trend and use refcurve
to add both the population and fitted mean functions:
p = [1 -2 -1 0];
t = 0:0.1:3;
y = polyval(p,t) + 0.5*randn(size(t));
plot(t,y,'ro')
h = refcurve(p);
set(h,'Color','r')
q = polyfit(t,y,3);
refcurve(q)
legend('Data','Population Mean','Fitted Mean',...
'Location','NW')
18-1194
refcurve
Example 2
Plot trajectories of a batted baseball, with and without air resistance.
Relevant physical constants are:
18-1195
refcurve
D = rho*C*A/2;
% Drag proportional to the square of the speed
g = 9.8; % Acceleration due to gravity (m/s^2)
r = r0;
v = v0;
trajectory = r0;
while r(2) > 0
a = [0 -g]-(D/M)*norm(v)*v;
v = v + a*dt;
r = r + v*dt + (1/2)*a*(dt^2);
trajectory = [trajectory;r];
end
plot(trajectory(:,1),trajectory(:,2),'m','LineWidth',2)
xlim([0,250])
h = refcurve([-g/(2*v0(1)^2),...
(g*r0(1)/v0(1)^2)+(v0(2)/v0(1)),...
(-g*r0(1)^2/(2*v0(1)^2))-(v0(2)*r0(1)/v0(1))+r0(2)]);
set(h,'Color','c','LineWidth',2)
axis equal
ylim([0,50])
grid on
xlabel('Distance (m)')
ylabel('Height (m)')
title('{\bf Baseball Trajectories}')
18-1196
refcurve
18-1197
refline
Syntax refline(m,b)
refline(coeffs)
refline
hline = refline(...)
y = coeffs(1)*x + coeffs(2)
to the figure.
refline with no input arguments is equivalent to lsline.
hline = refline(...) returns the handle hline to the line.
Examples Add a reference line at the mean of a data scatter and its least-squares
line:
x = 1:10;
y = x + randn(1,10);
scatter(x,y,25,'b','*')
lsline
mu = mean(y);
hline = refline([0 mu]);
set(hline,'Color','r')
18-1198
refline
18-1199
regress
Syntax b = regress(y,X)
[b,bint] = regress(y,X)
[b,bint,r] = regress(y,X)
[b,bint,r,rint] = regress(y,X)
[b,bint,r,rint,stats] = regress(y,X)
[...] = regress(y,X,alpha)
18-1200
regress
degrees of freedom. The intervals returned in rint are shifts of the 95%
confidence intervals of these t distributions, centered at the residuals.
[b,bint,r,rint,stats] = regress(y,X) returns a 1-by-4 vector
stats that contains, in order, the R2 statistic, the F statistic and its p
value, and an estimate of the error variance.
load carsmall
x1 = Weight;
x2 = Horsepower; % Contains NaN data
y = MPG;
X = [ones(size(x1)) x1 x2 x1.*x2];
b = regress(y,X) % Removes NaN data
b =
60.7104
-0.0102
-0.1882
0.0000
18-1201
regress
scatter3(x1,x2,y,'filled')
hold on
x1fit = min(x1):100:max(x1);
x2fit = min(x2):10:max(x2);
[X1FIT,X2FIT] = meshgrid(x1fit,x2fit);
YFIT = b(1) + b(2)*X1FIT + b(3)*X2FIT + b(4)*X1FIT.*X2FIT;
mesh(X1FIT,X2FIT,YFIT)
xlabel('Weight')
ylabel('Horsepower')
zlabel('MPG')
view(50,10)
18-1202
regress
18-1203
regstats
Syntax regstats(y,X,model)
stats = regstats(...)
stats = regstats(y,X,model,whichstats)
18-1204
regstats
18-1205
regstats
When you select check boxes corresponding to the statistics you want
to compute and click OK, regstats returns the selected statistics to
the MATLAB workspace. The names of the workspace variables are
displayed on the right-hand side of the interface. You can change the
name of the workspace variable to any valid MATLAB variable name.
stats = regstats(...) creates the structure stats, whose fields
contain all of the diagnostic statistics for the regression. This syntax
does not open the GUI. The fields of stats are listed in the following
table.
Field Description
Q Q from the QR decomposition of the design matrix
R R from the QR decomposition of the design matrix
beta Regression coefficients
covb Covariance of regression coefficients
yhat Fitted values of the response data
r Residuals
mse Mean squared error
rsquare R2 statistic
adjrsquare Adjusted R2 statistic
leverage Leverage
hatmat Hat matrix
s2_i Delete-1 variance
beta_i Delete-1 coefficients
standres Standardized residuals
studres Studentized residuals
dfbetas Scaled change in regression coefficients
dffit Change in fitted values
18-1206
regstats
Field Description
dffits Scaled change in fitted values
covratio Change in covariance
cookd Cook’s distance
tstat t statistics and p-values for coefficients
fstat F statistic and p-value
dwstat Durbin-Watson statistic and p-value
Note that the fields names of stats correspond to the names of the
variables returned to the MATLAB workspace when you use the GUI.
For example, stats.beta corresponds to the variable beta that is
returned when you select Coefficients in the GUI and click OK.
stats = regstats(y,X,model,whichstats) returns only the statistics
that you specify in whichstats. whichstats can be a single string
such as 'leverage' or a cell array of strings such as {'leverage'
'standres' 'studres'}. Set whichstats to 'all' to return all of
the statistics.
Note The F statistic is computed under the assumption that the model
contains a constant term. It is not correct for models without a constant.
The R2 statistic can be negative for models without a constant, which
indicates that the model is not appropriate for the data.
load hald
regstats(heat,ingredients,'linear');
18-1207
regstats
whichstats = {'yhat','r'};
stats = regstats(heat,ingredients,'linear',whichstats);
yhat = stats.yhat;
r = stats.r;
18-1208
gmdistribution.RegV property
18-1209
categorical.reorderlevels
Syntax B = reorderlevels(A,newlevels)
standings = ordinal(1:3,{'Leafs','Canadiens','Bruins'});
getlabels(standings)
ans =
'Leafs' 'Canadiens' 'Bruins'
standings = reorderlevels(standings,...
{'Canadiens','Leafs','Bruins'});
getlabels(standings)
ans =
'Canadiens' 'Leafs' 'Bruins'
18-1210
cvpartition.repartition
c = cvpartition(100,'kfold',3)
c =
K-fold cross validation partition
N: 100
NumTestSets: 3
TrainSize: 67 66 67
TestSize: 33 34 33
cnew = repartition(c)
cnew =
K-fold cross validation partition
N: 100
NumTestSets: 3
TrainSize: 67 66 67
TestSize: 33 34 33
isequal(test(c,1),test(cnew,1))
ans =
0
18-1211
dataset.replacedata
Syntax B = replacedata(A,X)
B = replacedata(A,X,vars)
B = replacedata(A,fun)
B = replacedata(A,fun,vars)
18-1212
dataset.replacedata
X = double(data);
X = zscore(X);
data = replacedata(data,X)
18-1213
categorical.repmat
Syntax B = repmat(A,m,n)
B = repmat(A,[m n p ...])
18-1214
qrandstream.reset
Syntax reset(q)
Description reset(q) resets the state of the quasi-random number stream q of the
qrandstream class back to its initial state, 1. Subsequent points drawn
from the stream will be the same as those drawn from a new stream.
The command is equivalent to q.State = 1.
q = qrandstream('halton',3,'Skip',1e3,'Leap',1e2)
q =
Halton quasi-random stream in 3 dimensions
Point set properties:
Skip : 1000
Leap : 100
ScrambleMethod : none
nextIdx = q.State
nextIdx =
1
X1 = qrand(q,4)
X1 =
0.0928 0.3475 0.0051
0.6958 0.2035 0.2371
0.3013 0.8496 0.4307
0.9087 0.5629 0.6166
nextIdx = q.State
nextIdx =
5
X2 = qrand(q,4)
18-1215
qrandstream.reset
X2 =
0.2446 0.0238 0.8102
0.5298 0.7540 0.0438
0.3843 0.5112 0.2758
0.8335 0.2245 0.4694
nextIdx = q.State
nextIdx =
9
reset(q)
nextIdx = q.State
nextIdx =
1
X = qrand(q,4)
X =
0.0928 0.3475 0.0051
0.6958 0.2035 0.2371
0.3013 0.8496 0.4307
0.9087 0.5629 0.6166
18-1216
categorical.reshape
Syntax B = reshape(A,M,N)
B = reshape(A,m,n,p,...)
reshape(A,[m n p ...])
B = reshape(A,...,[],...)
18-1217
ridge
Syntax b = ridge(y,X,k)
b = ridge(y,X,k,scaled)
m = mean(X);
s = std(X,0,1)';
b1_scaled = b1./s;
b0 = [mean(y)-m*b1_scaled; b1_scaled]
18-1218
ridge
ˆ = ( X T X )−1 X T y
ˆ = ( X T X + kI )−1 X T y
load acetylene
subplot(1,3,1)
plot(x1,x2,'.')
xlabel('x1'); ylabel('x2'); grid on; axis square
subplot(1,3,2)
plot(x1,x3,'.')
xlabel('x1'); ylabel('x3'); grid on; axis square
18-1219
ridge
subplot(1,3,3)
plot(x2,x3,'.')
xlabel('x2'); ylabel('x3'); grid on; axis square
Note the correlation between x1 and the other two predictor variables.
Use ridge and x2fx to compute coefficient estimates for a multilinear
model with interaction terms, for a range of ridge parameters:
X = [x1 x2 x3];
D = x2fx(X,'interaction');
D(:,1) = []; % No constant term
k = 0:1e-5:5e-3;
b = ridge(y,D,k);
figure
plot(k,b,'LineWidth',2)
ylim([-100 100])
grid on
xlabel('Ridge Parameter')
ylabel('Standardized Coefficient')
title('{\bf Ridge Trace}')
legend('x1','x2','x3','x1x2','x1x3','x2x3')
18-1220
ridge
The estimates stabilize to the right of the plot. Note that the coefficient
of the x2x3 interaction term changes sign at a value of the ridge
parameter ≈ 5 × 10–4.
18-1221
ridge
18-1222
classregtree.risk
Syntax r = risk(t)
r = risk(t,nodes)
load fisheriris;
t = classregtree(meas,species,...
'names',{'SL' 'SW' 'PL' 'PW'})
t =
Decision tree for classification
1 if PL<2.45 then node 2 elseif PL>=2.45 then node 3 else setosa
2 class = setosa
3 if PW<1.75 then node 4 elseif PW>=1.75 then node 5 else versicolor
4 if PL<4.95 then node 6 elseif PL>=4.95 then node 7 else versicolor
5 class = virginica
6 if PW<1.65 then node 8 elseif PW>=1.65 then node 9 else versicolor
7 class = virginica
8 class = versicolor
9 class = virginica
view(t)
18-1223
classregtree.risk
e = nodeerr(t);
p = nodeprob(t);
r = risk(t);
r
r =
0.6667
0
0.3333
0.0333
18-1224
classregtree.risk
0.0067
0.0067
0.0133
0
0
e.*p
ans =
0.6667
0
0.3333
0.0333
0.0067
0.0067
0.0133
0
0
18-1225
robustdemo
Syntax robustdemo
robustdemo(x,y)
Description robustdemo shows the difference between ordinary least squares and
robust regression for data with a single predictor. With no input
arguments, robustdemo displays a scatter plot of a sample of roughly
linear data with one outlier. The bottom of the figure displays equations
of lines fitted to the data using ordinary least squares and robust
methods, together with estimates of the root mean squared errors.
Use the right mouse button to click on a point and view its least-squares
leverage and robust weight.
Use the left mouse button to click-and-drag a point. The displays will
update.
robustdemo(x,y) uses x and y data vectors you supply, in place of the
sample data supplied with the function.
1 Start the demo. To begin using robustdemo with the built-in data,
simply type the function name:
robustdemo
18-1226
robustdemo
The resulting figure shows a scatter plot with two fitted lines. The
red line is the fit using ordinary least-squares regression. The green
line is the fit using robust regression. At the bottom of the figure are
the equations for the fitted lines, together with the estimated root
mean squared errors for each fit.
18-1227
robustdemo
3 See how changes in the data affect the fits. With the left mouse
button, click and hold on any data point and drag it to a new location.
When you release the mouse button, the displays update:
18-1228
robustdemo
18-1229
robustfit
Syntax b = robustfit(X,y)
b = robustfit(X,y,wfun,tune)
b = robustfit(X,y,wfun,tune,const)
[b,stats] = robustfit(...)
Default
Weight Tuning
Function Equation Constant
'andrews' w = (abs(r)<pi) .* sin(r) ./ r 1.339
'bisquare' w = (abs(r)<1) .* (1 - 4.685
(default) r.^2).^2
'cauchy' w = 1 ./ (1 + r.^2) 2.385
18-1230
robustfit
Default
Weight Tuning
Function Equation Constant
'fair' w = 1 ./ (1 + abs(r)) 1.400
'huber' w = 1 ./ max(1, abs(r)) 1.345
'logistic' w = tanh(r) ./ r 1.205
'ols' Ordinary least squares (no None
weighting function)
'talwar' w = 1 * (abs(r)<1) 2.795
'welsch' w = exp(-(r.^2)) 2.985
r = resid/(tune*s*sqrt(1-h))
s = MAD/0.6745
Here MAD is the median absolute deviation of the residuals from their
median. The constant 0.6745 makes the estimate unbiased for the
normal distribution. If there are p columns in X, the smallest p absolute
deviations are excluded when computing the median.
You can write your own weight function. The function must take a
vector of scaled residuals as input and produce a vector of weights as
18-1231
robustfit
18-1232
robustfit
Examples Generate data with the trend y = 10-2*x, then change one value to
simulate an outlier:
x = (1:10)';
y = 10 - 2*x + randn(10,1);
y(10) = 0;
brob = robustfit(x,y)
brob =
9.1063
-1.8231
A scatter plot of the data together with the fits shows that the robust fit
is less influenced by the outlier than the least-squares fit:
18-1233
robustfit
[3] Huber, P. J. Robust Statistics. Hoboken, NJ: John Wiley & Sons,
Inc., 1981.
18-1234
robustfit
18-1235
categorical.rot90
Syntax B = rot90(A)
B = rot90(A,k)
18-1236
rotatefactors
Syntax B = rotatefactors(A)
B = rotatefactors(A,'Method','orthomax','Coeff',gamma)
B = rotatefactors(A,'Method','procrustes','Target',target)
B = rotatefactors(A,'Method','pattern','Target',target)
B = rotatefactors(A,'Method','promax')
[B,T] = rotatefactors(A,...)
sum(D*sum(B.^4,1) - GAMMA*sum(B.^2,1).^2)
18-1237
rotatefactors
B = rotatefactors(A,'Method','procrustes','Target',target)
performs an oblique procrustes rotation of A to the d-by-m target
loadings matrix target.
B = rotatefactors(A,'Method','pattern','Target',target)
performs an oblique rotation of the loadings matrix A to the d-by-m
target pattern matrix target, and returns the result in B. target
defines the "restricted" elements of B, i.e., elements of B corresponding
to zero elements of target are constrained to have small magnitude,
while elements of B corresponding to nonzero elements of target are
allowed to take on any magnitude.
If 'Method' is 'procrustes' or 'pattern', an additional parameter is
'Type', the type of rotation. If 'Type' is 'orthogonal', the rotation
is orthogonal, and the factors remain uncorrelated. If 'Type' is
'oblique' (the default), the rotation is oblique, and the rotated factors
might be correlated.
When 'Method' is 'pattern', there are restrictions on target. If A has
m columns, then for orthogonal rotation, the jth column of target must
contain at least m - j zeros. For oblique rotation, each column of target
must contain at least m - 1 zeros.
B = rotatefactors(A,'Method','promax') rotates A to maximize
the promax criterion, equivalent to an oblique Procrustes rotation
with a target created by an orthomax rotation. Use the four orthomax
parameters to control the orthomax rotation used internally by promax.
An additional parameter for ’promax’ is 'Power', the exponent for
creating promax target matrix. 'Power' must be 1 or greater. The
default is 4.
[B,T] = rotatefactors(A,...) returns the rotation matrix T used to
create B, that is, B = A*T. inv(T'*T) is the correlation matrix of the
18-1238
rotatefactors
Examples X = randn(100,10);
% Equamax rotation:
% first three principle components.
[L2,T] = rotatefactors(LPC(:,1:3),...
'method','equamax');
% Promax rotation:
% first three factors.
LFA = factoran(X,3,'Rotate','none');
[L3,T] = rotatefactors(LFA(:,1:3),...
'method','promax',...
'power',2);
% Pattern rotation:
% first three factors.
Tgt = [1 1 1 1 1 0 1 0 1 1; ...
0 0 0 1 1 1 0 0 0 0; ...
1 0 0 1 0 1 1 1 1 0]';
[L4,T] = rotatefactors(LFA(:,1:3),...
'method','pattern',...
'target',Tgt);
inv(T'*T) % Correlation matrix of the rotated factors
References [1] Harman, H. H. Modern Factor Analysis. 3rd ed. Chicago: University
of Chicago Press, 1976.
18-1239
rotatefactors
18-1240
rowexch
The order of the columns of X for a full quadratic model with n terms is:
3 The interaction terms in order (1, 2), (1, 3), ..., (1, n), (2, 3), ..., (n–1, n)
18-1241
rowexch
Parameter Value
'bounds' Lower and upper bounds for each factor, specified as
a 2-by-nfactors matrix. Alternatively, this value
can be a cell array containing nfactors elements,
each element specifying the vector of allowable
values for the corresponding factor.
'categorical' Indices of categorical predictors.
'display' Either 'on' or 'off' to control display of the
iteration counter. The default is 'on'.
'excludefun' Handle to a function that excludes undesirable
runs. If the function is f, it must support the syntax
b = f(S), where S is a matrix of treatments with
nfactors columns and b is a vector of Boolean
values with the same number of rows as S. b(i) is
true if the ith row S should be excluded.
'init' Initial design as an nruns-by-nfactors matrix. The
default is a randomly selected set of points.
'levels' Vector of number of levels for each factor.
18-1242
rowexch
Parameter Value
'maxiter' Maximum number of iterations. The default is 10.
'tries' Number of times to try to generate a design from
a new starting point. The algorithm uses random
points for each try, except possibly the first. The
default is 1.
Algorithm Both cordexch and rowexch use iterative search algorithms. They
operate by incrementally changing an initial design matrix X to increase
D = |XTX| at each step. In both algorithms, there is randomness
built into the selection of the initial design and into the choice of the
incremental changes. As a result, both algorithms may return locally,
but not globally, D-optimal designs. Run each algorithm multiple times
and select the best result for your final design. Both functions have a
'tries' parameter that automates this repetition and comparison.
At each step, the row-exchange algorithm exchanges an entire row of
X with a row from a design matrix C evaluated at a candidate set of
feasible treatments. The rowexch function automatically generates a C
appropriate for a specified model, operating in two steps by calling the
candgen and candexch functions in sequence. Provide your own C by
calling candexch directly. In either case, if C is large, its static presence
in memory can affect computation.
Examples Suppose you want a design to estimate the parameters in the following
three-factor, seven-term interaction model:
y = 0 + 1 x 1 + 2 x 2 + 3 x 3 + 12 x 1 x 2 + 13 x 1 x 3 + 23 x 2 x 3 +
nfactors = 3;
nruns = 7;
[dRE,X] = rowexch(nfactors,nruns,'interaction','tries',10)
dRE =
-1 -1 1
18-1243
rowexch
1 -1 1
1 -1 -1
1 1 1
-1 -1 -1
-1 1 -1
-1 1 1
X =
1 -1 -1 1 1 -1 -1
1 1 -1 1 -1 1 -1
1 1 -1 -1 -1 -1 1
1 1 1 1 1 1 1
1 -1 -1 -1 1 1 1
1 -1 1 -1 -1 1 -1
1 -1 1 1 -1 -1 1
Columns of the design matrix X are the model terms evaluated at each
row of the design dRE. The terms appear in order from left to right:
constant term, linear terms (1, 2, 3), interaction terms (12, 13, 23). Use
X to fit the model, as described in “Linear Regression” on page 9-3, to
response data measured at the design points in dRE.
18-1244
rsmdemo
Syntax rsmdemo
1 x2 − x3 / 5
rate =
1 + 2 x1 + 3 x2 + 4 x3
where rate is the reaction rate, x1, x2, and x3 are the concentrations of
hydrogen, n-pentane, and isopentane, respectively, and β1, β2, ... , β5 are
fixed parameters. Random errors are used to perturb the reaction rate
for each combination of reactants.
Collect data using one of two methods:
18-1245
rsmdemo
When you click Run, the concentrations and simulated reaction rate
are recorded on the Trial and Error Data interface.
18-1246
rsmdemo
18-1247
rsmdemo
Fit a response surface model to the data by clicking the Analyze button
below the trial-and-error data or the Response Surface button below
the experimental data. Both buttons load the data into the Response
Surface Tool rstool. By default, trial-and-error data is fit with a linear
additive model and experimental data is fit with a full quadratic model,
but the models can be adjusted in the Response Surface Tool.
For experimental data, you have the additional option of fitting a
Hougen-Watson model. Click the Nonlinear Model button to load the
data and the model in hougen into the Nonlinear Fitting Tool nlintool.
18-1248
rsmdemo
18-1249
rstool
Syntax rstool
rstool(X,Y,model)
rstool(x,y,model,alpha)
rstool(x,y,model,alpha,xname,yname)
By default, the interface opens with the data from hald.mat and a fitted
response surface with constant, linear, and interaction terms.
A sequence of plots is displayed, each showing a contour of the response
surface against a single predictor, with all other predictors held fixed.
18-1250
rstool
The dialog allows you to save information about the fit to MATLAB
workspace variables with valid names.
rstool(X,Y,model) opens the interface with the predictor data
in X, the response data in Y, and the fitted model model. Distinct
predictor variables should appear in different columns of X. Y can be a
vector, corresponding to a single response, or a matrix, with columns
18-1251
rstool
load reaction
alpha = 0.01; % Significance level
rstool(reactants,rate,'quadratic',alpha,xn,yn)
18-1252
rstool
18-1253
runstest
Syntax h = runstest(x)
h = runstest(x,v)
h = runstest(x,'ud')
h = runstest(...,param1,val1,param2,val2,...)
[h,p] = runstest(...)
[h,p,stats] = runstest(...)
18-1254
runstest
default is 'exact' for runs above/below, and for runs up/down when
the length of x is 50 or less. The 'exact' method is not available for
runs up/down when the length of x is 51 or greater.
• 'tail' — Performs the test against one of the following alternative
hypotheses:
- 'both' — two-tailed test (sequence is not random)
- 'right' — right-tailed test (like values separate for runs
above/below, direction alternates for runs up/down)
- 'left' — left-tailed test (like values cluster for runs above/below,
values trend for runs up/down)
Examples x = randn(40,1);
[h,p] = runstest(x,median(x))
h =
0
p =
0.6286
18-1255
TreeBagger.SampleWithReplacement property
18-1256
sampsizepwr
Syntax n = sampsizepwr(testtype,p0,p1)
n = sampsizepwr(testtype,p0,p1,power)
power = sampsizepwr(testtype,p0,p1,[],n)
p1 = sampsizepwr(testtype,p0,[],power,n)
[...] = sampsizepwr(...,n,param1,val1,param2,val2,...)
18-1257
sampsizepwr
there may be values smaller than the returned n value that also
produce the desired size and power.
18-1258
sampsizepwr
napprox = sampsizepwr('p',0.2,0.26,0.6)
Warning: Values N>200 are approximate. Plotting the power as a function
of N may reveal lower N values that have the required power.
napprox =
244
nn = 1:250;
pwr = sampsizepwr('p',0.2,0.26,[],nn);
nexact = min(nn(pwr>=0.6))
nexact =
213
18-1259
sampsizepwr
18-1260
scatterhist
Syntax scatterhist(x,y)
scatterhist(x,y,nbins)
h = scatterhist(...)
Examples Example 1
Independent normal and lognormal random samples:
x = randn(1000,1);
y = exp(.5*randn(1000,1));
scatterhist(x,y)
18-1261
scatterhist
Example 2
Marginal uniform samples that are not independent:
u = copularnd('Gaussian',.8,1000);
scatterhist(u(:,1),u(:,2))
18-1262
scatterhist
Example 3
Mixed discrete and continuous data:
cars = load('carsmall');
scatterhist(cars.Weight,cars.Cylinders,[10 3])
18-1263
scatterhist
18-1264
qrandset.scramble
Syntax ps = scramble(p,type)
ps = scramble(p,'clear')
ps = scramble(p)
Examples Use haltonset to generate a 3-D Halton point set, skip the first 1000
values, and then retain every 101st point:
p = haltonset(3,'Skip',1e3,'Leap',1e2)
p =
Halton point set in 3 dimensions (8.918019e+013 points)
Properties:
Skip : 1000
18-1265
qrandset.scramble
Leap : 100
ScrambleMethod : none
p = scramble(p,'RR2')
p =
Halton point set in 3 dimensions (8.918019e+013 points)
Properties:
Skip : 1000
Leap : 100
ScrambleMethod : RR2
X0 = net(p,4)
X0 =
0.0928 0.6950 0.0029
0.6958 0.2958 0.8269
0.3013 0.6497 0.4141
0.9087 0.7883 0.2166
X = p(1:3:11,:)
X =
0.0928 0.6950 0.0029
0.9087 0.7883 0.2166
0.3843 0.9840 0.9878
0.6831 0.7357 0.7923
18-1266
qrandset.scramble
18-1267
qrandset.ScrambleMethod property
Examples Apply a random linear scramble combined with a random digital shift
to a sobolset point set class:
P = sobolset(5);
P = scramble(P, 'MatousekAffineOwen');
P.ScrambleMethod
18-1268
piecewisedistribution.segment
Syntax S = segment(obj,X,P)
t = trnd(3,100,1);
obj = paretotails(t,0.1,0.9);
pvals = 0:0.2:1;
s = segment(obj,[],pvals)
s =
1 2 2 2 2 3
18-1269
sequentialfs
criterion = fun(XTRAIN,ytrain,XTEST,ytest)
XTRAIN and ytrain contain the same subset of rows of X and Y, while
XTEST and ytest contain the complementary subset of rows. XTRAIN
and XTEST contain the data taken from the columns of X that correspond
to the current candidate feature set.
Each time it is called, fun must return a scalar value criterion.
Typically, fun uses XTRAIN and ytrain to train or fit a model, then
predicts values for XTEST using that model, and finally returns some
measure of distance, or loss, of those predicted values from ytest.
In the cross-validation calculation for a given candidate feature set,
sequentialfs sums the values returned by fun and divides that sum
18-1270
sequentialfs
by the total number of test observations. It then uses that mean value
to evaluate each candidate feature subset.
Typical loss measures include sum of squared errors for regression
models (sequentialfs computes the mean-squared error in this case),
and the number of misclassified observations for classification models
(sequentialfs computes the misclassification rate in this case).
criterion = fun(XTRAIN,YTRAIN,ZTRAIN,...,
XTEST,YTEST,ZTEST,...)
18-1271
sequentialfs
[] = sequentialfs(...,param1,val1,param2,val2,...) specifies
optional parameter name/value pairs from the following table.
Parameter Value
'cv' The validation method used to compute the
criterion for each candidate feature subset.
18-1272
sequentialfs
Parameter Value
'mcreps' A positive integer indicating the number of
Monte-Carlo repetitions for cross-validation. The
default value is 1. The value must be 1 if the
value of 'cv' is 'resubstitution' or 'none'.
'direction' The direction of the sequential search. The
default is 'forward'. A value of 'backward'
specifies an initial candidate set including all
features and an algorithm that removes features
sequentially until the criterion increases.
'keepin' A logical vector or a vector of column numbers
specifying features that must be included. The
default is empty.
'keepout' A logical vector or a vector of column numbers
specifying features that must be excluded. The
default is empty.
'nfeatures' The number of features at which sequentialfs
should stop. inmodel includes exactly this
many features. The default value is empty,
indicating that sequentialfs should stop when
a local minimum of the criterion is found. A
nonempty value overrides values of 'MaxIter'
and 'TolFun' in 'options'.
'nullmodel' A logical value, indicating whether or not the null
model (containing no features from X) should be
included in feature selection and in the history
output. The default is false.
18-1273
sequentialfs
Parameter Value
'options' Options structure for the iterative sequential
search algorithm, as created by statset.
sequentialfs uses the following statset
parameters:
load fisheriris;
X = randn(150,10);
X(:,[1 3 5 7 ])= meas;
y = species;
c = cvpartition(y,'k',10);
opts = statset('display','iter');
fun = @(XT,yT,Xt,yt)...
(sum(~strcmp(yt,classify(Xt,XT,yT,'quadratic'))));
[fs,history] = sequentialfs(fun,X,y,'cv',c,'options',opts)
18-1274
sequentialfs
fs =
0 0 0 0 1 0 1 0 0 0
history =
In: [2x10 logical]
Crit: [0.0400 0.0267]
history.In
ans =
0 0 0 0 0 0 1 0 0 0
0 0 0 0 1 0 1 0 0 0
18-1275
dataset.set
Syntax set(A)
set(A,PropertyName)
A = set(A,PropertyName,PropertyValue,...)
B = set(A,PropertyName,value)
Description set(A) displays all properties of the dataset array A and their possible
values.
set(A,PropertyName) displays possible values for the property
specified by the string PropertyName.
A = set(A,PropertyName,PropertyValue,...) sets property
name/value pairs.
B = set(A,PropertyName,value) returns a dataset array B that is a
copy of A, but with the property 'PropertyName' set to the value value.
Examples Create a dataset array from Fisher’s iris data and add a description:
load fisheriris
NumObs = size(meas,1);
NameObs = strcat({'Obs'},num2str((1:NumObs)','%-d'));
iris = dataset({nominal(species),'species'},...
{meas,'SL','SW','PL','PW'},...
'ObsNames',NameObs);
iris = set(iris,'Description','Fisher''s Iris Data');
get(iris)
Description: 'Fisher's Iris Data'
Units: {}
DimNames: {'Observations' 'Variables'}
18-1276
dataset.set
UserData: []
ObsNames: {150x1 cell}
VarNames: {'species' 'SL' 'SW' 'PL' 'PW'}
18-1277
CompactTreeBagger.SetDefaultYfit
Syntax B = SetDefaultYfit(B,Yfit)
18-1278
categorical.setdiff
Syntax C = setdiff(A,B)
[C,I] = setdiff(A,B)
18-1279
categorical.setlabels
Syntax A = setlabels(A,labels)
A = setlabels(A,labels,levels)
Examples Example 1
Relabel the species in Fisher’s iris data using new categories:
load fisheriris
species = nominal(species);
species = mergelevels(...
species,{'setosa','virginica'},'parent');
species = setlabels(species,'hybrid','versicolor');
getlabels(species)
ans =
'hybrid' 'parent'
Example 2
1 Load patient data from the CSV file hospital.dat and store the
information in a dataset array with observation names given by the
first column in the data (patient identification):
patients = dataset('file','hospital.dat',...
'delimiter',',',...
'ReadObsNames',true);
2 Make the {0,1}-valued variable smoke nominal, and change the labels
to 'No' and 'Yes':
patients.smoke = nominal(patients.smoke,{'No','Yes'});
18-1280
categorical.setlabels
patients.smoke = addlevels(patients.smoke,...
{'0-5 Years','5-10 Years','LongTerm'});
4 Assuming the nonsmokers have never smoked, relabel the 'No' level:
patients.smoke = setlabels(patients.smoke,'Never','No');
patients.smoke = droplevels(patients.smoke,'Yes');
18-1281
categorical.setxor
Syntax C = setxor(A,B)
[C,IA,IB] = setxor(A,B)
18-1282
gmdistribution.SharedCov property
Description Logical true if all the covariance matrices are restricted to be the same
(pooled estimate); logical false otherwise.
18-1283
categorical.shiftdim
Syntax B = shiftdim(A,n)
[B,nshifts] = shiftdim(A)
18-1284
gmdistribution.Sigma property
18-1285
signrank
Syntax p = signrank(x)
p = signrank(x,m)
p = signrank(x,y)
[p,h] = signrank(...)
[p,h] = signrank(...,'alpha',alpha)
[p,h] = signrank(...,'method',method)
[p,h,stats] = signrank(...)
18-1286
signrank
Examples Test the hypothesis of zero median for the difference between two
paired samples.
before = lognrnd(2,.25,10,1);
after = before+trnd(2,10,1);
[p,h] = signrank(before,after)
p =
0.5566
h =
0
18-1287
signtest
Syntax p = signtest(x)
p = signtest(x,m)
p = signtest(x,y)
[p,h] = signtest(...)
[p,h] = signtest(...,'alpha',alpha)
[p,h] = signtest(...,'method',method)
[p,h,stats] = signtest(...)
18-1288
signtest
unspecified, is the exact method for small samples and the approximate
method for large samples.
[p,h,stats] = signtest(...) returns the structure stats with the
following fields:
Examples Test the hypothesis of zero median for the difference between two
paired samples.
before = lognrnd(2,.25,10,1);
after = before + (lognrnd(0,.5,10,1) - 1);
[p,h] = signtest(before,after)
p =
0.3438
h =
0
18-1289
silhouette
Syntax silhouette(X,clust)
s = silhouette(X,clust)
[s,h] = silhouette(X,clust)
[...] = silhouette(X,clust,metric)
[...] = silhouette(X,clust,distfun,p1,p2,...)
Metric Description
'Euclidean' Euclidean distance
'sqEuclidean' Squared Euclidean distance (default)
'cityblock' Sum of absolute differences
'cosine' One minus the cosine of the included angle
between points (treated as vectors)
'correlation' One minus the sample correlation between points
(treated as sequences of values)
18-1290
silhouette
Metric Description
'Hamming' Percentage of coordinates that differ
'Jaccard' Percentage of nonzero coordinates that differ
Vector A numeric distance matrix in upper triangular
vector form, such as is created by pdist. X is
not used in this case, and can safely be set to [].
For more information on each metric, see “Distance Metrics” on page
12-14.
[...] = silhouette(X,clust,distfun,p1,p2,...) accepts a
function handle distfun to a metric of the form
d = distfun(X0,X,p1,p2,...)
Remarks The silhouette value for each point is a measure of how similar that
point is to points in its own cluster compared to points in other clusters,
and ranges from -1 to +1. It is defined as
where a(i) is the average distance from the ith point to the other
points in its cluster, and b(i,k) is the average distance from the ith
point to points in another cluster k.
Examples X = [randn(10,2)+ones(10,2);
randn(10,2)-ones(10,2)];
cidx = kmeans(X,2,'distance','sqeuclid');
s = silhouette(X,cidx,'sqeuclid');
18-1291
silhouette
18-1292
categorical.single
Syntax B = single(A)
18-1293
dataset.single
Syntax B = single(A)
B = single(A,vars)
18-1294
categorical.size
Syntax d = size(A)
[m,n] = size(A)
[m1,m2,m3,...,mn] = size(A)
m = size(A,dim)
18-1295
dataset.size
Syntax B = single(A)
B = single(A,vars)
18-1296
qrandset.size
Syntax d = size(p)
[m,n] = size(p)
m = size(p,dim)
P = sobolset(12);
d = size(P)
return
d = [9.0072e+015 12]
The command
[m,n] = size(P)
returns
m = 9.0072e+015
n = 12
The command
18-1297
qrandset.size
m2 = size(P, 2)
returns
m2 = 12
18-1298
slicesample
18-1299
slicesample
Next, use the slicesample function to generate the random samples for
the function defined above.
x = slicesample(1,2000,'pdf',f,'thin',5,'burnin',1000);
hist(x,50)
set(get(gca,'child'),'facecolor',[0.8 .8 1]);
hold on
xd = get(gca,'XLim'); % Gets the xdata of the bins
binwidth = (xd(2)-xd(1)); % Finds the width of each bin
% Use linspace to normalize the histogram
y = 5.6398*binwidth*f(linspace(xd(1),xd(2),1000));
plot(linspace(xd(1),xd(2),1000),y,'r','LineWidth',2)
18-1300
slicesample
18-1301
skewness
Purpose Skewness
Syntax y = skewness(X)
y = skewness(X,flag)
Algorithm Skewness is a measure of the asymmetry of the data around the sample
mean. If skewness is negative, the data are spread out more to the
left of the mean than to the right. If skewness is positive, the data are
spread out more to the right. The skewness of the normal distribution
(or any perfectly symmetric distribution) is zero.
The skewness of a distribution is defined as
E(x − )
3
s=
3
where µ is the mean of x, σ is the standard deviation of x, and E(t)
represents the expected value of the quantity t. skewness computes a
sample version of this population value.
When you set flag to 1, the following equation applies:
18-1302
skewness
1 n
∑(x − x)
3
n i=1 i
s1 =
3
⎛ 1 n ⎞
⎜ n∑
⎜ ( xi − x ) 2⎟
⎟
⎝ i=1 ⎠
When you set flag to 0, the following equation applies:
n ( n − 1)
s0 = s1
n−2
This bias-corrected formula requires that X contain at least three
elements.
y = skewness(X)
y =
-0.2933 0.0482 0.2735 0.4641
18-1303
qrandset.Skip property
Description The Skip property of a point set contains a positive integer which
specifies the number of initial points in the sequence to omit from the
point set. The default Skip value is 0.
Initial points of a sequence sometimes exhibit undesirable properties,
for example the first point is often (0,0,0,...) and this may
"unbalance" the sequence since its counterpart, (1,1,1,...), never
appears. Another common reason is that initial points often exhibit
correlations among different dimensions which disappear later in the
sequence.
Examples Examine the difference between skipping and not skipping points:
% Skip the first point of the sequence. The point set now
% starts at the second point of the basic Sobol sequence.
P.Skip = 1;
P(1:3,:)
18-1304
sobolset class
Superclasses qrandset
Description sobolset is a quasi-random point set class that produces points from
the Sobol sequence. The Sobol sequence is a base-2 digital sequence
that fills space in a highly uniform manner.
Inherited Properties
Properties in the following table are inherited from qrandset.
18-1305
sobolset class
Copy Value. To learn how this affects your use of the class, see Comparing
Semantics Handle and Value Classes in the MATLAB Object-Oriented
Programming documentation.
[4] Matousek, J., "On the L2-discrepancy for anchored boxes," Journal
of Complexity, Vol. 14, pp. 527-556, 1998.
18-1306
sobolset
Syntax p = sobolset(d)
p = sobolset(d,prop1,val1,prop2,val2,...)
Examples Generate a 3-D Sobol point set, skip the first 1000 values, and then
retain every 101st point:
p = sobolset(3,'Skip',1e3,'Leap',1e2)
p =
Sobol point set in 3 dimensions (8.918019e+013 points)
Properties:
Skip : 1000
Leap : 100
ScrambleMethod : none
PointOrder : standard
p = scramble(p,'MatousekAffineOwen')
p =
Sobol point set in 3 dimensions (8.918019e+013 points)
Properties:
Skip : 1000
18-1307
sobolset
Leap : 100
ScrambleMethod : MatousekAffineOwen
PointOrder : standard
X0 = net(p,4)
X0 =
0.7601 0.5919 0.9529
0.1795 0.0856 0.0491
0.5488 0.0785 0.8483
0.3882 0.8771 0.8755
X = p(1:3:11,:)
X =
0.7601 0.5919 0.9529
0.3882 0.8771 0.8755
0.6905 0.4951 0.8464
0.1955 0.5679 0.3192
18-1308
sobolset
18-1309
ordinal.sort
Syntax B = sort(A)
B = sort(A,dim)
B = sort(A,dim,mode)
[B,I] = sort(A,...)
A = ordinal([6 2 5; 2 4 1; 3 2 4],...
{'lo','med','hi'},[],[0 2 4 6])
A =
hi med hi
med hi lo
med med hi
B = sort(A)
B =
med med lo
med med hi
hi hi hi
18-1310
ordinal.sort
18-1311
dataset.sortrows
Syntax B = sortrows(A)
B = sortrows(A,vars)
B = sortrows(A,'obsnames')
B = sortrows(A,vars,mode)
[B,idx] = sortrows(A)
Examples Sort the data in hospital.mat by age and then by last name:
load hospital
hospital(1:5,1:3)
ans =
LastName Sex Age
YPL-320 'SMITH' Male 38
GLI-532 'JOHNSON' Male 43
PNI-258 'WILLIAMS' Female 38
MIJ-579 'JONES' Female 40
18-1312
dataset.sortrows
hospital = sortrows(hospital,{'Age','LastName'});
hospital(1:5,1:3)
ans =
LastName Sex Age
REV-997 'ALEXANDER' Male 25
FZR-250 'HALL' Male 25
LIM-480 'HILL' Female 25
XUE-826 'JACKSON' Male 25
SCQ-914 'JAMES' Male 25
18-1313
ordinal.sortrows
Syntax B = sortrows(A)
B = sortrows(A,col)
[B,I] = sortrows(A)
[B,I] = sortrows(A,col)
Examples Sort the rows of an ordinal array in ascending order for the first column,
and then in descending order for the second column:
A = ordinal([6 2 5; 2 4 1; 3 2 4],...
{'lo','med','hi'},[],[0 2 4 6])
A =
hi med hi
med hi lo
med med hi
B = sortrows(A,[1 -2])
B =
med hi lo
med med hi
hi med hi
18-1314
squareform
Syntax Z = squareform(y)
y = squareform(Z)
Z = squareform(y,'tovector')
Y = squareform(Z,'tomatrix')
Examples y = 1:6
y =
1 2 3 4 5 6
X = [0 1 2 3; 1 0 4 5; 2 4 0 6; 3 5 6 0]
X =
0 1 2 3
1 0 4 5
2 4 0 6
3 5 6 0
18-1315
squareform
18-1316
categorical.squeeze
Syntax B = squeeze(A)
18-1317
dataset.stack
18-1318
dataset.stack
Examples Convert a wide format data set to tall format, and then back to a
different wide format:
load flu
flu2 = stack(flu, 2:11, 'NewdatavarsName','FluRate',...
'IndVarName','Region')
dateNames = cellstr(datestr(flu.Date,'mmm_DD_YYYY'));
18-1319
dataset.stack
18-1320
qrandstream.State property
Description The State property of a quasi-random stream contains the index into
the associated point set of the next point to draw in the stream. Getting
and resetting the State property allows you to return a stream to a
previous state. The initial value of State is 1.
18-1321
statget
Input DerivStep
Arguments Relative difference used in finite difference derivative calculations.
A positive scalar, or a vector of positive scalars the same size
as the vector of parameters estimated by the Statistics Toolbox
function using the options structure.
Display
Amount of information displayed by the algorithm.
FunValCheck
Check for invalid values, such as NaN or Inf, from the objective
function.
• 'off'
18-1322
statget
• 'on'
GradObj
Flags whether the objective function returns a gradient vector
as a second output.
• 'off'
• 'on'
Jacobian
Flags whether the objective function returns a Jacobian as a
second output.
• 'off'
• 'on'
MaxFunEvals
Maximum number of objective function evaluations allowed.
Positive integer.
MaxIter
Maximum number of iterations allowed. Positive integer.
OutputFcn
The solver calls all output functions after each iteration.
Robust
Invoke robust fitting option.
18-1323
statget
• 'off'
• 'on'
Streams
A single instance of the RandStream class, or a cell array of
RandStream instances. The Streams option is accepted by some
functions to govern what stream(s) to use in generating random
numbers within the function. If 'UseSubstreams' is 'always',
the Streams value must be a scalar, or must be empty. If
'UseParallel' is 'always' and 'UseSubstreams' is 'never',
then the Streams argument must either be empty, or its length
must match the number of processors used in the computation:
equal to the matlabpool size if a matlabpool is open, a scalar
otherwise.
TolBnd
Parameter bound tolerance. Positive scalar.
TolFun
Termination tolerance for the objective function value. Positive
scalar.
TolTypeFun
Use TolFun for absolute or relative objective function tolerances.
• 'abs'
• 'rel'
TolTypeX
Use TolX for absolute or relative parameter tolerances.
• 'abs'
• 'rel'
18-1324
statget
TolX
Termination tolerance for the parameters. Positive scalar.
Tune
The tuning constant used in robust fitting to normalize the
residuals before applying the weight function. The default value
depends upon the weight function. This parameter is necessary
if you specify the weight function as a function handle. Positive
scalar.
UseParallel
Flag indicating whether eligible functions should use capabilities
of the Parallel Computing Toolbox (PCT), if the capabilities are
available. That is, if the PCT is installed, and a PCT matlabpool
is in effect. Eligible functions are bootci, bootstrp, crossval,
jackknife, and the TreeBagger constructor. Valid values are
'never' (the default), for serial computation, and 'always', for
parallel computation.
UseSubstreams
Flag indicating whether the random number generator in eligible
functions should use Substream property of the RandStream
class. 'never' (default) or 'always'. 'always', high level
iterations within the function will set the Substream property
to the value of the iteration. This behavior helps to generate
reproducible random number streams in parallel and/or serial
mode computation. Eligible functions are bootci, bootstrp,
crossval, and the TreeBagger constructor.
WgtFun
A weight function for robust fitting. Valid only when Robust is
'on'. Can also be a function handle that accepts a normalized
residual as input and returns the robust weights as output.
• 'bisquare'
• 'andrews'
18-1325
statget
• 'cauchy'
• 'fair'
• 'huber'
• 'logistic'
• 'talwar'
• 'welsch'
Examples This statement returns the value of the Display statistics options
parameter from the structure called my_options.
val = statget(my_options,'Display')
optnew = statget(my_options,'Display','final');
18-1326
statset
Syntax statset
statset(statfun)
options = statset(...)
options = statset(fieldname1,val1,fieldname2,val2,...)
options = statset(oldopts,fieldname1,val1,fieldname2,val2,
...)
options = statset(oldopts,newopts)
Description statset with no input arguments and no output arguments displays all
fields of a statistics options structure and their possible values.
statset(statfun) displays fields and default values used by the
Statistics Toolbox function statfun. Specify statfun using a string
name or a function handle.
options = statset(...) creates a statistics options structure options.
With no input arguments, all fields of the options structure are an
empty array ([]). With a specified statfun, function-specific fields are
default values and the remaining fields are []. Function-specific fields
set to [] indicate that the function is to use its default value for that
parameter. For available options, see Inputs.
options = statset(fieldname1,val1,fieldname2,val2,...)
creates an options structure in which the named fields have the
specified values. Any unspecified values are []. Use strings for field
names. For fields that are string-valued, you must input the complete
string for the value. If you provide an invalid string for a value, statset
uses the default.
options =
statset(oldopts,fieldname1,val1,fieldname2,val2,...)
creates a copy of oldopts with the named parameters changed to
the specified values.
options = statset(oldopts,newopts) combines an existing options
structure, oldopts, with a new options structure, newopts. Any
18-1327
statset
Input DerivStep
Arguments Relative difference used in finite difference derivative calculations.
A positive scalar, or a vector of positive scalars the same size
as the vector of parameters estimated by the Statistics Toolbox
function using the options structure.
Display
Amount of information displayed by the algorithm.
FunValCheck
Check for invalid values, such as NaN or Inf, from the objective
function.
• 'off'
• 'on'
GradObj
Flags whether the objective function returns a gradient vector
as a second output.
• 'off'
• 'on'
Jacobian
18-1328
statset
• 'off'
• 'on'
MaxFunEvals
Maximum number of objective function evaluations allowed.
Positive integer.
MaxIter
Maximum number of iterations allowed. Positive integer.
OutputFcn
The solver calls all output functions after each iteration.
Robust
Invoke robust fitting option.
• 'off'
• 'on'
Streams
A single instance of the RandStream class, or a cell array of
RandStream instances. The Streams option is accepted by some
functions to govern what stream(s) to use in generating random
numbers within the function. If 'UseSubstreams' is 'always',
the Streams value must be a scalar, or must be empty. If
'UseParallel' is 'always' and 'UseSubstreams' is 'never',
18-1329
statset
• 'abs'
• 'rel'
TolTypeX
Use TolX for absolute or relative parameter tolerances.
• 'abs'
• 'rel'
TolX
Termination tolerance for the parameters. Positive scalar.
Tune
The tuning constant used in robust fitting to normalize the
residuals before applying the weight function. The default value
depends upon the weight function. This parameter is necessary
if you specify the weight function as a function handle. Positive
scalar.
UseParallel
18-1330
statset
• 'bisquare'
• 'andrews'
• 'cauchy'
• 'fair'
• 'huber'
• 'logistic'
• 'talwar'
• 'welsch'
18-1331
statset
Examples Suppose you want to change the default parameter values for the
function evfit, which fits an extreme value distribution to data. The
defaults parameter values are:
statset('evfit')
ans =
Display: 'off'
MaxFunEvals: []
MaxIter: []
TolBnd: []
TolFun: []
TolX: 1.0000e-006
GradObj: []
DerivStep: []
FunValCheck: []
Robust: []
WgtFun: []
Tune: []
The only parameters that evfit uses are Display and TolX. To create
an options structure with the value of TolX set to 1e-8, enter:
options = statset('TolX',1e-8)
% Pass options to evfit:
mu = 1;
sigma = 1;
data = evrnd(mu,sigma,1,100);
paramhat = evfit(data,[],[],[],options)
18-1332
ProbDistUnivParam.std
Syntax S = std(PD)
18-1333
stepwise
Syntax stepwise
stepwise(X,y)
stepwise(X,y,inmodel,penter,premove)
Description stepwise uses the sample data in hald.mat to display a graphical user
interface for performing stepwise regression of the response values in
heat on the predictive terms in ingredients.
18-1334
stepwise
The upper left of the interface displays estimates of the coefficients for
all potential terms, with horizontal bars indicating 90% (colored) and
95% (grey) confidence intervals. The red color indicates that, initially,
the terms are not in the model. Values displayed in the table are those
that would result if the terms were added to the model.
The middle portion of the interface displays summary statistics for the
entire model. These statistics are updated with each step.
18-1335
stepwise
The lower portion of the interface, Model History, displays the RMSE
for the model. The plot tracks the RMSE from step to step, so you can
compare the optimality of different models. Hover over the blue dots
in the history to see which terms were in the model at a particular
step. Click on a blue dot in the history to open a copy of the interface
initialized with the terms in the model at that step.
Initial models, as well as entrance/exit tolerances for the p-values
of F-statistics, are specified using additional input arguments to
stepwise. Defaults are an initial model with no terms, an entrance
tolerance of 0.05, and an exit tolerance of 0.10.
To center and scale the input data (compute z-scores) to improve
conditioning of the underlying least-squares problem, select Scale
Inputs from the Stepwise menu.
You proceed through a stepwise regression in one of two ways:
2 Click a line in the plot or in the table to toggle the state of the
corresponding term. Clicking a red line, corresponding to a term not
currently in the model, adds the term to the model and changes the
line to blue. Clicking a blue line, corresponding to a term currently
in the model, removes the term from the model and changes the line
to red.
18-1336
stepwise
Check the information you want to export and, optionally, change the
names of the workspace variables to be created. Click OK to export
the information.
stepwise(X,y) displays the interface using the p predictive terms in
the n-by-p matrix X and the response values in the n-by-1 vector y.
Distinct predictive terms should appear in different columns of X.
18-1337
stepwise
2 If any terms not in the model have p-values less than an entrance
tolerance (that is, if it is unlikely that they would have zero coefficient
if added to the model), add the one with the smallest p value and
repeat this step; otherwise, go to step 3.
3 If any terms in the model have p-values greater than an exit tolerance
(that is, if it is unlikely that the hypothesis of a zero coefficient can
be rejected), remove the one with the largest p value and go to step
2; otherwise, end.
Depending on the terms included in the initial model and the order in
which terms are moved in and out, the method may build different
models from the same set of potential terms. The method terminates
when no single step improves the model. There is no guarantee,
however, that a different initial model or a different sequence of steps
will not lead to a better fit. In this sense, stepwise models are locally
optimal, but may not be globally optimal.
18-1338
stepwisefit
Syntax b = stepwisefit(X,y)
[b,se,pval,inmodel,stats,nextstep,history] = stepwisefit(...)
[...] = stepwisefit(X,y,param1,val1,param2,val2,...)
18-1339
stepwisefit
18-1340
stepwisefit
[...] = stepwisefit(X,y,param1,val1,param2,val2,...)
specifies one or more of the name/value pairs described in the following
table.
Parameter Value
'inmodel' A logical vector specifying terms to include in the
initial fit. The default is to specify no terms.
'penter' The maximum p value for a term to be added. The
default is 0.05.
18-1341
stepwisefit
2 If any terms not in the model have p-values less than an entrance
tolerance (that is, if it is unlikely that they would have zero coefficient
if added to the model), add the one with the smallest p value and
repeat this step; otherwise, go to step 3.
3 If any terms in the model have p-values greater than an exit tolerance
(that is, if it is unlikely that the hypothesis of a zero coefficient can
be rejected), remove the one with the largest p value and go to step
2; otherwise, end.
Depending on the terms included in the initial model and the order in
which terms are moved in and out, the method may build different
models from the same set of potential terms. The method terminates
when no single step improves the model. There is no guarantee,
however, that a different initial model or a different sequence of steps
will not lead to a better fit. In this sense, stepwise models are locally
optimal, but may not be globally optimal.
Examples Load the data in hald.mat, which contains observations of the heat of
reaction of various cement mixtures:
load hald
whos
Name Size Bytes Class Attributes
18-1342
stepwisefit
stepwisefit(ingredients,heat,...
'penter',0.05,'premove',0.10);
Initial columns included: none
Step 1, added column 4, p=0.000576232
Step 2, added column 1, p=1.10528e-006
Final columns included: 1 4
'Coeff' 'Std.Err.' 'Status' 'P'
[ 1.4400] [ 0.1384] 'In' [1.1053e-006]
[ 0.4161] [ 0.1856] 'Out' [ 0.0517]
[-0.4100] [ 0.1992] 'Out' [ 0.0697]
[-0.6140] [ 0.0486] 'In' [1.8149e-007]
initialModel = ...
[false true false false]; % Force in 2nd term
stepwisefit(ingredients,heat,...
'inmodel',initialModel,...
'penter',.05,'premove',0.10);
Initial columns included: 2
Step 1, added column 1, p=2.69221e-007
Final columns included: 1 2
'Coeff' 'Std.Err.' 'Status' 'P'
[ 1.4683] [ 0.1213] 'In' [2.6922e-007]
[ 0.6623] [ 0.0459] 'In' [5.0290e-008]
[ 0.2500] [ 0.1847] 'Out' [ 0.2089]
18-1343
stepwisefit
The preceding two models, built from different initial models, use
different subsets of the predictive terms. Terms 2 and 4, swapped in the
two models, are highly correlated:
term2 = ingredients(:,2);
term4 = ingredients(:,4);
R = corrcoef(term2,term4)
R =
1.0000 -0.9730
-0.9730 1.0000
[betahat1,se1,pval1,inmodel1,stats1] = ...
stepwisefit(ingredients,heat,...
'penter',.05,'premove',0.10,...
'display','off');
[betahat2,se2,pval2,inmodel2,stats2] = ...
stepwisefit(ingredients,heat,...
'inmodel',initialModel,...
'penter',.05,'premove',0.10,...
'display','off');
RMSE1 = stats1.rmse
RMSE1 =
2.7343
RMSE2 = stats2.rmse
RMSE2 =
2.4063
The second model has a lower Root Mean Square Error (RMSE).
References [1] Draper, N. R., and H. Smith. Applied Regression Analysis. Hoboken,
NJ: Wiley-Interscience, 1998. pp. 307–312.
18-1344
categorical.subsasgn
Syntax A = subsasgn(A,S,B)
18-1345
classregtree.subsasgn
Syntax
18-1346
dataset.subsasgn
• positive integers
• vectors of positive integers
• observation/variable names
• cell arrays containing one or more observation/variable names
• logical vectors
The assignment does not use observation names, variable names, or any
other properties of B to modify properties of A; however properties of A
are extended with default values if the assignment expands the number
of observations or variables in A. Elements of B are assigned into A by
position, not by matching names.
A{i,j} = B assigns the value B into an element of the dataset array A.
i and J are positive integers, or logical vectors. Cell indexing cannot
assign into multiple dataset elements, that is, the subscripts i and
j must each refer to only a single observation or variable. B is cast
to the type of the target variable if necessary. If the dataset element
already exists, A{i,j} may also be followed by further subscripting as
supported by the variable.
18-1347
dataset.subsasgn
• 'ObsNames'
• 'VarNames'
• 'Description'
• 'Units'
• 'DimNames'
• 'UserData'
• 'VarDescription'
18-1348
dataset.subsasgn
[A.StructVar(1:2).field] = B, or [A.Properties.ObsNames{1:2}]
= B. Use multiple assignments of the form A.CellVar{1} = B instead.
Similarly, if a dataset variable is a cell array with multiple columns
or is an n-D cell array, then the contents of that variable for a single
observation consists of multiple cells, and you cannot assign to all of
them using the syntax A{1,'CellVar'} = B. Use multiple assignments
of the form [A.CellVar{1,1}] = B instead.
18-1349
gmdistribution.subsasgn
18-1350
NaiveBayes.subsasgn
18-1351
categorical.subsindex
Syntax I = subsindex(A)
18-1352
classregtree.subsref
Syntax B = subsref(T,S)
18-1353
categorical.subsref
Syntax A = subsref(A,S,B)
18-1354
dataset.subsref
Syntax B = subsref(A,S)
• positive integers
• vectors of positive integers
• observation/variable names
• cell arrays containing one or more observation/variable names
• logical vectors
18-1355
dataset.subsref
• 'ObsNames'
• 'VarNames'
• 'Description'
• 'Units'
• 'DimNames'
• 'UserData'
• 'VarDescription'
Limitations
Subscripting expressions such as A.CellVar{1:2},
A.StructVar(1:2).field, or A.Properties.ObsNames{1:2}
are valid, but result in subsref returning multiple outputs in the
form of a comma-separated list. If you explicitly assign to output
18-1356
dataset.subsref
18-1357
gmdistribution.subsref
Syntax B = subsref(T,S)
18-1358
NaiveBayes.subsref
Syntax b = subsref(nb,s)
18-1359
qrandset.subsref
Syntax x = p(i,j)
x = subsref(p,s)
Description x = p(i,j) returns a matrix that contains a subset of the points from
the point set p. The indices in i select points from the set and the
indices in j select columns from those points. i and j are vector of
positive integers or logical vectors. A colon used as a subscript, as in
p(i,:), indicates the entire row (or column).
x = subsref(p,s) is called for the syntax p(i), p{i}, or p.i. s is a
structure array with the fields:
18-1360
categorical.summary
Syntax summary(A)
C = summary(A)
[C,labels] = summary(A)
Examples Count the number of patients in each age group in the data in
hospital.mat:
load hospital
edges = 0:10:100;
labels = strcat(num2str((0:10:90)','%d'),{'s'});
AgeGroup = ordinal(hospital.Age,labels,[],edges);
[c,labels] = summary(AgeGroup);
Table = dataset({labels,'AgeGroup'},{c,'Count'});
Table(3:6,:)
ans =
AgeGroup Count
'20s' 15
'30s' 41
'40s' 42
'50s' 2
18-1361
categorical.summary
18-1362
dataset.summary
Syntax summary(A)
s = summary(A)
18-1363
dataset.summary
load fisheriris
species = nominal(species);
data = dataset(species,meas);
summary(data)
species: [150x1 nominal]
setosa versicolor virginica
50 50 50
18-1364
dataset.summary
load hospital
summary(hospital)
18-1365
dataset.summary
18-1366
ProbDist.Support property
• range
• closedbound
• iscontinuous
Values The values for the three fields in the structure are:
• range — A two-element vector [L, U], such that all of the probability
is contained from L to U.
• closedbound — A two-element logical vector indicating whether the
corresponding range endpoint is included. Possible values for each
endpoint are 1 (true) or 0 (false).
• iscontinuous — A logical value indicates if the distribution takes
values on the entire interval from L to U (true), or if it takes only
integer values within this range (false). Possible values are 1 (true)
or 0 (false).
18-1367
surfht
Syntax surfht(Z)
surfht(x,y,Z)
18-1368
tabulate
18-1369
tblread
18-1370
tblread
varnames =
Male
Female
casenames =
Verbal
Quantitative
18-1371
tblwrite
Syntax tblwrite(data,varnames,casenames)
tblwrite(data,varnames,casenames,filename)
tblwrite(data,varnames,casenames,filename,delimiter)
Character String
' ' 'space'
'\t' 'tab'
',' 'comma'
';' 'semi'
'|' 'bar'
tblwrite(data,varnames,casenames,'sattest.dat')
18-1372
tblwrite
type sattest.dat
Male Female
Verbal 470 530
Quantitative 520 480
18-1373
tcdf
Syntax P = tcdf(X,V)
⎛ + 1 ⎞
Γ⎜ ⎟
x
p = F ( x | ) = ∫ ⎝ 2 ⎠ 1 1
dt
−∞ ⎛ ⎞ +1
Γ⎜ ⎟ ⎛ t2 ⎞ 2
⎝2⎠ ⎜1 + ⎟
⎜ ⎟⎠
⎝
18-1374
tcdf
[h,ptest] = ttest(x,mu,0.05,'right')
h =
0
ptest =
0.4020
18-1375
tdfread
Syntax tdfread
tdfread(filename)
tdfread(filename,delimiter)
s = tdfread(filename,...)
Description tdfread displays the File Open dialog box for interactive selection
of a data file, then reads data from the file. The file should have
variable names separated by tabs in the first row, and data values
separated by tabs in the remaining rows. tdfread creates variables in
the workspace, one for each column of the file. The variable names
are taken from the first row of the file. If a column of the file contains
only numeric data in the second and following rows, tdfread creates a
double variable. Otherwise, tdfread creates a char variable. After all
values are imported, tdfread displays information about the imported
values using the format of the tdfread command.
tdfread(filename) allows command line specification of the name of a
file in the current folder, or the complete path name of any file, using
the string filename.
tdfread(filename,delimiter) indicates that the character specified
by delimiter separates columns in the file. Accepted values for
delimiter are:
18-1376
tdfread
type sat2.dat
Test,Gender,Score
Verbal,Male,470
Verbal,Female,530
Quantitative,Male,520
Quantitative,Female,480
The following creates the variables Gender, Score, and Test from the
file sat2.dat and displays the contents of the MATLAB workspace:
tdfread('sat2.dat',',')
18-1377
classregtree.test
18-1378
classregtree.test
subtree, and the scalar bestlevel containing the estimated best level
of pruning. A bestlevel of 0 means no pruning. The best level is the
one that produces the smallest tree that is within one standard error of
the minimum-cost subtree.
[...] = test(...,param1,val1,param2,val2,...) specifies
optional parameter name/value pairs for methods other than
'resubstitution', chosen from the following:
Examples Find the best tree for Fisher’s iris data using cross-validation. Start
with a large tree:
load fisheriris;
t = classregtree(meas,species,...
'names',{'SL' 'SW' 'PL' 'PW'},...
'splitmin',5)
t =
Decision tree for classification
1 if PL<2.45 then node 2 elseif PL>=2.45 then node 3 else setosa
2 class = setosa
3 if PW<1.75 then node 4 elseif PW>=1.75 then node 5 else versicolor
4 if PL<4.95 then node 6 elseif PL>=4.95 then node 7 else versicolor
5 class = virginica
6 if PW<1.65 then node 8 elseif PW>=1.65 then node 9 else versicolor
7 if PW<1.55 then node 10 elseif PW>=1.55 then node 11 else virginica
8 class = versicolor
9 class = virginica
10 class = virginica
18-1379
classregtree.test
11 class = versicolor
view(t)
[c,s,n,best] = test(t,'cross',meas,species);
tmin = prune(t,'level',best)
tmin =
Decision tree for classification
18-1380
classregtree.test
view(tmin)
Plot the smallest tree within one standard error of the minimum cost
tree:
18-1381
classregtree.test
[mincost,minloc] = min(c);
plot(n,c,'b-o',...
n(best+1),c(best+1),'bs',...
n,(mincost+s(minloc))*ones(size(n)),'k--')
xlabel('Tree size (number of terminal nodes)')
ylabel('Cost')
The solid line shows the estimated cost for each tree size, the dashed
line marks one standard error above the minimum, and the square
marks the smallest tree under the dashed line.
18-1382
classregtree.test
18-1383
cvpartition.test
Description idx = test(c) returns the logical vector idx of test indices for an object
c of the cvpartition class of type 'holdout' or 'resubstitution'.
If c.Type is 'holdout', idx specifies the observations in the test set.
If c.Type is 'resubstitution', idx specifies all observations.
idx = test(c,i) returns the logical vector idx of test indices for
repetition i of an object c of the cvpartition class of type 'kfold'
or 'leaveout'.
If c.Type is 'kfold', idx specifies the observations in the test set in
fold i.
If c.Type is 'leaveout', idx specifies the observation left out at
repetition i.
Examples Identify the test indices in the first fold of a partition of 10 observations
for 3-fold cross-validation:
c = cvpartition(10,'kfold',3)
c =
K-fold cross validation partition
N: 10
NumTestSets: 3
TrainSize: 7 6 7
TestSize: 3 4 3
test(c,1)
ans =
1
1
0
0
18-1384
cvpartition.test
0
0
0
0
1
0
18-1385
cvpartition.TestSize property
18-1386
tiedrank
Examples Counting from smallest to largest, the two 20 values are 2nd and 3rd,
so they both get rank 2.5 (average of 2 and 3):
tiedrank([10 20 30 40 20])
ans =
1.0000 2.5000 4.0000 5.0000 2.5000
18-1387
categorical.times
Syntax C = times(A,B)
18-1388
tinv
Syntax X = tinv(P,V)
Description X = tinv(P,V) computes the inverse of Student’s t cdf using the degrees
of freedom in V for the corresponding probabilities in P. P and V can be
vectors, matrices, or multidimensional arrays that are the same size. A
scalar input is expanded to a constant array with the same dimensions
as the other inputs. The values in P must lie on the interval [0 1].
The t inverse function in terms of the t cdf is
x = F −1 ( p| ) = { x : F ( x | ) = p}
where
⎛ + 1 ⎞
Γ⎜ ⎟
p = F ( x | ) = ∫
x ⎝ 2 ⎠ 1 1
dt
−∞ ⎛ ⎞ +1
Γ⎜ ⎟ ⎛ t ⎞ 2
2
⎝2⎠ ⎜1 + ⎟
⎜ ⎟⎠
⎝
The result, x, is the solution of the cdf integral with parameter ν, where
you supply the desired probability p.
Examples What is the 99th percentile of the t distribution for one to six degrees
of freedom?
percentile = tinv(0.99,1:6)
percentile =
31.8205 6.9646 4.5407 3.7469 3.3649 3.1427
18-1389
tpdf
Syntax Y = tpdf(X,V)
⎛ + 1 ⎞
Γ⎜
2 ⎟⎠ 1
y = f ( x | ) = ⎝
1
⎛ ⎞ +1
Γ⎜ ⎟ ⎛ x 2⎞ 2
⎝2⎠ ⎜1 + ⎟
⎜ ⎟⎠
⎝
tpdf(0,1:6)
ans =
0.3183 0.3536 0.3676 0.3750 0.3796 0.3827
difference = tpdf(-2.5:2.5,30)-normpdf(-2.5:2.5)
difference =
0.0035 -0.0006 -0.0042 -0.0042 -0.0006 0.0035
18-1390
cvpartition.training
Description idx = training(c) returns the logical vector idx of training indices
for an object c of the cvpartition class of type 'holdout' or
'resubstitution'.
If c.Type is 'holdout', idx specifies the observations in the training
set.
If c.Type is 'resubstitution', idx specifies all observations.
idx = training(c,i) returns the logical vector idx of training indices
for repetition i of an object c of the cvpartition class of type 'kfold'
or 'leaveout'.
If c.Type is 'kfold', idx specifies the observations in the training
set in fold i.
If c.Type is 'leaveout', idx specifies the observations left in at
repetition i.
c = cvpartition(10,'kfold',3)
c =
K-fold cross validation partition
N: 10
NumTestSets: 3
TrainSize: 7 6 7
TestSize: 3 4 3
training(c,1)
ans =
0
0
18-1391
cvpartition.training
1
1
1
1
1
1
0
1
18-1392
cvpartition.TrainSize property
18-1393
categorical.transpose
Syntax B = transpose(A)
18-1394
TreeBagger.TreeArgs property
18-1395
TreeBagger class
18-1396
TreeBagger class
18-1397
TreeBagger class
18-1398
TreeBagger class
Copy Value. To learn how this affects your use of the class, see Comparing
Semantics Handle and Value Classes in the MATLAB Object-Oriented
Programming documentation.
See Also “Regression and Classification by Bagging Decision Trees” on page 12-30
Classification Trees
Regression Tress
Grouped Data
18-1399
TreeBagger
Syntax B = TreeBagger(ntrees,X,Y)
B = TreeBagger(ntrees,X,Y,’param1’,val1,'param2',val2,...)
18-1400
TreeBagger
18-1401
TreeBagger
18-1402
TreeBagger
returns
b =
18-1403
TreeBagger
18-1404
treedisp
Syntax treedisp(t)
treedisp(t,param1,val1,param2,val2,...)
Description
Note This function is superseded by the view method of the
classregtree class and is maintained only for backwards compatibility.
It accepts objects t created with the classregtree constructor.
18-1405
treedisp
After you select the type of information you want, click any node to
display the information for that node.
The Pruning level button displays the number of levels that have
been cut from the tree and the number of levels in the unpruned tree.
For example, 1 of 6 indicates that the unpruned tree has six levels,
and that one level has been cut from the tree. Use the spin button to
change the pruning level.
treedisp(t,param1,val1,param2,val2,...) specifies optional
parameter name-value pairs, listed in the following table.
Parameter Value
'names' A cell array of names for the predictor variables,
in the order in which they appear in the X matrix
from which the tree was created (see treefit)
'prunelevel' Initial pruning level to display
Examples Create and graph classification tree for Fisher’s iris data. The names in
this example are abbreviations for the column contents (sepal length,
sepal width, petal length, and petal width).
load fisheriris;
t = treefit(meas,species);
treedisp(t,'names',{'SL' 'SW' 'PL' 'PW'});
18-1406
treedisp
18-1407
treefit
Syntax t = treefit(X,y)
t = treefit(X,y,param1,val1,param2,val2,...)
Description
Note This function is superseded by the classregtree constructor
of the classregtree class and is maintained only for backwards
compatibility. It returns objects t in the classregtree class.
Parameter Value
'catidx' Vector of indices of the columns of X. treefit
treats these columns as unordered categorical
values.
'method' Either 'classification' (default if y is text) or
'regression' (default if y is numeric).
18-1408
treefit
Parameter Value
'splitmin' A number n such that impure nodes must have n
or more observations to be split (default 10).
'prune' 'on' (default) to compute the full tree and a
sequence of pruned subtrees, or 'off' for the full
tree without pruning.
Parameter Value
'cost' p-by-p matrix C, where p is the number of distinct
response values or class names in the input y.
C(i,j) is the cost of classifying a point into class
j if its true class is i. (The default has C(i,j)=1
if i~=j, and C(i,j)=0 if i=j.) C can also be a
structure S with two fields: S.group containing
the group names (see “Grouped Data” on page
2-34), and S.cost containing a matrix of cost
values.
'splitcriterion' Criterion for choosing a split: either 'gdi'
(default) for Gini’s diversity index, 'twoing' for
the twoing rule, or 'deviance' for maximum
deviance reduction.
'priorprob' Prior probabilities for each class, specified as a
vector (one value for each distinct group name)
or as a structure S with two fields: S.group
containing the group names, and S.prob
containing a vector of corresponding probabilities.
load fisheriris;
t = treefit(meas,species);
18-1409
treefit
18-1410
treeprune
Syntax t2 = treeprune(t1,'level',level)
t2 = treeprune(t1,'nodes',nodes)
t2 = treeprune(t1)
Description
Note This function is superseded by the prune method of the
classregtree class and is maintained only for backwards compatibility.
It accepts objects t1 created with the classregtree constructor and
returns objects t2 in the classregtree class.
18-1411
treeprune
Examples Display the full tree for Fisher’s iris data, as well as the next largest
tree from the optimal pruning sequence:
load fisheriris;
t1 = treefit(meas,species,'splitmin',5);
treedisp(t1,'names',{'SL' 'SW' 'PL' 'PW'});
t2 = treeprune(t1,'level',1);
treedisp(t2,'names',{'SL' 'SW' 'PL' 'PW'});
18-1412
treeprune
18-1413
CompactTreeBagger.Trees property
Description The Trees property is a cell array of size NTrees-by-1 containing the
trees in the ensemble.
18-1414
TreeBagger.Trees property
Description The Trees property is a cell array of size NTrees-by-1 containing the
trees in the ensemble.
18-1415
treetest
Description
Note This function is superseded by the test method of the
classregtree class class and is maintained only for backwards
compatibility. It accepts objects t created with the classregtree
constructor.
18-1416
treetest
Parameter Value
Examples Find the best tree for Fisher’s iris data using cross-validation. The
solid line shows the estimated cost for each tree size, the dashed line
marks one standard error above the minimum, and the square marks
the smallest tree under the dashed line.
18-1417
treetest
18-1418
treetest
18-1419
treeval
Description
Note This function is superseded by the eval method of the
classregtree class and is maintained only for backwards compatibility.
It accepts objects t created with the classregtree constructor.
18-1420
treeval
load fisheriris;
t = treefit(meas,species); % Create decision tree
sfit = treeval(t,meas); % Find assigned class numbers
sfit = t.classname(sfit); % Get class names
mean(strcmp(sfit,species)) % Proportion in correct class
ans =
0.9800
18-1421
trimmean
Syntax m = trimmean(X,percent)
trimmean(X,percent,dim)
m = trimmean(X,percent,flag)
m = trimmean(x,percent,flag,dim)
18-1422
trimmean
then the trimmed mean is less efficient than the sample mean as an
estimator of the location of the data.
Examples Example 1
This example shows a Monte Carlo simulation of the efficiency of the
10% trimmed mean relative to the sample mean for normal data.
x = normrnd(0,1,100,100);
m = mean(x);
trim = trimmean(x,10);
sm = std(m);
strim = std(trim);
efficiency = (sm/strim).^2
efficiency =
0.9702
Example 2
Generate random data from the t distribution, which tends to have
outliers:
reset(RandStream.getDefaultStream)
x = trnd(1,40,1);
probplot(x)
18-1423
trimmean
mean(x)
ans =
2.7991
trimmean(x,25)
ans =
18-1424
trimmean
0.8797
18-1425
trnd
Syntax R = trnd(V)
R = trnd(v,m)
R = trnd(V,m,n)
numbers = trnd(3,2,6)
numbers =
-0.3177 -0.0812 -0.6627 0.1905 -1.5585 -0.0433
0.2536 0.5502 0.8646 0.8060 -0.5216 0.0891
18-1426
tstat
Description [M,V] = tstat(NU) returns the mean of and variance for Student’s t
distribution using the degrees of freedom in NU. M and V are the same
size as NU.
The mean of the Student’s t distribution with parameter ν is zero for
values of ν greater than 1. If ν is one, the mean does not exist. The
variance for values of ν greater than 2 is ν/(ν-2).
[m,v] = tstat(reshape(1:30,6,5))
m =
NaN 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
v =
NaN 1.4000 1.1818 1.1176 1.0870
NaN 1.3333 1.1667 1.1111 1.0833
3.0000 1.2857 1.1538 1.1053 1.0800
2.0000 1.2500 1.1429 1.1000 1.0769
1.6667 1.2222 1.1333 1.0952 1.0741
1.5000 1.2000 1.1250 1.0909 1.0714
Note that the variance does not exist for one and two degrees of freedom.
18-1427
ttest
Syntax h = ttest(x)
h = ttest(x,m)
h = ttest(x,y)
h = ttest(...,alpha)
h = ttest(...,alpha,tail)
h = ttest(...,alpha,tail,dim)
[h,p] = ttest(...)
[h,p,ci] = ttest(...)
[h,p,ci,stats] = ttest(...)
Description h = ttest(x) performs a t-test of the null hypothesis that data in the
vector x are a random sample from a normal distribution with mean 0
and unknown variance, against the alternative that the mean is not 0.
The result of the test is returned in h. h = 1 indicates a rejection of the
null hypothesis at the 5% significance level. h = 0 indicates a failure to
reject the null hypothesis at the 5% significance level.
x can also be a matrix or an N-dimensional array. For matrices, ttest
performs separate t-tests along each column of x and returns a vector
of results. For N-dimensional arrays, ttest works along the first
non-singleton dimension of x.
The test treats NaN values as missing data, and ignores them.
h = ttest(x,m) performs a t-test of the null hypothesis that data in
the vector x are a random sample from a normal distribution with mean
m and unknown variance, against the alternative that the mean is not m.
h = ttest(x,y) performs a paired t-test of the null hypothesis
that data in the difference x-y are a random sample from a normal
distribution with mean 0 and unknown variance, against the alternative
that the mean is not 0. x and y must be vectors of the same length,
or arrays of the same size.
h = ttest(...,alpha) performs the test at the (100*alpha)%
significance level. The default, when unspecified, is alpha = 0.05.
18-1428
ttest
x−
t=
s/ n
where x is the sample mean, μ = 0 (or m) is the hypothesized population
mean, s is the sample standard deviation, and n is the sample size.
Under the null hypothesis, the test statistic will have Student’s t
distribution with n – 1 degrees of freedom.
[h,p,ci] = ttest(...) returns a 100*(1 – alpha)% confidence
interval on the population mean, or on the difference of population
means for a paired test.
[h,p,ci,stats] = ttest(...) returns the structure stats with the
following fields:
18-1429
ttest
Examples Simulate a random sample of size 100 from a normal distribution with
mean 0.1:
x = normrnd(0.1,1,1,100);
Test the null hypothesis that the sample comes from a normal
distribution with mean 0:
[h,p,ci] = ttest(x,0)
h =
0
p =
0.8323
ci =
-0.1650 0.2045
The test fails to reject the null hypothesis at the default α = 0.05
significance level. Under the null hypothesis, the probability of
observing a value as extreme or more extreme of the test statistic, as
indicated by the p value, is much greater than α. The 95% confidence
interval on the mean contains 0.
Simulate a larger random sample of size 1000 from the same
distribution:
y = normrnd(0.1,1,1,1000);
Test again if the sample comes from a normal distribution with mean 0:
[h,p,ci] = ttest(y,0)
h =
1
p =
0.0160
ci =
0.0142 0.1379
18-1430
ttest
This time the test rejects the null hypothesis at the default α = 0.05
significance level. The p value has fallen below α = 0.05 and the 95%
confidence interval on the mean does not contain 0.
Because the p value of the sample y is greater than 0.01, the test will
fail to reject the null hypothesis when the significance level is lowered
to α = 0.01:
[h,p,ci] = ttest(y,0,0.01)
h =
0
p =
0.0160
ci =
-0.0053 0.1574
Notice that at the lowered significance level the 99% confidence interval
on the mean widens to contain 0.
This example will produce slightly different results each time it is run,
because of the random sampling.
18-1431
ttest2
Syntax h = ttest2(x,y)
h = ttest2(x,y,alpha)
h = ttest2(x,y,alpha,tail)
h = ttest2(x,y,alpha,tail,vartype)
h = ttest2(x,y,alpha,tail,vartype,dim)
[h,p] = ttest2(...)
[h,p,ci] = ttest2(...)
[h,p,ci,stats] = ttest2(...)
• 'both' — Means are not equal (two-tailed test). This is the default,
when tail is unspecified.
• 'right' — Mean of x is greater than mean of y (right-tail test)
18-1432
ttest2
(n − 1) sx2 + (m − 1) s2y
s=
n+m−2
where sx and sy are the sample standard deviations of x and y,
respectively, and n and m are the sample sizes of x and y, respectively.
h = ttest2(x,y,alpha,tail,vartype,dim) works along dimension
dim of x and y. Use [] to pass in default values for alpha, tail, or
vartype.
[h,p] = ttest2(...) returns the p value of the test. The p value
is the probability, under the null hypothesis, of observing a value as
extreme or more extreme of the test statistic
18-1433
ttest2
x−y
t=
2
sx2 s y
+
n m
where x and y are the sample means, sx and sy are the sample standard
deviations (replaced by the pooled standard deviation s in the default
case where vartype is 'equal'), and n and m are the sample sizes.
In the default case where vartype is 'equal', the test statistic, under
the null hypothesis, has Student’s t distribution with n + m – 2 degrees
of freedom.
In the case where vartype is 'unequal', the test statistic, under the
null hypothesis, has an approximate Student’s t distribution with a
number of degrees of freedom given by Satterthwaite’s approximation.
[h,p,ci] = ttest2(...) returns a 100*(1 – alpha)% confidence
interval on the difference of population means.
[h,p,ci,stats] = ttest2(...) returns structure stats with the
following fields:
x = normrnd(0,1,1,1000);
y = normrnd(0.1,2,1,1000);
18-1434
ttest2
Test the null hypothesis that the samples come from populations with
equal means, against the alternative that the means are unequal.
Perform the test assuming unequal variances:
[h,p,ci] = ttest2(x,y,[],[],'unequal')
h =
1
p =
0.0102
ci =
-0.3227 -0.0435
The test rejects the null hypothesis at the default α = 0.05 significance
level. Under the null hypothesis, the probability of observing a value
as extreme or more extreme of the test statistic, as indicated by the p
value, is less than α. The 95% confidence interval on the mean of the
difference does not contain 0.
This example will produce slightly different results each time it is run,
because of the random sampling.
18-1435
classregtree.type
Description ttype = type(t) returns the type of the tree t. ttype is 'regression'
for regression trees and 'classification' for classification trees.
load fisheriris;
t = classregtree(meas,species,...
'names',{'SL' 'SW' 'PL' 'PW'})
t =
Decision tree for classification
1 if PL<2.45 then node 2 elseif PL>=2.45 then node 3 else setosa
2 class = setosa
3 if PW<1.75 then node 4 elseif PW>=1.75 then node 5 else versicolor
4 if PL<4.95 then node 6 elseif PL>=4.95 then node 7 else versicolor
5 class = virginica
6 if PW<1.65 then node 8 elseif PW>=1.65 then node 9 else versicolor
7 class = virginica
8 class = versicolor
9 class = virginica
view(t)
18-1436
classregtree.type
ttype = type(t)
ttype =
classification
18-1437
cvpartition.Type property
18-1438
qrandset.Type property
Description P.Type returns a string that contains the name of the sequence on
which the point set P is based, for example 'Sobol'. You cannot change
the Type property for a point set.
18-1439
categorical.uint8
Syntax B = uint8(A)
18-1440
categorical.uint16
Syntax B = uint16(A)
18-1441
categorical.uint32
Syntax B = uint32(A)
18-1442
categorical.uint64
Syntax B = uint64(A)
18-1443
categorical.undeflabel property
Description Text label for undefined levels. Constant property with value
'<undefined>'.
18-1444
dataset.unique
Syntax B = unique(A)
B = unique(A,vars)
[B,i,j] = unique(A)
[...] = unique(A,vars,'first')
Description B = unique(A) returns a copy of the dataset A that contains only the
sorted unique observations. A must contain only variables whose class
has a unique method, including:
• numeric
• character
• logical
• categorical
• cell arrays of strings
For a variable with multiple columns, its class’s unique method must
support the 'rows' flag.
B = unique(A,vars) returns a dataset that contains only one
observation for each unique combination of values for the variables in A
specified in vars. vars is a positive integer, a vector of positive integers,
a variable name, a cell array containing one or more variable names,
or a logical vector. B includes all variables from A. The values in B for
the variables not specified in vars are taken from the last occurrence
among observations in A with each unique combination of values for the
variables specified in vars.
[B,i,j] = unique(A) also returns index vectors i and j such that B =
A(i,:) and A = B(j,:).
[...] = unique(A,vars,'first') returns the vector i to
index the first occurrence of each unique observation in A.
unique(A,vars,'last'), the default, returns the vector i to index
18-1445
dataset.unique
the last occurrence. Specify vars as [] to use the default value of all
variables.
18-1446
dataset.Units property
Description A cell array of strings giving the units of the variables in the data set.
This property may be empty, but if not empty, the number of strings
must equal the number of variables. Any individual string may be
empty for a variable that does not have units defined. The default is
an empty cell array.
18-1447
unidcdf
Syntax P = unidcdf(X,N)
floor ( x )
p = F( x| N ) = I(1,..., N ) ( x)
N
Examples What is the probability of drawing a number 20 or less from a hat with
the numbers from 1 to 50 inside?
probability = unidcdf(20,50)
probability =
0.4000
18-1448
unidinv
Syntax X = unidinv(P,N)
Description X = unidinv(P,N) returns the smallest positive integer X such that the
discrete uniform cdf evaluated at X is equal to or exceeds P. You can
think of P as the probability of drawing a number as large as X out of a
hat with the numbers 1 through N inside.
P and N can be vectors, matrices, or multidimensional arrays that have
the same size, which is also the size of X. A scalar input for N or P is
expanded to a constant array with the same dimensions as the other
input. The values in P must lie on the interval [0 1] and the values in N
must be positive integers.
Examples x = unidinv(0.7,20)
x =
14
y = unidinv(0.7 + eps,20)
y =
15
18-1449
unidpdf
Syntax Y = unidpdf(X,N)
1
y = f ( x| N ) = I ( x)
N (1,..., N )
y = unidpdf(1:6,10)
y =
0.1000 0.1000 0.1000 0.1000 0.1000 0.1000
likelihood = unidpdf(5,4:9)
likelihood =
0 0.2000 0.1667 0.1429 0.1250 0.1111
18-1450
unidrnd
Syntax R = unidrnd(N)
R = unidrnd(N,v)
R = unidrnd(N,m,n)
numbers = unidrnd(10000,1,6)-1
numbers =
4564 185 8214 4447 6154 7919
18-1451
unidstat
Description [M,V] = unidstat(N) returns the mean of and variance for the
discrete uniform distribution with corresponding maximum observable
value in N.
The mean of the discrete uniform distribution with parameter N is
(N+1)/2. The variance is (N2+1)/12.
18-1452
unifcdf
Syntax P = unifcdf(X,A,B)
x−a
p = F ( x | a, b) = I ( x)
b − a [ a,b]
probability = unifcdf(0.75)
probability =
0.7500
probability = unifcdf(0.75,-1,1)
probability =
0.8750
18-1453
unifinv
Syntax X = unifinv(P,A,B)
x = F −1 ( p| a, b) = a + p ( a − b) I[0,1] ( p)
median_value = unifinv(0.5)
median_value =
0.5000
percentile = unifinv(0.99,-1,1)
percentile =
0.9800
18-1454
unifit
Examples r = unifrnd(10,12,100,2);
[ahat,bhat,aci,bci] = unifit(r)
ahat =
10.0154 10.0060
bhat =
11.9989 11.9743
aci =
9.9551 9.9461
10.0154 10.0060
bci =
11.9989 11.9743
12.0592 12.0341
18-1455
unifpdf
Syntax Y = unifpdf(X,A,B)
1
y = f ( x | a, b) = I ( x)
b − a [ a,b]
x = 0.1:0.1:0.6;
y = unifpdf(x)
y =
1 1 1 1 1 1
y = unifpdf(-1,0,1)
y =
0
18-1456
unifrnd
Syntax R = unifrnd(A,B)
R = unifrnd(A,B,m,n,...)
R = unifrnd(A,B,[m,n,...])
Examples Generate one random number each from the continuous uniform
distributions on the intervals (0,1), (0,2), ..., (0,5):
a = 0; b = 1:5;
r1 = unifrnd(a,b)
r1 =
0.8147 1.8116 0.3810 3.6535 3.1618
B = repmat(b,5,1);
R = unifrnd(a,B)
R =
0.0975 0.3152 0.4257 2.6230 3.7887
0.2785 1.9412 1.2653 0.1428 3.7157
0.5469 1.9143 2.7472 3.3965 1.9611
0.9575 0.9708 2.3766 3.7360 3.2774
0.9649 1.6006 2.8785 2.7149 0.8559
18-1457
unifrnd
r2 = unifrnd(a,b(2),1,5)
r2 =
1.4121 0.0637 0.5538 0.0923 0.1943
18-1458
unifstat
Description [M,V] = unifstat(A,B) returns the mean of and variance for the
continuous uniform distribution using the corresponding lower endpoint
(minimum), A and upper endpoint (maximum), B. Vector or matrix
inputs for A and B must have the same size, which is also the size of M
and V. A scalar input for A or B is expanded to a constant matrix with
the same dimensions as the other input.
The mean of the continuous uniform distribution with parameters a and
b is (a + b)/2, and the variance is (a – b)2/12.
Examples a = 1:6;
b = 2.*a;
[m,v] = unifstat(a,b)
m =
1.5000 3.0000 4.5000 6.0000 7.5000 9.0000
v =
0.0833 0.3333 0.7500 1.3333 2.0833 3.0000
18-1459
categorical.union
Syntax C = union(A,B)
[C,IA,IB] = union(A,B)
18-1460
categorical.unique
Syntax B = unique(A)
[B,I,J] = unique(A)
[B,I,J] = unique(A,'first')
18-1461
dataset.unstack
18-1462
dataset.unstack
18-1463
dataset.unstack
You can also specify more than one data variable in tall, each of
which becomes a set of m variables in wide. In this case, specify
datavar as a vector of positive integers, a cell array containing variable
names, or a logical vector. You may specify only one variable with
indvar. The names of each set of data variables in wide are the
name of the corresponding data variable in tall concatenated with
the names specified in 'NewDataVarNames'. The function specified in
'AggregationFun' must return a value with a single row.
Examples Convert a "wide format" data set to "tall format", and then back to
a different "wide format":
load flu
% FLU has a 'Date' variable, and 10 variables for estimated
% influenza rates (in 9 different regions, estimated from
% Google searches, plus a nationwide extimate from the
% CDC). Combine those 10 variables into a "tall" array
% that has a single data variable, 'FluRate', and an
% indicator variable, 'Region', that says which region
% each estimate is from.
flu2 = stack(flu, 2:11, 'NewDataVarName','FluRate',...
'IndVarName','Region')
dateNames = cellstr(datestr(flu.Date,'mmm_DD_YYYY'));
18-1464
paretotails.upperparams
t = trnd(3,100,1);
obj = paretotails(t,0.1,0.9);
lowerparams(obj)
ans =
-0.1901 1.1898
upperparams(obj)
ans =
0.3646 0.5103
18-1465
dataset.UserData property
18-1466
ProbDistUnivParam.var
Syntax V = var(PD)
18-1467
dataset.VarDescription property
Description A cell array of strings giving the descriptions of the variables in the
data set. This property may be empty, but if not empty, the number
of strings must equal the number of variables. Any individual string
may be empty for a variable that does not have a description defined.
The default is an empty cell array.
18-1468
classregtree.varimportance
18-1469
CompactTreeBagger.VarNames property
Description The VarNames property is a cell array containing the names of the
predictor variables (features). These names are taken from the optional
'names' parameter that supplied to TreeBagger. The default names
are 'x1', 'x2', etc.
18-1470
dataset.VarNames property
Description A cell array of nonempty, distinct strings giving the names of the
variables in the data set. The number of strings must equal the number
of variables. The default is the cell array of string names for the
variables used to create the data set.
18-1471
TreeBagger.VarNames property
Description The VarNames property is a cell array containing the names of the
predictor variables (features). TreeBagger takes these names from the
optional 'names' parameter. The default names are 'x1', 'x2', etc.
18-1472
vartest
Syntax H = vartest(X,V)
H = vartest(X,V,alpha)
H = vartest(X,V,alpha,tail)
[H,P] = vartest(...)
[H,P,CI] = vartest(...)
[H,P,CI,STATS] = vartest(...)
[...] = vartest(X,V,alpha,tail,dim)
18-1473
vartest
18-1474
vartest2
Syntax H = vartest2(X,Y)
H = vartest2(X,Y,alpha)
H = vartest2(X,Y,alpha,tail)
[H,P] = vartest2(...)
[H,P,CI] = vartest2(...)
[H,P,CI,STATS] = vartest2(...)
[...] = vartest2(X,Y,alpha,tail,dim)
18-1475
vartest2
Examples Is the variance significantly different for two model years, and what is a
confidence interval for the ratio of these variances?
load carsmall
[H,P,CI] =
vartest2(MPG(Model_Year==82),MPG(Model_Year==76))
18-1476
vartestn
Syntax vartestn(X)
vartestn(X,group)
p = vartestn(...)
[p,STATS] = vartestn(...)
[...] = vartestn(...,displayopt)
[...] = vartestn(...,testtype)
Description vartestn(X) performs Bartlett’s test for equal variances for the
columns of the matrix X. This is a test of the null hypothesis that the
columns of X come from normal distributions with the same variance,
against the alternative that they come from normal distributions with
different variances. The result is a display of a box plot of the groups,
and a summary table of statistics.
vartestn(X,group) requires a vector X, and a group argument that is
a categorical variable, vector, string array, or cell array of strings with
one row for each element of X. The X values corresponding to the same
value of group are placed in the same group. (See “Grouped Data” on
page 2-34.) The function tests for equal variances across groups.
vartestn treats NaNs as missing values and ignores them.
p = vartestn(...) returns the p value, i.e., the probability of
observing the given result, or one more extreme, by chance if the null
hypothesis of equal variances is true. Small values of p cast doubt on
the validity of the null hypothesis.
[p,STATS] = vartestn(...) returns a structure with the following
fields:
18-1477
vartestn
load carsmall
p = vartestn(MPG,Model_Year)
p =
0.8327
18-1478
vartestn
18-1479
categorical.vertcat
Syntax C = vertcat(dim,A,B,...)
C = vertcat(A,B)
18-1480
dataset.vertcat
18-1481
classregtree.view
Syntax view(t)
view(t,param1,val1,param2,val2,...)
For each branch node, the left child node corresponds to the points that
satisfy the condition, and the right child node corresponds to the points
that do not satisfy the condition.
load fisheriris;
t = classregtree(meas,species,...
'names',{'SL' 'SW' 'PL' 'PW'})
t =
Decision tree for classification
1 if PL<2.45 then node 2 elseif PL>=2.45 then node 3 else setosa
2 class = setosa
3 if PW<1.75 then node 4 elseif PW>=1.75 then node 5 else versicolor
4 if PL<4.95 then node 6 elseif PL>=4.95 then node 7 else versicolor
5 class = virginica
18-1482
classregtree.view
view(t)
18-1483
classregtree.view
18-1484
wblcdf
Syntax P = wblcdf(X,A,B)
[P,PLO,PUP] = wblcdf(X,A,B,PCOV,alpha)
bˆ ( log x − log aˆ )
and then transforms those bounds to the scale of the output P. The
computed bounds give approximately the desired confidence level when
you estimate mu, sigma, and PCOV from large samples, but in smaller
samples other methods of computing the confidence bounds might be
more accurate.
The Weibull cdf is
b b
⎛t⎞ ⎛ x⎞
− −⎜ ⎟
x − b b−1 ⎜⎝ a ⎟⎠
p = F ( x | a, b) = ∫0 ba t e dt = 1− e ⎝a⎠ I( 0,∞ ) ( x )
Examples What is the probability that a value from a Weibull distribution with
parameters a = 0.15 and b = 0.8 is less than 0.5?
18-1485
wblcdf
[A, B] = meshgrid(0.1:0.05:0.2,0.2:0.05:0.3);
probability = wblcdf(0.5, A, B)
probability =
0.7484 0.7198 0.6991
0.7758 0.7411 0.7156
0.8022 0.7619 0.7319
18-1486
wblfit
b
⎛ x⎞
−
− b b−1 ⎜⎝ a ⎟⎠
y = f ( x | a, b) = ba x e I( 0,∞ ) ( x )
18-1487
wblfit
18-1488
wblinv
Syntax X = wblinv(P,A,B)
[X,XLO,XUP] = wblinv(P,A,B,PCOV,alpha)
log q
log a +
b
1/ b
x = F −1 ( p| a, b) = − a ⎡⎣ ln (1 − p ) ⎤⎦ I[ 0,1] ( p)
Examples The lifetimes (in hours) of a batch of light bulbs has a Weibull
distribution with parameters a = 200 and b = 6.
18-1489
wblinv
Generate 100 random values from this distribution, and estimate the
90th percentile (with confidence bounds) from the random sample
x = wblrnd(200,6,100,1);
p = wblfit(x)
[nlogl,pcov] = wbllike(p,x)
[q90,q90lo,q90up] = wblinv(0.9,p(1),p(2),pcov)
p =
204.8918 6.3920
nlogl =
496.8915
pcov =
11.3392 0.5233
0.5233 0.2573
q90 =
233.4489
q90lo =
18-1490
wblinv
226.0092
q90up =
241.1335
18-1491
wbllike
n
( − log L ) = − log ∏ f ( a, b| xi ) = −∑ log f ( a, b| xi )
i=1 i=1
r = wblrnd(0.5,0.8,100,1);
18-1492
wbllike
18-1493
wblpdf
Syntax Y = wblpdf(X,A,B)
b
⎛ x⎞
−
− b b−1 ⎜⎝ a ⎟⎠
y = f ( x | a, b) = ba x e I( 0,∞ ) ( x )
lambda = 1:6;
y = wblpdf(0.1:0.1:0.6,lambda,1)
y =
0.9048 0.4524 0.3016 0.2262 0.1810 0.1508
y1 = exppdf(0.1:0.1:0.6,lambda)
y1 =
0.9048 0.4524 0.3016 0.2262 0.1810 0.1508
See Also pdf, wblcdf, wblfit, wblinv, wbllike, wblplot, wblrnd, wblstat
“Weibull Distribution” on page B-103
18-1494
wblplot
Syntax wblplot(X)
h = wblplot(X)
Examples r = wblrnd(1.2,1.5,50,1);
wblplot(r)
18-1495
wblplot
18-1496
wblrnd
Syntax R = wblrnd(A,B)
R = wblrnd(A,B,v)
R = wblrnd(A,B,m,n)
Examples n1 = wblrnd(0.5:0.5:2,0.5:0.5:2)
n1 =
0.0178 0.0860 2.5216 0.9124
n2 = wblrnd(1/2,1/2,[1 6])
n2 =
0.0046 1.7214 2.2108 0.0367 0.0531 0.0917
18-1497
wblrnd
18-1498
wblstat
Description [M,V] = wblstat(A,B) returns the mean of and variance for the
Weibull distribution with scale parameter, A and shape parameter, B.
Vector or matrix inputs for A and B must have the same size, which
is also the size of M and V. A scalar input for A or B is expanded to a
constant matrix with the same dimensions as the other input.
The mean of the Weibull distribution with parameters a and b is
( )
a ⎡⎢Γ 1 + b−1 ⎤⎥
⎣ ⎦
⎡
( ) ( )
2⎤
a2 ⎢Γ 1 + 2b−1 − Γ 1 + b−1 ⎥
⎣ ⎦
wblstat(0.5,0.7)
ans =
0.6329
18-1499
wishrnd
Syntax W = wishrnd(Sigma,df)
W = wishrnd(Sigma,df,D)
[W,D] = wishrnd(Sigma,df)
18-1500
TreeBagger.X property
18-1501
xptread
Purpose Create dataset array from data stored in SAS XPORT format file
Description data = xptread displays a dialog box for selecting a file, then reads
data from the file into a dataset array. The file must be in the SAS
XPORT format.
data = xptread(filename) retrieves data from a SAS XPORT format
file filename. The XPORT format allows for 28 missing data types,
represented in the file by an upper case letter, '.' or '_'. xptread
converts All missing data to NaN values in data. However, if you need
the specific missing types then you can recover this information by
specifying a second output.
[data,missing] = xptread(filename) returns a nominal array,
missing, of the same size as data containing the missing data type
information from the xport format file. The entries are undefined for
values that are not present and are one of '.', '_', 'A',...,'Z' for
missing values.
xptread(...,'ReadObsNames',true) treats the first variable in the
file as observation names. The default value is false.
xptread only supports single data sets per file.
data = xptread('sample.xpt')
18-1502
x2fx
Syntax D = x2fx(X,model)
D = x2fx(X,model,categ)
D = x2fx(X,model,categ,catlevels)
18-1503
x2fx
example, if X has columns X1, X2, and X3, then a row [0 1 2] in model
specifies the term (X1.^0).*(X2.^1).*(X3.^2). A row of all zeros in
model specifies a constant term, which can be omitted.
D = x2fx(X,model,categ) treats columns with numbers listed in
the vector categ as categorical variables. Terms involving categorical
variables produce dummy variable columns in D. Dummy variables
are computed under the assumption that possible categorical levels
are completely enumerated by the unique values that appear in the
corresponding column of X.
D = x2fx(X,model,categ,catlevels) accepts a vector catlevels
the same length as categ, specifying the number of levels in each
categorical variable. In this case, values in the corresponding column of
X must be integers in the range from 1 to the specified number of levels.
Not all of the levels need to appear in X.
Examples Example 1
The following converts 2 predictors X1 and X2 (the columns of X) into a
design matrix for a full quadratic model with terms constant, X1, X2,
X1.*X2, X1.^2, and X2.^2.
X = [1 10
2 20
3 10
4 20
5 15
6 15];
D = x2fx(X,'quadratic')
D =
1 1 10 10 1 100
1 2 20 40 4 400
1 3 10 30 9 100
1 4 20 80 16 400
1 5 15 75 25 225
1 6 15 90 36 225
18-1504
x2fx
Example 2
The following converts 2 predictors X1 and X2 (the columns of X) into
a design matrix for a quadratic model with terms constant, X1, X2,
X1.*X2, and X1.^2.
X = [1 10
2 20
3 10
4 20
5 15
6 15];
model = [0 0
1 0
0 1
1 1
2 0];
D = x2fx(X,model)
D =
1 1 10 10 1
1 2 20 40 4
1 3 10 30 9
1 4 20 80 16
1 5 15 75 25
1 6 15 90 36
18-1505
TreeBagger.Y property
18-1506
zscore
Syntax Z = zscore(X)
[Z,mu,sigma] = zscore(X)
[...] = zscore(X,1)
[...] = zscore(X,flag,dim)
Examples Compare the predictors in the Moore data on original and standardized
scales:
load moore
predictors = moore(:,1:5);
subplot(2,1,1),plot(predictors)
subplot(2,1,2),plot(zscore(predictors))
18-1507
zscore
18-1508
ztest
Purpose z-test
Syntax h = ztest(x,m,sigma)
h = ztest(...,alpha)
h = ztest(...,alpha,tail)
h = ztest(...,alpha,tail,dim)
[h,p] = ztest(...)
[h,p,ci] = ztest(...)
[h,p,ci,zval] = ztest(...)
18-1509
ztest
x−
z=
/ n
where x is the sample mean, μ = m is the hypothesized population
mean, σ is the population standard deviation, and n is the sample
size. Under the null hypothesis, the test statistic will have a standard
normal distribution, N(0,1).
[h,p,ci] = ztest(...) returns a 100*(1 – alpha)% confidence
interval on the population mean.
[h,p,ci,zval] = ztest(...) returns the value of the test statistic.
Examples Simulate a random sample of size 100 from a normal distribution with
mean 0.1 and standard deviation 1:
x = normrnd(0.1,1,1,100);
Test the null hypothesis that the sample comes from a standard normal
distribution:
[h,p,ci] = ztest(x,0,1)
h =
0
p =
0.1391
ci =
-0.0481 0.3439
The test fails to reject the null hypothesis at the default α = 0.05
significance level. Under the null hypothesis, the probability of
observing a value as extreme or more extreme of the test statistic, as
18-1510
ztest
y = normrnd(0.1,1,1,1000);
Test again if the sample comes from a normal distribution with mean 0:
[h,p,ci] = ztest(y,0,1)
h =
1
p =
5.5160e-005
ci =
0.0655 0.1895
This time the test rejects the null hypothesis at the default α = 0.05
significance level. The p value has fallen below α = 0.05 and the 95%
confidence interval on the mean does not contain 0.
Because the p value of the sample y is less than 0.01, the test will still
reject the null hypothesis when the significance level is lowered to α
= 0.01:
[h,p,ci] = ztest(y,0,1,0.01)
h =
1
p =
5.5160e-005
ci =
0.0461 0.2090
This example will produce slightly different results each time it is run,
because of the random sampling.
18-1511
ztest
18-1512
A
Data Sets
A Data Sets
Statistics Toolbox software includes the sample data sets in the following
table.
load filename
A-2
Data Sets
A-3
A Data Sets
A-4
B
Distribution Reference
B-2
Bernoulli Distribution
Bernoulli Distribution
See Also
“Discrete Distributions” on page 5-7
B-3
B Distribution Reference
Beta Distribution
In this section...
“Definition” on page B-4
“Background” on page B-4
“Parameters” on page B-5
“Example” on page B-6
“See Also” on page B-6
Definition
The beta pdf is
1
y = f ( x | a, b) = x a−1 (1 − x)b−1 I(0,1) ( x)
B(a, b)
where B( · ) is the Beta function. The indicator function I(0,1)(x) ensures that
only values of x in the range (0 1) have nonzero probability.
Background
The beta distribution describes a family of curves that are unique in that they
are nonzero only on the interval (0 1). A more general version of the function
assigns parameters to the endpoints of the interval.
1 1 Y
X= +
2 2 + Y2
B-4
Beta Distribution
⎛ ⎞
If Y~t(v), then X ∼ ⎜ , ⎟
⎝2 2⎠
This relationship is used to compute values of the t cdf and inverse function
as well as generating t distributed random numbers.
Parameters
Suppose you are collecting data that has hard lower and upper bounds of zero
and one respectively. Parameter estimation is the process of determining the
parameters of the beta distribution that fit this data best in some sense.
The function betafit returns the MLEs and confidence intervals for the
parameters of the beta distribution. Here is an example using random
numbers from the beta distribution with a = 5 and b = 0.2.
r = betarnd(5,0.2,100,1);
[phat, pci] = betafit(r)
phat =
4.5330 0.2301
pci =
2.8051 0.1771
6.2610 0.2832
The MLE for parameter a is 4.5330, compared to the true value of 5. The
95% confidence interval for a goes from 2.8051 to 6.2610, which includes
the true value.
Similarly the MLE for parameter b is 0.2301, compared to the true value
of 0.2. The 95% confidence interval for b goes from 0.1771 to 0.2832, which
B-5
B Distribution Reference
also includes the true value. In this made-up example you know the “true
value.” In experimentation you do not.
Example
The shape of the beta distribution is quite variable depending on the values of
the parameters, as illustrated by the plot below.
The constant pdf (the flat line) shows that the standard uniform distribution
is a special case of the beta distribution.
See Also
“Continuous Distributions (Data)” on page 5-4
B-6
Binomial Distribution
Binomial Distribution
In this section...
“Definition” on page B-7
“Background” on page B-7
“Parameters” on page B-8
“Example” on page B-9
“See Also” on page B-9
Definition
The binomial pdf is
⎛ n⎞
f (k| n, p) = ⎜ ⎟ pk (1 − p)n− k
⎝ k⎠
Background
The binomial distribution models the total number of successes in repeated
trials from an infinite population under the following conditions:
B-7
B Distribution Reference
Parameters
Suppose you are collecting data from a widget manufacturing process, and
you record the number of widgets within specification in each batch of 100.
You might be interested in the probability that an individual widget is
within specification. Parameter estimation is the process of determining the
parameter, p, of the binomial distribution that fits this data best in some
sense.
The function binofit returns the MLEs and confidence intervals for the
parameters of the binomial distribution. Here is an example using random
numbers from the binomial distribution with n = 100 and p = 0.9.
r = binornd(100,0.9)
r =
88
phat =
0.8800
pci =
0.7998
0.9364
The MLE for parameter p is 0.8800, compared to the true value of 0.9. The
95% confidence interval for p goes from 0.7998 to 0.9364, which includes
the true value. In this made-up example you know the “true value” of p. In
experimentation you do not.
B-8
Binomial Distribution
Example
The following commands generate a plot of the binomial pdf for n = 10 and
p = 1/2.
x = 0:10;
y = binopdf(x,10,0.5);
plot(x,y,'+')
See Also
“Discrete Distributions” on page 5-7
B-9
B Distribution Reference
Birnbaum-Saunders Distribution
In this section...
“Definition” on page B-10
“Background” on page B-10
“Parameters” on page B-11
“See Also” on page B-11
Definition
The Birnbaum-Saunders distribution has the density function
⎧ ⎛ ⎞ ⎫⎛ ⎛
2
⎞⎞
⎪ ⎜ x − ⎟ ⎪⎜ ⎜ x + ⎟ ⎟
⎪ x ⎠ ⎪⎜ ⎝ x
exp ⎨− ⎝ ⎠⎟
1
⎬
2 ⎪ 2 2
⎪ ⎜ 2 x ⎟
⎪ ⎪⎝⎜ ⎟
⎩ ⎭ ⎠
with scale parameter β > 0 and shape parameter γ > 0, for x > 0.
⎛ x ⎞
⎜ − ⎟
⎝ x ⎠
Background
The Birnbaum-Saunders distribution was originally proposed as a lifetime
model for materials subject to cyclic patterns of stress and strain, where the
ultimate failure of the material comes from the growth of a prominent flaw.
In materials science, Miner’s Rule suggests that the damage occurring after n
cycles, at a stress level with an expected lifetime of N cycles, is proportional
B-10
Birnbaum-Saunders Distribution
Parameters
See mle, dfittool.
See Also
“Continuous Distributions (Data)” on page 5-4
B-11
B Distribution Reference
Chi-Square Distribution
In this section...
“Definition” on page B-12
“Background” on page B-12
“Example” on page B-13
“See Also” on page B-13
Definition
The χ2 pdf is
x(
−2 ) / 2 − x / 2
e
y = f ( x | ) =
22 Γ ( / 2)
Background
The χ2 distribution is a special case of the gamma distribution where b = 2 in
the equation for gamma distribution below.
x
1
y = f ( x | a, b) = x a−1 e b
ba Γ ( a )
( n − 1) s2
∼ 2 ( n − 1)
2
B-12
Chi-Square Distribution
Example
The χ2 distribution is skewed to the right especially for few degrees of freedom
(ν). The plot shows the χ2 distribution with four degrees of freedom.
x = 0:0.2:15;
y = chi2pdf(x,4);
plot(x,y)
See Also
“Continuous Distributions (Statistics)” on page 5-6
B-13
B Distribution Reference
Copulas
See “Copulas” on page 5-107.
B-14
Custom Distributions
Custom Distributions
User-defined custom distributions, created using files and function handles,
are supported by the Statistics Toolbox functions pdf, cdf, icdf, and mle, and
the Statistics Toolbox GUI dfittool.
B-15
B Distribution Reference
Exponential Distribution
In this section...
“Definition” on page B-16
“Background” on page B-16
“Parameters” on page B-16
“Example” on page B-17
“See Also” on page B-18
Definition
The exponential pdf is
−x
1
y = f ( x| ) = e
Background
Like the chi-square distribution, the exponential distribution is a special case
of the gamma distribution (obtained by setting a = 1)
x
1
y = f ( x | a, b) = x a−1 e b
ba Γ ( a )
Parameters
Suppose you are stress testing light bulbs and collecting data on their
lifetimes. You assume that these lifetimes follow an exponential distribution.
B-16
Exponential Distribution
You want to know how long you can expect the average light bulb to last.
Parameter estimation is the process of determining the parameters of the
exponential distribution that fit this data best in some sense.
The function expfit returns the MLEs and confidence intervals for the
parameters of the exponential distribution. Here is an example using random
numbers from the exponential distribution with µ = 700.
lifetimes = exprnd(700,100,1);
[muhat, muci] = expfit(lifetimes)
muhat =
672.8207
muci =
547.4338
810.9437
The MLE for parameter µ is 672, compared to the true value of 700. The 95%
confidence interval for µ goes from 547 to 811, which includes the true value.
In the life tests you do not know the true value of µ so it is nice to have a
confidence interval on the parameter to give a range of likely values.
Example
For exponentially distributed lifetimes, the probability that an item will
survive an extra unit of time is independent of the current age of the item.
The example shows a specific case of this special property.
l = 10:10:60;
B-17
B Distribution Reference
lpd = l+0.1;
deltap = (expcdf(lpd,50)-expcdf(l,50))./(1-expcdf(l,50))
deltap =
0.0020 0.0020 0.0020 0.0020 0.0020 0.0020
The following commands generate a plot of the exponential pdf with its
parameter (and mean), µ, set to 2.
x = 0:0.1:10;
y = exppdf(x,2);
plot(x,y)
See Also
“Continuous Distributions (Data)” on page 5-4
B-18
Extreme Value Distribution
Definition
The probability density function for the extreme value distribution with
location parameter µ and scale parameter σ is
⎛x−⎞ ⎛ ⎛ x − ⎞⎞
y = f ( x | , ) = −1 exp ⎜ ⎟ exp ⎜ − exp ⎜ ⎟ ⎟
⎝ ⎠ ⎝ ⎝ ⎠⎠
Background
Extreme value distributions are often used to model the smallest or largest
value among a large set of independent, identically distributed random values
representing measurements or observations. The extreme value distribution
is appropriate for modeling the smallest value from a distribution whose tails
decay exponentially fast, for example, the normal distribution. It can also
model the largest value from a distribution, such as the normal or exponential
distributions, by using the negative of the original values.
B-19
B Distribution Reference
hist(xMinima,-4.75:.25:-1.75);
p = evpdf(y,paramEstsMinima(1),paramEstsMinima(2));
line(y,.25*length(xMinima)*p,'color','r')
B-20
Extreme Value Distribution
Although the extreme value distribution is most often used as a model for
extreme values, you can also use it as a model for other types of continuous
data. For example, extreme value distributions are closely related to the
Weibull distribution. If T has a Weibull distribution, then log(T) has a type 1
extreme value distribution.
Parameters
The function evfit returns the maximum likelihood estimates (MLEs) and
confidence intervals for the parameters of the extreme value distribution. The
following example shows how to fit some sample data using evfit, including
estimates of the mean and variance from the fitted distribution.
Suppose you want to model the size of the smallest washer in each batch
of 1000 from a manufacturing process. If you believe that the sizes are
B-21
B Distribution Reference
independent within and between each batch, you can fit an extreme value
distribution to measurements of the minimum diameter from a series of eight
experimental batches. The following code returns the MLEs of the distribution
parameters as parmhat and the confidence intervals as the columns of parmci.
parmhat =
20.2506 0.8223
parmci =
19.644 0.49861
20.857 1.3562
You can find mean and variance of the extreme value distribution with these
parameters using the function evstat.
meanfit =
19.776
varfit =
1.1123
Example
The following code generates a plot of the pdf for the extreme value
distribution.
t = [-5:.01:2];
y = evpdf(t);
plot(t,y)
B-22
Extreme Value Distribution
The extreme value distribution is skewed to the left, and its general shape
remains the same for all parameter values. The location parameter, mu, shifts
the distribution along the real line, and the scale parameter, sigma, expands
or contracts the distribution. This example plots the probability function for
different combinations of mu and sigma.
x = -15:.01:5;
plot(x,evpdf(x,2,1),'-', ...
x,evpdf(x,0,2),':', ...
x,evpdf(x,-2,4),'-.');
legend({'mu = 2, sigma = 1', ...
'mu = 0, sigma = 2', ...
'mu = -2, sigma = 4'}, ...
'Location','NW')
xlabel('x')
ylabel('f(x|mu,sigma)')
B-23
B Distribution Reference
See Also
“Continuous Distributions (Data)” on page 5-4
B-24
F Distribution
F Distribution
In this section...
“Definition” on page B-25
“Background” on page B-25
“Example” on page B-26
“See Also” on page B-26
Definition
The pdf for the F distribution is
⎡ ( + 2 ) ⎤ 1 1 −2
Γ⎢ 1 ⎥
y = f ( x | 1 , 2 ) = ⎣
2 ⎦ ⎛ 1 ⎞ 2 x 2
⎜ ⎟ 1 + 2
⎛ ⎞ ⎛ ⎞
Γ⎜ 1 ⎟Γ⎜ 2 ⎟ ⎝ 2 ⎠ ⎡ ⎛ ⎞ ⎤ 2
⎝ 2⎠ ⎝ 2 ⎠ 1
⎢1 + ⎜ ⎟ x ⎥
⎣ ⎝ 2 ⎠ ⎦
Background
The F distribution has a natural relationship with the chi-square distribution.
If χ1 and χ2 are both chi-square with ν1 and ν2 degrees of freedom respectively,
then the statistic F below is F-distributed.
1
1
F ( 1 , 2 ) =
2
2
The two parameters, ν1 and ν2, are the numerator and denominator degrees
of freedom. That is, ν1 and ν2 are the number of independent pieces of
information used to calculate χ1 and χ2, respectively.
B-25
B Distribution Reference
Example
The most common application of the F distribution is in standard tests of
hypotheses in analysis of variance and regression.
The plot shows that the F distribution exists on the positive real numbers
and is skewed to the right.
x = 0:0.01:10;
y = fpdf(x,5,3);
plot(x,y)
See Also
“Continuous Distributions (Statistics)” on page 5-6
B-26
Gamma Distribution
Gamma Distribution
In this section...
“Definition” on page B-27
“Background” on page B-27
“Parameters” on page B-28
“Example” on page B-29
“See Also” on page B-29
Definition
The gamma pdf is
−x
1 a −1
y = f ( x | a, b) = x eb
ba Γ(a)
Background
The gamma distribution models sums of exponentially distributed random
variables.
The gamma distribution has the following relationship with the incomplete
Gamma function.
⎛x ⎞
f ( x | a, b) = gammainc ⎜ , a ⎟
⎝b ⎠
B-27
B Distribution Reference
Parameters
Suppose you are stress testing computer memory chips and collecting data on
their lifetimes. You assume that these lifetimes follow a gamma distribution.
You want to know how long you can expect the average computer memory chip
to last. Parameter estimation is the process of determining the parameters of
the gamma distribution that fit this data best in some sense.
The function gamfit returns the MLEs and confidence intervals for the
parameters of the gamma distribution. Here is an example using random
numbers from the gamma distribution with a = 10 and b = 5.
lifetimes = gamrnd(10,5,100,1);
[phat, pci] = gamfit(lifetimes)
phat =
10.9821 4.7258
pci =
7.4001 3.1543
14.5640 6.2974
B-28
Gamma Distribution
Similarly the MLE for parameter b is 4.7, compared to the true value of 5.
The 95% confidence interval for b goes from 3.2 to 6.3, which also includes
the true value.
In the life tests you do not know the true value of a and b so it is nice to have
a confidence interval on the parameters to give a range of likely values.
Example
In the example the gamma pdf is plotted with the solid line. The normal
pdf has a dashed line type.
x = gaminv((0.005:0.01:0.995),100,10);
y = gampdf(x,100,10);
y1 = normpdf(x,1000,100);
plot(x,y,'-',x,y1,'-.')
See Also
“Continuous Distributions (Data)” on page 5-4
B-29
B Distribution Reference
Gaussian Distribution
See “Normal Distribution” on page B-83.
B-30
Gaussian Mixture Distributions
B-31
B Distribution Reference
Definition
The probability density function for the generalized extreme value distribution
with location parameter µ, scale parameter σ, and shape parameter k ≠ 0 is
⎛ − ⎞
1
−1−
1
⎛1⎞ ⎜ ⎛ ( x − μ) ⎞ k ⎟ ⎛ ( x − μ) ⎞ k
y = f ( x |k, μ, σ ) = ⎜ ⎟ exp ⎜ − ⎜ 1 + k 1+k
⎝σ⎠ ⎜ ⎝ σ ⎟⎠ ⎟⎟ ⎜⎝ σ ⎟⎠
⎝ ⎠
for
(x-μ)
1+k >0
σ
k > 0 corresponds to the Type II case, while k < 0 corresponds to the Type
III case. For k = 0, corresponding to the Type I case, the density is
⎛1⎞ ⎛ ⎛ ( x − μ) ⎞ ( x − μ) ⎞
y = f ( x |0, μ, σ ) = ⎜ ⎟ exp ⎜ − exp ⎜ − − ⎟
⎝σ⎠ ⎝ ⎝ σ ⎟⎠ σ ⎠
Background
Like the extreme value distribution, the generalized extreme value
distribution is often used to model the smallest or largest value among a
large set of independent, identically distributed random values representing
measurements or observations. For example, you might have batches of 1000
B-32
Generalized Extreme Value Distribution
washers from a manufacturing process. If you record the size of the largest
washer in each batch, the data are known as block maxima (or minima if you
record the smallest). You can use the generalized extreme value distribution
as a model for those block maxima.
The three cases covered by the generalized extreme value distribution are
often referred to as the Types I, II, and III. Each type corresponds to the
limiting distribution of block maxima from a different class of underlying
distributions. Distributions whose tails decrease exponentially, such as the
normal, lead to the Type I. Distributions whose tails decrease as a polynomial,
such as Student’s t, lead to the Type II. Distributions whose tails are finite,
such as the beta, lead to the Type III.
Types I, II, and III are sometimes also referred to as the Gumbel, Frechet,
and Weibull types, though this terminology can be slightly confusing. The
Type I (Gumbel) and Type III (Weibull) cases actually correspond to the
mirror images of the usual Gumbel and Weibull distributions, for example,
as computed by the functions evcdf and evfit , or wblcdf and wblfit,
respectively. Finally, the Type II (Frechet) case is equivalent to taking the
reciprocal of values from a standard Weibull distribution.
Parameters
If you generate 250 blocks of 1000 random values drawn from Student’s t
distribution with 5 degrees of freedom, and take their maxima, you can fit a
generalized extreme value distribution to those maxima.
blocksize = 1000;
nblocks = 250;
t = trnd(5,blocksize,nblocks);
x = max(t); % 250 column maxima
paramEsts = gevfit(x)
paramEsts =
B-33
B Distribution Reference
Notice that the shape parameter estimate (the first element) is positive,
which is what you would expect based on block maxima from a Student’s t
distribution.
hist(x,2:20);
set(get(gca,'child'),'FaceColor',[.8 .8 1])
xgrid = linspace(2,20,1000);
line(xgrid,nblocks*...
gevpdf(xgrid,paramEsts(1),paramEsts(2),paramEsts(3)));
Example
The following code generates examples of probability density functions for the
three basic forms of the generalized extreme value distribution.
B-34
Generalized Extreme Value Distribution
x = linspace(-3,6,1000);
y1 = gevpdf(x,-.5,1,0);
y2 = gevpdf(x,0,1,0);
y3 = gevpdf(x,.5,1,0)
plot(x,y1,'-', x,y2,'-', x,y3,'-')
legend({'K<0, Type III' 'K=0, Type I' 'K>0, Type II'});
Notice that for k > 0, the distribution has zero probability density for x such
that
σ
x < - + μ
k
B-35
B Distribution Reference
σ
x > - + μ
k
See Also
“Continuous Distributions (Data)” on page 5-4
B-36
Generalized Pareto Distribution
Definition
The probability density function for the generalized Pareto distribution with
shape parameter k ≠ 0, scale parameter σ, and threshold parameter θ, is
1
−1 −
⎛ 1 ⎞⎛ (x − ) ⎞ k
y = f ( x| k, , ) = ⎜ ⎟ ⎜ 1 + k
⎝ ⎠⎝ ⎟⎠
for θ < x, when k > 0, or for θ < x < –σ/k when k < 0.
( x − )
⎛1⎞ −
y = f ( x|0, , ) = ⎜ ⎟ e
⎝ ⎠
for θ < x.
Background
Like the exponential distribution, the generalized Pareto distribution is often
used to model the tails of another distribution. For example, you might
have washers from a manufacturing process. If random influences in the
process lead to differences in the sizes of the washers, a standard probability
B-37
B Distribution Reference
distribution, such as the normal, could be used to model those sizes. However,
while the normal distribution might be a good model near its mode, it might
not be a good fit to real data in the tails and a more complex model might
be needed to describe the full range of the data. On the other hand, only
recording the sizes of washers larger (or smaller) than a certain threshold
means you can fit a separate model to those tail data, which are known as
exceedences. You can use the generalized Pareto distribution in this way, to
provide a good fit to extremes of complicated data.
The generalized Pareto distribution has three basic forms, each corresponding
to a limiting distribution of exceedence data from a different class of
underlying distributions.
Parameters
If you generate a large number of random values from a Student’s t
distribution with 5 degrees of freedom, and then discard everything less than
2, you can fit a generalized Pareto distribution to those exceedences.
t = trnd(5,5000,1);
y = t(t > 2) - 2;
paramEsts = gpfit(y)
paramEsts =
B-38
Generalized Pareto Distribution
0.1267 0.8134
Notice that the shape parameter estimate (the first element) is positive, which
is what you would expect based on exceedences from a Student’s t distribution.
hist(y+2,2.25:.5:11.75);
set(get(gca,'child'),'FaceColor',[.8 .8 1])
xgrid = linspace(2,12,1000);
line(xgrid,.5*length(y)*...
gppdf(xgrid,paramEsts(1),paramEsts(2),2));
Example
The following code generates examples of the probability density functions for
the three basic forms of the generalized Pareto distribution.
x = linspace(0,10,1000);
B-39
B Distribution Reference
y1 = gppdf(x,-.25,1,0);
y2 = gppdf(x,0,1,0);
y3 = gppdf(x,1,1,0)
plot(x,y1,'-', x,y2,'-', x,y3,'-')
legend({'K<0' 'K=0' 'K>0'});
σ
Notice that for k < 0, the distribution has zero probability density for x > - ,
while for k ≥ 0, there is no upper bound. k
See Also
“Continuous Distributions (Data)” on page 5-4
B-40
Geometric Distribution
Geometric Distribution
In this section...
“Definition” on page B-41
“Background” on page B-41
“Example” on page B-41
“See Also” on page B-42
Definition
The geometric pdf is
y = f ( x | p) = pq x I(0,1,...) ( x)
Background
The geometric distribution is discrete, existing only on the nonnegative
integers. It is useful for modeling the runs of consecutive successes (or
failures) in repeated independent trials of a system.
The geometric distribution models the number of successes before one failure
in an independent succession of tests where each test results in success or
failure.
Example
Suppose the probability of a five-year-old battery failing in cold weather is
0.03. What is the probability of starting 25 consecutive days during a long
cold snap?
1 - geocdf(25,0.03)
ans =
B-41
B Distribution Reference
0.4530
x = 0:25;
y = geocdf(x,0.03);
stairs(x,y)
See Also
“Discrete Distributions” on page 5-7
B-42
Hypergeometric Distribution
Hypergeometric Distribution
In this section...
“Definition” on page B-43
“Background” on page B-43
“Example” on page B-44
“See Also” on page B-44
Definition
The hypergeometric pdf is
⎛ K ⎞⎛ M − K ⎞
⎜ ⎟⎜ ⎟
x n− x ⎠
y = f ( x | M , K , n) = ⎝ ⎠ ⎝
⎛M⎞
⎜ ⎟
⎝n⎠
Background
The hypergeometric distribution models the total number of successes in a
fixed-size sample drawn without replacement from a finite population.
The distribution is discrete, existing only for nonnegative integers less than
the number of samples or the number of possible successes, whichever is
greater. The hypergeometric distribution differs from the binomial only in
that the population is finite and the sampling from the population is without
replacement.
B-43
B Distribution Reference
Example
The plot shows the cdf of an experiment taking 20 samples from a group of
1000 where there are 50 items of the desired type.
x = 0:10;
y = hygecdf(x,1000,50,20);
stairs(x,y)
See Also
“Discrete Distributions” on page 5-7
B-44
Inverse Gaussian Distribution
Definition
The inverse Gaussian distribution has the density function
⎪⎧ ⎫
exp ⎨−
3 2
( x − )2 ⎪⎬
2 x ⎪⎩ 2 x ⎪⎭
Background
Also known as the Wald distribution, the inverse Gaussian is used to model
nonnegative positively skewed data. The distribution originated in the theory
of Brownian motion, but has been used to model diverse phenomena. Inverse
Gaussian distributions have many similarities to standard Gaussian (normal)
distributions, which lead to applications in inferential statistics.
Parameters
See mle, dfittool.
See Also
“Continuous Distributions (Data)” on page 5-4
B-45
B Distribution Reference
Definition
The probability density function of the d-dimensional Inverse Wishart
distribution is given by
⎛ 1 ⎞
ν /2 ⎜ ( )
- trace ΤX −1 ⎟
Τ ( ) e⎝ 2 ⎠
y = f(Χ, Σ, ν) =
(νd)/2 (d(d-1))/4 (ν + d +1)/ 2
where X and T are2d-by-d π symmetric X positiveΓ (definite
ν / 2 ) ...Γmatrices,
(ν-(d-1))/2and ν is a
scalar greater than or equal to d. While it is possible to define the Inverse
Wishart for singular Τ, the density cannot be written as above.
If a random matrix has a Wishart distribution with parameters T-1 and ν, then
the inverse of that random matrix has an inverse Wishart distribution with
parameters Τ and ν. The mean of the distribution is given by
1
Τ
ν − d −1
where d is the number of rows and columns in T.
Background
The inverse Wishart distribution is based on the Wishart distribution. In
Bayesian statistics it is used as the conjugate prior for the covariance matrix
of a multivariate normal distribution.
Example
Notice that the sampling variability is quite large when the degrees of
freedom is small.
B-46
Inverse Wishart Distribution
S1 =
1.7959 0.64107
0.64107 1.5496
df = 1000; S2 = iwishrnd(Tau,df)*(df-2-1)
S2 =
0.9842 0.50158
0.50158 2.1682
See Also
“Multivariate Distributions” on page 5-8
B-47
B Distribution Reference
Johnson System
See “Pearson and Johnson Systems” on page 6-27.
B-48
Logistic Distribution
Logistic Distribution
In this section...
“Definition” on page B-49
“Background” on page B-49
“Parameters” on page B-49
“See Also” on page B-49
Definition
The logistic distribution has the density function
x−
e
2
⎛ x− ⎞
⎜
1+ e ⎟
⎜ ⎟
⎝ ⎠
with location parameter µ and scale parameter σ > 0, for all real x.
Background
The logistic distribution originated with Verhulst’s work on demography in
the early 1800s. The distribution has been used for various growth models,
and is used in logistic regression. It has longer tails and a higher kurtosis
than the normal distribution.
Parameters
See mle, dfittool.
See Also
“Continuous Distributions (Data)” on page 5-4
B-49
B Distribution Reference
Loglogistic Distribution
In this section...
“Definition” on page B-50
“Parameters” on page B-50
“See Also” on page B-50
Definition
The variable x has a loglogistic distribution with location parameter µ and
scale parameter σ > 0 if ln x has a logistic distribution with parameters µ
and σ. The relationship is similar to that between the lognormal and normal
distribution.
Parameters
See mle, dfittool.
See Also
“Continuous Distributions (Data)” on page 5-4
B-50
Lognormal Distribution
Lognormal Distribution
In this section...
“Definition” on page B-51
“Background” on page B-51
“Example” on page B-52
“See Also” on page B-53
Definition
The lognormal pdf is
−( ln x − )
2
1 2 2
y = f ( x| , ) = e
x 2
Background
The normal and lognormal distributions are closely related. If X is distributed
lognormally with parameters µ and σ, then log(X) is distributed normally
with mean µ and standard deviation σ.
(
m = exp + 2 / 2 )
( )( ( ) )
v = exp 2 + 2 exp 2 − 1
= log ⎛⎜ m2 / v + m2 ⎞⎟
⎝ ⎠
(
= log v / m2 + 1 )
B-51
B Distribution Reference
Example
Suppose the income of a family of four in the United States follows a lognormal
distribution with µ = log(20,000) and σ2 = 1.0. Plot the income density.
x = (10:1000:125010)';
y = lognpdf(x,log(20000),1.0);
plot(x,y)
set(gca,'xtick',[0 30000 60000 90000 120000])
set(gca,'xticklabel',{'0','$30,000','$60,000',...
'$90,000','$120,000'})
B-52
Lognormal Distribution
See Also
“Continuous Distributions (Data)” on page 5-4
B-53
B Distribution Reference
Multinomial Distribution
In this section...
“Definition” on page B-54
“Background” on page B-54
“Example” on page B-54
Definition
The multinomial pdf is
n!
f ( x | n, p) =
x x
p 1 pk k
x1 ! xk ! 1
Background
The multinomial distribution is a generalization of the binomial distribution.
The binomial distribution gives the probability of the number of “successes”
and “failures” in n independent trials of a two-outcome process. The
probability of “success” and “failure” in any one trial is given by the fixed
probabilities p and q = 1–p. The multinomial distribution gives the probability
of each combination of outcomes in n independent trials of a k-outcome
process. The probability of each outcome in any one trial is given by the fixed
probabilities p1,...,pk.
The expected value of outcome i is npi. The variance of outcome i is npi(1 – pi).
The covariance of outcomes i and j is –npipj for distinct i and j.
Example
The following uses mnpdf to produce a visualization of a trinomial distribution:
B-54
Multinomial Distribution
B-55
B Distribution Reference
Note that the visualization does not show x3, which is determined by the
constraint x1 + x2 + x3 = n.
B-56
Multivariate Gaussian Distribution
B-57
B Distribution Reference
Definition
The probability density function of the d-dimensional multivariate normal
distribution is given by
1
− (x- ) Σ-1 (x- )′
1 2
y = f ( x, , Σ) = e
Σ (2 ) d
where x and μ are 1-by-d vectors and Σ is a d-by-d symmetric positive definite
matrix. While it is possible to define the multivariate normal for singular Σ,
the density cannot be written as above. Only random vector generation is
supported for the singular case. Note that while most textbooks define the
multivariate normal with x and μ oriented as column vectors, for the purposes
of data analysis software, it is more convenient to orient them as row vectors,
and Statistics Toolbox software uses that orientation.
Background
The multivariate normal distribution is a generalization of the univariate
normal to two or more variables. It is a distribution for random vectors
of correlated variables, each element of which has a univariate normal
distribution. In the simplest case, there is no correlation among variables, and
elements of the vectors are independent univariate normal random variables.
B-58
Multivariate Normal Distribution
contain the variances for each variable, while the off-diagonal elements of Σ
contain the covariances between variables.
Example
This example shows the probability density function (pdf) and cumulative
distribution function (cdf) for a bivariate normal distribution with unequal
standard deviations. You can use the multivariate normal distribution in a
higher number of dimensions as well, although visualization is not easy.
mu = [0 0];
Sigma = [.25 .3; .3 1];
x1 = -3:.2:3; x2 = -3:.2:3;
[X1,X2] = meshgrid(x1,x2);
F = mvnpdf([X1(:) X2(:)],mu,Sigma);
F = reshape(F,length(x2),length(x1));
surf(x1,x2,F);
caxis([min(F(:))-.5*range(F(:)),max(F(:))]);
axis([-3 3 -3 3 0 .4])
xlabel('x1'); ylabel('x2'); zlabel('Probability Density');
B-59
B Distribution Reference
F = mvncdf([X1(:) X2(:)],mu,Sigma);
F = reshape(F,length(x2),length(x1));
surf(x1,x2,F);
caxis([min(F(:))-.5*range(F(:)),max(F(:))]);
axis([-3 3 -3 3 0 1])
xlabel('x1'); ylabel('x2'); zlabel('Cumulative Probability');
B-60
Multivariate Normal Distribution
Since the bivariate normal distribution is defined on the plane, you can also
compute cumulative probabilities over rectangular regions. For example,
this contour plot illustrates the computation that follows, of the probability
contained within the unit square.
B-61
B Distribution Reference
B-62
Multivariate Normal Distribution
See Also
“Multivariate Distributions” on page 5-8
B-63
B Distribution Reference
Multivariate t Distribution
In this section...
“Definition” on page B-64
“Background” on page B-64
“Example” on page B-65
“See Also” on page B-69
Definition
The probability density function of the d-dimensional multivariate Student’s t
distribution is given by
-(ν+d)/2
1 1 Γ((ν+d)/2) ⎛ x ′ Ρ-1 x ⎞
y = f(x, Ρ, ν) = ⎜1 + ⎟
⎜ ν ⎟⎠
(νπ)d Γ(ν /2) ⎝
1/2
Σ
where x is a 1-by-d vector, P is a d-by-d symmetric, positive definite matrix,
and ν is a positive scalar. While it is possible to define the multivariate
Student’s t for singular P, the density cannot be written as above. For the
singular case, only random number generation is supported. Note that while
most textbooks define the multivariate Student’s t with x oriented as a column
vector, for the purposes of data analysis software, it is more convenient to
orient x as a row vector, and Statistics Toolbox software uses that orientation.
Background
The multivariate Student’s t distribution is a generalization of the univariate
Student’s t to two or more variables. It is a distribution for random vectors
of correlated variables, each element of which has a univariate Student’s t
distribution. In the same way as the univariate Student’s t distribution can
be constructed by dividing a standard univariate normal random variable by
the square root of a univariate chi-square random variable, the multivariate
Student’s t distribution can be constructed by dividing a multivariate
normal random vector having zero mean and unit variances by a univariate
chi-square random variable.
B-64
Multivariate t Distribution
Example
This example shows the probability density function (pdf) and cumulative
distribution function (cdf) for a bivariate Student’s t distribution. You can use
the multivariate Student’s t distribution in a higher number of dimensions as
well, although visualization is not easy.
B-65
B Distribution Reference
F = mvtcdf([X1(:) X2(:)],Rho,nu);
F = reshape(F,length(x2),length(x1));
surf(x1,x2,F);
caxis([min(F(:))-.5*range(F(:)),max(F(:))]);
axis([-3 3 -3 3 0 1])
xlabel('x1'); ylabel('x2'); zlabel('Cumulative Probability');
B-66
Multivariate t Distribution
Since the bivariate Student’s t distribution is defined on the plane, you can
also compute cumulative probabilities over rectangular regions. For example,
this contour plot illustrates the computation that follows, of the probability
contained within the unit square.
B-67
B Distribution Reference
B-68
Multivariate t Distribution
See Also
“Multivariate Distributions” on page 5-8
B-69
B Distribution Reference
Nakagami Distribution
In this section...
“Definition” on page B-70
“Background” on page B-70
“Parameters” on page B-70
“See Also” on page B-71
Definition
The Nakagami distribution has the density function
−
⎛⎞ x2
x(
1 2 −1)
2⎜ ⎟ e
⎝ ⎠ Γ ( )
with shape parameter µ and scale parameter ω > 0, for x > 0. If x has a
Nakagami distribution with parameters µ and ω, then x2 has a gamma
distribution with shape parameter µ and scale parameter ω/µ.
Background
In communications theory, Nakagami distributions, Rician distributions,
and Rayleigh distributions are used to model scattered signals that reach
a receiver by multiple paths. Depending on the density of the scatter, the
signal will display different fading characteristics. Rayleigh and Nakagami
distributions are used to model dense scatters, while Rician distributions
model fading with a stronger line-of-sight. Nakagami distributions can be
reduced to Rayleigh distributions, but give more control over the extent
of the fading.
Parameters
See mle, dfittool.
B-70
Nakagami Distribution
See Also
“Continuous Distributions (Data)” on page 5-4
B-71
B Distribution Reference
Definition
When the r parameter is an integer, the negative binomial pdf is
⎛ r + x − 1⎞ r x
y = f ( x | r, p) = ⎜ ⎟ p q I(0,1,...) ( x)
⎝ x ⎠
Γ(r + x)
Γ(r)Γ( x + 1)
Background
In its simplest form (when r is an integer), the negative binomial distribution
models the number of failures x before a specified number of successes is
reached in a series of independent, identical trials. Its parameters are the
probability of success in a single trial, p, and the number of successes, r. A
special case of the negative binomial distribution, when r = 1, is the geometric
distribution, which models the number of failures before the first success.
More generally, r can take on non-integer values. This form of the negative
binomial distribution has no interpretation in terms of repeated trials, but,
like the Poisson distribution, it is useful in modeling count data. The negative
binomial distribution is more general than the Poisson distribution because it
has a variance that is greater than its mean, making it suitable for count data
B-72
Negative Binomial Distribution
that do not meet the assumptions of the Poisson distribution. In the limit,
as r increases to infinity, the negative binomial distribution approaches the
Poisson distribution.
Parameters
Suppose you are collecting data on the number of auto accidents on a busy
highway, and would like to be able to model the number of accidents per day.
Because these are count data, and because there are a very large number of
cars and a small probability of an accident for any specific car, you might
think to use the Poisson distribution. However, the probability of having an
accident is likely to vary from day to day as the weather and amount of traffic
change, and so the assumptions needed for the Poisson distribution are not
met. In particular, the variance of this type of count data sometimes exceeds
the mean by a large amount. The data below exhibit this effect: most days
have few or no accidents, and a few days have a large number.
accident = [2 3 4 2 3 1 12 8 14 31 23 1 10 7 0];
mean(accident)
ans =
8.0667
var(accident)
ans =
79.352
The negative binomial distribution is more general than the Poisson, and is
often suitable for count data when the Poisson is not. The function nbinfit
returns the maximum likelihood estimates (MLEs) and confidence intervals
for the parameters of the negative binomial distribution. Here are the results
from fitting the accident data:
[phat,pci] = nbinfit(accident)
phat =
1.0060 0.1109
pci =
0.2152 0.0171
1.7968 0.2046
B-73
B Distribution Reference
plot(0:50,nbincdf(0:50,phat(1),phat(2)),'.-');
xlabel('Accidents per Day')
ylabel('Cumulative Probability')
B-74
Negative Binomial Distribution
Example
The negative binomial distribution can take on a variety of shapes ranging
from very skewed to nearly symmetric. This example plots the probability
function for different values of r, the desired number of successes: .1, 1, 3, 6.
x = 0:10;
plot(x,nbinpdf(x,.1,.5),'s-', ...
x,nbinpdf(x,1,.5),'o-', ...
x,nbinpdf(x,3,.5),'d-', ...
x,nbinpdf(x,6,.5),'^-');
legend({'r = .1' 'r = 1' 'r = 3' 'r = 6'})
xlabel('x')
ylabel('f(x|r,p)')
See Also
“Discrete Distributions” on page 5-7
B-75
B Distribution Reference
Definition
There are many equivalent formulas for the noncentral chi-square distribution
function. One formulation uses a modified Bessel function of the first
kind. Another uses the generalized Laguerre polynomials. The cumulative
distribution function is computed using a weighted sum of χ2 probabilities
with the weights equal to the probabilities of a Poisson distribution.
The Poisson parameter is one-half of the noncentrality parameter of the
noncentral chi-square
⎛⎛1 ⎞j ⎞
∞ ⎜ ⎜ ⎟ − ⎟
F ( x | , ) = ∑ ⎜⎜ ⎝ ⎟ Pr ⎡ 2
2 ⎠ ⎤
j !
e2 ⎟ ⎣ +2 j ≤ x ⎦
j =0 ⎜ ⎟
⎜ ⎟
⎝ ⎠
Background
The χ2 distribution is actually a simple special case of the noncentral
chi-square distribution. One way to generate random numbers with a χ2
distribution (with ν degrees of freedom) is to sum the squares of ν standard
normal random numbers (mean equal to zero.)
What if the normally distributed quantities have a mean other than zero? The
sum of squares of these numbers yields the noncentral chi-square distribution.
The noncentral chi-square distribution requires two parameters: the degrees
of freedom and the noncentrality parameter. The noncentrality parameter is
the sum of the squared means of the normally distributed quantities.
B-76
Noncentral Chi-Square Distribution
Example
The following commands generate a plot of the noncentral chi-square pdf.
x = (0:0.1:10)';
p1 = ncx2pdf(x,4,2);
p = chi2pdf(x,4);
plot(x,p,'-',x,p1,'-')
B-77
B Distribution Reference
Noncentral F Distribution
In this section...
“Definition” on page B-78
“Background” on page B-78
“Example” on page B-79
“See Also” on page B-79
Definition
Similar to the noncentral χ2 distribution, the toolbox calculates noncentral
F distribution probabilities as a weighted sum of incomplete beta functions
using Poisson probabilities as the weights.
⎛⎛1 ⎞j ⎞
∞ ⎜⎜ ⎟ − ⎟
F ( x | 1 , 2 , ) = ∑ ⎜⎜ ⎝
2 ⎠ ⎟I ⎛ 1 ⋅ x 1 + j, 2 ⎞
j!
e2 ⎟ ⎜⎜ + ⋅ x 2 ⎟
2 ⎟⎠
j =0 ⎜ ⎟ ⎝ 2 1
⎜ ⎟
⎝ ⎠
Background
As with the χ2 distribution, the F distribution is a special case of the
noncentral F distribution. The F distribution is the result of taking the ratio
of χ2 random variables each divided by its degrees of freedom.
B-78
Noncentral F Distribution
Example
The following commands generate a plot of the noncentral F pdf.
x = (0.01:0.1:10.01)';
p1 = ncfpdf(x,5,20,10);
p = fpdf(x,5,20);
plot(x,p,'-',x,p1,'-')
See Also
“Continuous Distributions (Statistics)” on page 5-6
B-79
B Distribution Reference
Noncentral t Distribution
In this section...
“Definition” on page B-80
“Background” on page B-80
“Example” on page B-81
“See Also” on page B-81
Definition
The most general representation of the noncentral t distribution is quite
complicated. Johnson and Kotz [60] give a formula for the probability that a
noncentral t variate falls in the range [–t, t].
⎛⎛1 ⎞j ⎞
∞ ⎜⎜ ⎟ − ⎟ ⎛ 2
⎞
P r ( ( −t ) < x < t |( , ) ) = ∑ ⎜⎜ ⎝ ⎟I ⎜ x
2 ⎠ 1
e2 + j , ⎟
⎟ ⎜ 2 2 2⎟
j =0 ⎜
j!
⎟ ⎝ + x ⎠
⎜ ⎟
⎝ ⎠
Background
The noncentral t distribution is a generalization of Student’s t distribution.
x−
t=
s/ n
where x is the sample mean and s is the sample standard deviation of a
random sample of size n from a normal population with mean μ. If the
population mean is actually μ0, then the t-statistic has a noncentral t
distribution with noncentrality parameter
B-80
Noncentral t Distribution
0 −
=
/ n
The noncentral t distribution gives the probability that a t test will correctly
reject a false null hypothesis of mean μ when the population mean is actually
μ0; that is, it gives the power of the t test. The power increases as the
difference μ0 – μ increases, and also as the sample size n increases.
Example
The following commands generate a plot of the noncentral t pdf.
x = (-5:0.1:5)';
p1 = nctcdf(x,10,1);
p = tcdf(x,10);
plot(x,p,'-',x,p1,'-')
See Also
“Continuous Distributions (Statistics)” on page 5-6
B-81
B Distribution Reference
Nonparametric Distributions
See the discussion of ksdensity in “Estimating PDFs without Parameters”
on page 5-55.
B-82
Normal Distribution
Normal Distribution
In this section...
“Definition” on page B-83
“Background” on page B-83
“Parameters” on page B-84
“Example” on page B-85
“See Also” on page B-85
Definition
The normal pdf is
−( x − )2
1
y = f ( x| , ) = e 2
2
2
Background
The normal distribution is a two-parameter family of curves. The first
parameter, µ, is the mean. The second, σ, is the standard deviation. The
standard normal distribution (written Φ(x)) sets µ to 0 and σ to 1.
(
erf ( x ) = 2Φ x 2 − 1 )
The first use of the normal distribution was as a continuous approximation
to the binomial.
The usual justification for using the normal distribution for modeling is the
Central Limit Theorem, which states (roughly) that the sum of independent
samples from any distribution with finite mean and variance converges to the
normal distribution as the sample size goes to infinity.
B-83
B Distribution Reference
Parameters
To use statistical parameters such as mean and standard deviation reliably,
you need to have a good estimator for them. The maximum likelihood
estimates (MLEs) provide one such estimator. However, an MLE might be
biased, which means that its expected value of the parameter might not
equal the parameter being estimated. For example, an MLE is biased for
estimating the variance of a normal distribution. An unbiased estimator
that is commonly used to estimate the parameters of the normal distribution
is the minimum variance unbiased estimator (MVUE). The MVUE has the
minimum variance of all unbiased estimators of a parameter.
The MVUEs of parameters µ and σ2 for the normal distribution are the sample
mean and variance. The sample mean is also the MLE for µ. The following
are two common formulas for the variance.
1 n
∑ ( xi − x )
2
s2 =
n i=1 (B-1)
1 n
s2 = ∑
n − 1 i=1
( xi − x )2
(B-2)
where
n
xi
x=∑
i=1
n
As an example, suppose you want to estimate the mean, µ, and the variance,
σ2, of the heights of all fourth grade children in the United States. The
function normfit returns the MVUE for µ, the square root of the MVUE for
σ2, and confidence intervals for µ and σ2. Here is a playful example modeling
the heights in inches of a randomly chosen fourth grade class.
B-84
Normal Distribution
mu =
50.2025
s =
1.7946
muci =
49.5210
50.8841
sci =
1.4292
2.4125
s^2
ans =
3.2206
Example
The plot shows the bell curve of the standard normal pdf, with µ = 0 and σ = 1.
See Also
“Continuous Distributions (Data)” on page 5-4
B-85
B Distribution Reference
Pareto Distribution
See “Generalized Pareto Distribution” on page B-37.
B-86
Pearson System
Pearson System
See “Pearson and Johnson Systems” on page 6-27.
B-87
B Distribution Reference
Piecewise Distributions
See the discussion of the @piecewisedistribution class in “Fitting Piecewise
Distributions” on page 5-72.
B-88
Poisson Distribution
Poisson Distribution
In this section...
“Definition” on page B-89
“Background” on page B-89
“Parameters” on page B-90
“Example” on page B-90
“See Also” on page B-90
Definition
The Poisson pdf is
x −
y = f ( x|) = e I(0,1,...) ( x)
x!
Background
The Poisson distribution is appropriate for applications that involve counting
the number of times a random event occurs in a given amount of time,
distance, area, etc. Sample applications that involve Poisson distributions
include the number of Geiger counter clicks per second, the number of people
walking into a store in an hour, and the number of flaws per 1000 feet of
video tape.
B-89
B Distribution Reference
Parameters
The MLE and the MVUE of the Poisson parameter, λ, is the sample mean.
The sum of independent Poisson random variables is also Poisson distributed
with the parameter equal to the sum of the individual parameters. This
is used to calculate confidence intervals λ. As λ gets large the Poisson
distribution can be approximated by a normal distribution with µ = λ and σ2
= λ. This approximation is used to calculate confidence intervals for values
of λ greater than 100.
Example
The plot shows the probability for each nonnegative integer when λ = 5.
x = 0:15;
y = poisspdf(x,5);
plot(x,y,'+')
See Also
“Discrete Distributions” on page 5-7
B-90
Rayleigh Distribution
Rayleigh Distribution
In this section...
“Definition” on page B-91
“Background” on page B-91
“Parameters” on page B-92
“Example” on page B-92
“See Also” on page B-92
Definition
The Rayleigh pdf is
⎛ − x2 ⎞
x ⎜ 2⎟
2b ⎠
y = f ( x | b) = e⎝
b2
Background
The Rayleigh distribution is a special case of the Weibull distribution. If
A and B are the parameters of the Weibull distribution, then the Rayleigh
distribution with parameter b is equivalent to the Weibull distribution with
parameters A = 2b and B = 2.
B-91
B Distribution Reference
Parameters
The raylfit function returns the MLE of the Rayleigh parameter. This
estimate is
1 n 2
b= ∑ xi
2n i=1
Example
The following commands generate a plot of the Rayleigh pdf.
x = [0:0.01:2];
p = raylpdf(x,0.5);
plot(x,p)
See Also
“Continuous Distributions (Data)” on page 5-4
B-92
Rician Distribution
Rician Distribution
In this section...
“Definition” on page B-93
“Background” on page B-93
“Parameters” on page B-93
“See Also” on page B-94
Definition
The Rician distribution has the density function
⎛ x 2 + s2 ⎞
−⎜ ⎟
⎛ xs ⎞ x 2 2 ⎠
I0 ⎜ ⎟ e ⎝
⎝2 ⎠2
Background
In communications theory, Nakagami distributions, Rician distributions,
and Rayleigh distributions are used to model scattered signals that reach
a receiver by multiple paths. Depending on the density of the scatter, the
signal will display different fading characteristics. Rayleigh and Nakagami
distributions are used to model dense scatters, while Rician distributions
model fading with a stronger line-of-sight. Nakagami distributions can be
reduced to Rayleigh distributions, but give more control over the extent
of the fading.
Parameters
See mle, dfittool.
B-93
B Distribution Reference
See Also
“Continuous Distributions (Data)” on page 5-4
B-94
Student’s t Distribution
Student’s t Distribution
In this section...
“Definition” on page B-95
“Background” on page B-95
“Example” on page B-96
“See Also” on page B-96
Definition
Student’s t pdf is
⎛ + 1 ⎞
Γ⎜
2 ⎟⎠ 1
y = f ( x | ) = ⎝
1
⎛ ⎞ +1
Γ⎜ ⎟ ⎛ x 2⎞ 2
⎝2⎠ ⎜1 + ⎟
⎜ ⎟⎠
⎝
Background
The t distribution is a family of curves depending on a single parameter ν (the
degrees of freedom). As ν goes to infinity, the t distribution approaches the
standard normal distribution.
x−
t=
s/ n
B-95
B Distribution Reference
where x is the sample mean and s is the sample standard deviation, has
Student’s t distribution with n – 1 degrees of freedom.
Example
The plot compares the t distribution with ν = 5 (solid line) to the shorter
tailed, standard normal distribution (dashed line).
x = -5:0.1:5;
y = tpdf(x,5);
z = normpdf(x,0,1);
plot(x,y,'-',x,z,'-.')
See Also
“Continuous Distributions (Statistics)” on page 5-6
B-96
t Location-Scale Distribution
t Location-Scale Distribution
In this section...
“Definition” on page B-97
“Background” on page B-97
“Parameters” on page B-97
“See Also” on page B-98
Definition
The t location-scale distribution has the density function
⎛ +1 ⎞
−⎜ ⎟
⎛ + 1 ⎞ ⎡ ⎛x−⎞ ⎤
2 ⎝ 2 ⎠
Γ⎜ ⎟ ⎢ + ⎜ ⎟ ⎥
⎝ 2 ⎠ ⎢ ⎝ ⎠ ⎥
⎛ ⎞ ⎢ ⎥
Γ ⎜ ⎟ ⎢ ⎥
⎝ 2 ⎠ ⎢⎣ ⎥⎦
with location parameter µ, scale parameter σ > 0, and shape parameter ν > 0.
If x has a t location-scale distribution, with parameters µ, σ, and ν, then
x−
Background
The t location-scale distribution is useful for modeling data distributions
with heavier tails (more prone to outliers) than the normal distribution. It
approaches the normal distribution as ν approaches infinity, and smaller
values of ν yield heavier tails.
Parameters
See mle, dfittool.
B-97
B Distribution Reference
See Also
“Continuous Distributions (Statistics)” on page 5-6
B-98
Uniform Distribution (Continuous)
Definition
The uniform cdf is
x−a
p = F ( x | a, b) = I ( x)
b − a [ a,b]
Background
The uniform distribution (also called rectangular) has a constant pdf between
its two parameters a (the minimum) and b (the maximum). The standard
uniform distribution (a = 0 and b = 1) is a special case of the beta distribution,
obtained by setting both of its parameters to 1.
Parameters
The sample minimum and maximum are the MLEs of a and b respectively.
Example
The example illustrates the inversion method for generating normal random
numbers using rand and norminv. Note that the MATLAB function, randn,
does not use inversion since it is not efficient for this case.
u = rand(1000,1);
B-99
B Distribution Reference
x = norminv(u,0,1);
hist(x)
See Also
“Continuous Distributions (Data)” on page 5-4
B-100
Uniform Distribution (Discrete)
Definition
The discrete uniform pdf is
1
y = f ( x| N ) = I ( x)
N (1,..., N )
Background
The discrete uniform distribution is a simple distribution that puts equal
weight on the integers from one to N.
Example
As for all discrete distributions, the cdf is a step function. The plot shows
the discrete uniform cdf for N = 10.
x = 0:10;
y = unidcdf(x,10);
stairs(x,y)
set(gca,'Xlim',[0 11])
B-101
B Distribution Reference
numbers = unidrnd(553,1,10)
numbers =
293 372 5 213 37 231 380 326 515 468
See Also
“Discrete Distributions” on page 5-7
B-102
Weibull Distribution
Weibull Distribution
In this section...
“Definition” on page B-103
“Background” on page B-103
“Parameters” on page B-104
“Example” on page B-104
“See Also” on page B-105
Definition
The Weibull pdf is
b
⎛ x⎞
−
− b b−1 ⎜⎝ a ⎟⎠
y = f ( x | a, b) = ba x e I( 0,∞ ) ( x )
Background
Waloddi Weibull offered the distribution that bears his name as an
appropriate analytical tool for modeling the breaking strength of materials.
Current usage also includes reliability and lifetime modeling. The Weibull
distribution is more flexible than the exponential for these purposes.
To see why, consider the hazard rate function (instantaneous failure rate). If
f(t) and F(t) are the pdf and cdf of a distribution, then the hazard rate is
f (t)
h (t) =
1 − F (t)
Substituting the pdf and cdf of the exponential distribution for f(t) and F(t)
above yields a constant. The example below shows that the hazard rate for
the Weibull distribution can vary.
B-103
B Distribution Reference
Parameters
Suppose you want to model the tensile strength of a thin filament using
the Weibull distribution. The function wblfit gives maximum likelihood
estimates and confidence intervals for the Weibull parameters.
p =
0.4715 1.9811
ci =
0.4248 1.7067
0.5233 2.2996
The default 95% confidence interval for each parameter contains the true
value.
Example
The exponential distribution has a constant hazard function, which is not
generally the case for the Weibull distribution.
The plot shows the hazard functions for exponential (dashed line) and Weibull
(solid line) distributions having the same mean life. The Weibull hazard rate
here increases with age (a reasonable assumption).
t = 0:0.1:4.5;
h1 = exppdf(t,0.6267) ./ (1-expcdf(t,0.6267));
h2 = wblpdf(t,2,2) ./ (1-wblcdf(t,2,2));
plot(t,h1,'-',t,h2,'-')
B-104
Weibull Distribution
See Also
“Continuous Distributions (Data)” on page 5-4
B-105
B Distribution Reference
Wishart Distribution
In this section...
“Definition” on page B-106
“Background” on page B-106
“Example” on page B-107
“See Also” on page B-107
Definition
The probability density function of the d-dimensional Wishart distribution is
given by
⎛ 1 ⎞
(ν-d-1)/2 ) ⎜⎝ ( )
- trace Σ −1 Χ ⎟
Χ( e 2 ⎠
y = f(Χ, Σ, ν) =
ν/2
2(νd)/2 π(d(d-1))/4 ∑ Γ ( ν / 2 ) ...Γ(ν-(d-1))/2
Background
The Wishart distribution is a generalization of the univariate chi-square
distribution to two or more variables. It is a distribution for symmetric
positive semidefinite matrices, typically covariance matrices, the diagonal
elements of which are each chi-square random variables. In the same way
as the chi-square distribution can be constructed by summing the squares of
independent, identically distributed, zero-mean univariate normal random
variables, the Wishart distribution can be constructed by summing the inner
products of independent, identically distributed, zero-mean multivariate
normal random vectors.
B-106
Wishart Distribution
The Wishart distribution is often used as a model for the distribution of the
sample covariance matrix for multivariate normal random data, after scaling
by the sample size.
Example
If x is a bivariate normal random vector with mean zero and covariance matrix
⎛ 1 .5 ⎞
Σ=⎜ ⎟
⎝ .5 2 ⎠
then you can use the Wishart distribution to generate a sample covariance
matrix without explicitly generating x itself. Notice how the sampling
variability is quite large when the degrees of freedom is small.
S1 =
1.7959 0.64107
0.64107 1.5496
df = 1000; S2 = wishrnd(Sigma,df)/df
S2 =
0.9842 0.50158
0.50158 2.1682
See Also
“Multivariate Distributions” on page 5-8
B-107
B Distribution Reference
B-108
C
Bibliography
[2] Bates, D. M., and D. G. Watts. Nonlinear Regression Analysis and Its
Applications. Hoboken, NJ: John Wiley & Sons, Inc., 1988.
[7] Bowman, A. W., and A. Azzalini. Applied Smoothing Techniques for Data
Analysis. New York: Oxford University Press, 1997.
[12] Breiman, L., et al., Classification and Regression Trees, Chapman &
Hall, Boca Raton, 1993.
[16] Collett, D. Modeling Binary Data. New York: Chapman & Hall, 2002.
[19] Cox, D. R., and D. Oakes. Analysis of Survival Data. London: Chapman
& Hall, 1984.
[21] Deb, P., and M. Sefton. “The Distribution of a Lagrange Multiplier Test
of Normality.” Economics Letters. Vol. 51, 1996, pp. 123–130.
C-2
Bibliography
[30] Drezner, Z., and G. O. Wesolowsky. “On the Computation of the Bivariate
Normal Integral.” Journal of Statistical Computation and Simulation. Vol.
35, 1989, pp. 101–107.
C-3
C Bibliography
[37] Genz, A., and F. Bretz. “Comparison of Methods for the Computation
of Multivariate t Probabilities.” Journal of Computational and Graphical
Statistics. Vol. 11, No. 4, 2002, pp. 950–971.
C-4
Bibliography
[52] Huber, P. J. Robust Statistics. Hoboken, NJ: John Wiley & Sons, Inc.,
1981.
[54] Jain, A., and R. Dubes. Algorithms for Clustering Data. Upper Saddle
River, NJ: Prentice-Hall, 1988.
[56] Joe, S., and F. Y. Kuo. “Remark on Algorithm 659: Implementing Sobol’s
Quasirandom Sequence Generator.” ACM Transactions on Mathematical
Software. Vol. 29, No. 1, 2003, pp. 49–57.
C-5
C Bibliography
[68] Kotz, S., and S. Nadarajah. Extreme Value Distributions: Theory and
Applications. London: Imperial College Press, 2000.
C-6
Bibliography
[75] Little, Roderick J. A., and Donald B. Rubin. Statistical Analysis with
Missing Data. 2nd ed., Hoboken, NJ: John Wiley & Sons, Inc., 2002.
[78] Marquardt, D. W., and R.D. Snee. “Ridge Regression in Practice.” The
American Statistician. Vol. 29, No. 1, 1975, pp. 3–20.
C-7
C Bibliography
[84] McLachlan, G., and D. Peel. Finite Mixture Models. Hoboken, NJ: John
Wiley & Sons, Inc., 2000.
[85] McCullagh, P., and J. A. Nelder. Generalized Linear Models. New York:
Chapman & Hall, 1990.
C-8
Bibliography
[96] Mosteller, F., and J. Tukey. Data Analysis and Regression. Upper Saddle
River, NJ: Addison-Wesley, 1977.
[107] Sexton, Joe, and A. R. Swensen. “ECM Algorithms that Converge at the
Rate of EM.” Biometrika. Vol. 87, No. 3, 2000, pp. 651–662.
C-9
C Bibliography
[113] Student. “On the Probable Error of the Mean.” Biometrika. Vol. 6,
No. 1, 1908, pp. 1–25.
C-10
Index
A
Index computing with 2-31
absolute deviation 3-5 constructing 2-25
added variable plots creating 2-24
adding new term to model 9-23 removing observations from 2-31
from stepwise 9-29 multidimensional 2-6
addedvarplot 18-2 numerical 2-4
additive effects 8-9 statistical 2-11
adjacent value 18-90 average linkage 18-699
adjacent values 4-7
AIC. See Akaike Information Criterion B
Akaike Information Criterion (AIC) 5-105 18-8
bacteria counts 8-4
alternative hypotheses 7-3
Bartlett multiple-sample test 7-15
analysis of variance
barttest 18-42
F distribution B-26
batch updates 18-640
functions 16-32
Bayes classification
multivariate 8-39
objects 16-43 17-5
N-way 8-12
Bayes Information Criterion (BIC) 5-105 18-58
one-way 8-3
bbdesign 18-43
two-way 8-9
Bernoulli distribution B-3
visualization functions 16-12 16-32
Bernoulli random variables 18-66
andrewsplot 18-9
beta distribution B-4
ANOVA tables
betacdf 18-46
regression 9-13
betafit 18-47
anova1 18-13
betainv 18-49
anova2 18-19
betalike 18-51
anovan 18-23
betapdf 18-53
Ansari-Bradley test 7-13
betarnd 18-55
ansaribradley 18-33
betastat 18-57
aoctool 8-27 18-36
BIC. See Bayes Information Criterion
arrays
binocdf 18-59
categorical
binofit 18-61
accessing 2-18
binoinv 18-63
combining 2-19
binomial distribution B-7
computing with 2-20
negative B-72
constructing 2-16
binopdf 18-64
implementation 2-14
binornd 18-66
types 2-14
binostat 18-68
dataset
biplot 18-69
accessing 2-27
Birnbaum-Saunders distribution B-10
combining 2-29
Index-1
Index
Index-2
Index
Index-3
Index
Index-4
Index
fixed 9-64 F
random 9-64 F distribution B-25
statistical 9-64 F-test, one-sample 7-14
efinv 18-362 factor analysis
emission matrices functions 16-39
estimating 13-9 maximum likelihood 18-395
empirical cumulative distribution function 5-63 factoran 18-395
18-350 factorial designs
ensemble methods fractional 14-5
functions 16-36 16-44 full 14-3
objects 17-4 to 17-5 generating fractional 18-440
equal variances generating full 18-455
Bartlett multiple-sample test for 7-15 fcdf 18-410
F-test for 7-14 feature selection
erf B-83 functions 16-39
error function B-83 overview 10-23
Euclidean distance 12-14 18-1050 18-1056 sequential 10-23
evcdf 18-358 feature transformation
evfit 18-360 functions 16-39
evlike 18-373 overview 10-28
evpdf 18-374 ff2n 18-412
evrnd 18-375 file I/O
evstat 18-376 functions 16-2
expcdf 18-377 filter methods
expectation maximization (EM) algorithm feature selection 18-1272
cluster analysis 11-2 finv 18-416
Gaussian mixture models 5-99 11-28 fitdist 18-429
expfit 18-379 folds
expinv 18-386 partition 18-283
explike 18-388 fpdf 18-439
exponential distribution B-16 fracfact 18-440
exppdf 18-392 fracfactgen 18-442
exprnd 18-393 fractional factorial designs
expstat 18-394 functions 16-48
extrapolated 18-1163 generating 18-440
extreme value distribution B-19 overview 14-5
extreme value fit 18-360 friedman 18-445
Friedman’s test 8-37
frnd 18-449
fstat 18-451
Index-5
Index
Index-6
Index
constructing clusters 18-188 IFM. See Inference Functions for Margins method
cophenetic correlation coefficients 18-223 incomplete beta function B-4
creating cluster trees 18-697 incomplete gamma function B-27
creating clusters 11-16 inconsistency coefficient 18-585
creating clusters from data 18-195 inconsistent 18-585
determining proximity 18-1048 18-1054 Inference Functions for Margins (IFM)
evaluating cluster formation 18-223 method 18-228
functions 16-40 initial state distribution
grouping objects 11-6 changing 13-12
inconsistency coefficient 18-585 interaction effects
plotting cluster trees 18-313 designed experiments 14-2
procedure 11-3 two-way ANOVA 8-9
hist3 18-545 interactionplot 18-593
histfit 18-553 interquartile range (iqr) 3-6
histogram fit 18-553 inverse cumulative distribution
hmmdecode 18-556 functions 5-66 16-21
hmmestimate 13-9 18-558 inverse Gaussian distribution B-45
hmmgenerate 18-561 inverse Wishart distribution B-46 B-106
hmmtrain 13-10 18-563 invpred 18-596
hmmviterbi 18-566 iqr 18-599
holdout iwishrnd 18-617
partition 18-283
Hotelling’s T-squared 10-42
J
hougen 18-570
hygecdf 18-571 jackknife 18-618
hygeinv 18-572 Jarque-Bera test 7-13 18-620
hygepdf 18-573 jbtest 18-620
hygernd 18-574 Johnson system of distributions 6-27 B-48
hygestat 18-575 johnsrnd 18-623
hypergeometric distribution B-43
hypotheses B-26 K
hypothesis tests
K-means clustering
assumptions 7-5
cluster separation 11-22
functions 16-31
functions 16-41
functions that support 7-13
local minima 11-26
power 7-4 18-1257
number of clusters 11-23
overview 11-21
I silhouette plot 18-1290
icdf 18-576
Index-7
Index
Index-8
Index
Index-9
Index
Index-10
Index
Index-11
Index
Index-12
Index
Index-13
Index
S state sequences
sampsizepwr 18-1257 estimating 13-8
SBS. See sequential backward selection statistical arrays 2-11
scaling arrays statistical data 2-23
classical multidimensional 18-198 statistical functions
scatter operating on numerical data 2-9
visualization functions 16-12 vectorized 2-9
scatter plots statistical process control
functions that produce 4-3 capability studies 15-6
grouped 8-40 control charts 15-3
scatterhist 18-1261 functions 16-51
scree plots 10-41 visualization functions 16-14 16-51
sequential backward selection (SBS) 10-24 statistical visualization
sequential feature selection functions 16-11
criterion 10-23 stepwise 18-1334
sequential forward selection (SFS) 10-24 stepwise regression 9-19
sequentialfs 18-1270 stepwisefit 18-1339
SFS. See sequential forward selection structure arrays
shape storing heterogeneous data in 2-7
functions 16-9 Student’s t distribution B-95
Shepard plots 10-11 noncentral B-80
sign tests 7-14 sum of squares (SS) 18-13
significance levels 7-3 summaries
signrank 18-1286 functions 16-8
signtest 18-1288 supported distribution fitting
silhouette 18-1290 functions 16-24
similarity matrices surfht 18-1368
creating 11-4
single linkage 18-699 T
skewness 18-1302
t location-scale distribution B-97
slicesample 18-1299
t-tests
SPC. See statistical process control
one-sample 7-14
specific variance 10-46
paired-sample 7-14
squareform 18-1315
two-sample 7-14
SS. See sum of squares
tab-delimited data
standard normal 18-966
reading from file 18-1376
standardized data
tabular data
zscore 18-1507
reading from file 18-1370
standardized Euclidean distance 12-14 18-1050
tabulate 18-1369
18-1056
Index-14
Index
Index-15
Index
Index-16