Anomaly Detection - Problem Motivation
Anomaly Detection - Problem Motivation
More formally
We have a dataset which contains normal (data)
How we ensure they're normal is up to us
In reality it's OK if there are a few which aren't actually normal
Using that dataset as a reference point we can see if other examples are anomalous
How do we do this?
First, using our training dataset we build a model
We can access this model using p(x)
This asks, "What is the probability that example x is normal"
Having built a model
if p(xtest) < ε --> flag this as an anomaly
if p(xtest) >= ε --> this is OK
ε is some threshold probability value which we define, depending on how sure we need/want to be
We expect our model to (graphically) look something like this;
Applications
Fraud detection
Users have activity associated with them, such as
Length on time on-line
Location of login
Spending frequency
Using this data we can build a model of what normal users' activity is like
What is the probability of "normal" behavior?
Identify unusual users by sending their data through the model
Flag up anything that looks a bit weird
Automatically block cards/transactions
Manufacturing
Already spoke about aircraft engine example
Monitoring computers in data center
If you have many machines in a cluster
Computer features of machine
x1 = memory use
x2 = number of disk accesses/sec
x3 = CPU load
In addition to the measurable features you can also define your own complex features
x4 = CPU load/network traffic
If you see an anomalous machine
Maybe about to fail
Look at replacing bits from it
What is it?
Say we have a data set of m examples
Give each example is a real number - we can plot the data on the x axis as shown below
Seems like a reasonable fit - data seems like a higher probability of being in the central region, lower probability of
being further away
Estimating μ and σ2
μ = average of examples
σ2 = standard deviation squared
As a side comment
These parameters are the maximum likelihood estimation values for μ and σ2
You can also do 1/(m) or 1/(m-1) doesn't make too much difference
Slightly different mathematical problems, but in practice it makes little difference
x1
Mean is about 5
Standard deviation looks to be about 2
x2
Mean is about 3
Standard deviation about 1
So we have the following system
If you plot the product of these things you get a surface plot like this
With this surface plot, the height of the surface is the probability - p(x)
We can't always do surface plots, but for this example it's quite a nice way to show the probability of a 2D feature vector
Check if a value is anomalous
Set epsilon as some value
Say we have two new data points new data-point has the values
x1 test
x2 test
We compute
p(x1 test) = 0.436 >= epsilon (~40% chance it's normal)
Normal
p(x2 test) = 0.0021 < epsilon (~0.2% chance it's normal)
Anomalous
What this is saying is if you look at the surface plot, all values above a certain height are normal, all the values below that threshold
are probably anomalous
Anomaly detection
Supervised learning
Can play with different transformations of the data to make it look more Gaussian
Might take a log transformation of the data
i.e. if you have some feature x1 , replace it with log( x1 )
This looks much more Gaussian
Or do log(x1 +c)
Play with c to make it look as Gaussian as possible
Or do x1/2
Or do x1/3
Here we have one dimension, and our anomalous value is sort of buried in it (in green - Gaussian superimposed in blue)
Look at data - see what went wrong
Can looking at that example help develop a new feature (x2) which can help distinguish further anomalous
Example - data center monitoring
Features
x1 = memory use
x2 = number of disk access/sec
x3 = CPU load
x4 = network traffic
We suspect CPU load and network traffic grow linearly with one another
If server is serving many users, CPU is high and network is high
Fail case is infinite loop, so CPU load grows but network traffic is low
New feature - CPU load/network traffic
May need to do feature scaling
Say you can fit a Gaussian distribution to CPU load and memory use
Lets say in the test set we have an example which looks like an anomaly (e.g. x1 = 0.4, x2 = 1.5)
Looks like most of data lies in a region far away from this example
Here memory use is high and CPU load is low (if we plot x1 vs. x2 our green example looks miles away from the
others)
Problem is, if we look at each feature individually they may fall within acceptable limits - the issue is we know we shouldn't don't
get those kinds of values together
But individually, they're both acceptable
This is because our function makes probability prediction in concentric circles around the the means of both
Probability of the two red circled examples is basically the same, even though we can clearly see the green one as an outlier
Doesn't understand the meaning
For the sake of completeness, the formula for the multivariate Gaussian distribution is as follows
Using these values we can, therefore, define the shape of this to better fit the data, rather than assuming symmetry in every
dimension
One of the cool things is you can use it to model correlation between data
If you start to change the off-diagonal values in the covariance matrix you can control how well the various dimensions correlation
So we see here the final example gives a very tall thin distribution, shows a strong positive correlation
We can also make the off-diagonal values negative to show a negative correlation
Hopefully this shows an example of the kinds of distribution you can get by varying sigma
We can, of course, also move the mean ( μ) which varies the peak of the distribution
1) Fit model - take data set and calculate μ and Σ using the formula above
2) We're next given a new example ( xtest) - see below
For it compute p(x) using the following formula for multivariate distribution
If you plug your variance values into the covariance matrix the models are actually identical