machine_learning
machine_learning
DLP
10.3
Revision A
Forcepoint DLP 10.3 | Forcepoint DLP Machine Learning
Contents
The algorithms are given labeled examples for the various types of data that need to be learned.
■ Unsupervised learning algorithms
Data is unlabeled and the algorithms attempt to find patterns within the data or to cluster the data into groups
or sets.
Forcepoint DLP machine learning uses both types of algorithms.
This article offers a general introduction to Forcepoint DLP machine learning and explores the types of data that
can be effectively protected using machine learning. See:
■ Knowing when to use machine learning
■ How Forcepoint DLP machine learning works
■ Selecting examples for training
■ What happens during training
■ Accuracy of machine learning
■ Using the classifier
■ Tuning the classifiers
■ Comparison with other types of classifiers
2
Forcepoint DLP 10.3 | Forcepoint DLP Machine Learning
Related concepts
Knowing when to use machine learning on page 3
How Forcepoint DLP machine learning works on page 3
Selecting examples for training on page 4
What happens during training on page 5
Using the classifier on page 10
Related tasks
Tuning the classifiers on page 10
Related reference
Accuracy of machine learning on page 8
Comparison with other types of classifiers on page 11
3
Forcepoint DLP 10.3 | Forcepoint DLP Machine Learning
Counterexamples are documents that are thematically related to the positive set, yet are not meant to be
protected. Examples might be public patents versus drafts of patent applications, or non-proprietary source
code versus proprietary source code.
Because it can be difficult and quite labor-intensive to find a sufficient number of documents for the negative
set (while ensuring that no positive examples are in the set), Forcepoint has developed methods that allow the
system to use a generic ensemble of documents as counterexamples. (See Negative examples consisting of “All
documents” and Positive examples.)
For text-based data, some of the algorithms automatically create an optimal “weighted dictionary” that assigns
positive weights to terms and phrases that are more likely to be included in the positive set and negative weights
to terms and phrases that are more likely to be included in the negative set. The algorithms also find an optimal
threshold. When the weighted sum of the terms that are found in a given document is greater than that threshold,
the algorithm decides that the document belongs to the positive set. The assumption is that positive examples are
more likely to have common themes.
Most machine learning algorithms are designed to be used with several hundred or several thousand positive and
negative examples and require “clean” data, or data that is correctly labeled. Forcepoint DLP machine learning,
however, utilizes different algorithms for different data sizes and attempts to automatically match the type of
algorithm to the size of the data.
In addition, Forcepoint DLP machine learning algorithms can detect “outliers” among a set of positive examples.
These are examples that should probably not be labeled “positive.” Forcepoint algorithms also allow learning to
take place even when negative examples are not provided.
Related concepts
Positive examples on page 4
Negative examples consisting of “All documents” on page 5
Positive examples
For effective machine learning to occur, it is most important to select the best positive examples.
■ These are textual examples of the data to protect.
■ The documents in this set should be related to the same theme or share other commonalities.
Without the commonalities, the learning algorithm will not be able to find a way to categorize the data.
The required number of examples depends on the level of commonality. If the positive examples share many
common terms that are very rare, in general, a small number suffices. On the other hand, if the differences
between the positive and the negative set are more subtle, more examples will be required. A positive set
typically consists of 100–200 text documents.
4
Forcepoint DLP 10.3 | Forcepoint DLP Machine Learning
Negative examples
Negative examples are samples of data that are semantically or thematically similar to the set of positive
samples, but that should not be protected.
The size of this set of negative examples can be similar to the size of the positive set, although a larger set is
preferable.
5
Forcepoint DLP 10.3 | Forcepoint DLP Machine Learning
If learning is successful (meaning that the data is “learnable”), the following window appears:
6
Forcepoint DLP 10.3 | Forcepoint DLP Machine Learning
By default:
■ The sensitivity level is set to “Default,” an optimal trade-off between false positives (unintended matches) and
false negatives (undetected matches). To change the sensitivity level, click Default, which opens the Update
machine learning Content Classifier window:
7
Forcepoint DLP 10.3 | Forcepoint DLP Machine Learning
It is important to consider the percentage of unintended and undetected matches before changing the
sensitivity level. For example, selecting “Narrow” increases the expected number of undetected matches
without reducing the expected number of unintended matches. It is, therefore, highly undesirable.
■ The training is performed ignoring outliers, or examples that could be labeled “positive,” but that do not seem
to belong to the positive set.
To avoid ignoring outliers, administrators can click Yes next to “Ignore outlier documents” and change it to No.
8
Forcepoint DLP 10.3 | Forcepoint DLP Machine Learning
By adjusting the sensitivity level of the classifier, administrators can reduce the number of false negatives
(unintended matches) while accepting a higher level of false positives (undetected matches) or accept some false
negatives to reduce the rate of false positives (or find an acceptable balance in between).
Factors influencing the choice include:
■ The level of commonality in the positive set of examples (a low level tends to decrease accuracy)
■ The business implications of false positives
■ The resources that available to deal with false positives
9
Forcepoint DLP 10.3 | Forcepoint DLP Machine Learning
Steps
1) Assign a folder to each subject.
10
Forcepoint DLP 10.3 | Forcepoint DLP Machine Learning
Next steps
In many cases, several small specific classifiers can provide better accuracy than one general classifier.
Accuracy Depends on the data Very High High for data types Medium
that are common
enough
11
© 2024 Forcepoint
Forcepoint and the FORCEPOINT logo are trademarks of Forcepoint.
All other trademarks used in this document are the property of their respective owners.
Published 30 October 2024