0% found this document useful (0 votes)

3 views12 pages

machine_learning

The document provides an overview of Forcepoint DLP Machine Learning, detailing its functionality, including the use of supervised and unsupervised learning algorithms to protect sensitive information. It discusses the importance of selecting appropriate training examples, the training process, and the accuracy of the machine learning system, along with guidance on tuning classifiers and comparing machine learning with other classification methods. The document emphasizes the need for careful assessment of data and classifier performance to minimize false positives and negatives.

Uploaded by

Blake Jimenez

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views12 pages

machine_learning

Uploaded by

Blake Jimenez

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Forcepoint

DLP
10.3

Forcepoint DLP Machine Learning

Revision A
Forcepoint DLP 10.3 | Forcepoint DLP Machine Learning

Contents

■ Introduction to Forcepoint DLP Machine Learning on page 2

■ Knowing when to use machine learning on page 3
■ How Forcepoint DLP machine learning works on page 3
■ Selecting examples for training on page 4
■ What happens during training on page 5
■ Accuracy of machine learning on page 8
■ Using the classifier on page 10
■ Tuning the classifiers on page 10
■ Comparison with other types of classifiers on page 11

Introduction to Forcepoint DLP Machine

Learning
Machine learning is a branch of artificial intelligence, comprising algorithms and techniques that allow computers
to learn from examples instead of predefined rules.
Administrators can provide examples that train the Forcepoint DLP machine learning system to help protect
sensitive, proprietary, and confidential information. After training, the system creates a classifier to identify
documents based on how similar they are to the positive examples provided during the learning process.
There are two main types of machine learning algorithms:
■ Supervised learning algorithms

The algorithms are given labeled examples for the various types of data that need to be learned.
■ Unsupervised learning algorithms

Data is unlabeled and the algorithms attempt to find patterns within the data or to cluster the data into groups
or sets.
Forcepoint DLP machine learning uses both types of algorithms.
This article offers a general introduction to Forcepoint DLP machine learning and explores the types of data that
can be effectively protected using machine learning. See:
■ Knowing when to use machine learning
■ How Forcepoint DLP machine learning works
■ Selecting examples for training
■ What happens during training
■ Accuracy of machine learning
■ Using the classifier
■ Tuning the classifiers
■ Comparison with other types of classifiers

2
Forcepoint DLP 10.3 | Forcepoint DLP Machine Learning

Related concepts
Knowing when to use machine learning on page 3
How Forcepoint DLP machine learning works on page 3
Selecting examples for training on page 4
What happens during training on page 5
Using the classifier on page 10

Related tasks
Tuning the classifiers on page 10

Related reference
Accuracy of machine learning on page 8
Comparison with other types of classifiers on page 11

Knowing when to use machine learning

Machine learning offers advantages and disadvantages compared with other Forcepoint DLP classification
methods. It is important to assess whether machine learning is the best solution for a particular deployment.
Like any other decision systems that handle complicated data, Forcepoint DLP machine learning may generate
false positives (unintended matches) and false negatives (undetected matches). The total fraction of false
positives and false negatives is sometimes referred to as the accuracy of the system.
Accuracy of machine learning is derived from the properties of the data, and finding the best data sets can
sometimes be challenging. Because of this, before considering machine learning, administrators may want to
determine if other types of classifiers, such as fingerprinting or pre-defined policies, are sufficient to classify and
protect their data.
An example of when machine learning could be most effective is in differentiating between proprietary and
non-proprietary data found in source code. It can be hard to fingerprint source code that is under constant
development and continually changing, and predefined policies cannot distinguish between proprietary and non-
proprietary source code.
Forcepoint DLP provides several predefined content types that address common use cases, including source
code (in C, C++, Java, Perl, and F#), patents, software design documents, and documents related to financial
investments. To protect content that belongs to these content types, consider using machine learning, and ensure
that you select the appropriate predefined content type.
Machine learning can also be used to complement and enhance fingerprinting and predefined policies and other
Forcepoint DLP detection and classification methods.

How Forcepoint DLP machine learning

works
Supervised machine learning for data protection requires, in general, two types of examples:

3
Forcepoint DLP 10.3 | Forcepoint DLP Machine Learning

■ Content that needs to be protected (“positive” examples)

■ Counterexamples (“negative” examples)

Counterexamples are documents that are thematically related to the positive set, yet are not meant to be
protected. Examples might be public patents versus drafts of patent applications, or non-proprietary source
code versus proprietary source code.
Because it can be difficult and quite labor-intensive to find a sufficient number of documents for the negative
set (while ensuring that no positive examples are in the set), Forcepoint has developed methods that allow the
system to use a generic ensemble of documents as counterexamples. (See Negative examples consisting of “All
documents” and Positive examples.)
For text-based data, some of the algorithms automatically create an optimal “weighted dictionary” that assigns
positive weights to terms and phrases that are more likely to be included in the positive set and negative weights
to terms and phrases that are more likely to be included in the negative set. The algorithms also find an optimal
threshold. When the weighted sum of the terms that are found in a given document is greater than that threshold,
the algorithm decides that the document belongs to the positive set. The assumption is that positive examples are
more likely to have common themes.
Most machine learning algorithms are designed to be used with several hundred or several thousand positive and
negative examples and require “clean” data, or data that is correctly labeled. Forcepoint DLP machine learning,
however, utilizes different algorithms for different data sizes and attempts to automatically match the type of
algorithm to the size of the data.
In addition, Forcepoint DLP machine learning algorithms can detect “outliers” among a set of positive examples.
These are examples that should probably not be labeled “positive.” Forcepoint algorithms also allow learning to
take place even when negative examples are not provided.

Related concepts
Positive examples on page 4
Negative examples consisting of “All documents” on page 5

Selecting examples for training

Which examples you are selecting is important for machine learning training.

Positive examples
For effective machine learning to occur, it is most important to select the best positive examples.
■ These are textual examples of the data to protect.
■ The documents in this set should be related to the same theme or share other commonalities.

Without the commonalities, the learning algorithm will not be able to find a way to categorize the data.
The required number of examples depends on the level of commonality. If the positive examples share many
common terms that are very rare, in general, a small number suffices. On the other hand, if the differences
between the positive and the negative set are more subtle, more examples will be required. A positive set
typically consists of 100–200 text documents.

4
Forcepoint DLP 10.3 | Forcepoint DLP Machine Learning

Negative examples
Negative examples are samples of data that are semantically or thematically similar to the set of positive
samples, but that should not be protected.
The size of this set of negative examples can be similar to the size of the positive set, although a larger set is
preferable.

Negative examples consisting of “All

documents”
To create a generic ensemble of documents that Forcepoint DLP machine learning can use as negative
examples, select the path to a large folder with a representative sample of documents from the organization. This
folder can contain both positive and negative examples, but substantially more negative examples should exist.
The size of this set of counterexamples can be similar to the size of the positive set, although a larger set is
recommended.

What happens during training

After the examples are submitted, the crawler starts examining the files and providing them to the learning
algorithms. If the number of files in a folder is very large, a sampling algorithm samples the folder several times
and checks for convergence.

5
Forcepoint DLP 10.3 | Forcepoint DLP Machine Learning

If learning is successful (meaning that the data is “learnable”), the following window appears:

6
Forcepoint DLP 10.3 | Forcepoint DLP Machine Learning

By default:
■ The sensitivity level is set to “Default,” an optimal trade-off between false positives (unintended matches) and
false negatives (undetected matches). To change the sensitivity level, click Default, which opens the Update
machine learning Content Classifier window:

7
Forcepoint DLP 10.3 | Forcepoint DLP Machine Learning

It is important to consider the percentage of unintended and undetected matches before changing the
sensitivity level. For example, selecting “Narrow” increases the expected number of undetected matches
without reducing the expected number of unintended matches. It is, therefore, highly undesirable.
■ The training is performed ignoring outliers, or examples that could be labeled “positive,” but that do not seem
to belong to the positive set.

To avoid ignoring outliers, administrators can click Yes next to “Ignore outlier documents” and change it to No.

Accuracy of machine learning

The ability of the system to accurately classify data depends to a large extent on the examples provided. If
Forcepoint DLP machine learning fails to find enough common elements, its results may not be accurate. Should
this happen, the system performs another stage of validation to assess the level of false positives (unintended
matches) and false negatives (undetected matches) on new data that is not used during the training phase,
sometimes referred to as “zero-day documents.”
If the “recall” level of the classifier (the total number of true positives divided by the sum of false positives and
false negatives in the new data) is below 70 percent, the system returns a FAIL message that includes the likely
reason the attempt to accurately classify data failed.
Error messages include:

Error Code Error Message

DSCV_ERR_-420_CODE There are not enough examples in your positive

examples folder. X were provided and at least Y are
required. Please add more examples then restart the
machine learning process.

DSCV_ERR_-421_CODE There are not enough examples in your negative

examples folder. X were provided and at least Y are
required. Please add more examples then restart the
machine learning process.

8
Forcepoint DLP 10.3 | Forcepoint DLP Machine Learning

DSCV_ERR_-422_CODE The files in your positive examples folder do not

contain enough text. Of X files provided, only Y have
enough text. At least Z are required. Please update the
files or point to another folder, then restart the machine
learning process.

DSCV_ERR_-423_CODE The files in your negative examples folder do not

contain enough text. Of X files provided, only Y have
enough text. At least Z are required. Please update the
files or point to another folder, then restart the machine
learning process.

DSCV_ERR_-424_CODE Your positive and negative examples are too similar.

No significant difference in words distribution was
found. Please provide new examples.

DSCV_ERR_-425_CODE Your positive and negative examples are too similar, or

your positive examples may not be consistent enough
to draw conclusions. There were bad error rates on
both training X and validation Y. Use different example
folders in the classifier.

DSCV_ERR_-426_CODE The examples you provided were not sufficient for

accurate training. Though the accuracy of the training
set is good X, the machine learning process cannot
make accurate conclusions on unseen data X. Your
positive examples may not be homogeneous enough.
Please provide more consistent examples then restart
the machine learning process.

DSCV_ERR_-427_CODE Your examples do not fit the content type you

specified. You provided X positive examples, but only
{2} of them fit the type.

DSCV_ERR_-428_CODE The files in your example folders don't contain enough

meaningful text (only X words). Please add files with
more meaningful content or point to other folders, then
restart the machine learning process.

DSCV_ERR_-429_CODE More than one file in your examples folders doesn't

contain enough text (only X words). Please update the
files or point to other folders, then restart the machine
learning process.

By adjusting the sensitivity level of the classifier, administrators can reduce the number of false negatives
(unintended matches) while accepting a higher level of false positives (undetected matches) or accept some false
negatives to reduce the rate of false positives (or find an acceptable balance in between).
Factors influencing the choice include:
■ The level of commonality in the positive set of examples (a low level tends to decrease accuracy)
■ The business implications of false positives
■ The resources that available to deal with false positives

9
Forcepoint DLP 10.3 | Forcepoint DLP Machine Learning

Using the classifier

After successful training, the machine learning classifier can be used to create rules and policies. An incident that
resulted from a match with a classifier might look like this:

Tuning the classifiers

In some cases, administrators may want to tune the classifiers. For example, if too many false positives occur,
start by setting the sensitivity level to “Narrow.”
It is also possible to combine the classifier with other classifiers, such as looking at certain file types, like both
Microsoft Office files and PDF files.
If the overall accuracy level is too low, check to see if all of the positive examples are related to the same subject.
If there is a small number of subjects and enough samples for each of them, optionally create a different classifier
for each subject:

Steps
1) Assign a folder to each subject.

2) Place documents related to the subject in the corresponding folder.

3) Train the system separately on each folder.

10
Forcepoint DLP 10.3 | Forcepoint DLP Machine Learning

Next steps
In many cases, several small specific classifiers can provide better accuracy than one general classifier.

Comparison with other types of

classifiers
The following table summarizes the advantages and disadvantages of the various classifier types:

Machine Learning Fingerprint- ing Pre-Defined User-Defined

Policies Dictionaries
and Regular
Expressions
Coverage High: Covers any Medium: Detects Limited to the Unlimited, providing
document with only derivatives existing predefined that the user has
semantic similarities of fingerprinted types properly defined the
to the learned data documents dictionaries and the
regular expressions

Accuracy Depends on the data Very High High for data types Medium
that are common
enough

“Zero-Day” High Very Low High High

Protection

Size/Footprint Medium High Low Low

Deployment and Medium (may Medium Low High - requires

Config Effort require some tuning) careful setting and
tuning

For more information on how to use machine learning, see:

■ Forcepoint DLP Administrator Help

11
© 2024 Forcepoint
Forcepoint and the FORCEPOINT logo are trademarks of Forcepoint.
All other trademarks used in this document are the property of their respective owners.
Published 30 October 2024

21AI63 Module 1
No ratings yet
21AI63 Module 1
38 pages
Inductive Learning and Machine Learning
100% (1)
Inductive Learning and Machine Learning
321 pages
Machine Learning
No ratings yet
Machine Learning
9 pages
machine-learning-concise
No ratings yet
machine-learning-concise
35 pages
1loggggggggggggcat Vnot
No ratings yet
1loggggggggggggcat Vnot
471 pages
ML m1-m5 NOTES
No ratings yet
ML m1-m5 NOTES
160 pages
ML Links
No ratings yet
ML Links
176 pages
Tagging of Learners in Inclusion Programs - v4.1 MCA From Sir Jericho
No ratings yet
Tagging of Learners in Inclusion Programs - v4.1 MCA From Sir Jericho
82 pages
1589372770679_spammer detection fake pople identification on social networks1 (1)
No ratings yet
1589372770679_spammer detection fake pople identification on social networks1 (1)
64 pages
Substation Maintenance and Construction Manual Circuit Breakers Booklet
No ratings yet
Substation Maintenance and Construction Manual Circuit Breakers Booklet
293 pages
Lecture 2
No ratings yet
Lecture 2
26 pages
Chapter 01 Notes
No ratings yet
Chapter 01 Notes
11 pages
Cc3d Manual
No ratings yet
Cc3d Manual
65 pages
Lesson_2_Introduction_to_Machine_Learning
No ratings yet
Lesson_2_Introduction_to_Machine_Learning
38 pages
Choosing DLP Classifiers
No ratings yet
Choosing DLP Classifiers
3 pages
EZa-C3 - Digital Timer Remote
71% (7)
EZa-C3 - Digital Timer Remote
20 pages
Chapter 5 Machine Learning
No ratings yet
Chapter 5 Machine Learning
96 pages
Introduction To AI
No ratings yet
Introduction To AI
17 pages
L2_2_ANN
No ratings yet
L2_2_ANN
42 pages
Exhaust Gas Analysis of CI Engine With Co-Generati
No ratings yet
Exhaust Gas Analysis of CI Engine With Co-Generati
10 pages
COS 511: Foundations of Machine Learning
No ratings yet
COS 511: Foundations of Machine Learning
7 pages
Lecture 1.1
No ratings yet
Lecture 1.1
7 pages
FSF Fortinet
No ratings yet
FSF Fortinet
31 pages
ML Unit-1
No ratings yet
ML Unit-1
60 pages
Technical Report 2.0
No ratings yet
Technical Report 2.0
8 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
42 pages
Rapid Application Development Model Presentation
No ratings yet
Rapid Application Development Model Presentation
19 pages
Machine Learning
No ratings yet
Machine Learning
104 pages
Machine Learning
No ratings yet
Machine Learning
6 pages
1_AML _Manish
No ratings yet
1_AML _Manish
72 pages
Machine Learning
No ratings yet
Machine Learning
16 pages
MACHINE LEARNING
No ratings yet
MACHINE LEARNING
97 pages
IWE - Report 1
No ratings yet
IWE - Report 1
21 pages
Cognate x Spidey
No ratings yet
Cognate x Spidey
46 pages
What Is Machine Learning
No ratings yet
What Is Machine Learning
7 pages
Eltek Smartpack S Controller Ug
No ratings yet
Eltek Smartpack S Controller Ug
36 pages
brochure-forcepoint-data-classification-en_0_0
No ratings yet
brochure-forcepoint-data-classification-en_0_0
7 pages
Chlorobenzene Plant 2520Location&Layout
100% (1)
Chlorobenzene Plant 2520Location&Layout
3 pages
Heijunka Generator v.1.0
No ratings yet
Heijunka Generator v.1.0
177 pages
01 Introduction To Smart Mini and Small Data Center Solution FusionModule500
No ratings yet
01 Introduction To Smart Mini and Small Data Center Solution FusionModule500
26 pages
Unit 1
No ratings yet
Unit 1
62 pages
ML notes
No ratings yet
ML notes
18 pages
Unit-1 new
No ratings yet
Unit-1 new
48 pages
What Is Machine Learning - IBM
No ratings yet
What Is Machine Learning - IBM
1 page
Bgates LT Ita
No ratings yet
Bgates LT Ita
26 pages
Machine Learning
No ratings yet
Machine Learning
73 pages
D691en SHARK Resektoskope
No ratings yet
D691en SHARK Resektoskope
20 pages
ML History
No ratings yet
ML History
14 pages
2010 IEEE - TPEL Lee Hong Nam Ortega Praly Astolfi
No ratings yet
2010 IEEE - TPEL Lee Hong Nam Ortega Praly Astolfi
8 pages
UNIT 3__ML
No ratings yet
UNIT 3__ML
15 pages
How To Build A Communications Plan For A Hybrid Workplace
No ratings yet
How To Build A Communications Plan For A Hybrid Workplace
25 pages
Chapter1
No ratings yet
Chapter1
30 pages
Chapter 1
No ratings yet
Chapter 1
40 pages
ANT-A70VP1100v06-4032 Datasheet
No ratings yet
ANT-A70VP1100v06-4032 Datasheet
2 pages
UNIT-IV Notes
No ratings yet
UNIT-IV Notes
42 pages
Sangfor Endpoint Secure Required Network Access Address Requirements Guide_20241111
No ratings yet
Sangfor Endpoint Secure Required Network Access Address Requirements Guide_20241111
11 pages
Computer Project X
No ratings yet
Computer Project X
11 pages
Unit-1 Part-1 Material
No ratings yet
Unit-1 Part-1 Material
45 pages
Rizwan Report
No ratings yet
Rizwan Report
23 pages
Lecture bsmd -Introduction to ML
No ratings yet
Lecture bsmd -Introduction to ML
16 pages
LAB8 DSA W23 Open Ended
No ratings yet
LAB8 DSA W23 Open Ended
5 pages
Industrial Training Report On Machine Le
No ratings yet
Industrial Training Report On Machine Le
21 pages
BB Bsvi e Brochure
No ratings yet
BB Bsvi e Brochure
39 pages
Unit-1
No ratings yet
Unit-1
55 pages
Tutorial 1
No ratings yet
Tutorial 1
3 pages
What Is RTWP?: Leopedrini
No ratings yet
What Is RTWP?: Leopedrini
4 pages
Machine Learning: by Prof. Prasad Kulkarni
No ratings yet
Machine Learning: by Prof. Prasad Kulkarni
12 pages
Sachin (9y 10m)
No ratings yet
Sachin (9y 10m)
5 pages
Unit 5 Machine Learning
No ratings yet
Unit 5 Machine Learning
14 pages
ML_Basic
No ratings yet
ML_Basic
12 pages
Unit-I
No ratings yet
Unit-I
8 pages
Curved Tooth Couplings Disengageable at Standstill
No ratings yet
Curved Tooth Couplings Disengageable at Standstill
8 pages
Jacob Abhishek Tatapudi: B O C S
No ratings yet
Jacob Abhishek Tatapudi: B O C S
2 pages
Research Paper (Machine Learning & Clustering)
No ratings yet
Research Paper (Machine Learning & Clustering)
8 pages
Jasmeet Kaur 14539
No ratings yet
Jasmeet Kaur 14539
6 pages
Unit I_Machine Learning @ CSJMU_6 Slides Handouts
No ratings yet
Unit I_Machine Learning @ CSJMU_6 Slides Handouts
4 pages
Road Relay
No ratings yet
Road Relay
5 pages
Basler UFOV Product Bulletin
100% (1)
Basler UFOV Product Bulletin
4 pages
Machine Learning
No ratings yet
Machine Learning
3 pages
Sap Case Study Divekar November 2011
No ratings yet
Sap Case Study Divekar November 2011
2 pages
Python Text Mining: Perform Text Processing, Word Embedding, Text Classification and Machine Translation
From Everand
Python Text Mining: Perform Text Processing, Word Embedding, Text Classification and Machine Translation
Alexandra George
No ratings yet
Python Machine Learning: A Practical Beginner's Guide to Understanding Machine Learning, Deep Learning and Neural Networks with Python, Scikit-Learn, Tensorflow and Keras
From Everand
Python Machine Learning: A Practical Beginner's Guide to Understanding Machine Learning, Deep Learning and Neural Networks with Python, Scikit-Learn, Tensorflow and Keras
Brandon Railey
No ratings yet
MACHINE LEARNING FOR BEGINNERS: A Practical Guide to Understanding and Applying Machine Learning Concepts (2023 Beginner Crash Course)
From Everand
MACHINE LEARNING FOR BEGINNERS: A Practical Guide to Understanding and Applying Machine Learning Concepts (2023 Beginner Crash Course)
Elaine Tate
No ratings yet
Python Machine Learning By Example
From Everand
Python Machine Learning By Example
Yuxi (Hayden) Liu
4/5 (7)
Mastering Machine Learning: A Comprehensive Guide to Success
From Everand
Mastering Machine Learning: A Comprehensive Guide to Success
Rick Spair
No ratings yet
Practical Full Stack Machine Learning: A Guide to Build Reliable, Reusable, and Production-Ready Full Stack ML Solutions
From Everand
Practical Full Stack Machine Learning: A Guide to Build Reliable, Reusable, and Production-Ready Full Stack ML Solutions
Alok Kumar
No ratings yet
Machine Learning - A Comprehensive, Step-by-Step Guide to Learning and Applying Advanced Concepts and Techniques in Machine Learning: 3
From Everand
Machine Learning - A Comprehensive, Step-by-Step Guide to Learning and Applying Advanced Concepts and Techniques in Machine Learning: 3
Peter Bradley
No ratings yet
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
From Everand
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
Artem Kovera
No ratings yet
Pattern Recognition: Fundamentals and Applications
From Everand
Pattern Recognition: Fundamentals and Applications
Fouad Sabry
No ratings yet
Machine Learning: Fundamentals and Applications
From Everand
Machine Learning: Fundamentals and Applications
Fouad Sabry
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

machine_learning

Uploaded by

machine_learning

Uploaded by

Forcepoint

Forcepoint DLP Machine Learning

■ Introduction to Forcepoint DLP Machine Learning on page 2

Introduction to Forcepoint DLP Machine

Knowing when to use machine learning

How Forcepoint DLP machine learning

■ Content that needs to be protected (“positive” examples)

Selecting examples for training

Negative examples consisting of “All

What happens during training

Accuracy of machine learning

Error Code Error Message

DSCV_ERR_-420_CODE There are not enough examples in your positive

DSCV_ERR_-421_CODE There are not enough examples in your negative

DSCV_ERR_-422_CODE The files in your positive examples folder do not

DSCV_ERR_-423_CODE The files in your negative examples folder do not

DSCV_ERR_-424_CODE Your positive and negative examples are too similar.

DSCV_ERR_-425_CODE Your positive and negative examples are too similar, or

DSCV_ERR_-426_CODE The examples you provided were not sufficient for

DSCV_ERR_-427_CODE Your examples do not fit the content type you

DSCV_ERR_-428_CODE The files in your example folders don't contain enough

DSCV_ERR_-429_CODE More than one file in your examples folders doesn't

Using the classifier

Tuning the classifiers

2) Place documents related to the subject in the corresponding folder.

3) Train the system separately on each folder.

Comparison with other types of

Machine Learning Fingerprint- ing Pre-Defined User-Defined

“Zero-Day” High Very Low High High

Size/Footprint Medium High Low Low

Deployment and Medium (may Medium Low High - requires

For more information on how to use machine learning, see:

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.