0% found this document useful (0 votes)
23 views19 pages

CDGC Classfication

This document discusses data classification in Informatica Cloud Data Governance and Catalog (CDGC), detailing its importance in managing risks, ensuring compliance, and enhancing data security. It explains two main types of data classification: rule-based, which uses predefined rules based on metadata and data, and clear generated classification, which utilizes AI to automate the process. Additionally, it covers the prerequisites for effective data classification, such as metadata extraction and profiling, and provides examples of how to implement rule-based classifications for identifying sensitive information.

Uploaded by

Siddhu Vinayaka
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views19 pages

CDGC Classfication

This document discusses data classification in Informatica Cloud Data Governance and Catalog (CDGC), detailing its importance in managing risks, ensuring compliance, and enhancing data security. It explains two main types of data classification: rule-based, which uses predefined rules based on metadata and data, and clear generated classification, which utilizes AI to automate the process. Additionally, it covers the prerequisites for effective data classification, such as metadata extraction and profiling, and provides examples of how to implement rule-based classifications for identifying sensitive information.

Uploaded by

Siddhu Vinayaka
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 19

remove time slots and convert this transcript into beatiful notes that I can always

refer before attending the interview.


"This is Vive In this
0:15
video we will explore data
0:17
classification in Informatica Cloud Data
0:19
Governance and catalog CDGC Let's have a
0:23
quick look at what we will be covering
0:25
in this video
0:34
Data classification is the process of
0:37
identifying and tagging the data into
0:40
relevant categories based on metadata
0:43
and or the content of the
0:46
columns Data classification can be
0:49
designed to work in three ways Using the
0:52
metadata of the table or column using
0:55
the data of the column or using the
0:57
combination of both metadata and data
1:01
Classifying the data helps organization
1:04
manage risks ensure compliance and
1:07
enhance data security For example data
1:10
classification can help to identify
1:13
where personally identifiable
1:15
information or any sensitive information
1:19
such as credit card number or customer
1:21
address is stored within the
1:24
organization Once such sensitive data is
1:27
identified organizations can then take
1:30
appropriate steps to protect this
1:32
sensitive data In CDGC data
1:35
classification can be categorized into
1:37
two segments Rule-based data
1:39
classifications are predefined rules
1:42
which are created using the metadata and
1:45
or data of the columns or fields CDGC
1:48
provides 200 plus out ofthe-box
1:51
rule-based data classification and users
1:54
can also create their own custom
1:56
rule-based data classifications Clear
1:58
generated data classification These are
2:01
automated classification using the
2:03
Informatica's AI engine Clear clear
2:06
generated data classifications works
2:08
only on the metadata of the columns or
2:14
fields Now rulebased data classification
2:17
can be further segregated into two types
2:20
Data element classification and data
2:22
entity
2:24
classification Data element
2:25
classifications are basically these are
2:27
the rules which are applicable at the
2:29
column or field level These are written
2:32
in Spark SQL Data element classification
2:35
rule can be designed to utilize only the
2:38
metadata of the column or field only the
2:41
data in the column or field or it can
2:44
also use combination of both metadata
2:46
and data of the column or field For
2:49
example classify a column as social
2:52
security number if the column name
2:55
contains the keywords like SSN or SOC
2:59
followed by SEC Also the column should
3:03
contain more than 80% of the values in
3:06
the format like three-digit
3:08
number two-digit number hyphen
3:11
four-digit number So this example data
3:14
element classification rule is utilizing
3:17
both metadata that is name of the column
3:20
and data of the column which are content
3:22
of the column Data entity classification
3:25
is basically applicable at the table
3:27
level and it is dependent on data
3:30
element classification because tables
3:33
will be having column So at the column
3:36
level data element classification will
3:38
be applicable and using that data entity
3:40
classification will be derived at the
3:42
table level or at the fine level Data
3:45
entity classification can be designed to
3:48
consider all any or some of the selected
3:51
data element
3:53
classifications For example let's say if
3:57
full name gender date of birth email or
4:01
phone number data element
4:02
classifications are identified in one or
4:06
more columns of a table then that table
4:09
can be classified as person entity So
4:12
basically in this example a table is
4:15
classified as person when it contains
4:18
full name gender date of birth email or
4:21
phone information Clear generated data
4:24
classification is only data element
4:27
classification It does not have data
4:29
entity classification which means clear
4:32
can only generate data classification at
4:34
the column level or field level This
4:36
classification is automatically
4:38
generated at the column level by
4:41
Informatica's AI engine CL clear which
4:43
is based on machine learning algorithms
4:45
When you enable this feature in any
4:47
catalog source Clear uses some
4:50
predefined rules to analyze the column
4:52
or field metadata mainly the name of the
4:55
column or name of the field and
4:57
automatically generates data
4:58
classification for the columns or fields
5:00
of the tables or or files Clearbased
5:03
data classification only works on the
5:05
metadata of the columns User can use
5:08
this feature when he is not aware about
5:11
the metadata of the column or the
5:13
content inside the column and hence he
5:15
cannot write an effective rule-based
5:18
data element classification In such
5:20
cases user might be interested in using
5:23
clear generated data classification
5:29
When you create a rule-based data
5:32
element classification you have to
5:35
specify whether the process should
5:37
consider conformance percentage or a
5:39
weighted conformance percentage So when
5:42
you create a rule-based data
5:44
classification using the data of the
5:47
column it uses the column profiling data
5:50
available in
5:52
CDGC So on your screen you can see a
5:54
sample column profiling data which CDGC
5:57
stores So consider a column called
6:00
gender in any table which has three
6:03
values male female and null or empty The
6:07
CDGC stores its value frequency which
6:09
means male is appearing 11 times and the
6:12
percentage is 47%age in all three values
6:16
Similarly for the female and for the
6:17
null or empty values Now this data will
6:21
be consumed by the process for
6:23
conformance percentage and weighted
6:25
confformance percentage Let's see how
6:28
conformance percentage is calculated So
6:31
the process will consider only the
6:34
unique values like male female and null
6:38
or blank values So these are unique
6:40
values Hence their occurrence will be 1
6:43
only While calculating the conformance
6:45
percentage 1 will be divided by the
6:48
total which is 1 + 1 + 1 which is 3 So 1
6:51
divid 3 is 33% So for each value the
6:54
conformance percentage is
6:56
33% And now if we have written our data
7:00
element classification to match male and
7:02
female values Hence to sum the
7:05
conformance percentage of male and
7:07
female it will be
7:09
66.66% as the final conformance
7:12
percentage If you want to calculate the
7:14
weighted conformance percentage in that
7:17
case process will consider the value
7:20
frequency So instead of taking unique
7:23
values count it will take the value
7:25
frequency So male is appearing 11 times
7:28
female is appearing nine times and blank
7:30
or null values are appearing three times
7:32
So it will divide 11 by sum of all three
7:35
which is 11 + 9 + 3 So for each value it
7:39
will calculate the weighted confformance
7:41
percentage and if we have written the
7:43
rule to match only male and female
7:46
keywords then the total overall weighted
7:50
confformance percentage will be
7:52
86.96% Now the question comes is in
7:55
which situation we should use
7:57
conformance percentage compared to the
7:59
weighted confformance
8:01
percentage So if your column is having a
8:04
lot of unique values and the percentage
8:07
of blank or null values should be
8:09
ignored in those cases we should
8:11
consider conformance percentage So for
8:14
example SSN name country phone number
8:18
credit card these columns will be
8:21
usually having many unique set of values
8:24
So for this we can consider confformance
8:26
percentage
8:28
But if your column is having very few
8:31
unique values and percentage of blank or
8:34
null is very important to consider in
8:37
that case we consider weighted
8:39
conformance percentage So few example
8:41
could be gender USA city which means a
8:44
city of any specific country because
8:47
that can be repeated again and again
8:49
ethnic group some kind of flags So these
8:52
columns will be generally having
8:54
repeated values and so for these columns
8:56
we can consider weighted conformance
8:59
percentage while designing any
9:00
rule-based data element
9:06
classification In order for data
9:08
classification feature to work there are
9:11
few prerequisites First is metadata
9:14
extraction is mandatory for any kind of
9:17
data classification rule to work Second
9:20
profiling with keep signature and values
9:23
option is mandatory only if you are
9:26
designing rule-based data classification
9:28
rule which is using the column data If
9:31
your data classification rule is
9:34
designed to work only on the metadata
9:37
then profiling is not
9:42
required Let's take few examples of
9:45
rule-based data element classification
9:48
So let's say our requirement is to
9:50
identify the columns where social
9:52
security number data resides in the
9:55
organization and tag them as SSN So for
9:59
this we have rule like the column name
10:03
should contain any of the keywords like
10:06
SSN or SOC SEC social security number
10:11
social hyphen security hash or at least
10:15
80% of the data in the column should
10:17
follow pattern like three-digit number
10:20
two-digit number four-digit number or
10:23
other variant of this So this rule is
10:25
basically considering metadata or data
10:29
So either metadata matches or data
10:31
matches then the column will be flagged
10:33
as SSN If we use and operator then both
10:37
will be matching In the second example
10:39
identify the column where credit card
10:42
data resides in the organization and tag
10:44
them as credit card and for this we are
10:47
considering both metadata and data
10:50
Similarly in the third rule we are
10:53
trying to identify USA individual tax
10:56
identification number and tag it as USA
10:59
individual tax identification number
11:02
ITIN where the column name should
11:05
contain any of the keyword as listed
11:07
here and more than 80% of the column
11:11
data should match the given pattern like
11:14
three-digit number hyphen twodigit
11:15
number hyphen four-digit number and
11:18
other variant of this So this rule is
11:20
containing both column name along with
11:22
pattern of the column data Let's take
11:25
one example of data entity
11:28
classification which is a classification
11:30
which is applicable at the table level
11:33
or file level So let's say identify a
11:36
table that contains personal
11:39
identifiable information and tag it as
11:42
PII To identify a table as PII consider
11:47
the rules like the table must contain
11:50
either full name or SSN or email
11:54
information In order to identify this
11:56
information in the table column the
11:58
process must look for the rules as
12:01
specified here like classify a column as
12:04
full name when the column name contains
12:07
keyword like name or full name or
12:10
complete name So for full name we are
12:14
using only the metadata For SSN we can
12:17
use the same logic as we discussed in
12:19
the previous slide And for the email the
12:22
column should contain the keyword like
12:25
mail or at least 80% of the column data
12:28
should be in the format is specified
12:30
here Now you might be wondering how this
12:33
data entity classification gets tagged
12:35
at the table or file level When you
12:38
create a data entity classification you
12:40
have to select one or more data element
12:43
classifications along with inclusion
12:45
rule to consider all or any or some of
12:49
the data element classifications to tag
12:51
a table with the data entity
12:54
classification The process will first
12:57
identify and tag the columns with the
13:00
data element classifications that are
13:02
part of a data entity classification
13:05
Depending on the inclusion rule if all
13:08
or any or some of the data element
13:10
classifications are tagged to the
13:12
columns of a table then that table is
13:15
tagged with the data entity
13:21
classification For this demo I have
13:24
created one Oracle catalog source with
13:27
metadata extraction enabled along with
13:31
data profiling
13:33
enabled and then we'll enable data
13:36
classification capability In this demo
13:39
we are going to look into rule-based
13:42
data classification Hence I will select
13:45
data classification rules In the next
13:47
video we will look into generated data
13:49
classification So for now we will ignore
13:51
this
13:52
option Now we need to add the data
13:55
classification against which we want to
13:58
analyze our data So as we discussed in
14:01
the previous slide we want to find all
14:04
the columns which contains social
14:06
security number credit card information
14:09
and USA individual tax identification
14:12
number Hence I will add these data
14:15
element classification here Click on add
14:18
data classification and here it will
14:21
list all the data element and data
14:24
entity classifications available in your
14:27
org Now it lists both out of the box and
14:31
any custom created data classifications
14:34
In case if you are not able to see these
14:36
data classifications here just navigate
14:39
back to
14:40
explore and from the drop-down select
14:43
the data classification and all the data
14:46
classification should be visible here If
14:49
you're not able to see even out
14:51
ofthe-box data classification simply
14:53
click on import predefined content and
14:55
follow the process to import the out
14:58
ofthe-box data
15:00
classification Let me add the data
15:02
classifications here So first one is SSN
15:09
Then we need to add credit
15:14
card and another one is US individual
15:17
tax identification
15:21
number All these three are out of the
15:25
box provided data element
15:28
classifications Now we also want to
15:30
identify the tables that contain
15:33
personal identifiable information So for
15:37
that I have created one custom data
15:40
entity classification called PII Let me
15:43
add that
15:49
here and then we can save the
15:52
changes Let me also show you how this
15:55
PII data entity classification looks
15:58
like because this is the custom one
16:02
So here I have selected three data
16:05
element classifications full name email
16:08
SSN and the inclusion scope is anyone
16:11
which means tables that contains any of
16:13
these information will be tagged as PII
16:16
data If you want to check for all these
16:20
data element classification then you
16:22
have to select all option or if you want
16:24
to select let's say two of these data
16:27
element classification then you have to
16:28
specify the number two here So which
16:30
means tables that contains all of these
16:34
data element classification will be
16:36
tagged as PIA data In case if you select
16:39
include option as all but we have
16:41
selected any and then we have specified
16:43
one that means table that contain any of
16:46
these information will be tagged as PII
16:50
And of course since these are data
16:52
element classification so when any table
16:54
is getting tagged as PII the underlying
16:57
column will also be tagged with any of
16:59
these data element classifications Now
17:02
we can run this catalog
17:06
source since this is the first time we
17:08
are going to execute this catalog source
17:10
So we'll run with all the three
17:12
capabilities selected But in case if you
17:15
have already executed the metadata
17:17
extraction and data profiling then you
17:19
can simply run just data classification
17:22
as well unchecking other two
17:26
options Catalog source is successfully
17:29
executed with data classification
17:32
capability Expand the data
17:34
classification and you will see it was a
17:36
rule-based data classification execution
17:38
Click on that and you will be able to
17:40
see the stats like the total number of
17:42
data classifications that are evaluated
17:45
the total number of columns that are
17:47
evaluated the unique number of data
17:50
classification and the total number of
17:52
data classification which are inferred
17:54
or the total number of data
17:56
classification inference which are
17:57
deleted So here we can see there are
18:00
total 11 associations which are inferred
18:03
So click on that and it will show you
18:05
the stats Here you can see the stats
18:09
like full name is a column which is
18:12
classified as full name data
18:14
classification that belongs to the
18:16
passenger table Similarly for all the
18:19
columns it is showing Now if you see
18:21
this one it is saying passenger of type
18:24
table is classified as PII PII is a data
18:27
entity classification which is inferred
18:30
on the table passenger and rest all are
18:33
data element classifications
18:36
Let's navigate back to CDGC and see the
18:39
result there So I have opened the
18:41
catalog source Let's go to the table Let
18:43
me open the passenger table And here we
18:47
can see on the passenger table PII data
18:51
entity classification is inferred If I
18:54
hover on this it shows the detail like
18:56
its type is data entity classification
18:59
And since it is in accepted state so
19:01
score will always be 100 Go to contains
19:04
and it will list all the columns which
19:05
are part of this table And here we can
19:08
see for all these columns there are data
19:12
element classifications which are
19:13
inferred like on full name column full
19:15
name data element classification is
19:17
there On contact email classification is
19:19
there on identity document SSN
19:22
classification is inferred and on CCN
19:25
column there are two classification SSN
19:27
and there is one more So we can simply
19:30
click on this
19:31
column and here it will show that CCN is
19:35
inferred with two data element
19:37
classification One is credit card and
19:39
another one is SSN I can curate this
19:42
data classification from here So let's
19:45
say I I want to reject this SSN data
19:47
element classification on CCN column So
19:50
I can click on this and it will be
19:52
rejected
19:55
PIA data classification was inferred on
19:59
this table Why Because as per this PIA
20:02
data classification rule the table
20:05
should contain at least any of these
20:08
information like email SSN or full name
20:11
And if I go back to the passenger table
20:13
go to the column it contains email it
20:16
also contains SSN and it also contains
20:18
full name So if any of these information
20:20
is found then this table is classified
20:24
as PII Similarly let's go back to the
20:27
other table which is pilot table and
20:30
here also PI data entity classification
20:33
is inferred Why Let's check the columns
20:36
and here I can see full name and SSN is
20:38
also inferred here and that's why this
20:40
this table is also having PII data We
20:44
can also utilize search queries to find
20:48
the assets where there are data
20:50
classification associations For example
20:52
if I use this search query data elements
20:55
in this catalog source which are having
20:59
the data classification associations and
21:01
the curation status is either accepted
21:04
or rejected I will be able to see the
21:07
total number of columns So here it is
21:09
showing seven columns are there where
21:11
there is data classification association
21:14
Similarly I can use different variant of
21:16
the search query to get the required
21:22
result Let's have a quick look at some
21:25
of the frequently asked
21:28
questions What metadata can be used for
21:31
defining data element classification
21:33
rules So when you define a data element
21:36
classification rules you can utilize
21:39
four things
21:41
name of the table or column and comment
21:44
on the table or
21:47
column What profiling data can be used
21:50
for defining data element classification
21:53
rules So when the data classification
21:56
rule is written to use the profiling
21:59
data you can utilize bunch of the
22:02
attribute from the CDGC profiling such
22:05
as the count of the values profiled for
22:07
any column the distinct values count of
22:10
any column the inferred data type of the
22:13
column average value of the column
22:15
standard deviation value of the column
22:18
minimum values of the column maximum
22:20
value of the column null percentage of
22:22
the column most frequent values of the
22:25
column or we can utilize the distinct
22:28
values of the column or value frequency
22:31
of the column like how many times a
22:33
particular value is appearing in the
22:35
column and the frequency percentage of
22:37
the column Usually when we define
22:41
rule-based data element classification
22:43
using profiling data we use value
22:46
frequencies data like value or frequency
22:49
of the values or frequency percentage
22:53
Some of the common search queries
22:54
related to data classification is listed
22:57
on the screen Which language is used for
23:00
defining rule-based data classification
23:03
CDGC uses Spark SQL syntax for defining
23:06
classification rules I have provided one
23:09
link to the Spark SQL for your reference
23:12
Between rule-based data classification
23:14
and clear generated classification which
23:17
one is recommended
23:19
Rule-based data classification always
23:22
provides better confidence because when
23:25
designing rule-based data classification
23:28
user needs to be aware about the
23:31
metadata or data of the column against
23:35
which he wants to run the rule and so
23:38
user knows what he is trying to evaluate
23:42
This also makes the curation of the
23:44
association easier both technically
23:47
because it can be done in bulk through
23:50
the bulk import process and even
23:53
logically because user will be aware
23:55
about the context of the column or table
23:59
and also about the data classification
24:01
rule against which he's trying to
24:03
evaluate the column or table data When
24:06
should rule-based data classifications
24:08
be used compared to clear generated data
24:11
classifications Rule-based data
24:13
classifications are suitable when the
24:15
user has a clear understanding of the
24:17
technical assets context like about its
24:20
metadata and or data In such cases the
24:24
user can easily define rules to classify
24:27
the data But consider some cases where
24:30
the user doesn't have information about
24:33
what a particular column is related to
24:35
or what kind of data it contains In such
24:39
cases defining rule-based classification
24:41
can be challenging In those scenarios
24:44
clear generated classification can be
24:46
utilized Clear is Informatica's machine
24:49
learning based AI engine and it uses
24:52
internal algorithms to analyze column
24:55
names and generate placeholders for the
24:58
data entity classifications These
25:00
placeholders are listed under generated
25:03
classification section of the asset in
25:06
CDGC The user then has the option to
25:09
either promote or reject these clear
25:12
generated placeholders If the user
25:15
promote a placeholder they can link this
25:18
placeholder to an existing data entity
25:20
classification or they can create a new
25:23
data entity classification and associate
25:26
this placeholder with the created data
25:28
entity classification The content
25:31
discussed in this video is available on
25:33
my website You can find the link in the
25:36
video description below That's all for
25:39
today See you in the next video Until
25:41
then take"

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy