This document discusses data classification in Informatica Cloud Data Governance and Catalog (CDGC), detailing its importance in managing risks, ensuring compliance, and enhancing data security. It explains two main types of data classification: rule-based, which uses predefined rules based on metadata and data, and clear generated classification, which utilizes AI to automate the process. Additionally, it covers the prerequisites for effective data classification, such as metadata extraction and profiling, and provides examples of how to implement rule-based classifications for identifying sensitive information.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
23 views19 pages
CDGC Classfication
This document discusses data classification in Informatica Cloud Data Governance and Catalog (CDGC), detailing its importance in managing risks, ensuring compliance, and enhancing data security. It explains two main types of data classification: rule-based, which uses predefined rules based on metadata and data, and clear generated classification, which utilizes AI to automate the process. Additionally, it covers the prerequisites for effective data classification, such as metadata extraction and profiling, and provides examples of how to implement rule-based classifications for identifying sensitive information.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 19
remove time slots and convert this transcript into beatiful notes that I can always
refer before attending the interview.
"This is Vive In this 0:15 video we will explore data 0:17 classification in Informatica Cloud Data 0:19 Governance and catalog CDGC Let's have a 0:23 quick look at what we will be covering 0:25 in this video 0:34 Data classification is the process of 0:37 identifying and tagging the data into 0:40 relevant categories based on metadata 0:43 and or the content of the 0:46 columns Data classification can be 0:49 designed to work in three ways Using the 0:52 metadata of the table or column using 0:55 the data of the column or using the 0:57 combination of both metadata and data 1:01 Classifying the data helps organization 1:04 manage risks ensure compliance and 1:07 enhance data security For example data 1:10 classification can help to identify 1:13 where personally identifiable 1:15 information or any sensitive information 1:19 such as credit card number or customer 1:21 address is stored within the 1:24 organization Once such sensitive data is 1:27 identified organizations can then take 1:30 appropriate steps to protect this 1:32 sensitive data In CDGC data 1:35 classification can be categorized into 1:37 two segments Rule-based data 1:39 classifications are predefined rules 1:42 which are created using the metadata and 1:45 or data of the columns or fields CDGC 1:48 provides 200 plus out ofthe-box 1:51 rule-based data classification and users 1:54 can also create their own custom 1:56 rule-based data classifications Clear 1:58 generated data classification These are 2:01 automated classification using the 2:03 Informatica's AI engine Clear clear 2:06 generated data classifications works 2:08 only on the metadata of the columns or 2:14 fields Now rulebased data classification 2:17 can be further segregated into two types 2:20 Data element classification and data 2:22 entity 2:24 classification Data element 2:25 classifications are basically these are 2:27 the rules which are applicable at the 2:29 column or field level These are written 2:32 in Spark SQL Data element classification 2:35 rule can be designed to utilize only the 2:38 metadata of the column or field only the 2:41 data in the column or field or it can 2:44 also use combination of both metadata 2:46 and data of the column or field For 2:49 example classify a column as social 2:52 security number if the column name 2:55 contains the keywords like SSN or SOC 2:59 followed by SEC Also the column should 3:03 contain more than 80% of the values in 3:06 the format like three-digit 3:08 number two-digit number hyphen 3:11 four-digit number So this example data 3:14 element classification rule is utilizing 3:17 both metadata that is name of the column 3:20 and data of the column which are content 3:22 of the column Data entity classification 3:25 is basically applicable at the table 3:27 level and it is dependent on data 3:30 element classification because tables 3:33 will be having column So at the column 3:36 level data element classification will 3:38 be applicable and using that data entity 3:40 classification will be derived at the 3:42 table level or at the fine level Data 3:45 entity classification can be designed to 3:48 consider all any or some of the selected 3:51 data element 3:53 classifications For example let's say if 3:57 full name gender date of birth email or 4:01 phone number data element 4:02 classifications are identified in one or 4:06 more columns of a table then that table 4:09 can be classified as person entity So 4:12 basically in this example a table is 4:15 classified as person when it contains 4:18 full name gender date of birth email or 4:21 phone information Clear generated data 4:24 classification is only data element 4:27 classification It does not have data 4:29 entity classification which means clear 4:32 can only generate data classification at 4:34 the column level or field level This 4:36 classification is automatically 4:38 generated at the column level by 4:41 Informatica's AI engine CL clear which 4:43 is based on machine learning algorithms 4:45 When you enable this feature in any 4:47 catalog source Clear uses some 4:50 predefined rules to analyze the column 4:52 or field metadata mainly the name of the 4:55 column or name of the field and 4:57 automatically generates data 4:58 classification for the columns or fields 5:00 of the tables or or files Clearbased 5:03 data classification only works on the 5:05 metadata of the columns User can use 5:08 this feature when he is not aware about 5:11 the metadata of the column or the 5:13 content inside the column and hence he 5:15 cannot write an effective rule-based 5:18 data element classification In such 5:20 cases user might be interested in using 5:23 clear generated data classification 5:29 When you create a rule-based data 5:32 element classification you have to 5:35 specify whether the process should 5:37 consider conformance percentage or a 5:39 weighted conformance percentage So when 5:42 you create a rule-based data 5:44 classification using the data of the 5:47 column it uses the column profiling data 5:50 available in 5:52 CDGC So on your screen you can see a 5:54 sample column profiling data which CDGC 5:57 stores So consider a column called 6:00 gender in any table which has three 6:03 values male female and null or empty The 6:07 CDGC stores its value frequency which 6:09 means male is appearing 11 times and the 6:12 percentage is 47%age in all three values 6:16 Similarly for the female and for the 6:17 null or empty values Now this data will 6:21 be consumed by the process for 6:23 conformance percentage and weighted 6:25 confformance percentage Let's see how 6:28 conformance percentage is calculated So 6:31 the process will consider only the 6:34 unique values like male female and null 6:38 or blank values So these are unique 6:40 values Hence their occurrence will be 1 6:43 only While calculating the conformance 6:45 percentage 1 will be divided by the 6:48 total which is 1 + 1 + 1 which is 3 So 1 6:51 divid 3 is 33% So for each value the 6:54 conformance percentage is 6:56 33% And now if we have written our data 7:00 element classification to match male and 7:02 female values Hence to sum the 7:05 conformance percentage of male and 7:07 female it will be 7:09 66.66% as the final conformance 7:12 percentage If you want to calculate the 7:14 weighted conformance percentage in that 7:17 case process will consider the value 7:20 frequency So instead of taking unique 7:23 values count it will take the value 7:25 frequency So male is appearing 11 times 7:28 female is appearing nine times and blank 7:30 or null values are appearing three times 7:32 So it will divide 11 by sum of all three 7:35 which is 11 + 9 + 3 So for each value it 7:39 will calculate the weighted confformance 7:41 percentage and if we have written the 7:43 rule to match only male and female 7:46 keywords then the total overall weighted 7:50 confformance percentage will be 7:52 86.96% Now the question comes is in 7:55 which situation we should use 7:57 conformance percentage compared to the 7:59 weighted confformance 8:01 percentage So if your column is having a 8:04 lot of unique values and the percentage 8:07 of blank or null values should be 8:09 ignored in those cases we should 8:11 consider conformance percentage So for 8:14 example SSN name country phone number 8:18 credit card these columns will be 8:21 usually having many unique set of values 8:24 So for this we can consider confformance 8:26 percentage 8:28 But if your column is having very few 8:31 unique values and percentage of blank or 8:34 null is very important to consider in 8:37 that case we consider weighted 8:39 conformance percentage So few example 8:41 could be gender USA city which means a 8:44 city of any specific country because 8:47 that can be repeated again and again 8:49 ethnic group some kind of flags So these 8:52 columns will be generally having 8:54 repeated values and so for these columns 8:56 we can consider weighted conformance 8:59 percentage while designing any 9:00 rule-based data element 9:06 classification In order for data 9:08 classification feature to work there are 9:11 few prerequisites First is metadata 9:14 extraction is mandatory for any kind of 9:17 data classification rule to work Second 9:20 profiling with keep signature and values 9:23 option is mandatory only if you are 9:26 designing rule-based data classification 9:28 rule which is using the column data If 9:31 your data classification rule is 9:34 designed to work only on the metadata 9:37 then profiling is not 9:42 required Let's take few examples of 9:45 rule-based data element classification 9:48 So let's say our requirement is to 9:50 identify the columns where social 9:52 security number data resides in the 9:55 organization and tag them as SSN So for 9:59 this we have rule like the column name 10:03 should contain any of the keywords like 10:06 SSN or SOC SEC social security number 10:11 social hyphen security hash or at least 10:15 80% of the data in the column should 10:17 follow pattern like three-digit number 10:20 two-digit number four-digit number or 10:23 other variant of this So this rule is 10:25 basically considering metadata or data 10:29 So either metadata matches or data 10:31 matches then the column will be flagged 10:33 as SSN If we use and operator then both 10:37 will be matching In the second example 10:39 identify the column where credit card 10:42 data resides in the organization and tag 10:44 them as credit card and for this we are 10:47 considering both metadata and data 10:50 Similarly in the third rule we are 10:53 trying to identify USA individual tax 10:56 identification number and tag it as USA 10:59 individual tax identification number 11:02 ITIN where the column name should 11:05 contain any of the keyword as listed 11:07 here and more than 80% of the column 11:11 data should match the given pattern like 11:14 three-digit number hyphen twodigit 11:15 number hyphen four-digit number and 11:18 other variant of this So this rule is 11:20 containing both column name along with 11:22 pattern of the column data Let's take 11:25 one example of data entity 11:28 classification which is a classification 11:30 which is applicable at the table level 11:33 or file level So let's say identify a 11:36 table that contains personal 11:39 identifiable information and tag it as 11:42 PII To identify a table as PII consider 11:47 the rules like the table must contain 11:50 either full name or SSN or email 11:54 information In order to identify this 11:56 information in the table column the 11:58 process must look for the rules as 12:01 specified here like classify a column as 12:04 full name when the column name contains 12:07 keyword like name or full name or 12:10 complete name So for full name we are 12:14 using only the metadata For SSN we can 12:17 use the same logic as we discussed in 12:19 the previous slide And for the email the 12:22 column should contain the keyword like 12:25 mail or at least 80% of the column data 12:28 should be in the format is specified 12:30 here Now you might be wondering how this 12:33 data entity classification gets tagged 12:35 at the table or file level When you 12:38 create a data entity classification you 12:40 have to select one or more data element 12:43 classifications along with inclusion 12:45 rule to consider all or any or some of 12:49 the data element classifications to tag 12:51 a table with the data entity 12:54 classification The process will first 12:57 identify and tag the columns with the 13:00 data element classifications that are 13:02 part of a data entity classification 13:05 Depending on the inclusion rule if all 13:08 or any or some of the data element 13:10 classifications are tagged to the 13:12 columns of a table then that table is 13:15 tagged with the data entity 13:21 classification For this demo I have 13:24 created one Oracle catalog source with 13:27 metadata extraction enabled along with 13:31 data profiling 13:33 enabled and then we'll enable data 13:36 classification capability In this demo 13:39 we are going to look into rule-based 13:42 data classification Hence I will select 13:45 data classification rules In the next 13:47 video we will look into generated data 13:49 classification So for now we will ignore 13:51 this 13:52 option Now we need to add the data 13:55 classification against which we want to 13:58 analyze our data So as we discussed in 14:01 the previous slide we want to find all 14:04 the columns which contains social 14:06 security number credit card information 14:09 and USA individual tax identification 14:12 number Hence I will add these data 14:15 element classification here Click on add 14:18 data classification and here it will 14:21 list all the data element and data 14:24 entity classifications available in your 14:27 org Now it lists both out of the box and 14:31 any custom created data classifications 14:34 In case if you are not able to see these 14:36 data classifications here just navigate 14:39 back to 14:40 explore and from the drop-down select 14:43 the data classification and all the data 14:46 classification should be visible here If 14:49 you're not able to see even out 14:51 ofthe-box data classification simply 14:53 click on import predefined content and 14:55 follow the process to import the out 14:58 ofthe-box data 15:00 classification Let me add the data 15:02 classifications here So first one is SSN 15:09 Then we need to add credit 15:14 card and another one is US individual 15:17 tax identification 15:21 number All these three are out of the 15:25 box provided data element 15:28 classifications Now we also want to 15:30 identify the tables that contain 15:33 personal identifiable information So for 15:37 that I have created one custom data 15:40 entity classification called PII Let me 15:43 add that 15:49 here and then we can save the 15:52 changes Let me also show you how this 15:55 PII data entity classification looks 15:58 like because this is the custom one 16:02 So here I have selected three data 16:05 element classifications full name email 16:08 SSN and the inclusion scope is anyone 16:11 which means tables that contains any of 16:13 these information will be tagged as PII 16:16 data If you want to check for all these 16:20 data element classification then you 16:22 have to select all option or if you want 16:24 to select let's say two of these data 16:27 element classification then you have to 16:28 specify the number two here So which 16:30 means tables that contains all of these 16:34 data element classification will be 16:36 tagged as PIA data In case if you select 16:39 include option as all but we have 16:41 selected any and then we have specified 16:43 one that means table that contain any of 16:46 these information will be tagged as PII 16:50 And of course since these are data 16:52 element classification so when any table 16:54 is getting tagged as PII the underlying 16:57 column will also be tagged with any of 16:59 these data element classifications Now 17:02 we can run this catalog 17:06 source since this is the first time we 17:08 are going to execute this catalog source 17:10 So we'll run with all the three 17:12 capabilities selected But in case if you 17:15 have already executed the metadata 17:17 extraction and data profiling then you 17:19 can simply run just data classification 17:22 as well unchecking other two 17:26 options Catalog source is successfully 17:29 executed with data classification 17:32 capability Expand the data 17:34 classification and you will see it was a 17:36 rule-based data classification execution 17:38 Click on that and you will be able to 17:40 see the stats like the total number of 17:42 data classifications that are evaluated 17:45 the total number of columns that are 17:47 evaluated the unique number of data 17:50 classification and the total number of 17:52 data classification which are inferred 17:54 or the total number of data 17:56 classification inference which are 17:57 deleted So here we can see there are 18:00 total 11 associations which are inferred 18:03 So click on that and it will show you 18:05 the stats Here you can see the stats 18:09 like full name is a column which is 18:12 classified as full name data 18:14 classification that belongs to the 18:16 passenger table Similarly for all the 18:19 columns it is showing Now if you see 18:21 this one it is saying passenger of type 18:24 table is classified as PII PII is a data 18:27 entity classification which is inferred 18:30 on the table passenger and rest all are 18:33 data element classifications 18:36 Let's navigate back to CDGC and see the 18:39 result there So I have opened the 18:41 catalog source Let's go to the table Let 18:43 me open the passenger table And here we 18:47 can see on the passenger table PII data 18:51 entity classification is inferred If I 18:54 hover on this it shows the detail like 18:56 its type is data entity classification 18:59 And since it is in accepted state so 19:01 score will always be 100 Go to contains 19:04 and it will list all the columns which 19:05 are part of this table And here we can 19:08 see for all these columns there are data 19:12 element classifications which are 19:13 inferred like on full name column full 19:15 name data element classification is 19:17 there On contact email classification is 19:19 there on identity document SSN 19:22 classification is inferred and on CCN 19:25 column there are two classification SSN 19:27 and there is one more So we can simply 19:30 click on this 19:31 column and here it will show that CCN is 19:35 inferred with two data element 19:37 classification One is credit card and 19:39 another one is SSN I can curate this 19:42 data classification from here So let's 19:45 say I I want to reject this SSN data 19:47 element classification on CCN column So 19:50 I can click on this and it will be 19:52 rejected 19:55 PIA data classification was inferred on 19:59 this table Why Because as per this PIA 20:02 data classification rule the table 20:05 should contain at least any of these 20:08 information like email SSN or full name 20:11 And if I go back to the passenger table 20:13 go to the column it contains email it 20:16 also contains SSN and it also contains 20:18 full name So if any of these information 20:20 is found then this table is classified 20:24 as PII Similarly let's go back to the 20:27 other table which is pilot table and 20:30 here also PI data entity classification 20:33 is inferred Why Let's check the columns 20:36 and here I can see full name and SSN is 20:38 also inferred here and that's why this 20:40 this table is also having PII data We 20:44 can also utilize search queries to find 20:48 the assets where there are data 20:50 classification associations For example 20:52 if I use this search query data elements 20:55 in this catalog source which are having 20:59 the data classification associations and 21:01 the curation status is either accepted 21:04 or rejected I will be able to see the 21:07 total number of columns So here it is 21:09 showing seven columns are there where 21:11 there is data classification association 21:14 Similarly I can use different variant of 21:16 the search query to get the required 21:22 result Let's have a quick look at some 21:25 of the frequently asked 21:28 questions What metadata can be used for 21:31 defining data element classification 21:33 rules So when you define a data element 21:36 classification rules you can utilize 21:39 four things 21:41 name of the table or column and comment 21:44 on the table or 21:47 column What profiling data can be used 21:50 for defining data element classification 21:53 rules So when the data classification 21:56 rule is written to use the profiling 21:59 data you can utilize bunch of the 22:02 attribute from the CDGC profiling such 22:05 as the count of the values profiled for 22:07 any column the distinct values count of 22:10 any column the inferred data type of the 22:13 column average value of the column 22:15 standard deviation value of the column 22:18 minimum values of the column maximum 22:20 value of the column null percentage of 22:22 the column most frequent values of the 22:25 column or we can utilize the distinct 22:28 values of the column or value frequency 22:31 of the column like how many times a 22:33 particular value is appearing in the 22:35 column and the frequency percentage of 22:37 the column Usually when we define 22:41 rule-based data element classification 22:43 using profiling data we use value 22:46 frequencies data like value or frequency 22:49 of the values or frequency percentage 22:53 Some of the common search queries 22:54 related to data classification is listed 22:57 on the screen Which language is used for 23:00 defining rule-based data classification 23:03 CDGC uses Spark SQL syntax for defining 23:06 classification rules I have provided one 23:09 link to the Spark SQL for your reference 23:12 Between rule-based data classification 23:14 and clear generated classification which 23:17 one is recommended 23:19 Rule-based data classification always 23:22 provides better confidence because when 23:25 designing rule-based data classification 23:28 user needs to be aware about the 23:31 metadata or data of the column against 23:35 which he wants to run the rule and so 23:38 user knows what he is trying to evaluate 23:42 This also makes the curation of the 23:44 association easier both technically 23:47 because it can be done in bulk through 23:50 the bulk import process and even 23:53 logically because user will be aware 23:55 about the context of the column or table 23:59 and also about the data classification 24:01 rule against which he's trying to 24:03 evaluate the column or table data When 24:06 should rule-based data classifications 24:08 be used compared to clear generated data 24:11 classifications Rule-based data 24:13 classifications are suitable when the 24:15 user has a clear understanding of the 24:17 technical assets context like about its 24:20 metadata and or data In such cases the 24:24 user can easily define rules to classify 24:27 the data But consider some cases where 24:30 the user doesn't have information about 24:33 what a particular column is related to 24:35 or what kind of data it contains In such 24:39 cases defining rule-based classification 24:41 can be challenging In those scenarios 24:44 clear generated classification can be 24:46 utilized Clear is Informatica's machine 24:49 learning based AI engine and it uses 24:52 internal algorithms to analyze column 24:55 names and generate placeholders for the 24:58 data entity classifications These 25:00 placeholders are listed under generated 25:03 classification section of the asset in 25:06 CDGC The user then has the option to 25:09 either promote or reject these clear 25:12 generated placeholders If the user 25:15 promote a placeholder they can link this 25:18 placeholder to an existing data entity 25:20 classification or they can create a new 25:23 data entity classification and associate 25:26 this placeholder with the created data 25:28 entity classification The content 25:31 discussed in this video is available on 25:33 my website You can find the link in the 25:36 video description below That's all for 25:39 today See you in the next video Until 25:41 then take"
Data Classification - Algorithms and Applications-Chapman and Hall - CRC (2014) - (Chapman & Hall - CRC Data Mining and Knowledge Discovery Series) Charu C. Aggarwal PDF