Lab4 Data Quality
Lab4 Data Quality
Realized by :
Bouhtil Rania
El Maksour Imane
School-Year : 2022-2023
In this exercise, we will use the Data Profiling task to find inaccurate data in the
CustomersDirty view we created in the previous lab, within the
DQS_STAGING_DATA database.
1. Open Visual Studio (or SQL Server Data Tools (SSDT) for older versions). Create
a new SSIS project and solution.
2.Drag the Data Profiling task from the SSIS Toolbox (it should be in the Common
Tasks group) to the control flow working area. Right-click it and select Edit.
3.On the General tab, use the Destination Property drop-down list to select New File
Connection.
4.In the File Connection Manager Editor window, change the usage type to Create
File. In the File text box, type the file name ProfilingCustomers.xml.
4.In the File Connection Manager Editor window, change the usage type to Create
File. In the File text box, type the file name ProfilingCustomers.xml.
5.When you are back in the Data Profiling Task Editor, on the General tab, change
the OverwriteDestination property to True to make it possible to re-execute the
package multiple times (otherwise you will get an error saying that the destination file
already exists when the package next executes).
6.In the lower-right corner of the Data Profiling Task Editor, on the General tab, click
the Quick Profile button.
7. In the Simple Table Quick Profiling Form dialog box, click the New button to create
a new ADO.NET connection. The Data Profiling task accepts only ADO.NET
connections.
8.Connect to your SQL Server instance by using Windows authentication, and select
the DQS_STAGING_DATA database. Click OK to return to the Simple Table Quick
Profiling Form dialog box.
9. Select the CustomersDirty view in the Table Or View drop-down list. Leave the
first four check boxes selected, as they are by default. Clear the Candidate Key
Profile check box, and select the Column Pattern Profile check box.
10. In the Data Profiling Task Editor window, in the Profile Type list on the right,
select different profiles and check their settings. Change the Column property for the
Column Value Distribution Profile Request from (*) to Occupation (you are going to
profile this column only). Change the ValueDistributionOption property for this
request to All-Values. In addition, change the value for the Column property of the
Column Pattern Profile Request from (*) to EmailAddress. Click OK.
11. Save the project. Execute the package.
12.When the Execution finishes, Check whether the XML file appeared in the folder
you chose in step 4.
Open Data Profile Viewer and Navigate to the ProfilingCustomers.xml file and open
it. Now you can thus start harvesting the results.
2. On the left, in the Profiles pane, select, for example, the Column Value Distribution
Profiles. In the upperright pane, select the Occupation column. In the middle-right
window, you should see the distribution for the Occupation attribute. Click the value
that has very low frequency (the Profesional value). Find the drilldown button in the
upper-right corner of the middle-right window. Click it, and in the lower-right pane,
check the row with this suspicious value.
3. Check the Column Pattern Profiles. Note that for the EmailAddress column, the
Data Profiling task shows you the regular expression patterns for this column. Note
that these two regular expressions are the regular expressions you used when you
prepared a DQS knowledge base in the previous Labs.
4. Also check the other profiles. When you are done checking, close the Data Profile
Viewer.
Data Cleansing with SSIS
1. Open SSMS, connect to your SQL Server instance, open a new query window,
and change the context to the DQS_STAGING_DATA database.
2. Create a table for clean customer data. Name it CustomersCleanT. Include only
columns for the customer key, full name, and street address. Use the following code.
3. Populate the table with every tenth customer from the DimCustomer table from the
AdventureWorksDW database by using the following query.
4. Create a table with a structure similar to the one for CustomersCleanT and call it
customersDirtyT. Add two integer columns to this table called Updated and
CleanCustomerKey. The first one will be used by the query that makes the data dirty
and the second one to populate the table with the customer key from the clean table
after identity mapping (process of linking or mapping data from an input data source
to corresponding records in a reference data source.).
To create our Dirty Data we will execute the queries in the createDiryData.sql file.
5. Check the dirty data after changes. A little bit more than 40 percent of data should
be updated. Because there is randomness in updates, you get a different number of
rows and different rows updated every time you run the code. You can check the
changes with the following query.
6. Finally, update the row for the customer with a key equal to -11010. Set the
FullName to jacquelyn suarez and StreetAddress to 7800 corrinne ct. This gives you
a row that can be corrected with the DQS Cleansing transformation in the practice
next. Use the following code
8. Add another new table in the dbo schema and name it CustomersDirtyNoMatchT.
Use the same schema as for the previous table
Now that our data and tables are prepared, we will create an SSIS flow to clean the
dirty data.
1. Create a new package in your integration project from the first exercise. You can
name the package DQSCleansing.
2. Drag a data flow task to the control flow working area. Click the Data Flow tab to
open the data flow working area.
3. Right-click the Connection Managers folder in Solution Explorer and select New
Connection Manager
4. Select the OLEDB connection manager type and click Add. In the Configure OLE
DB Connection Manager window, click New.
5. Select Native OLE DB\SQL Server Native Client 11.0 Provider. Provide the name
of your SQL Server instance and authentication information, and select the DQS_
STAGING_DATA database. Click OK. When you are back in the Configure OLE DB
Connection Manager window, click OK.