0% found this document useful (0 votes)
17 views158 pages

DWV Notes Units 1 to 5

The document outlines the course structure for DS305: Data Wrangling and Data Visualization, detailing its objectives, classroom guidelines, and evaluation scheme. It emphasizes the importance of data pre-processing, exploratory techniques, and visualization skills, along with practical applications in industries like telecom and e-commerce. The document also discusses data workflow stages, data quality issues, and the challenges of deriving value from data.

Uploaded by

saivikas1393
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views158 pages

DWV Notes Units 1 to 5

The document outlines the course structure for DS305: Data Wrangling and Data Visualization, detailing its objectives, classroom guidelines, and evaluation scheme. It emphasizes the importance of data pre-processing, exploratory techniques, and visualization skills, along with practical applications in industries like telecom and e-commerce. The document also discusses data workflow stages, data quality issues, and the challenges of deriving value from data.

Uploaded by

saivikas1393
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 158

Data Wrangling and Data Visualization

Course No: DS305


Introductory Class
• Introductions
• Course layout

• Purpose of the course


• How to display the data for the data analyst?
• Data exploration and how to present the data so that new things can be
discovered.
• Explanation of patterns in the data through visualization
• Presenting something that we understood to other people with the goal of
convincing them
Classroom Guidelines
Course No Course Title L P U
DS 305 Data Wrangling and Data Visualization 2 2 2
Aug 2024 – Dec 2024
1. This course is conducted first two periods every Wednesday and
Thursday
2. Attendance will be given to students who come before 9:45 AM
Course Evaluation Scheme
Course Objectives and Outcomes
Objectives: Outcomes:
1. Importance of data pre-processing
1. Implement data checking, wrangling, tidying,
2. Exploratory data wrangling techniques and basic management methods.
3. Understand visual perception and core
skills for visual analysis 2. Apply exploratory techniques to identify and
describe underlying patterns in data.
4. Visualization for information and
dashboard design 3. Identify and avoid common flaws in the
5. Introduction to data modelling and presentation of data.
analysis
4. Communicate and report upon data
6. Presentation and interpretation of data
effectively.
7. To get familiar with different visualization
techniques using various tools like 5. Analyze, visualize, and report on data using
Python, R, Tableau and etc. the multiple programming languages.
8. Knowledge of visualization software
6. Recognize how data wrangling is implemented
in real-life application
Unit 1
Introduction: Data wrangling

Data Wrangling, Data Workflow, Data Dynamics (Data Wrangling Steps), Data Profiling,
Transformation, Data Quality, Team Structure Roles and Responsibilities, and DW Tools
Some Facts about Data
1. What is data? 1. Collection of facts
2. Is data abundant or scarce? 2. Abundant
3. Is it available freely?
3. No, takes time and money to collect
4. Do we understand this data in
a raw form? 4. No, if the volume of data is large
5. What is Visualization? 5. Representing data in graphical way
that can be easily understood by the
human cognitive system

6. Do you need Visualization? 6. Yes


Facts - Few examples
1. Bus Schedule for my stop:
• 07-Aug-2023 bus reached my stop at 8:00AM
• On 08-Aug-2023 it is at 8:05AM XML example:
• On 08-08-2023 8:05AM <Student Admissions>
• 10-Aug-2023 bus arrived at 17:55AM <Student Name>
<Student A>
2. Students attendance </Student Name>
Date 07-Aug-23 <Program Name>
Date 07-Aug-23 Date 07-Aug-23
Student Name Roll Number Attendance Roll Number Attendance
Roll Number Attendance <B.Sc.>
102 N
Ram 101 Y 101 Y </Program Name>
Ganga 102 N 102 N *Captured only absentees details <Specialization>
John 103 Y 103 Y <DW&DM>
</Specialization>
<Offer Date>
Data is available in various formats
<date1>
• Unstructured e.g. text,
</Offer Date>
• Semi Structured e.g. XML
</student Admissions>
• Structured e.g. Tables
Purpose of the data
Telecom Industry
It provides mobile services such as voice, internet, SMS etc., to both residential and commercial customers

❖ Potential Pain Areas


▪ Customer Churn (leaving the company and becoming customer for another company)
▪ Reduced Profits

❖ Insights/ Knowledge/ Analytics required to resolve Pain Areas


▪ Which City/location has high customer churn?
▪ What are the top 5 reasons for customer churn?
▪ What are the top reasons to reduce 95% of customer churn?
▪ What is the average data usage under each plan, use this information to adjust data plans to improve profits?
▪ Which are high demand plans and their customer profiles, look for opportunities to improve profits?
E-Commerce Industry
E-Commerce is a company which sells its products through online portal and delivers the product at customer
location, providing the convenience to the customer in purchasing the product without visiting shops.
❖ Potential Pain Areas
▪ Decreased Sales
▪ Reduced Margins/ Profits
▪ Retaining loyal customers
❖ Insights/ Knowledge/ Analytics required to resolve Pain Areas
▪ Which product’s sales is coming down month on month for past one quarter?
▪ What is the % of loyal customers among all customers in the last one year who purchased products at
least 6 times in last 12 months?
▪ What % of products that were returned from customers after successful delivery? What are the top 10
reasons? What are those top 100 products by their percentage?
▪ What % of orders that got cancelled before delivery? What are the top 10 reasons? What are those top
100 products by their percentage?
▪ Which are the top 100 products whose sales is increasing month on month to promote them further
and increase sale?
Purpose of Data Collection
• To extract/ derive value from the collected data
• Value could be categorized into various dimensions
Temporal Dimension Delivery Dimension
1. Near Term value 1. Direct Value
2. Long Term Value 2. Indirect Value
• Temporal dimension is about deriving the value from data with time
1. Near Term Value is about deriving the value from current/ existing data e.g. How many absent in
today’s class; Average weekly attendance; google maps gives best possible route based current traffic
2. Long Term Value is about deriving the value from historical data e.g. Comparison of student
attendance for the entire college with respect to the first period, middle period, and last period over the past three
years.
• Delivery dimension is about how the value is delivered to the organization
1. Direct Value – Data provides value to your organization by feeding automated systems e.g.
Amazon recommending products that are suitable for you.
2. Indirect Value (Human Mediated Value) - Data provides value to the organization by
influencing people’s decisions by analyzing the data through visualization (reports, graphs)
e.g. generate various graphs and statistics that help human to make decisions; Amazon product rating statistics influence
customer to purchase.
Why deriving value from data is difficult? What are
the bottlenecks to derive value from data?
Time:
1. Deriving the value out of data is iterative process, it take several
iterations to get desired output
2. It also takes time to gather sufficient volume of data to improve
accuracy of data value that can be derived
Time Resources
Data Quality:
1. Good quality of data will give good value

Data Resources:
Knowledge 1. Lack of adequate number of resources working on data compared to
Quality
high number of data value consumers in the organization.

Knowledge:
1. Inadequate Skills / knowledge for the (IT) people working on data to
meet the expectations of business analysts, this makes multiple
feedback cycles between IT staff and Business
Data Projects
There is a natural progression of data projects: from near-term
answering of known questions, to longer-term value analyses to finally value to production systems
that use data in an automated way. Underlying this progression is the movement of data through
three main data stages: raw, refined, and production.
Near Term Long Term value Long Term value
answering known analysis for humans from Automated
questions to make decisions decision making
(Raw stage) (Refined stage) (Production stage)

• When it comes to delivering production value from your data, there


are two critical points to consider.
• First, data can produce insights that are not useful to you and your
business. These insights might not be actionable, or their potential
impact might be too small to warrant a change in existing processes
• Second, Empower the people who know your business priorities to
explore your data, these exploratory analytics efforts should be as
efficient as possible and fast

These two critical points leads to importance of having data wrangling


tools which speed data exploration activities for business and reduce
dependency on IT people.
Data Workflow in Data Projects

A minority of data projects will end in the raw or production stages. The
majority will end in the refined stage. Projects ending in the refined stage will
add indirect value by delivering insights and models that drive better decisions.

The Figure 2.2 depicts, natural progression of data projects and actions that take
place at each stage of the data project.
Data Workflow in Data Projects … continued
Raw Data Stage:
1. Ingest Data
• As part of Ingest data action, data is collected from various sources, many times,
these sources are of different formats.
• Collected data will be stored at one central location with or without
transforming the data into structured format.
• Schema-on-read ingestion, in this style of data ingestion, data is not
transformed into usable data structure until it is need for further analysis.
• Schema-on-write ingestion, in this style of data ingestion, data is transformed
into usable data structure while it is collected and stored into central location.
This kind of data ingestion style is used in data warehouse projects.
• Ingesting data triggers two additional actions, both related to the creation of
generic and custom metadata.
Data Workflow in Data Projects … continued
Raw Data Stage:
2. Describe Data (Generate Generic Meta Data)
• Before ingesting data, it is necessary to understand generic data characteristics,
such as data, its types, length, format etc., describing these general
characteristics is generic meta data
3. Assess Data Utility (Generate Custom Meta Data)
• This involves assessing data utility (usefulness) in order to streamline data
ingestion process. It is not necessary that all data sources follow the same
generic meta data, there will be some exceptions that go into custom meta
description.
Data Workflow in Data Projects … continued
Data Source 1 Data Source 2 Data Source 3
Date 07-Aug-23 Date 07-Aug-23 Date 07-Aug-23
Y 101 Roll Number Attendance
Student Name Roll Number Attendance
N 102 102 N
Ram 101 Y
Y 103
Ganga 102 N *Captured only absentees details
John 103 Y

Generic Meta Data Custom Meta Data


Central
1. Three data sources 1. Data Source 1 is CSV file, one line for each
Repository
2. Data is ingested daily record
3. There are three fields that are important. date, 2. Data Source 2 is table format, ignore first
roll number and attendance column, 2nd column roll number and 3rd
4. Date field represented as header field column attendance field
5. Roll number and Attendance in tabular format 3. Data source 3, table format, it has only roll
6. First column Roll number, data type integer numbers who are absent
7. Second column Attendance. data type char (1),
value Y, N
Data Workflow in Data Projects … continued
Refined Data Stage:
1. Design & Refine Data
• In the raw data stage, ingestion involves minimal data transformation—just
enough to comply with the syntactic constraints of the data storage system. By
contrast, the act of designing and preparing “refined” data often involves a
significant amount of transformation. These transformations are often guided
by the range of analyses planned on the data.
Example transformations to the data:
• Excluding salaries of top management people, whose salaries are significantly
higher (Outliers) than remaining employees, to find average employee salary.
• Converting all employees salaries across the branches spread across the world
into single currency to find average salary of employee
Data Workflow in Data Projects … continued
Refined Data Stage:
1. Generate Ad-hoc reports
• In addition to data refinement, action items such as ad-hoc reports/ dashboards
are generated to derive the value from those reports. These reports primarily to
find insights from the historical data for decision making
Examples:
• Weekly/ Monthly/ Yearly sales figures to perform sales trend
• Customer Rating before and after product quality enhancement
2. Prototype modelling
• Action items that takes place as part of prototype modelling is to predict future
events or correlation effect of some data items on another
Data Workflow in Data Projects … continued
Production Data Stage:
1. Optimize Data
• After reaching enough maturity levels from the refined data stage, the insights
are translated into direct actionable (Optimized) data for decision making
systems without human intervention.
• Example: An intelligent inventory management system, automatically raise
purchase order to refill the inventory by considering all possible scenarios.
2. Regular Reporting
• Unlike ad-hoc reporting where human involvement is there, regular reporting
will be generated that provide direct value for humans to make decision without
further analysis
3. Data Products and Services
• Feed the data directly into production systems/ services for making decisions
for speed and accuracy
Understanding various Data
Quality Issues
Data Quality – Data Inconsistency
Customer Table Customer Complaints Table
Customer Customer (First Birth Place City Short Birth Date Customer Customer Customer Customer Complaint
Id Name, Last Name) Name Id First Name Last Name
1 John, Millar California CA 01-JUN-1985 1 Millar John Product1 Fragile

2 Ram, Koni California Cali. 06-31-1990 2 Ram Koni Product2 is expensive

Inconsistency Short Name Inconsistency Date Data Inconsistency means data is not
Formats uniformly represented within the data
Inconsistency Customer First and Last Names source or across the data sources.
Customer Data Form with Free Flow Text Customer Data Form with Input Controls
While data capture, Free Flow Text
Customer Name : Cust. First Name :
Input forms and manual entries are
Birth Place : major causes for data inconsistency, Cust. Last Name :
instead by using modern data input
Birth Place controls like drop –down and date Birth Place :
Short Name : controls, these kind of problems can
Date of Birth :
Date of Birth : be pre-vented.

In this Solution, we are preventing future data becoming inconsistent,


How do we fix historical data?
Data Quality – Data Inconsistency
Customer Table Customer Complaints Table
Cust Customer (First, Birth Place City Short Birth Date Date_ Cust Customer Customer Customer Date-Modified
Id Last Name) Name Modified Id First Name Last Name Complaint
1 John, Millar California CA 01-JUN-1985 01-JUN-2006 1 Millar John Product1 Fragile 01-JUN-2005
9:30:03 13:30:03 9:30:03
2 Ram, Koni California Cali. 06-12-1990 11-JUL-2015 2 Ram Koni Product2 is 11-JUL-2016
10:45:50 19:45:50 expensive 19:30:53

Inconsistency in Customer First and Last Names


John Millar record in this table is latest
Customer Table compared to customer complaints table Customer Complaints Table
Cust Customer (First, Birth Place City Short Birth Date Date_ Cust Customer Customer Customer Date-Modified
Id Last Name) Name Modified Id First Name Last Name Complaint
1 John, Millar California CA 01-JUN-1985 01-JUN-2006 1 John Miller Product1 Fragile 01-JUN-2005
9:30:03 13:30:03 9:30:03
2 Ram, Koni California Cali. 06-12-1990 11-JUL-2015 2 Ram Koni Product2 is 11-JUL-2016
10:45:50 19:45:50 expensive 19:30:53

Fixing data inconsistency in the historical data:


Replace inconsistent data based on latest information from available data sources
In the above example, John Millar record in Customer table is latest (year 2006) , based on this
information, customer complaints table is fixed for customer name.
Data Quality – Missing Values (Incomplete Data)
Customer Table Sensor Data Table
Customer Customer (First Birth Place City Short Birth Date Sensor # Date & Time Sensor Value
Id Name, Last Name) Name 1 01-Jan-2023 9:15 40
1 John, Millar California 01-JUN-1985 1 01-Jan-2023 10:00
2 Ram, Koni Cali. 06-31-1990 1 01-Jan-2023 10:00 38
1 01-Jan-2023 10:00 42
Missing Values (Incomplete Data)

Customer Data Form1 (Free Flow Text)


While data capture, stop Free
Customer Data Form2(With input controls)
Customer Name Flow Text Input forms, instead Cust. First Name * :
: use modern data input controls
Birth Place : like drop –down and date Cust. Last Name * :
controls.
Birth Place Birth Place* :
Additionally, make required
Short Name : fields mandatory Date of Birth* :
Date of Birth :
In this Solution, we are preventing missing values in the future data by
having mandatory fields but how do we fix historical data?
Data Quality – Missing Values (Incomplete Data)
Sensor Data Table
How to fix missing data that comes from sensors like IOT
Sensor # Date & Time Sensor Value
devices. Since sensor data is coming automatically and there
1 01-Jan-2023 9:15 40
is no input form where input field can be made mandatory
to prevent missing values. 1 01-Jan-2023 10:00
1 01-Jan-2023 10:00 38
1 01-Jan-2023 10:00 42
In such scenarios, how to fix this missing data?
Missing Values (Incomplete Data)
Some Solutions to fix this kind of missing data are
1. Use statistical methods
1. Mean, Median, Mode
2. Ignore (delete) those records
Data Quality – Missing Values (Incomplete Data)
How to fix missing values (also inconsistent values) for non-numerical data?
Ans. Use company reference data based on unique keys.

Customer Table Reference Table


Customer Customer (First City City Short Customer City City Short
Id Name, Last Name) Name Id Name
1 John, Millar California CA 1 California CA
2 Ram, Koni Houston 2 Houston HOU
3 Ahmed, H California Cali.
Customer Table after fixing
Missing Value Customer Customer (First Birth Place City Short
Id Name, Last Name) Name
NOTE: In case of no reference data and no another 1 John, Millar California CA
source to fix them, we might delete missing data rows.
2 Ram, Koni Houston HOU

3 Ahmed, H California CA

Fill Missing values and inconsistence values using reference data


Data Quality – Data Inaccuracy
Customer Table Sales Table
Cust Customer (First Name, Residence City Short Cust Product Id Product Name Customer
Id Last Name) City Name Id City
1 John, Millar Delhi DH 1 0010 Samsung AC Hyderabad

2 Ram, Koni Chennai CH 2 0X1024 LG Mixer Chennai

Customer City is Different at two tables?

Bank ATM Table Customer Savings Account Table


Cust ATM id ATM Location Transaction Transaction Balance Cust Transaction Transaction Balance
Id Type Amt Id Type Amt
1 ATM0023 Mokila, ICFAI Db 100 900 1 Db 0 1000

2 ATM0045 Mokila, Town Cr 50 600 2 Cr 50 600


Centre
Begin Transaction For Customer id 1, balance amount is different at ATM and savings account table.
Update Bank ATM Table Balance = Balance – 100
Update Customer Account Table Balance = Balance – 100
End Transaction
Data inaccuracy occurs when same data is available at more the one source and changing data at some
places not changing in other places. Data inaccuracy can be prevented by using Database Transactions
and updating data in all sources as one atomic operation. OR synchronizing data at the earliest.
In case of historical data, data inaccuracies, will be corrected using available
latest information and reference data
Data Quality - Outliers
Sensor Data Table
Outliers are data values that vary largely from other
Sensor # Date & Time Sensor Value
values. These are either very large or very small
1 01-Jan-2023 9:15 40
compared when compared to other values in the
1 01-Jan-2023 10:15 45 data set.
1 01-Jan-2023 10:30 38
1 01-Jan-2023 10:35 42 These values are significantly impact the results of
1 01-Jan-2023 10:40 100 data insights derived from data analysis.
1 01-Jan-2023 10:45 120
1 01-Jan-2023 10:50 35 Real World Scenarios:
1 01-Jan-2023 10:55 37
1. Manual errors
1. data entry errors
1 01-Jan-2023 11:00 39
2. Mixing wrong data sets
Outliers 2. Programmatical errors
3. Intentional errors
4. Natural errors
1. Sensor mal-functions
2. Due to exceptional events/ facts
Outlier Types
• Global Or Univariate Outliers
• These are extreme data points for a single variable
• Income of citizens in the country; outliers are income of ultra rich people
• Traffic flow at the junction; outliers are the values that are measured during exceptional
situations like curfew, road repair, long weekend etc.,
• Multivariate Outliers Or Contextual Outliers
• These are extreme data points in the context of another data point
• Temperature in 30oC in Delhi on peak Winter month. If we look at 30oC it is not a outlier
but when we look at 30oC w.r.t to “Winter” data point then it becomes an outlier, these
kind of Outliers are contextual outliers
• Collective Outliers
• A subset of data points that are completely different from entire dataset.
Collective Outliers Example
Month City 1 Sales Data City 2 Sales Data City 3 Sales Data
1 Rs. 50000 Rs. 90000 Rs. 51000
2 Rs. 55000 Rs. 100000 Rs. 50000
3 Rs. 52000 Rs. 85000 Rs. 55000
4 Rs. 57000 Rs. 99000 Rs. 59000
5 Rs. 60000 Rs. 80000 Rs. 62000

In this example city 2 sales (a subset of entire data city1, city2, city3) is
extremely high and it is a collective outlier from rest of the cities.
Identifying Outliers
• With small data set it is easier to detect by visualization
• 24, 28, 32, 5, 40, 45, 35, 120, 130
• In this data set it is very easy to identify 5, 120 and 130 outliers either they
are too small or too large compared to the rest of the values
• How do Identify when the data is very large Or how do we find data
programmatically
• Box Plots
• Z-score method
Box Plot
Box Plotting of Data set 24, 28, 32, 5, 40, 45, 35, 120, 130
Box Plot Method Steps:-
1. Sort the date set
2. Identify Q1 Position (Quantile 1 (Q1) is 25% of percentile (25% of values are lower than this number))
3. Identify Q3 Position (Quantile 3 (Q3) is 75% of percentile (75% of values are lower than this number))
4. Identify Inter Quantile Range (IQR), it is the difference of Q3 – Q1 position values
5. Lower Boundary (LB) = (Q1 position value) – (1.5 * IQR) (Data values that are lower than LB are Outliers)
6. Higher Boundary (HB) = (Q3 position value) + (1.5 * IQR) (Data values that are higher than HB are Outliers)

Higher Boundary Outliers


Lower Boundary Outliers

LB Q1 HB
Q3 Setp1: sorted
5 24 28 32 35 40 45 120 130
Step2: Q1 Position (Total values in the date set)*25% = 9* (1/4) = 2.5 position i.e. means take 2nd or 3rd position, but take always lower position. In our
example it is 2nd position.
Step3: Q3 Position (Total values in the date set)*75% = 9*(3/4) = 6.75 position i.e. means take 6th or 7th position, but take always higher position. In
our example it is 7th position.
Step4: IQR = (7th position value) – (2nd position value) = (45 – 24) = 21
Step5: Lower Boundary (LB) = (24 – 1.5*21) = -7.5 ( so values lower than -7.5 are considered outliers; in our example no LB outliers)
Step6: Higer Boundary (HB) = (45 + 1.5*21) = 76.5 ( so values higher than 76.5 are considered outliers; in our example 120, 130 are HB Outliers)
Z-Score Method to identify Outliers
Z-score: Indicates how many standard deviations a data point is from the
mean.
Threshold: Typically, a z-score above 3 or below -3 is considered an
outlier.
Formula for Z-Score = (x – mean)/(stand deviation) {x is data value}

5 24 28 32 35 40 45 120 130 X : data values

-1.12 -0.66 -0.56 -0.46 -0.39 -0.27 -0.15 1.68 1.93 Z-score values

In the above example no z-score values is above 3 or below -3, so no


outliers in the data set as per z-score method.
When to use box plot and z-score methods
• Box Plot method
• Best for: Small to medium-sized datasets, non-parametric data (data that
doesn’t necessarily follow a normal distribution).
• May not be as precise for identifying outliers in normally distributed data
• Z-Score method
• Best for: Large datasets, parametric data (data that follows a normal
distribution).
• It will be influenced by extreme values, which may distort the mean and
standard deviation.
Exercise
Data Set -4, 12, 28, 32, 5, 40, 45, 35, 120, 100, 75
1. For the above data set find LB and HB outliers using Box Plot and Z-
Score methods?
2. Write Python Program to find LB and HB outliers using Box Plot and
Z-Score methods?
Data Noise
• Data noise is a random error in measured variable it is not as
significant as data outlier.
• Techniques to Fix
• Smoothing Technique through
• Smoothing by Binning mean value
• Smoothing by Binning nearest boundary value

• Reasons to fix
• Fixing complex datasets make them more
interpretable by simplifying the data representation
and highlighting the most important features.
• Identify meaningful patterns or trends
• Enhanced visualization
• Forecasting and Predictions
Fixing data noise
• Techniques to Fix
• Smoothing Technique through Binning
• Smoothing by mean value
• Smoothing by nearest boundary value

• Reasons to fix
• Fixing complex datasets make them more interpretable by simplifying the
data representation and highlighting the most important features.
• Identify meaningful patterns or trends
• Enhanced visualization
• Forecasting and Predictions
Smoothing Technique thru Binning by Bin
Means Method
1. Original data 24, 28, 15, 32, 55, 40, 45, 35, 62
Binning by Bin Means method:-
Step1: Sort the values
15, 24, 28, 32, 35, 40, 45, 55, 62
Step2: select bin size; consider 3 here.
Step3: Partition the data set into equal frequency Bins based on bin size
• Bin1 - 15, 24, 28 (Bin Mean : 22.3)
• Bin2 - 32, 35, 40 (Bin Mean : 35.6)
• Bin3 - 45, 55, 62 (Bin Mean: 54)
• Step4: Replace each Bin value with corresponding bin mean value
• Step5: Construct new data set using new bin values i.e. 22.3, 22.3, 22.3, 35.6, 35.6, 35.6, 54, 54, 54

Before Smoothing After Smoothing by Bin Means


Bin1 – 15, 24, 28 Bin1 – 22.3, 22.3, 22.3
Bin2 – 32, 35, 40 Bin2 – 35.6, 35.6, 35.6
Bin3 – 45, 55, 62 Bin3 – 54, 54, 54
Smoothing Technique thru Binning by Nearest
Bin Boundary Values
1. Original data 24, 28, 15, 32, 55, 40, 45, 35, 62
Binning by Bin Boundary Values method:-
Step1: Sort the values
15, 24, 28, 32, 35, 40, 45, 55, 62
Step2: select bin size; consider bin size 3 in this example.
Step3: Partition the data set into equal frequency Bins based on bin size
• Bin1 - 15, 24, 28 (Bin Boundary Values are 15, 28)
• Bin2 - 32, 35, 40 (Bin Boundary Values are 32, 40)
• Bin3 - 45, 55, 62 (Bin Boundary Values are 45, 62)
• Step4: Replace all Bin values between boundary values with nearest/ closest boundary value
• Step5: Construct new data set from new bin values i.e. 15, 28, 28, 32, 35, 40, 45, 62, 62

Before Smoothing After Smoothing by Nearest Bin Boundary Value


Bin1 – 15, 24, 28 Bin1 – 15, 28, 28
Bin2 – 32, 35, 40 Bin2 – 32, 35, 40
Bin3 – 45, 55, 62 Bin3 – 45, 62, 62
How do we measure data noise in the data
set?
One simple method is by using standard deviation or variance.
Higher the standard deviation or variance, higher the data noise.
Original data
• 15, 24, 28, 32, 35, 40, 45, 55, 62
• Its standard deviation is 14.08
Data after applying Binning by bin mean method
• 22.3, 22.3, 22.3, 35.6, 35.6, 35.6, 54, 54, 54
• Its standard deviation is 13
Data after applying Binning by nearing bin boundary value method
• 15, 28, 28, 32, 35, 40, 45, 62, 62
• Its standard deviation is 14.82
Exercises
1. Perform data smoothing using Binning method by considering the
mean value on the below data set with bin size 4. Include any
remaining values in the last bin
25, 35, 15, 85, 65, 44, 27, 76, 58, 22
2. Perform data smoothing using Binning method by considering the
nearest boundary value on the below data set with bin size 3.
Include any remaining values in the last bin
55, 35, 15, 65, 65, 24, 57, 76, 56, 45
3. Identify data outliers using Box Plot method on the below data set.
-10, -5, 2, 6, 8, 12, 28, 38, 45, 40, 22
Data Wrangling
• Data wrangling is a generic phrase capturing the range of tasks involved in
preparing the data for analysis.
• The term “data wrangling” was popularized by the academic team behind
Trifacta, a company known for its data preparation tools.
• Joe Hellerstein, Jeffrey Heer, and Sean Kandel, founded Trifacta (Tri.facta), a
software company specializing in data wrangling and exploratory analysis in 2012
• Joseph M. Hellerstein, Jim Gray Professor of Computer Science, UC Berkeley
• Trifacta’s platform is designed to help analysts explore, transform, and enrich
raw data into clean and structured formats. It leverages techniques in machine
learning, data visualization, human-computer interaction, and parallel
processing to enable non-technical users to work with large datasets. The
platform includes various products like Wrangler, Wrangler Pro, and Wrangler
Enterprise1.
• In 2022, Trifacta was acquired by Alteryx and is now known as Alteryx Designer
Cloud
What is Data Wrangling
• Data Wrangling enables non-technical users/ analysts to work with large datasets and
perform data pre-processing tasks by exploring the data through Visualizations and Low
code environment.
• It typically involves several tasks such as Data Collection, Data Cleaning, Data Structuring,
Data Integration, Data Transformation and Data Enrichment.
• Data wrangling is crucial for data scientists and analysts because it ensures the dataset is
accurate, complete, and properly structured before applying machine learning models,
visualization, or statistical analysis.
• Tools with low code environment
• Alteryx Designer from Alteryx vendor
• SageMaker Python SDK from AWS vendor
• Tableau from Saleforce vendor
• Microsoft Power BI
• Tools with Medium to High code environment
• ggplot2 is an open-source data visualization package from the statistical programming language R
• Python language

NOTE: many companies offer data wrangling tools bundle with their data analytics tools.
Data Profiling
• Data profiling is the process of examining, analyzing, and summarizing data sets to understand their
structure, content, and quality.
• It involves collecting statistical information about the data, which can help identify patterns,
anomalies, relationships, and potential issues within the data.Fundamentally profiling guides
transformations to consider for improving the data quality.
e.g. df.head(), df.info(), df.describe() data profiling statements in python
• There are two types of data profiling as shown in the table 3-2.

Importance of Data Profiling:-


• Improves Data Quality: Data profiling helps to identify and fix data issues early in the data project, it ensures high-
quality data available for analysis or reporting.
• Facilitates Data Understanding: Profiling provides insights into data patterns and distributions, making it easier for
data analysts or scientists to understand the dataset before applying advanced analysis techniques.
• Supports Data Integration: It helps detecting inconsistencies or misalignments before merging data from multiple
sources.
• Aids in Decision-Making: Proper profiling allows organizations to have High-quality data for decision making.
Data Profiling

Birth
LastName FirstName finalWorth gdp_country
Year
Arnault Bernard 211000 1949 $2,715,518,274,227
Musk Elon 180000 1971 $21,427,700,000,000
Bezos Jeff 114000 1964 $21,427,700,000,000
Ellison Larry 107000 1944 $21,427,700,000,000
Buffett Warren 106000 1930 $21,427,700,000,000
Gates Bill 104000 1955 $21,427,700,000,000

Individual Value Profiling


1. Syntax (data types etc.,) Set Based Profiling
2. Semantics (context
specific validity)
Data Wrangling Dynamics (Actions/ Steps)

Access Profile Transform Publish

Figure. Data Wrangling Basic Steps


• Access
• This step involves identifying right data sources and getting permissions to access the data
• Moving/ Storing data into location that is convenient for analysis
• Profiling
• Understanding content and quality of the data, fundamentally profiling guides or provides feedback to
transformation step
• Transform
• Manipulating the structure, granularity, accuracy, temporality, and scope of the data to align with analysis goals
• Publish
• Making enriched data available to data analysts with the information about sequence of transformation actions
performed, profiling statistics and visualization reports. It gives confidence to data analyst on data quality.
Data Wrangling Dynamics (Actions/ Steps)

Access Profile Transform Publish

Figure. Data Wrangling Basic Steps


• Out of 4 basic data wrangling steps as shown above, people spend majority of their time doing profiling and
transformations until they reach required data quality

Core transformation types


Additional Aspects of Data Wrangling Dynamics
• Data Wrangling is a time consuming and iterative process so efficiency
in doing data wrangling is very important, this becomes even more
important as data size increases.
• Two additional aspects to the dynamics of data wrangling that are
vital to finding efficiencies (Time , Computing Power and Memory) in
data wrangling practice.
• Data Wrangling employes two methods to improve its efficiency
without compromising on required data quality
• Subsetting Data
• Sampling Data
Additional Aspects of Data Wrangling Dynamics
• Subsetting Data
• Purpose: To extract specific rows, columns, or both from a dataset based on certain
criteria. Data subsetting focuses on specific portions of the data.
• Method: You might subset data to focus on a particular group, time period, or set of
variables.
• Example: If you have a dataset of customer transactions, you might subset it to
include only transactions from the last month or only transactions from a specific
region.
Billionaires Statistics Dataset
rank category personName country city lastName firstName finalWorth birthYear birthMonth birthDay
Fashion & Bernard
1 Retail Arnault France Paris Arnault Bernard 211000 1949 3 5
2 Automotive Elon Musk United States Austin Musk Elon 180000 1971 6 28 This is the subset of
3 Technology Jeff Bezos United States Medina Bezos Jeff 114000 1964 1 12
4 Technology Larry Ellison United States Lanai Ellison Larry 107000 1944 8 17 data with columns
Finance & Warren
5 Investments Buffett United States Omaha Buffett Warren 106000 1930 8 30
4-9 and only United
6 Technology Bill Gates United States Medina Gates Bill 104000 1955 10 28 States Records.
6 Technology Bill Gates United States Medina Gates Bill 104000 1955 10 28
Media &
Entertainmen Michael
7t Bloomberg United States New York Bloomberg Michael 94500 1942 2 14
Carlos Slim
8 Telecom Helu & family Mexico Mexico City Slim Helu Carlos 93000 1940 1 28
Mukesh
9 Diversified Ambani India Mumbai Ambani Mukesh 83400 1957 4 19
10 Technology Steve Ballmer United States Hunts Point Ballmer Steve 80700 1956 3 24
Additional Aspects of Data Wrangling Dynamics
• Sampling Data
• Purpose: To select a representative subset of the data, often to make inferences about the
entire dataset or to reduce the size of the data for analysis. Data sampling aims to create a
smaller but representative subset of original data
• Method: Sampling can be random (each data point has an equal chance of being selected) or
systematic (following a specific pattern or rule).
• Example: If you have a large dataset of survey responses, you might sample 10% of the
responses to analyze trends without processing the entire dataset.
Billionaires Statistics Dataset
rank category personName country city lastName firstName finalWorth birthYear birthMonth birthDay
Fashion & Bernard
1 Retail Arnault France Paris Arnault Bernard 211000 1949 3 5
2 Automotive Elon Musk United States Austin Musk Elon 180000 1971 6 28 This sampling of
3 Technology Jeff Bezos United States Medina Bezos Jeff 114000 1964 1 12
4 Technology Larry Ellison United States Lanai Ellison Larry 107000 1944 8 17 original data set by
Finance & Warren
5 Investments Buffett United States Omaha Buffett Warren 106000 1930 8 30
taking one record
6 Technology Bill Gates United States Medina Gates Bill 104000 1955 10 28 from each country
6 Technology Bill Gates United States Medina Gates Bill 104000 1955 10 28
Media &
Entertainmen Michael
7t Bloomberg United States New York Bloomberg Michael 94500 1942 2 14
Carlos Slim
8 Telecom Helu & family Mexico Mexico City Slim Helu Carlos 93000 1940 1 28
Mukesh
9 Diversified Ambani India Mumbai Ambani Mukesh 83400 1957 4 19
10 Technology Steve Ballmer United States Hunts Point Ballmer Steve 80700 1956 3 24
Team Structure, Roles and Responsibilities in Data Projects
Design and Refine
Ingest Data Optimize Data
Data

Build Data
Describe Generic Describe Custom Generate ad-hoc Build Prototype Generate Regular
Products and
Meta Data Meta Data Reports Models Reports
Services

Raw Data Stage Refined Data Stage Optimize Data Stage

Another representation of Figure 2.2 of data workflow of data projects

Data Projects
• His responsibilities to ensure sufficient tools and Head
technologies available for data projects and they working
coherently working to benefit the team
Data Architect
• He is also responsible for data integration, solution design
and data security
• Predominantly works in Optimize Data Stage and supports data
Data Scientist
analysts in building prototype models, interacts with business
analysts understanding their needs
Data Analyst • Predominantly works in Design and Refine Data Stage and
building prototype models, guides Data Engineer
Figure. Typical Team Structure, Roles and Responsibilities
Data Engineer • Predominantly works in Raw Data Stage, responsibilities
include data collection, programming to enrich data and
building ad-hoc reports
Unit 1 - Questions
1. Mention one industry, Can you provide at least three insights from data that would help
resolve these industry pain areas?
2. What are the two dimensions that can be used to measure the data value? Explain each one
briefly.
3. Explain four bottlenecks that industries face when deriving value from data?
4. Why are data wrangling tools important for extracting production value from your data?
5. Draw a typical workflow for data projects in the industry?
6. Explain stages of data projects?
7. What distinguishes Schema-on-Read from Schema-on-Write data ingestion methods?
8. Why is generating metadata necessary?
9. List at least five common data quality issues faced in most data projects?
10. Give at least three examples to prevent data quality issues?
11. Provide at least three typical data quality issues and explain how to fix them?
12. What is data profiling, and Why it is important?
13. Which three data wrangling tools would you recommend?
14. How would you describe the typical team structure in data projects, and what are the various
roles and responsibilities?
Unit 2
Introduction to Data Visualization
Need of Visualization, Block Diagram of Visualization, Visualization Stages.
Reference Books
Information Visualization: perception and design, Colin Ware 2nd edition, Omrgan Kaufmann publisher, 2004. : Ch1,
Visualizing data: Exploring and explaining data with the processing environment, Ben Fry O’Reilly, 1 st edition, 2008: Ch1
What is Visualization?
• Representing data in graphical way that can be easily understood by
the human cognitive system
• Externalization of an internal construct of the mind such as an image,
thought or data in the form of graphical representation (to support
decision making) is called visualization.
What is Human Cognitive System?
• It helps to perceive environment around us, learn from experiences,
anticipate outcomes, and adapt to changing circumstances.
How Human Cognitive system and
Visualization help/ influence each other
How Visualization help cognitive system:-
• Pattern Recognition: Our brains are excellent at recognizing patterns, especially
when data is presented visually.
• Reduced Cognitive Load: Visualizations simplify complex data, reducing the
cognitive effort required to understand it.
• Enhanced Memory Retention: Visual information is often easier to remember
than text. One picture is worth 100 words.
• How Cognitive system helps Visualization:-
• Our cognitive abilities enable us to interact with visual data, exploring different
dimensions and perspectives. This interaction can lead to deeper insights and
better decision-making
• By understanding Cognitive system, we can design better visual representations
further simplifying our understanding
Advantages of Visualization
• Intuitive Understanding: Visuals are often easier to comprehend than raw
numbers or text. Data visualizations allow people—even those who aren’t
comfortable with math—to quickly grasp patterns and insights.
• Simplifies Complexity: Visualizations simplify complex data by revealing
patterns, trends, and outliers. They help you see the forest for the trees,
making it easier to explore data structures and identify clusters.
• Better Decision-Making: When data is presented visually, decision-makers
can make informed choices more effectively.
• Improved Communication: Sharing data through visualizations ensures
everyone is on the same page. Instead of struggling with raw data,
colleagues can easily interpret and discuss insights from well-designed
visualizations.
Advantages of Visualization … Continued
Option 1:
• Given 5 years Nifty 50 S. No Date Nifty 50 Value

data i.e. 5 * 12 * 22 = 1 22-08-1999 10,848

1320 data points 2 23-08-1999 10,850

3 24-08-1999 10,900
Option 2:
• Given 5 years same 07-10-2021 18,780

data in graphical form


10-06-2022 15,700

Analyse visualization
advantages mentioned 1320 21-08-2024 24,770
in previous page w.r.t to Option1 – 1320 rows
these two options. of data point
Visualization is a blend of both science and
art
• Visualization being described as a blend of science and art, reflects its
dual nature, where technical precision and creative expression
intersect to effectively communicate complex information.
• Visualization is both a science—ensuring data is represented
accurately and logically—and an art—engaging the audience and
making the data relatable. When these two elements are balanced,
visualization becomes a powerful tool for both understanding and
communication.
• The art makes the data compelling, while the science ensures it is
trustworthy and actionable.
Block Diagram of Visualization and
Visualization Steps Data Exploration
View Manipulation

Data Visual Mapping


Data Transformation (Graphical Tools)

Data Gathering

Block Diagram of Visualization

Visualization Steps:

1. Collection and Storage of data


2. Preprocessing stage to transform that is easier to understand, data reduction to reveal selected aspects/
features
3. Mapping data to selected visual representation (like bar chart/ line graph/ pie chart or some other)
4. Finally Human perception
Unit 2 - Questions
1. What is Visualization?
2. What are the advantages of visualization?
3. Draw block diagram of visualization and explain these steps?
Unit 3
Perceptual Processing
A Model of Perceptual Processing. Data and Image models: Types of Data, Coding Words and Images, The Nature of Language,
Visual and Spoken Language.

Reference Books
Information Visualization: perception and design, Colin Ware 2nd edition, Omrgan Kaufmann publisher, 2004. : Ch1, Ch8, Ch9
Visualization vs Visual Perception

Reproduce the mental


Complex external World
Build mental image with image that is easier to
(data, scenery, graph, map
Key features understand, communicate
etc.,)
and decision making

Environment Visual Perception Visualization

Understanding visual perceptual mechanisms is fundamental to providing


visualization designers with sound design principles. i.e. Knowing how we
see things helps create good design rules for visualization designers

Visualization is an iterative process with continuous improvement.


Three stage model of visual information processing
Visual Scene

Stage1: Parallel Processing to Extract Low-Level Properties of Visual Scene.


Stage1 characteristics are -
• Rapid Parallel Processing of entire scene by Large Array of Neurons in the Eye and Primary Visual Cortex
• Bottom-up processing, processing each and every dots of the scene and extracts elements from visual field
• Extraction of features, orientation, color, texture and movement
• Extracted information either fades away (less than second) or goes to short term memory for further
processing based on what we like
• Stage 1 is the basis for understanding the visual salience (noticeable, important) of elements in the scene
Visual salience refers to the distinct perceptual quality that makes certain elements in a visual scene stand out and
grab our attention.
Three stage model of visual information processing
Visual Scene

Stage2: Pattern Perception


Stage2 characteristics are -
• Slower serial processing
• Top down attention, pulls out patterns from the feature maps (extracted from stage 1)
• Divide the visual field into patterns such as regions, regions of same color, texture, simple patterns,
continuous contours and pattern of motion.
• Small number of (one to three) patterns becoming “bound” and held for a second or two
• Brain follows two paths in pattern perception, this is called two – visual –system theory. One system for
locomotion and action. Second system for static object identification.
Three stage model of visual information processing
Visual Scene

Stage3: Visual Working Memory


Highest level of perception are the objects held in visual working memory at this stage.
Stage3 characteristics are -
• Constructs a sequence of visual queries (visual queries often driven by task driven, goal oriented)
• Constructs few objects from the patterns found in stage2 as well as information stored in the long term
memory related to the task, these objects will provide answers to the visual queries

Example: If we use road map to look for a route, the visual query trigger a search for connected red contours
(representing major highways) between two visual symbols (representing cities)
Power of Vision
Sight

Touch

Hearing & Smell


1000X 100X
10X 1X Taste

Bandwidth of 5 senses to perceive the external environment. Sight/ Visual media has
1000 times more bandwidth than senses “Taste” as per Dutch scientist

• Enormous amount of data comes in contact with eye unconsciously, eye is very sensitive to recognize colors,
shapes, patterns and their variations in the language of eye.
• When we combine the language of the eye with the language of the mind (such as numbers, words, and concepts),
both languages work together to enhance each other, aiding human perception.
Key learnings from three stage visual
information processing
• Both the eye and minds are fed with an enormous amount of
information, consciously and unconsciously, but this information fade
away unless the visual details stand out and grab the attention.
• The brain follows two paths for visual perception: one for static
information and the other for information in motion (animation).
Therefore, data visualization can use static, animated, or both to
grab the audience’s attention.
• The mind perceives visual information based on objects we already
know (stored in long-term memory). Therefore, the objects we use in
visual forms should be familiar to the audience.
Sample Visual Representation
Military Budget ($bn)
700 607
600
500
400
300
200
100 61 60 47 41 40 38 36 29 25
0

Military Budget ($bn) Customer Order


S. Korea
India
Russia
Saudi Arabia
Line item -1 Line Item -2 Line Item – 3
Germany (Pens) (Pencil) (Eraser)
France

Japan Conclusions that we can make


UK - USA has very large military budget and it is more than combined
USA
China military budget of 9 other big countries.

Country and Its Budget (Entity and its attribute value)


Types of Data
The goal of the visualization research is to transform data into a perceptually efficient visual format. So
understanding types of data and choosing the right format for these types of data is crucial.

1. Entities - Are objects of interest, we wish to visualize (e.g. people, places, events etc.,)
2. Relationships – Are Structures and patterns that relate entities with one another (e.g. “part-of”, “supervisor-sub-
ordinate”, “parent – siblings” etc.,
3. Attributes - It is a property of entity or relationship and cannot be thought independently e.g. color of an Apple
4. Attribute dimensions – Attribute can have one or more dimensions ( Person Weight one Dimensional, Journey
will have two dimensions a. distance travelled from Origin b. Direction in which he is travelling)
5. Numbers – these are used to measure quality of attributes
1. Categorical Data – Classification of data into groups (like fruits into apples, bananas groups)
2. Integer Data – This is like ordinal class in that it is discrete or ordered. Discrete is a whole number and it has
natural order.
3. Real Number Data – It represent attributes properties such as interval (gap between two values, gap
between Bus start time and end time) and ratios (Object A is half the size of Object B i.e. 0.5 times)
Types of Data … Continued
6. Uncertainty data (e.g. flipping a coin, fuzzy values like high, low, medium, brightness of color)
7. Operational Data (Mathematical Operations, Merging, Inverting, Splitting single entity into Several entities etc.,)
8. Meta Data – it is data about data (It describes data entities and attributes, who and when collected data, quality
of data etc.,). Metadata serves several critical purposes, including: Data Understanding and Interpretation, Data
Discovery and Searchability, Data Quality and Trustworthiness,...etc. especially in large-scale projects or when
collaborating across multiple teams or systems.
Important aspect of relationships:
❖ Sometimes relationships provided explicitly
❖ Many times relationships are discovered, discovering relationships is the very purpose of visualization

Sample: visual of merge operation


Reversing arrow could be split operation
Uncertainty: Probability
Types of Data … Continued Military Budget ($bn)
S. Korea
India
Russia
Saudi Arabia

Germany

France

Japan

UK
USA
China

Figure1. Country budgets


Figure2. Country budgets
Question:
1. What are the entities?
2. How are we representing them visually?
3. What are entity attributes? Answers for Figure 2 (same questions):
4. How are we representing them visually? 1.
2.
Answers: 3.
1. Country names 4.
2. Rectangles
3. What are entity attributes?
4. Size of the rectangle?
Sample Visual Representation
Original data Enriched data

Country Military Budget GDP % MB to its GDP


USA
Singapore
India

Data Types
1. Entity - Country
2. Attribute 1 – Military Budget
3. Attribute 2 – GDP
4. Enriched value (Attribute 3) - % of Military Budget
to country GDP, this enriched data gave a new
% of Military Budget to its GDP perspective

Conclusions that we can make


- USA is not spending much on military compared to other countries
when compared to their GDPs

Country and Its Budget (Entity and its attribute value)


Sample Visual Representation

Country and Number of its Soldiers

Conclusions that we can make


- China has most number of soldiers
- Perspective changes when we look same data relative to its population on
the second chart (blue color bars)
Coding Words and Images
❖ Dual Coding Theory, proposed by Allan Paivio in 1971, suggests that our brains process information through two
distinct channels: verbal and non-verbal (visual) and stores in both working and long-term memory.
❖ Paivio called mental representation of visual information as imagens, and verbal information as logogens
❖ Logogens include auditory information, mathematical symbols, natural language and music
❖ Imagens include graphics, abstract and figurative imagery
❖ This theory is particularly relevant in information visualization because it highlights how combining text and visuals
can enhance understanding and memory retention.
Coding Words and Images
Key Points of Dual Coding Theory:
• Two Channels: Information is processed through verbal (words) and non-verbal (images) channels independently
but are linked to form associative connections.
• Enhanced Memory: When information is encoded both verbally and visually, it creates a “double memory trace,”
making it easier to retrieve the information later.
• Cognitive Load Reduction: Using both channels can reduce cognitive load, making it easier to process and
understand complex information.
• Improved Learning: Presenting information in both text and visual formats can improve learning outcomes by
providing multiple pathways for encoding and retrieval.

Application in Information Visualization:


• Charts and Graphs: Combining textual explanations with visual data representations helps users understand
trends and patterns more effectively.
The Nature of Language
Basic purpose of Visualization
• Presenting something that we understood to other people with the goal of convincing them

How do we do this?
• Cognitive processes i.e. interpreting data and explaining data are very different, both should work together
for effective understanding and presenting.
• Our goal is to explore different ways that images and words can be used to create narrative structure,
example integrating visual and verbal materials in multimedia presentations.
The Nature of Language
Nature of Language:
The “nature of language” refers to the fundamental characteristics and properties that define language as a system
of communication. Here are some key aspects:

• Symbolic: Language uses symbols (words, sounds, gestures) to represent objects, actions, ideas, and feelings.
These symbols are arbitrary, meaning there is no inherent connection between the symbol and what it
represents.
• Rule-Governed: Language operates according to a set of rules, including grammar and syntax, which dictate how
symbols can be combined to create meaningful expressions.
• Dynamic: Language is constantly evolving. New words are created, meanings change, and grammatical structures
can shift over time.
• Cultural: Language is deeply embedded in culture. It reflects and influences cultural norms, values, and practices.
• Innate and Learned: According to theories like Chomsky’s Universal Grammar, humans have an innate capacity
for language, but the specific language we learn is influenced by our environment.
• Ambiguous and Contextual: Words and sentences can have multiple meanings, and context plays a crucial role in
interpreting them
What are the Key takeaways (learnings) from Nature of Language
for data visualization?
• Symbolic: Language uses symbols (words, sounds, gestures) to represent objects, actions, ideas, and
feelings. These symbols are arbitrary, meaning there is no inherent connection between the symbol
and what it represents.
Takeaways: Data Visualization can include pictures, words, sounds through audio, gestures thru
motion (animation)
• Rule-Governed: Language operates according to a set of rules, including grammar and syntax, which
dictate how symbols can be combined to create meaningful expressions.
Takeaways: Visualization too follows grammar, which may or may not be prescribed but generally
practiced
• Dynamic: Language is constantly evolving. New words are created, meanings change, and grammatical
structures can shift over time.
Takeaways: Visualization too will evolve with kind of graphical objects
What are the Key takeaways from Nature of Language for data
visualization? … Continued
• Cultural: Language is deeply embedded in culture. It reflects and influences cultural norms, values, and practices.
Takeaways: Visualization to follow audience cultural aspects for better presentation
• Innate and Learned: According to theories like Chomsky’s Universal Grammar, humans have an innate capacity
for language, but the specific language we learn is influenced by our environment.
Takeaways: Visualization to follow audience cultural aspects as well as their environment
• Ambiguous and Contextual: Words and sentences can have multiple meanings, and context plays a crucial role in
interpreting them
Takeaways: Visualization shall not create any ambiguity, so it is necessary to provide necessary context through
various means (words, audio, animation, brief explanation etc.,)

Refer next slide for examples of data visualization aspects from nature of language.
Examples of data visualization aspects from
nature of language
Chart Title

Country Population vs Armed Forces


Chart -2
1600 2.50
Axis legends
1400
2.00
1200
1000 1.50
Population
800 (millions)
600 1.00
Armed Forces
400 (Million Active Personnel)
0.50
200
0 0.00
Russia United China India Saudi
States Arabia

Figure -1 Figure -2

• Figure -1 is only using one symbol (bars), Figure -2 enhances our understanding by using both visual, and word
symbols, we can further enhance it through audio and animation.
• Figure -2 Follows certain established visual grammar of depicting x, y axis, chart title, axis legends etc.,
• Figure -2 Provides the cultural aspect, i.e. all the words are in English so that English audience can understand
• Figure -2 Eliminate ambiguity by choosing same color codes for bar symbols, and respective axis and legends
Visual and Spoken Language
❖ People interact with one another using words and spoken language compared to images and diagrams.
❖ Spoken and written language is ubiquitous, it is the most detailed, complete, and commonly used system of
symbols we have. For this reason alone, it is only when there is a clear advantage that visual techniques are
preferred.
❖ That said, images have clear advantages (the phrase “a picture worth a 1000 words”) for certain kind of
information, and combination of images and words will often be best.
❖ A visualization designer has the task of deciding whether to represent information visually, using words or both,
other related choices involve the selection of static, or moving images, and spoken or written text?

Choices available to visual designers to represent data


Both
Visual Representation Verbal Representation
Static Moving Text/
Audio
Images Images Words

When to use images vs words separately vs in combination?


Guidelines for - When to use images vs words
separately vs in combination?
❖ G1.1 Use methods based on natural language (as opposed to visual
patterns) to express a sequence of steps
Text1, Text2, … Textn

Source 1 Table 1

1. Convert text source 1 into Table 1


XML
2. Convert XML source 2 into Table 2
Source 2 Table 2
3. Create Table 3 by removing columns 2, 5, 8 from
Table 2
4. Merge Table 3 and Table 1 into Table 4
Table 2 Table 3
Remove columns 2, 5, 8

Table 1
Merge Table 4
Table 3
Guidelines for - When to use images vs words
separately vs in combination?
❖G1.2 Graphical elements, Chancellor
rather than words, should be
used to show structural
relationships, such as links Vice-Chancellor
between entities and group
of entities
Business School Engineering
Director School Director

Division Head 1 Division Head 2


Guidelines for - When to use images vs words
separately vs in combination?
❖ G1.3 Use methods based on natural language (as opposed to visual pattern) to
represent abstract concept
❖Example “Freedom”, “Justice”, “Courage” etc.,
❖ G1.4 To represent complex information, separate out components according to
which medium is most efficient for each display
❖ G1.5 Place explanatory text as close as possible to the related parts of the
diagram and use graphical linking methods (arrows, colors, callouts)
❖ G1.6 When choosing between displaying text or speaking, prefer spoken
information to accompany images.
❖ G1.7 If spoken words are to be integrated with visual information, the relevant
parts of the visualization should be highlighted just before the start of
accompanying speech segment
❖ G1.8 Use some form of deixis, such as pointing with a hand or an arrow or
timely highlighting to link spoken words and images
Unit 3 - Questions
• Explain 3 stage visual information processing?
• What are the key learnings from 3 stage visual information processing?
• List four choices that are available for visual designer to represent the data?
• List at least 5 guidelines in using images vs text or combination?
• List entities, and attributes depicted in Chart 1, 2, 3 below and how they are
depicted visually?
Chart 2
Unit 3 – Questions … continued
Chart 3
Guidelines for - When to use images vs words
separately vs in combination?
❖ G1.3 Use methods based on natural language (as opposed to visual pattern) to
represent abstract concept
❖Example “Freedom”, “Justice”, “Courage” etc.,
❖ G1.4 To represent complex information, separate out components according to
which medium is most efficient for each display
❖ G1.5 Place explanatory text as close as possible to the related parts of the
diagram and use graphical linking methods (arrows, colors, callouts)
❖ G1.6 When choosing between displaying text or speaking, prefer spoken
information to accompany images.
❖ G1.7 If spoken words are to be integrated with visual information,
the relevant parts of the visualization should be highlighted just
before the start of accompanying speech segment
❖ G1.8 Use some form of deixis, such as pointing with a hand or an arrow or
timely highlighting to link spoken words and images
Unit 4
Understanding the data quality
Assessing data fit, assessing data integrity, improving data quality, Cleaning, Transforming, and Augmenting data

Reference Books
1. Reference R5 - Data-Centric Systems and Applications Data Quality_ Concepts, Methodologies and Techniques-
Springer (2006) Carlo Batini, Monica Scannapieco
2. Reference R6 - Data Quality_ The Accuracy Dimension (The Morgan Kaufmann Series in Data Management
Systems)-Jack E. Olson (2003)
Refer slides 23-43 for data quality
issues – Prevention and Fixing
These slides are part of Unit 4 Syllabus
What is the difference between
Data inconsistency and Data
Inaccuracy?
Assessing data fit
A classification of data quality assessment methods from
16th International conference on information quality.
General DQ Problems
❖ Data Completeness
❖ Data Accuracy
❖ Data Currency (Timeliness)
❖ Data Consistency
❖ Duplicate data
❖ Data Outliers
Assessing data fit - Examples of context independent
data errors
Context independent
incorrect values

Also syntax problems

Easy to detect spelling


error and correct

• Not easy to detect and correct if


the both the product codes are
valid for the company.
Inconsistent data values, but • This is context specific based on
company rules.
semantically same
Data Assessing Methods
• Column analysis method
• Number of (unique) values
• Missing Values
• Minimum, Maximum, Total and standard deviation, Median and Average for
numerical columns
• Incorrect data formats
• Incorrect values (like unrealistic values)
• Inconsistent values (mixing CA and cali. for city)

Give me python commands which can be used for column


analysis?
Data Assessing Methods
• Cross-domain analysis method
• This technique will be applied to data integration scenarios with several
source systems. It enables the identification of redundant data across tables.
• Cross-domain analysis is done across columns from different tables to identify
the percentage of values within the columns indicating that they might hold
the same data.
Customer Profile Data from CRM Application
Customer Id First Name Last Name DOB City AADHAR Number Phone
1 John Reed 01-Jan-40 Hyderabad 1000324567 67890348

Customer Billing Data from Invocing Application


Customer Id First Name Last Name AADHAR Number Prev Meter Reading Current Meter Reading Bill Amount Phone
4 John Red 1000324567 100 150 750 67890358
Data Assessing Methods
• Data validation method
• Validation of column values with reference data.
• Reference data examples
• Company defined city, state, country abbreviations
• Pin codes defined by the government
• Semantic profiling method
• Verify column values for pre-defined rules by company/ government etc.,
• Example
• a rule for the columns AGE and LIFE_STAGE could be: IF AGE < 18 THEN
LIFE_STAGE=’CHILD’,
How do we say that data is fit for the purpose?
(Assessing data is fit)
Data is assessed and verified for its
fitness to the problem by verifying it
with predefined threshold values for
Access Profile Transform Publish
each data quality metrics.
Example predefined threshold values:
a. Missing values < 90%
b. Accuracy > 99%
Figure. Data Wrangling Basic Steps After publishing the data, if the data
consumers are not happy, pre-defined
threshold values will be modified
accordingly.

Define
Verify
Threshold
Access Profile/ Predefined
Values for Transform Publish
assess Threshold
each data
values
quality type

Figure. Data Wrangling Basic Steps with assessing data is fit for purpose
Assessing data integrity
Data integrity refers to the accuracy, consistency, and reliability of data
throughout its (Data) lifecycle. It ensures that data remains unchanged
and uncorrupted during operations such as transfer, storage, and
retrieval.
Here are some key aspects of data integrity:
• Accuracy: Data should be correct and free from errors.
• Consistency: Data should be consistent within and across the data sources
• Completeness: All required data should be present.
• Timeliness: Data should be up-to-date and available when needed.
• Reliability: Data should be trustworthy.

When it comes data integrity, reliability of data is most crucial.


Data will be unreliable due to human errors, and/or due to fraudulent activity. Most of
the times it is going to be too late for data analyst to prevent data corruption/
falsification. It is not so easy to correct data corruption once it happens. However,
industry/ Govt follows certain mechanisms to prevent/ correct data corruption.
How to enforce data integrity?
• Data integrity enforce methods to prevent human errors in the
relational databases. (Preventing data errors is better than cleansing
the data after it reaches data store.)
• “No Duplicates” constraint on a column in the table
• “Foreign Key rules”
• Mandatory Fields
• Column data types, lengths
• Business rules
Foreign Key Constraints on the table
PASSPORT TABLE to ensure valid data is
PK_Key (serial #) Mandatory Optional Mandatory Format dd-mm-yyyy FK_CITY_TABLE No Duplicates inserted into the table.
Passport _Id First Name Middle Name
Last Name DOB City Code Prev. Passport Constraints detect
1 Ramana Venkat NULL 22-06-1967 3 T0043234
2 Josula Ramana 18436 1 T0043234
errors (red color text)
during the data
CITY TABLE insertion and prevent
PK_Key
City Code City Name Country
such data from storing
1 Hyderabad India into the table
2 Chennai India
How to enforce data integrity?
• Data Integrity enforce methods to prevent fraudulent activities
• User (multi-factor) Authentication (allowing user to login with password, OTP,
bio-metric, to access data source)
• User Authorization (Giving use to right permission to access data. CRUD
permissions i.e. C- Create, R-Read, U-Update, D-Delete)
• Data Encryption while data is at rest and during the transmission
• Audit logs to store complete information of when, what and who changed the
data

Simple Audit Log Table


Date and Time User Id Table Name Column Name Previous Data Value Current Data Value
12-Sep-2024 12:00 AM Ramana PASSPORT DOB 22-06-1967 22-06-1966
12-Sep-2024 12:05 AM Ramana PASSPORT Prev. Passport P004323 V004323
Python exercise
Getting started with Python
• Use URL: https://www.python.org/downloads/
• Select windows version and install
• Python requires pandas and numpy libraries
• Pandas
• The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis"
and was created by Wes McKinney in 2008.
• Pandas is a Python library used for working with data sets.
• It has functions for analyzing, cleaning, exploring, and manipulating data.
• To install Pandas, move to directory by running cmd command c:\users\<userid>\
• Type pip install pandas (this will install both pandas and numpy)
• NumPy is a Python library used for working with arrays. It also has functions for
working in domain of linear algebra, fourier transform, and matrices. NumPy was
created in 2005 by Travis Oliphant. It is an open source project and you can use it
freely. NumPy stands for Numerical Python.
Python representation of missing values
Data Type Python Representation
Datetime pd.NaT
String “”; None
Numerical data np.nan
Python exercises for data quality – Cleaning
Refer slides 23-43 for data quality theory

Write python code for the below data cleansing tasks:-

Missing Values:- (clue: dropna(), replace() api)


1. Drop rows where empty (Null) value is there in any of the columns
2. Drop only rows where empty value is there on specific column
3. Replace empty values in a column with a constant value
4. Replace empty values in a column with mean value of that column
Python exercises for data quality – Cleaning
Refer slides 23-43 for theory on data quality

Write python code for the below data cleansing tasks:-

Duplicate Rows:- (clue: drop_duplicates(); subset parameter)


1. Drop duplicate rows, keeping first occurrence
2. Drop duplicate rows based on specific columns, keeping first
occurrence
3. Drop all duplicate rows without keeping first occurrence
Python exercises for data quality – Cleaning
Refer slides 23-43 for theory on data quality

Write python code for the below data cleansing tasks:-

Inconsistent Dates:- (clue: to_datetime() api from pandas)


1. Create a data file with different date format such as ‘yyyy/mm/dd’, ‘dd-
mm-yyyy’, ‘<3 digit month>/YYYY/dd’ etc., and convert them into single
date format.

Inconsistent Reference values:- (clue: dataframe.merge(), drop(columns:),


rename(columns:) )
1. Replace column value with correct reference value based on key value.
Python exercises for data quality – Cleaning
Refer slides 23-43 for theory on data quality

Write python code for the below data cleansing tasks:-

Inaccurate values:- (Clue: figure out how to do it)


1. You have customer data in two files with unique key and last
modified date columns, merge two files into one customer data file
keeping latest data out of two files.
Python exercises for data quality – Cleaning
Refer slides 23-43 for theory on data quality

Write python code for the below data cleansing tasks:-

Inaccurate values:- (Clue: figure out how to do it)


1. You have customer data in two files with unique key and last
modified date columns, merge two files into one customer data file
keeping latest data out of two files.
Python exercises for data quality – Cleaning
Refer slides 23-43 for theory on data quality

Write python code for the below data cleansing tasks:-

Inaccurate values:- (Clue: figure out how to do it)


1. You have customer data in two files with unique key and last
modified date columns, merge two files into one customer data file
keeping latest data out of two files.
Data quality: Cleaning, Transforming, and Augmenting
data
Data Cleaning/ Cleansing Data Transformation Data Augmentation
Data cleansing, is about discovering Brining the data into a format/ Data augmentation denotes
and eliminating or correcting structure that is suitable enough for methods for supplementing so-
corrupt, incomplete, improperly performing data mining. called incomplete datasets by
formatted, or replicated data within providing missing data points in
a dataset. In general, data transformation order to increase the dataset’s
methods are used for data cleansing analyzability, to simulate and make
The objective is to improve overall also. predications of the outcome.
quality of the data for better results The objective is to convert data into
in decision making. a well organized state/format so that Data augmentation is to increase the
Example: it simplifies decision making process. volume, quality and diversity of data.
Cleansing of Example:
1. Missing Values 1. Splitting Column In many real-world application
2. Inconsistent values 2. Merging Columns settings it is often not feasible to
3. Inaccurate values 3. New attribute construction
4. Duplicate values obtain sufficient data to test decision
4. Scaling data (between 0 to 1)
5. Outliers and Noise 5. Transform to normal making process and machine
distributed data (mean =0, Std learning models, so data
= 1) (Z-Score method) augmentation is required.
Z-Score Method to identify Outliers
Z-score: Indicates how many standard deviations a data point is from the
mean.
Threshold: Typically, a z-score above 3 or below -3 is considered an
outlier.
Formula for Z-Score = (x – mean)/(stand deviation) {x is data value}

5 24 28 32 35 40 45 120 130 X : data values

-1.12 -0.66 -0.56 -0.46 -0.39 -0.27 -0.15 1.68 1.93 Z-score values

In the above example no z-score values is above 3 or below -3, so no


outliers in the data set as per z-score method.
Augmented data vs Synthetic Data
• Augmented Data – Augmented data is generated by copying existing
data and transforms those copies to increase the diversity and
amount of data in a given set
• Synthetic Data – Synthetic data is a artificial data generated
automatically by computer programs to mimic statistical properties of
real world data

Original Data Augmented Data Original Picture on left, augmented pictures on right
-Rotated
-Zoom out
Few More examples of Augmented data vs
Synthetic Data
Image Data Augmentation (Image Recognition Models)
• Flipping: Horizontally or vertically flipping images to create mirror images.
• Rotation: Rotating images by a certain degree to simulate different orientations.
• Scaling: Zooming in or out on images to create variations in size.
• Color Jittering: Randomly changing the brightness, contrast, and saturation of images.
• Cropping: Randomly cropping parts of images to focus on different areas.
Text Data Augmentation (Natural Language Processing – NLP Models)
• Synonym Replacement: Replacing words with their synonyms.
• Random Insertion: Inserting random words into sentences.
• Random Deletion: Deleting random words from sentences.
• Sentence Shuffling: Shuffling the order of sentences in a paragraph
Audio Data Augmentation (Voice Recognition Models)
• Noise Injection: Adding random noise to audio signals.
• Pitch Shifting: Changing the pitch of the audio signal.
• Speed Variation: Changing the speed of the audio playback.
Few More examples of Augmented data vs
Synthetic Data
Sales Data Augmentation
• Changing the purchase date, time, or slightly altering the quantity sold
• Price Fluctuation Simulation: Introduce variations in product prices to
simulate different market conditions and study their effects on sales volumes.
• Discount Impact Analysis: Generate data reflecting different discount
strategies to analyze their impact on customer purchasing behavior.
• Demographic Variation: Create synthetic customer profiles by varying
demographic attributes like age, income, and location to study their influence
on purchasing patterns.
System
Cognitive System
Digestive System
Education System
Computer System
Monolithic System
Client/Server System
Distributed System
Redundant System
Loosely coupled System
Tightly coupled System

System comprises of several components that interact and work together as whole to
do its intended function. If one of the components fail, system does not function fully
(100%).
Unit 5
Understanding Color
Trichromacy Theory, Color Measurement, Application of color in visualization 1,2,3, Representing
Quantity, static and moving pattern, Gestalt Laws, Pattern learning
Reference Books
1. Chapter 4, 6, , from Information Visualization: perception and design, Colin Ware 2nd edition,
Omrgan Kaufmann publisher, 2004.
Importance of Color
Human Vision is influenced by four parameters of
the objects.
1. Layout of objects in the space
2. Shape of objects
3. Motion of objects
4. Color of the objects
People’s perception of colors based on types
Even though people have color blindness (inability of color blindness that they have.
to distinguish certain colors), their day to day life
is not impacted much due to their perception of
other three parameters of the objects.
~10% of male and ~1% of female population have
color blindness.
The importance of color becomes evident when Color blindness
the other three parameters of the objects are people will pick rotten
identical in space; in such cases, color alone Color blindness fruits along with good
distinguishes one object from another. people depend on ones. Do they eat
signal layout. rotten fruits?
Trichromacy Theory
The Trichromacy Theory [Tri –(three). Chromacy (Color)], explains how
humans perceive color. According to this theory, our eyes have three
types of cone cells (color receptors), each sensitive to different ranges
of wavelengths of light. These cones work together to allow us to see a
wide range of colors by combining the different levels of stimulation
from each type of cones.
• Red cones (L-cones): Sensitive to long wavelengths.
• Green cones (M-cones): Sensitive to medium wavelengths.
• Blue cones (S-cones): Sensitive to short wavelengths.
When light enters the eye, it stimulates these cones in varying
degrees, and the brain interprets these signals to produce the
perception of different colors.
Color vision deficiency (CVD)
• There are six to seven million cone cells in a human eye of which, 64%
are red sensitive, 33% are green sensitive and 3% are blue sensitive.
• The type of CVD depends on the type of faulty or missing cone cell.
Fig 4.2 shows normal cone sensitivity
to color wavelength
Fig 4.3 absence of one cone response
makes three dimensional response
space to two dimensional response
space.
Types of cone (color receptors) faults:
1. The sensitivity of cone cells are shifted
towards a shorter or longer wavelength
2. Missing cone cells
Comparison of human sensitivity vs other animals
• Humans are trichromatic
• Dogs are dichromatic
• Birds are tetrachromatic

• As number of cones/
receptor types increases,
they make us to see more
color shades
Color Measurement
• Human eye can match any color with the mixture of no more than three
primary lights (Primaries). This is called human colorimetry.
• Understanding of human colorimetry is essential for any one who wishes to
reproduce colors that they saw precisely on various media, such as
printing, digital displays, textiles, and more.
• We can describe color by the following equation
• C ≡ rR + gG + bB
• Where C is the color to reproduce
• R, G, B are primary colors for RED, GREEN and BLUE
• r, g, and b represent amounts of each primary colors
• ≡ is used to denote perceptual match
NOTE: It is not mandatory to have RGB as primary colors. Paint industry uses
RYB as primary pigment colors to make paints. Printing industry (Digital
color Printers) uses CMY - Cyan, Magenta, Yellow as primary colors to do
color printing.
Application Areas of Color in Visualization
1. Color selection interface (method of selecting a color)
2. Color labeling
3. Color Sequences for map coding
4. Color reproduction (Transfer of color from one device to another)

Color Labeling Color Sequence for Continuously varying values (here oC)
Color Specification Interface and Color Spaces
• We use several software applications throughout the day; without them,
we can’t complete our daily tasks.
• Office Productivity applications such as –
• MS-Office (Word, Excel, PowerPoint, etc.,)
• Google Workspace (Google Docs, Sheets, and Slides etc.,)
• Apache OpenOffice (word, spreadsheet, presentations, graphics etc.,)
• Apple iWork (Pages, Numbers, and Keynote etc.,)
• Data Visualization Software
• Drawing applications
• Reporting, Dashboard and Business Intelligence applications
• Computer Aided Design (CAD)
• It is essential for these applications to let users select their own colors and
apply them to their text, charts, and graphs and make their content visually
appealing.
• Color Specification Interface and Color Spaces will help selecting required
color in the productivity and data visualization software
Application 1 : Color Specification Interface and
Color Spaces
There are # of approaches to this user interface problem. Some of
those approaches are –
1. Giving set of controls to specify a point in 3 dimensional color space
a) Slider/ Input RGB values
b) 2/3 dimensional color space User Input in
Hexa
2. Set of color names to choose
1. National Coloring System (NCS) User Input
2 Dimensional Fields for
3. Palette of pre-defined/Custom colors Color Space primary colors

Slider
Fig. 1 Microsoft Paint Color Interface

3 Dimensional Color Space Color Palette (Pre-defined colors) Custom Color Palette
Some more color interfaces

Power point Standard Color Interface Power point Custom Color Interface
• There is no one color interface that meets all user needs, so software Theme Colors
applications provide multiple options to the color interface to meet
customer needs.
• Eyedropper is to select any color on the page and drop it to the selected
object
• Theme colors are the colors that automatically change when color theme Eyedropper
of the document changes
Demonstration of Theme Colors
1. Power point (Design -> Colors)
2. Excel sheet (Page Layout -> Colors)
3. How to select color filters for color blindness people in windows 11
(Settings -> Accessibility -> Color Filters)
Application1 : Interfaces comparison

Slider/ Input Primary color 2/3 Dimensional Color Color Names Predefined Color Palette
values Spaces
Advantages Advantages Advantages Advantages
1. Easy to input/select primary 1. Very easy for users to see 1. Easy for users to select 1. Color palette is easy for
color values to generate the color in the color colors by name users to select small set
required color space and select the of colors
desired color Disadvantages
Disadvantages 1. It is also very difficult to Disadvantages
1. People do not know Disadvantages name 256*256*256 1. It is very difficult to
which combination of 1. It is very hard to combinations of colors represent all possible
color values will give showcase all 2. It is very hard for people colors in a color palette
which color combinations of colors in to remember colors by
the color spaces their names except few
popular colors

There is no one easy method for selecting the colors, so software applications provide multiple options to choose
the one that is best suitable for the user.
Guidelines for designing color specification
interfaces
• G1.1: Consider laying out the red-green and
yellow-blue channel information on a plane. Use
separate control for specifying the dark-light
dimension
• G1.2:In an interface for designing visualization
color schemes, consider providing a method for
showing colors against different backgrounds
Geometric color layout
• G1.3: To support the use of easy-to-remember for Guideline 1
and consistent color codes, consider providing
color palettes for designers
Application 2 : Color for Labeling (Nominal Codes)
1. Colors can be extremely effective to
distinguish objects and object categories in
visualization
2. Colors are best option for labeling compared
to grayscale codes (White, Dark Grey, Light
Grey, Black), because number of gray colors
that are easily perceivable by humans are
very few.
3. Set of 12 colors are recommended for color
labeling. They are red, green, yellow, blue, Color Labeling
black, white, pink, cyan, orange, brown and 1. Nominal codes are unique codes
purple. First 6 would normally be selected that represent objects. These
before 2nd set of six. codes can be colors, patterns or
abbreviations.
1st set of
6 Colors 2. In the above charts, colors are used
to represent country names
3. Nominal codes are orderable,
unlike Numbers/ Alphabets. These
must be remembered and
recognized.
Perceptual factors (7) to be considered for
color labeling
1. Distinctness
• Degree of perceived difference between two colors
that are placed adjacent
2. Unique Hues
• Red, green, yellow, blue, black and white are natural
choices when small set of color codes are required.
3. Contrast with background
• Background colors can dramatically alter color
appearance of the object, making one color looks like
another or giving weak perception of the object. In Adding contrast border in to the color
order to perceive the object better, place a thin black dots (b) ensures clarity against all
or white border around the color coded object. backgrounds.
Showing color coded lines can be
problematic (c).
Perceptual factors to be considered for color
labeling … Continued
4. Color Blindness:
• Because there is a substantial color-blind population, it may be desirable to use
colors that can be distinguished by everyone. Majority of people can't
distinguish colors that differ in red-green direction. Almost everyone can
distinguish colors that vary in yellow-blue direction.
Perceptual factors to be considered for color
labeling … Continued
5. Number of colors:
• Although color coding is an excellent method for color labeling, Only small number (5-10) of colors can be
rapidly perceived.
6. Field Size:
a) To avoid the small field color blindness (inability to recognize small fields when color used for those
fields is less saturated), when color coded Small areas should have strong and highly saturated colors
for maximum distinction. When large areas of color coded, the colors should be low saturation and
differ only slightly from one another.
b) This enables small color coded areas to be easily perceived against background or larger areas.

• More brightness makes


the color whitish
• Less saturation makes
the color blackish
Perceptual factors to be considered for color
labeling … Continued
7. Field Size:
b) When highlighting text by changing the color, it is important to maintain
luminance contrast with the background
Application 2: Color for Labeling Guidelines
• G2.1: Consider using red, green, yellow, blue, black and white to
color code small number of symbols
• G2.2: For small color coded symbols, give importance to distinctness
with their background
• G2.3: If colored symbols are isoluminant (same brightness/ unable to
distinct) with their background, use white or black border around the
symbols to bring highly contrasting luminance with the background
• G2.4: To create set of symbol colors that can be distinguished by most
color blind people, ensure variation in the yellow-blue direction
Application 2: Color for Labeling Guidelines
• G2.5: Do not use more than ten colors (optimally 5) for coding
symbols if reliable identification is required. Also use same color
labelling for the categories across multiple visualization for easy to
remember and recognize.
• G2.6: Use low-saturation colors for larger areas
• G2.3: Use high-saturation colors for small areas in foreground and
low saturation colors for larger areas in the background
• G2.4: When highlighting text by changing the color, it is important to
maintain distinctness with the background
Application 3 : Color sequences for data maps
Color sequences are to represent continuously varying map values.
Examples:
1. Weather maps
2. Astronomical radiation charts
3. Medical and other scientific applications
Application 3 : Color sequences for data maps
• Most of the times we use physical spectrum (Visible spectrum of light –
VIBGYOR) of color for color sequence but it is not perceptual sequence.
Because these color are nominal indicator not ordinal.
• Example
1. Give someone gray color chips of equal size but different saturation and ask them
to place them in order
2. Similarly, give colored chips of equal size and ask them to place them in order
• Figure 4.26 shows picture of ozone
concentrations both in gray scale and
spectrum approximation
• If our goal is to help user understand forms in
a data set such as high, lows. Then gray scale
color sequence is good
• If our goal is to help user to understand
quantification in the visualization, then
spectrum approximation is good. It helps
lower error in reading the quantities.
Color – Options for Color Sequences
Gray Scale Color Option Spectrum Approximation Option
Advantages: Advantages:
1. Gray scale color option is good for human perception for 1. Spectrum approximation has better perception in judging/
color sequence reading the quantities in the visualization
2. It is good to show forms (patterns etc.,) in a data set such 2. Certain visualization humans naturally understand due to
as Highs/ Lows, spirals, ridges in weather maps. semantic reasons (Red for Hot, Green for cold, colors in
Disadvantages: between for temperatures between hot and cold), it
1. Difficult to judge/ read quantities (values) based on color avoids learning something new.
2. Humans have much higher error rate (17%) in reading the 3. It can represent 4 times more steps/ intervals compared to
quantities compared spectrum colors (2.5%) Gray scale (refer below picture)
3. Gray scale option has much lower steps/ intervals for Disadvantages:
visualization. (refer below picture) 1. Color sequence is confusing, obscuring and actively
misleading when understand forms of the data
Spectrum Approximation - Options for choosing colors
1. Spiral color sequence: Sequence of colors that cycles through a
variety of colors and each one lighter than the previous one, it is
spiralling upward in color space.
Advantages:
• It has monotonicity in luminance (easy to perceive forms data)
• Reduces contrast induced errors and enable accurate reading of quantities from
a color key

Spiral Spectrum Normal Circular Spectrum


Spectrum Approximation - Options for choosing colors
2. Interval Pseudocolor Sequence:
• This method suggests using a uniform
color space in which equal perceptual
steps corresponds to equal metric steps
of the characteristics being displayed.
• It means each unit of step in the
sequence represents an equal change
in magnitude of the characteristics of
the color being displayed.
Example use cases:
1. Contour maps
2. Topographic maps
How to calculate equal interval pseudo color
sequence
255.255.128 242,168,47 105,1,0
Example: Creating Equal Perceptual Steps in RGB
1. Define the Range: We want to create a colormap from (RGB: 255,
255, 128) to (RGB: 105, 1, 0).
2. Choose the Number of Steps: Let’s use 3 equal perceptual steps.
Steps:
1. Identify the Start and End Colors:
1. Start Color: RGB = (255, 255, 128)
2. End Color : RGB = (105, 1, 0)
2. Calculate the Intervals: Fig 1. Map with Random Color Sequence
1. Red interval: (105 - 255) / 2 = -75 {3 equal steps -1 = 2}
2. Green interval: (1 - 255 ) / 2 = -127
3. Blue interval: (0 - 128) / 2 = -64 255.255.128 180,128,64 105,1,0

3. Generate the Colors: Fig 2. These are equal interval Color Sequence
1. Step 1: RGB = (255, 255, 128)
2. Step 2: RGB = (255 - 75, 255 -127, 128 - 64) = (180, 128, 64)
3. Step 3: RGB = (105, 1, 0)
Spectrum Approximation Options … Continued
3. Ratio Pseudocolors
Ratio sequence represents numbers that have
zero and numbers both above and below the zero.
By using ratio pseudocolors, you can enhance the
interpretability of complex data, making it easier to
identify patterns, trends, and anomalies.
Use neutral color for zero, -values use red
sequence and +values use green sequence as
shown in the picture on right side.
• Medical Imaging: MRI scans, ratio images can help
identify areas with abnormal tissue properties.
• Geospatial Data: When visualizing environmental data,
such as vegetation indices or water quality, ratio
pseudocolors can effectively show variations.
• Heat Maps: Ratio pseudocolors are often used in heat
maps to show the intensity of data points.
Spectrum Approximation Options … Continued
4. Sequence for the color blind
• Use sequence of colors in the direction of yellow – blue to cover majority population.
5. Bivariate Color Sequence
• A bivariate color sequence in color visualization is a technique used to represent the
relationship between two variables (Fields) on a single map or chart.

Income
Education

Bivariate color sequence map with two variables income and


education level in one single map
Representing Quantities
Vegitables

Rice Fruits

1. Consider first, second and third objects in each of the graphs above represents Rice,
Vegetables and Fruits.
2. Which graph has less fruits out of three?
3. Which graph has more vegetables out of three?
Representing Quantities

60 50 60
50 50
50 40 50
40 40
30
30 30
50 20
20 20
10 5
10 1 10 5
1
0 5 1 0 0
Meat Vegitables Fruits MEAT VEGITABLES FRUITS Meat Vegitables Fruits

What is the guideline?


Gestalt Laws
1. Fig. a & b, show series of dots, which figure(s)
do you perceive dots as series of rows or series
of columns or both?
2. In Fig. c, how many cluster (groups) of dots are
there?

L 1. How many cluster of black dots are there?


2. Is distance between L & L1, L & L2 are same?
3. Why L1 is not considered as part of cluster b?

L1
L2

1. How black dots are arranged in fig a & b?


2. Observe fig. b, How are symbols x are arranged
(rows or columns)?
3. Are you seeing rows or columns in fig. c?
4. Are you seeing rows or columns in fig. d?
Gestalt Laws
1. How many groups of objects are there in each of these pictures a, b, c and d?
2. Suppose lines are not there, how do you group those objects?
3. Does lines make big difference in our perception?

1. Out of Fig. a & b, which is clear?


2. Which fig. is easier to find connections between nodes? Why?

1. Out of three patterns in this picture, which ones do you notice better?
2. Out of three patterns in this picture, where your eyes focus more? Why?
Gestalt Laws
1. Do you full circle behind green rectangle? Or some other way? What is
that some other way?

1. Do you recognize how many objects are there in this picture?

1. How many groups of red dots in this picture?


2. Can we perceive two dots that are closer to
each other as one group?
Gestalt Laws
1. How many color objects are there in the foreground?
same color objects look with different colors, due to cornsweet illusion.
When light changes gradually it create this effect of illusion.

1. In Fig a, which color is foreground?


2. In Fig a, How many objects are there?
3. In Fig b, How many objects are there?
4. In Fig c, How many objects are there?

1. How many objects are you seeing in this


picture?
2. What are they?
3. Here great importance of objects also
plays a role in our perception
Gestalt Laws

Gestalt Laws 1:

Spatial Proximity : Place symbols and glyphs (pictures) representing related information close together.
Gestalt Laws

Gestalt Laws 2:

Similarity: When designing a grid layout of a data set, consider coding rows and/or columns using low-level
channel properties, such as color and texture.
When proximity between the objects is same, similarity plays role in perceiving the patterns.
Gestalt Laws

Gestalt Laws 3:

Connectedness: To show relationships between entities, consider linking graphical representations of data
objects using lines or ribbons of color. Linking objects has higher level perception than proximity and similarity.

NOTE: this was proven by Palmer and Rock (1994) and argued that Gestalt Psychologists overlooked this
principle.
Gestalt Laws

Gestalt Laws 4:

Continuity: While drawing complex diagram with connectivity between the entities, use simple, smooth and
continuous lines rather than lines that contain abrupt changes in direction.
Gestalt Laws

Gestalt Laws 5:

Symmetry: Consider using symmetry to make pattern comparisons easier and to get stronger sense of
perception.
Gestalt Laws

Gestalt Laws 6:

Closure: 1. There is a perceptual tendency to close contours that have gaps in them. Fig a on top picture.
2. A simple line contour is adequate for regions having simple shape. But use lines, colors, texture and
cornsweet contours for overlapping regions and complex shape. Fig a & b in bottom picture.
Gestalt Laws

Gestalt Laws 7:

The law of Common Region: When elements are located in the same closed region, we perceive them as
belonging to the same group. Consider putting related information inside a closed contour.
Gestalt Laws
Figure and Ground (background):
• Smaller objects of a pattern tend to be
perceived as objects in the foreground.

Gestalt Laws 8:

Figure and Ground: Use combination of closure, common region, and layout to ensure that data entities are
represented by graphical patterns that are perceived as figures not ground (as a background).
Pattern Learning
• Pattern extraction is fundamental to extraction of meaning from
visualization, if so can we teach/ learn to see patterns better?
• Artists talk about seeing things that rest of us can’t see
• Research found that –
• For detecting some patterns almost no learning is required, our neurons learn
this trick in the first few months of life
• Some learning happens for intermediate complex patterns
• Most learning happens to detect higher level of complex patterns

In Fig 6.50
1. Object a, has very simple grating pattern
which does not require any learning
2. Object b, intermediate complex pattern,
requires some learning
3. Object c, higher level pattern tasks, such
as finding downward pointing triangles

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy