DWV Notes Units 1 to 5
DWV Notes Units 1 to 5
Data Wrangling, Data Workflow, Data Dynamics (Data Wrangling Steps), Data Profiling,
Transformation, Data Quality, Team Structure Roles and Responsibilities, and DW Tools
Some Facts about Data
1. What is data? 1. Collection of facts
2. Is data abundant or scarce? 2. Abundant
3. Is it available freely?
3. No, takes time and money to collect
4. Do we understand this data in
a raw form? 4. No, if the volume of data is large
5. What is Visualization? 5. Representing data in graphical way
that can be easily understood by the
human cognitive system
Data Resources:
Knowledge 1. Lack of adequate number of resources working on data compared to
Quality
high number of data value consumers in the organization.
Knowledge:
1. Inadequate Skills / knowledge for the (IT) people working on data to
meet the expectations of business analysts, this makes multiple
feedback cycles between IT staff and Business
Data Projects
There is a natural progression of data projects: from near-term
answering of known questions, to longer-term value analyses to finally value to production systems
that use data in an automated way. Underlying this progression is the movement of data through
three main data stages: raw, refined, and production.
Near Term Long Term value Long Term value
answering known analysis for humans from Automated
questions to make decisions decision making
(Raw stage) (Refined stage) (Production stage)
A minority of data projects will end in the raw or production stages. The
majority will end in the refined stage. Projects ending in the refined stage will
add indirect value by delivering insights and models that drive better decisions.
The Figure 2.2 depicts, natural progression of data projects and actions that take
place at each stage of the data project.
Data Workflow in Data Projects … continued
Raw Data Stage:
1. Ingest Data
• As part of Ingest data action, data is collected from various sources, many times,
these sources are of different formats.
• Collected data will be stored at one central location with or without
transforming the data into structured format.
• Schema-on-read ingestion, in this style of data ingestion, data is not
transformed into usable data structure until it is need for further analysis.
• Schema-on-write ingestion, in this style of data ingestion, data is transformed
into usable data structure while it is collected and stored into central location.
This kind of data ingestion style is used in data warehouse projects.
• Ingesting data triggers two additional actions, both related to the creation of
generic and custom metadata.
Data Workflow in Data Projects … continued
Raw Data Stage:
2. Describe Data (Generate Generic Meta Data)
• Before ingesting data, it is necessary to understand generic data characteristics,
such as data, its types, length, format etc., describing these general
characteristics is generic meta data
3. Assess Data Utility (Generate Custom Meta Data)
• This involves assessing data utility (usefulness) in order to streamline data
ingestion process. It is not necessary that all data sources follow the same
generic meta data, there will be some exceptions that go into custom meta
description.
Data Workflow in Data Projects … continued
Data Source 1 Data Source 2 Data Source 3
Date 07-Aug-23 Date 07-Aug-23 Date 07-Aug-23
Y 101 Roll Number Attendance
Student Name Roll Number Attendance
N 102 102 N
Ram 101 Y
Y 103
Ganga 102 N *Captured only absentees details
John 103 Y
Inconsistency Short Name Inconsistency Date Data Inconsistency means data is not
Formats uniformly represented within the data
Inconsistency Customer First and Last Names source or across the data sources.
Customer Data Form with Free Flow Text Customer Data Form with Input Controls
While data capture, Free Flow Text
Customer Name : Cust. First Name :
Input forms and manual entries are
Birth Place : major causes for data inconsistency, Cust. Last Name :
instead by using modern data input
Birth Place controls like drop –down and date Birth Place :
Short Name : controls, these kind of problems can
Date of Birth :
Date of Birth : be pre-vented.
3 Ahmed, H California CA
In this example city 2 sales (a subset of entire data city1, city2, city3) is
extremely high and it is a collective outlier from rest of the cities.
Identifying Outliers
• With small data set it is easier to detect by visualization
• 24, 28, 32, 5, 40, 45, 35, 120, 130
• In this data set it is very easy to identify 5, 120 and 130 outliers either they
are too small or too large compared to the rest of the values
• How do Identify when the data is very large Or how do we find data
programmatically
• Box Plots
• Z-score method
Box Plot
Box Plotting of Data set 24, 28, 32, 5, 40, 45, 35, 120, 130
Box Plot Method Steps:-
1. Sort the date set
2. Identify Q1 Position (Quantile 1 (Q1) is 25% of percentile (25% of values are lower than this number))
3. Identify Q3 Position (Quantile 3 (Q3) is 75% of percentile (75% of values are lower than this number))
4. Identify Inter Quantile Range (IQR), it is the difference of Q3 – Q1 position values
5. Lower Boundary (LB) = (Q1 position value) – (1.5 * IQR) (Data values that are lower than LB are Outliers)
6. Higher Boundary (HB) = (Q3 position value) + (1.5 * IQR) (Data values that are higher than HB are Outliers)
LB Q1 HB
Q3 Setp1: sorted
5 24 28 32 35 40 45 120 130
Step2: Q1 Position (Total values in the date set)*25% = 9* (1/4) = 2.5 position i.e. means take 2nd or 3rd position, but take always lower position. In our
example it is 2nd position.
Step3: Q3 Position (Total values in the date set)*75% = 9*(3/4) = 6.75 position i.e. means take 6th or 7th position, but take always higher position. In
our example it is 7th position.
Step4: IQR = (7th position value) – (2nd position value) = (45 – 24) = 21
Step5: Lower Boundary (LB) = (24 – 1.5*21) = -7.5 ( so values lower than -7.5 are considered outliers; in our example no LB outliers)
Step6: Higer Boundary (HB) = (45 + 1.5*21) = 76.5 ( so values higher than 76.5 are considered outliers; in our example 120, 130 are HB Outliers)
Z-Score Method to identify Outliers
Z-score: Indicates how many standard deviations a data point is from the
mean.
Threshold: Typically, a z-score above 3 or below -3 is considered an
outlier.
Formula for Z-Score = (x – mean)/(stand deviation) {x is data value}
-1.12 -0.66 -0.56 -0.46 -0.39 -0.27 -0.15 1.68 1.93 Z-score values
• Reasons to fix
• Fixing complex datasets make them more
interpretable by simplifying the data representation
and highlighting the most important features.
• Identify meaningful patterns or trends
• Enhanced visualization
• Forecasting and Predictions
Fixing data noise
• Techniques to Fix
• Smoothing Technique through Binning
• Smoothing by mean value
• Smoothing by nearest boundary value
• Reasons to fix
• Fixing complex datasets make them more interpretable by simplifying the
data representation and highlighting the most important features.
• Identify meaningful patterns or trends
• Enhanced visualization
• Forecasting and Predictions
Smoothing Technique thru Binning by Bin
Means Method
1. Original data 24, 28, 15, 32, 55, 40, 45, 35, 62
Binning by Bin Means method:-
Step1: Sort the values
15, 24, 28, 32, 35, 40, 45, 55, 62
Step2: select bin size; consider 3 here.
Step3: Partition the data set into equal frequency Bins based on bin size
• Bin1 - 15, 24, 28 (Bin Mean : 22.3)
• Bin2 - 32, 35, 40 (Bin Mean : 35.6)
• Bin3 - 45, 55, 62 (Bin Mean: 54)
• Step4: Replace each Bin value with corresponding bin mean value
• Step5: Construct new data set using new bin values i.e. 22.3, 22.3, 22.3, 35.6, 35.6, 35.6, 54, 54, 54
NOTE: many companies offer data wrangling tools bundle with their data analytics tools.
Data Profiling
• Data profiling is the process of examining, analyzing, and summarizing data sets to understand their
structure, content, and quality.
• It involves collecting statistical information about the data, which can help identify patterns,
anomalies, relationships, and potential issues within the data.Fundamentally profiling guides
transformations to consider for improving the data quality.
e.g. df.head(), df.info(), df.describe() data profiling statements in python
• There are two types of data profiling as shown in the table 3-2.
Birth
LastName FirstName finalWorth gdp_country
Year
Arnault Bernard 211000 1949 $2,715,518,274,227
Musk Elon 180000 1971 $21,427,700,000,000
Bezos Jeff 114000 1964 $21,427,700,000,000
Ellison Larry 107000 1944 $21,427,700,000,000
Buffett Warren 106000 1930 $21,427,700,000,000
Gates Bill 104000 1955 $21,427,700,000,000
Build Data
Describe Generic Describe Custom Generate ad-hoc Build Prototype Generate Regular
Products and
Meta Data Meta Data Reports Models Reports
Services
Data Projects
• His responsibilities to ensure sufficient tools and Head
technologies available for data projects and they working
coherently working to benefit the team
Data Architect
• He is also responsible for data integration, solution design
and data security
• Predominantly works in Optimize Data Stage and supports data
Data Scientist
analysts in building prototype models, interacts with business
analysts understanding their needs
Data Analyst • Predominantly works in Design and Refine Data Stage and
building prototype models, guides Data Engineer
Figure. Typical Team Structure, Roles and Responsibilities
Data Engineer • Predominantly works in Raw Data Stage, responsibilities
include data collection, programming to enrich data and
building ad-hoc reports
Unit 1 - Questions
1. Mention one industry, Can you provide at least three insights from data that would help
resolve these industry pain areas?
2. What are the two dimensions that can be used to measure the data value? Explain each one
briefly.
3. Explain four bottlenecks that industries face when deriving value from data?
4. Why are data wrangling tools important for extracting production value from your data?
5. Draw a typical workflow for data projects in the industry?
6. Explain stages of data projects?
7. What distinguishes Schema-on-Read from Schema-on-Write data ingestion methods?
8. Why is generating metadata necessary?
9. List at least five common data quality issues faced in most data projects?
10. Give at least three examples to prevent data quality issues?
11. Provide at least three typical data quality issues and explain how to fix them?
12. What is data profiling, and Why it is important?
13. Which three data wrangling tools would you recommend?
14. How would you describe the typical team structure in data projects, and what are the various
roles and responsibilities?
Unit 2
Introduction to Data Visualization
Need of Visualization, Block Diagram of Visualization, Visualization Stages.
Reference Books
Information Visualization: perception and design, Colin Ware 2nd edition, Omrgan Kaufmann publisher, 2004. : Ch1,
Visualizing data: Exploring and explaining data with the processing environment, Ben Fry O’Reilly, 1 st edition, 2008: Ch1
What is Visualization?
• Representing data in graphical way that can be easily understood by
the human cognitive system
• Externalization of an internal construct of the mind such as an image,
thought or data in the form of graphical representation (to support
decision making) is called visualization.
What is Human Cognitive System?
• It helps to perceive environment around us, learn from experiences,
anticipate outcomes, and adapt to changing circumstances.
How Human Cognitive system and
Visualization help/ influence each other
How Visualization help cognitive system:-
• Pattern Recognition: Our brains are excellent at recognizing patterns, especially
when data is presented visually.
• Reduced Cognitive Load: Visualizations simplify complex data, reducing the
cognitive effort required to understand it.
• Enhanced Memory Retention: Visual information is often easier to remember
than text. One picture is worth 100 words.
• How Cognitive system helps Visualization:-
• Our cognitive abilities enable us to interact with visual data, exploring different
dimensions and perspectives. This interaction can lead to deeper insights and
better decision-making
• By understanding Cognitive system, we can design better visual representations
further simplifying our understanding
Advantages of Visualization
• Intuitive Understanding: Visuals are often easier to comprehend than raw
numbers or text. Data visualizations allow people—even those who aren’t
comfortable with math—to quickly grasp patterns and insights.
• Simplifies Complexity: Visualizations simplify complex data by revealing
patterns, trends, and outliers. They help you see the forest for the trees,
making it easier to explore data structures and identify clusters.
• Better Decision-Making: When data is presented visually, decision-makers
can make informed choices more effectively.
• Improved Communication: Sharing data through visualizations ensures
everyone is on the same page. Instead of struggling with raw data,
colleagues can easily interpret and discuss insights from well-designed
visualizations.
Advantages of Visualization … Continued
Option 1:
• Given 5 years Nifty 50 S. No Date Nifty 50 Value
3 24-08-1999 10,900
Option 2:
• Given 5 years same 07-10-2021 18,780
Analyse visualization
advantages mentioned 1320 21-08-2024 24,770
in previous page w.r.t to Option1 – 1320 rows
these two options. of data point
Visualization is a blend of both science and
art
• Visualization being described as a blend of science and art, reflects its
dual nature, where technical precision and creative expression
intersect to effectively communicate complex information.
• Visualization is both a science—ensuring data is represented
accurately and logically—and an art—engaging the audience and
making the data relatable. When these two elements are balanced,
visualization becomes a powerful tool for both understanding and
communication.
• The art makes the data compelling, while the science ensures it is
trustworthy and actionable.
Block Diagram of Visualization and
Visualization Steps Data Exploration
View Manipulation
Data Gathering
Visualization Steps:
Reference Books
Information Visualization: perception and design, Colin Ware 2nd edition, Omrgan Kaufmann publisher, 2004. : Ch1, Ch8, Ch9
Visualization vs Visual Perception
Example: If we use road map to look for a route, the visual query trigger a search for connected red contours
(representing major highways) between two visual symbols (representing cities)
Power of Vision
Sight
Touch
Bandwidth of 5 senses to perceive the external environment. Sight/ Visual media has
1000 times more bandwidth than senses “Taste” as per Dutch scientist
• Enormous amount of data comes in contact with eye unconsciously, eye is very sensitive to recognize colors,
shapes, patterns and their variations in the language of eye.
• When we combine the language of the eye with the language of the mind (such as numbers, words, and concepts),
both languages work together to enhance each other, aiding human perception.
Key learnings from three stage visual
information processing
• Both the eye and minds are fed with an enormous amount of
information, consciously and unconsciously, but this information fade
away unless the visual details stand out and grab the attention.
• The brain follows two paths for visual perception: one for static
information and the other for information in motion (animation).
Therefore, data visualization can use static, animated, or both to
grab the audience’s attention.
• The mind perceives visual information based on objects we already
know (stored in long-term memory). Therefore, the objects we use in
visual forms should be familiar to the audience.
Sample Visual Representation
Military Budget ($bn)
700 607
600
500
400
300
200
100 61 60 47 41 40 38 36 29 25
0
1. Entities - Are objects of interest, we wish to visualize (e.g. people, places, events etc.,)
2. Relationships – Are Structures and patterns that relate entities with one another (e.g. “part-of”, “supervisor-sub-
ordinate”, “parent – siblings” etc.,
3. Attributes - It is a property of entity or relationship and cannot be thought independently e.g. color of an Apple
4. Attribute dimensions – Attribute can have one or more dimensions ( Person Weight one Dimensional, Journey
will have two dimensions a. distance travelled from Origin b. Direction in which he is travelling)
5. Numbers – these are used to measure quality of attributes
1. Categorical Data – Classification of data into groups (like fruits into apples, bananas groups)
2. Integer Data – This is like ordinal class in that it is discrete or ordered. Discrete is a whole number and it has
natural order.
3. Real Number Data – It represent attributes properties such as interval (gap between two values, gap
between Bus start time and end time) and ratios (Object A is half the size of Object B i.e. 0.5 times)
Types of Data … Continued
6. Uncertainty data (e.g. flipping a coin, fuzzy values like high, low, medium, brightness of color)
7. Operational Data (Mathematical Operations, Merging, Inverting, Splitting single entity into Several entities etc.,)
8. Meta Data – it is data about data (It describes data entities and attributes, who and when collected data, quality
of data etc.,). Metadata serves several critical purposes, including: Data Understanding and Interpretation, Data
Discovery and Searchability, Data Quality and Trustworthiness,...etc. especially in large-scale projects or when
collaborating across multiple teams or systems.
Important aspect of relationships:
❖ Sometimes relationships provided explicitly
❖ Many times relationships are discovered, discovering relationships is the very purpose of visualization
Germany
France
Japan
UK
USA
China
Data Types
1. Entity - Country
2. Attribute 1 – Military Budget
3. Attribute 2 – GDP
4. Enriched value (Attribute 3) - % of Military Budget
to country GDP, this enriched data gave a new
% of Military Budget to its GDP perspective
How do we do this?
• Cognitive processes i.e. interpreting data and explaining data are very different, both should work together
for effective understanding and presenting.
• Our goal is to explore different ways that images and words can be used to create narrative structure,
example integrating visual and verbal materials in multimedia presentations.
The Nature of Language
Nature of Language:
The “nature of language” refers to the fundamental characteristics and properties that define language as a system
of communication. Here are some key aspects:
• Symbolic: Language uses symbols (words, sounds, gestures) to represent objects, actions, ideas, and feelings.
These symbols are arbitrary, meaning there is no inherent connection between the symbol and what it
represents.
• Rule-Governed: Language operates according to a set of rules, including grammar and syntax, which dictate how
symbols can be combined to create meaningful expressions.
• Dynamic: Language is constantly evolving. New words are created, meanings change, and grammatical structures
can shift over time.
• Cultural: Language is deeply embedded in culture. It reflects and influences cultural norms, values, and practices.
• Innate and Learned: According to theories like Chomsky’s Universal Grammar, humans have an innate capacity
for language, but the specific language we learn is influenced by our environment.
• Ambiguous and Contextual: Words and sentences can have multiple meanings, and context plays a crucial role in
interpreting them
What are the Key takeaways (learnings) from Nature of Language
for data visualization?
• Symbolic: Language uses symbols (words, sounds, gestures) to represent objects, actions, ideas, and
feelings. These symbols are arbitrary, meaning there is no inherent connection between the symbol
and what it represents.
Takeaways: Data Visualization can include pictures, words, sounds through audio, gestures thru
motion (animation)
• Rule-Governed: Language operates according to a set of rules, including grammar and syntax, which
dictate how symbols can be combined to create meaningful expressions.
Takeaways: Visualization too follows grammar, which may or may not be prescribed but generally
practiced
• Dynamic: Language is constantly evolving. New words are created, meanings change, and grammatical
structures can shift over time.
Takeaways: Visualization too will evolve with kind of graphical objects
What are the Key takeaways from Nature of Language for data
visualization? … Continued
• Cultural: Language is deeply embedded in culture. It reflects and influences cultural norms, values, and practices.
Takeaways: Visualization to follow audience cultural aspects for better presentation
• Innate and Learned: According to theories like Chomsky’s Universal Grammar, humans have an innate capacity
for language, but the specific language we learn is influenced by our environment.
Takeaways: Visualization to follow audience cultural aspects as well as their environment
• Ambiguous and Contextual: Words and sentences can have multiple meanings, and context plays a crucial role in
interpreting them
Takeaways: Visualization shall not create any ambiguity, so it is necessary to provide necessary context through
various means (words, audio, animation, brief explanation etc.,)
Refer next slide for examples of data visualization aspects from nature of language.
Examples of data visualization aspects from
nature of language
Chart Title
Figure -1 Figure -2
• Figure -1 is only using one symbol (bars), Figure -2 enhances our understanding by using both visual, and word
symbols, we can further enhance it through audio and animation.
• Figure -2 Follows certain established visual grammar of depicting x, y axis, chart title, axis legends etc.,
• Figure -2 Provides the cultural aspect, i.e. all the words are in English so that English audience can understand
• Figure -2 Eliminate ambiguity by choosing same color codes for bar symbols, and respective axis and legends
Visual and Spoken Language
❖ People interact with one another using words and spoken language compared to images and diagrams.
❖ Spoken and written language is ubiquitous, it is the most detailed, complete, and commonly used system of
symbols we have. For this reason alone, it is only when there is a clear advantage that visual techniques are
preferred.
❖ That said, images have clear advantages (the phrase “a picture worth a 1000 words”) for certain kind of
information, and combination of images and words will often be best.
❖ A visualization designer has the task of deciding whether to represent information visually, using words or both,
other related choices involve the selection of static, or moving images, and spoken or written text?
Source 1 Table 1
Table 1
Merge Table 4
Table 3
Guidelines for - When to use images vs words
separately vs in combination?
❖G1.2 Graphical elements, Chancellor
rather than words, should be
used to show structural
relationships, such as links Vice-Chancellor
between entities and group
of entities
Business School Engineering
Director School Director
Reference Books
1. Reference R5 - Data-Centric Systems and Applications Data Quality_ Concepts, Methodologies and Techniques-
Springer (2006) Carlo Batini, Monica Scannapieco
2. Reference R6 - Data Quality_ The Accuracy Dimension (The Morgan Kaufmann Series in Data Management
Systems)-Jack E. Olson (2003)
Refer slides 23-43 for data quality
issues – Prevention and Fixing
These slides are part of Unit 4 Syllabus
What is the difference between
Data inconsistency and Data
Inaccuracy?
Assessing data fit
A classification of data quality assessment methods from
16th International conference on information quality.
General DQ Problems
❖ Data Completeness
❖ Data Accuracy
❖ Data Currency (Timeliness)
❖ Data Consistency
❖ Duplicate data
❖ Data Outliers
Assessing data fit - Examples of context independent
data errors
Context independent
incorrect values
Define
Verify
Threshold
Access Profile/ Predefined
Values for Transform Publish
assess Threshold
each data
values
quality type
Figure. Data Wrangling Basic Steps with assessing data is fit for purpose
Assessing data integrity
Data integrity refers to the accuracy, consistency, and reliability of data
throughout its (Data) lifecycle. It ensures that data remains unchanged
and uncorrupted during operations such as transfer, storage, and
retrieval.
Here are some key aspects of data integrity:
• Accuracy: Data should be correct and free from errors.
• Consistency: Data should be consistent within and across the data sources
• Completeness: All required data should be present.
• Timeliness: Data should be up-to-date and available when needed.
• Reliability: Data should be trustworthy.
-1.12 -0.66 -0.56 -0.46 -0.39 -0.27 -0.15 1.68 1.93 Z-score values
Original Data Augmented Data Original Picture on left, augmented pictures on right
-Rotated
-Zoom out
Few More examples of Augmented data vs
Synthetic Data
Image Data Augmentation (Image Recognition Models)
• Flipping: Horizontally or vertically flipping images to create mirror images.
• Rotation: Rotating images by a certain degree to simulate different orientations.
• Scaling: Zooming in or out on images to create variations in size.
• Color Jittering: Randomly changing the brightness, contrast, and saturation of images.
• Cropping: Randomly cropping parts of images to focus on different areas.
Text Data Augmentation (Natural Language Processing – NLP Models)
• Synonym Replacement: Replacing words with their synonyms.
• Random Insertion: Inserting random words into sentences.
• Random Deletion: Deleting random words from sentences.
• Sentence Shuffling: Shuffling the order of sentences in a paragraph
Audio Data Augmentation (Voice Recognition Models)
• Noise Injection: Adding random noise to audio signals.
• Pitch Shifting: Changing the pitch of the audio signal.
• Speed Variation: Changing the speed of the audio playback.
Few More examples of Augmented data vs
Synthetic Data
Sales Data Augmentation
• Changing the purchase date, time, or slightly altering the quantity sold
• Price Fluctuation Simulation: Introduce variations in product prices to
simulate different market conditions and study their effects on sales volumes.
• Discount Impact Analysis: Generate data reflecting different discount
strategies to analyze their impact on customer purchasing behavior.
• Demographic Variation: Create synthetic customer profiles by varying
demographic attributes like age, income, and location to study their influence
on purchasing patterns.
System
Cognitive System
Digestive System
Education System
Computer System
Monolithic System
Client/Server System
Distributed System
Redundant System
Loosely coupled System
Tightly coupled System
System comprises of several components that interact and work together as whole to
do its intended function. If one of the components fail, system does not function fully
(100%).
Unit 5
Understanding Color
Trichromacy Theory, Color Measurement, Application of color in visualization 1,2,3, Representing
Quantity, static and moving pattern, Gestalt Laws, Pattern learning
Reference Books
1. Chapter 4, 6, , from Information Visualization: perception and design, Colin Ware 2nd edition,
Omrgan Kaufmann publisher, 2004.
Importance of Color
Human Vision is influenced by four parameters of
the objects.
1. Layout of objects in the space
2. Shape of objects
3. Motion of objects
4. Color of the objects
People’s perception of colors based on types
Even though people have color blindness (inability of color blindness that they have.
to distinguish certain colors), their day to day life
is not impacted much due to their perception of
other three parameters of the objects.
~10% of male and ~1% of female population have
color blindness.
The importance of color becomes evident when Color blindness
the other three parameters of the objects are people will pick rotten
identical in space; in such cases, color alone Color blindness fruits along with good
distinguishes one object from another. people depend on ones. Do they eat
signal layout. rotten fruits?
Trichromacy Theory
The Trichromacy Theory [Tri –(three). Chromacy (Color)], explains how
humans perceive color. According to this theory, our eyes have three
types of cone cells (color receptors), each sensitive to different ranges
of wavelengths of light. These cones work together to allow us to see a
wide range of colors by combining the different levels of stimulation
from each type of cones.
• Red cones (L-cones): Sensitive to long wavelengths.
• Green cones (M-cones): Sensitive to medium wavelengths.
• Blue cones (S-cones): Sensitive to short wavelengths.
When light enters the eye, it stimulates these cones in varying
degrees, and the brain interprets these signals to produce the
perception of different colors.
Color vision deficiency (CVD)
• There are six to seven million cone cells in a human eye of which, 64%
are red sensitive, 33% are green sensitive and 3% are blue sensitive.
• The type of CVD depends on the type of faulty or missing cone cell.
Fig 4.2 shows normal cone sensitivity
to color wavelength
Fig 4.3 absence of one cone response
makes three dimensional response
space to two dimensional response
space.
Types of cone (color receptors) faults:
1. The sensitivity of cone cells are shifted
towards a shorter or longer wavelength
2. Missing cone cells
Comparison of human sensitivity vs other animals
• Humans are trichromatic
• Dogs are dichromatic
• Birds are tetrachromatic
• As number of cones/
receptor types increases,
they make us to see more
color shades
Color Measurement
• Human eye can match any color with the mixture of no more than three
primary lights (Primaries). This is called human colorimetry.
• Understanding of human colorimetry is essential for any one who wishes to
reproduce colors that they saw precisely on various media, such as
printing, digital displays, textiles, and more.
• We can describe color by the following equation
• C ≡ rR + gG + bB
• Where C is the color to reproduce
• R, G, B are primary colors for RED, GREEN and BLUE
• r, g, and b represent amounts of each primary colors
• ≡ is used to denote perceptual match
NOTE: It is not mandatory to have RGB as primary colors. Paint industry uses
RYB as primary pigment colors to make paints. Printing industry (Digital
color Printers) uses CMY - Cyan, Magenta, Yellow as primary colors to do
color printing.
Application Areas of Color in Visualization
1. Color selection interface (method of selecting a color)
2. Color labeling
3. Color Sequences for map coding
4. Color reproduction (Transfer of color from one device to another)
Color Labeling Color Sequence for Continuously varying values (here oC)
Color Specification Interface and Color Spaces
• We use several software applications throughout the day; without them,
we can’t complete our daily tasks.
• Office Productivity applications such as –
• MS-Office (Word, Excel, PowerPoint, etc.,)
• Google Workspace (Google Docs, Sheets, and Slides etc.,)
• Apache OpenOffice (word, spreadsheet, presentations, graphics etc.,)
• Apple iWork (Pages, Numbers, and Keynote etc.,)
• Data Visualization Software
• Drawing applications
• Reporting, Dashboard and Business Intelligence applications
• Computer Aided Design (CAD)
• It is essential for these applications to let users select their own colors and
apply them to their text, charts, and graphs and make their content visually
appealing.
• Color Specification Interface and Color Spaces will help selecting required
color in the productivity and data visualization software
Application 1 : Color Specification Interface and
Color Spaces
There are # of approaches to this user interface problem. Some of
those approaches are –
1. Giving set of controls to specify a point in 3 dimensional color space
a) Slider/ Input RGB values
b) 2/3 dimensional color space User Input in
Hexa
2. Set of color names to choose
1. National Coloring System (NCS) User Input
2 Dimensional Fields for
3. Palette of pre-defined/Custom colors Color Space primary colors
Slider
Fig. 1 Microsoft Paint Color Interface
3 Dimensional Color Space Color Palette (Pre-defined colors) Custom Color Palette
Some more color interfaces
Power point Standard Color Interface Power point Custom Color Interface
• There is no one color interface that meets all user needs, so software Theme Colors
applications provide multiple options to the color interface to meet
customer needs.
• Eyedropper is to select any color on the page and drop it to the selected
object
• Theme colors are the colors that automatically change when color theme Eyedropper
of the document changes
Demonstration of Theme Colors
1. Power point (Design -> Colors)
2. Excel sheet (Page Layout -> Colors)
3. How to select color filters for color blindness people in windows 11
(Settings -> Accessibility -> Color Filters)
Application1 : Interfaces comparison
Slider/ Input Primary color 2/3 Dimensional Color Color Names Predefined Color Palette
values Spaces
Advantages Advantages Advantages Advantages
1. Easy to input/select primary 1. Very easy for users to see 1. Easy for users to select 1. Color palette is easy for
color values to generate the color in the color colors by name users to select small set
required color space and select the of colors
desired color Disadvantages
Disadvantages 1. It is also very difficult to Disadvantages
1. People do not know Disadvantages name 256*256*256 1. It is very difficult to
which combination of 1. It is very hard to combinations of colors represent all possible
color values will give showcase all 2. It is very hard for people colors in a color palette
which color combinations of colors in to remember colors by
the color spaces their names except few
popular colors
There is no one easy method for selecting the colors, so software applications provide multiple options to choose
the one that is best suitable for the user.
Guidelines for designing color specification
interfaces
• G1.1: Consider laying out the red-green and
yellow-blue channel information on a plane. Use
separate control for specifying the dark-light
dimension
• G1.2:In an interface for designing visualization
color schemes, consider providing a method for
showing colors against different backgrounds
Geometric color layout
• G1.3: To support the use of easy-to-remember for Guideline 1
and consistent color codes, consider providing
color palettes for designers
Application 2 : Color for Labeling (Nominal Codes)
1. Colors can be extremely effective to
distinguish objects and object categories in
visualization
2. Colors are best option for labeling compared
to grayscale codes (White, Dark Grey, Light
Grey, Black), because number of gray colors
that are easily perceivable by humans are
very few.
3. Set of 12 colors are recommended for color
labeling. They are red, green, yellow, blue, Color Labeling
black, white, pink, cyan, orange, brown and 1. Nominal codes are unique codes
purple. First 6 would normally be selected that represent objects. These
before 2nd set of six. codes can be colors, patterns or
abbreviations.
1st set of
6 Colors 2. In the above charts, colors are used
to represent country names
3. Nominal codes are orderable,
unlike Numbers/ Alphabets. These
must be remembered and
recognized.
Perceptual factors (7) to be considered for
color labeling
1. Distinctness
• Degree of perceived difference between two colors
that are placed adjacent
2. Unique Hues
• Red, green, yellow, blue, black and white are natural
choices when small set of color codes are required.
3. Contrast with background
• Background colors can dramatically alter color
appearance of the object, making one color looks like
another or giving weak perception of the object. In Adding contrast border in to the color
order to perceive the object better, place a thin black dots (b) ensures clarity against all
or white border around the color coded object. backgrounds.
Showing color coded lines can be
problematic (c).
Perceptual factors to be considered for color
labeling … Continued
4. Color Blindness:
• Because there is a substantial color-blind population, it may be desirable to use
colors that can be distinguished by everyone. Majority of people can't
distinguish colors that differ in red-green direction. Almost everyone can
distinguish colors that vary in yellow-blue direction.
Perceptual factors to be considered for color
labeling … Continued
5. Number of colors:
• Although color coding is an excellent method for color labeling, Only small number (5-10) of colors can be
rapidly perceived.
6. Field Size:
a) To avoid the small field color blindness (inability to recognize small fields when color used for those
fields is less saturated), when color coded Small areas should have strong and highly saturated colors
for maximum distinction. When large areas of color coded, the colors should be low saturation and
differ only slightly from one another.
b) This enables small color coded areas to be easily perceived against background or larger areas.
3. Generate the Colors: Fig 2. These are equal interval Color Sequence
1. Step 1: RGB = (255, 255, 128)
2. Step 2: RGB = (255 - 75, 255 -127, 128 - 64) = (180, 128, 64)
3. Step 3: RGB = (105, 1, 0)
Spectrum Approximation Options … Continued
3. Ratio Pseudocolors
Ratio sequence represents numbers that have
zero and numbers both above and below the zero.
By using ratio pseudocolors, you can enhance the
interpretability of complex data, making it easier to
identify patterns, trends, and anomalies.
Use neutral color for zero, -values use red
sequence and +values use green sequence as
shown in the picture on right side.
• Medical Imaging: MRI scans, ratio images can help
identify areas with abnormal tissue properties.
• Geospatial Data: When visualizing environmental data,
such as vegetation indices or water quality, ratio
pseudocolors can effectively show variations.
• Heat Maps: Ratio pseudocolors are often used in heat
maps to show the intensity of data points.
Spectrum Approximation Options … Continued
4. Sequence for the color blind
• Use sequence of colors in the direction of yellow – blue to cover majority population.
5. Bivariate Color Sequence
• A bivariate color sequence in color visualization is a technique used to represent the
relationship between two variables (Fields) on a single map or chart.
Income
Education
Rice Fruits
1. Consider first, second and third objects in each of the graphs above represents Rice,
Vegetables and Fruits.
2. Which graph has less fruits out of three?
3. Which graph has more vegetables out of three?
Representing Quantities
60 50 60
50 50
50 40 50
40 40
30
30 30
50 20
20 20
10 5
10 1 10 5
1
0 5 1 0 0
Meat Vegitables Fruits MEAT VEGITABLES FRUITS Meat Vegitables Fruits
L1
L2
1. Out of three patterns in this picture, which ones do you notice better?
2. Out of three patterns in this picture, where your eyes focus more? Why?
Gestalt Laws
1. Do you full circle behind green rectangle? Or some other way? What is
that some other way?
Gestalt Laws 1:
Spatial Proximity : Place symbols and glyphs (pictures) representing related information close together.
Gestalt Laws
Gestalt Laws 2:
Similarity: When designing a grid layout of a data set, consider coding rows and/or columns using low-level
channel properties, such as color and texture.
When proximity between the objects is same, similarity plays role in perceiving the patterns.
Gestalt Laws
Gestalt Laws 3:
Connectedness: To show relationships between entities, consider linking graphical representations of data
objects using lines or ribbons of color. Linking objects has higher level perception than proximity and similarity.
NOTE: this was proven by Palmer and Rock (1994) and argued that Gestalt Psychologists overlooked this
principle.
Gestalt Laws
Gestalt Laws 4:
Continuity: While drawing complex diagram with connectivity between the entities, use simple, smooth and
continuous lines rather than lines that contain abrupt changes in direction.
Gestalt Laws
Gestalt Laws 5:
Symmetry: Consider using symmetry to make pattern comparisons easier and to get stronger sense of
perception.
Gestalt Laws
Gestalt Laws 6:
Closure: 1. There is a perceptual tendency to close contours that have gaps in them. Fig a on top picture.
2. A simple line contour is adequate for regions having simple shape. But use lines, colors, texture and
cornsweet contours for overlapping regions and complex shape. Fig a & b in bottom picture.
Gestalt Laws
Gestalt Laws 7:
The law of Common Region: When elements are located in the same closed region, we perceive them as
belonging to the same group. Consider putting related information inside a closed contour.
Gestalt Laws
Figure and Ground (background):
• Smaller objects of a pattern tend to be
perceived as objects in the foreground.
Gestalt Laws 8:
Figure and Ground: Use combination of closure, common region, and layout to ensure that data entities are
represented by graphical patterns that are perceived as figures not ground (as a background).
Pattern Learning
• Pattern extraction is fundamental to extraction of meaning from
visualization, if so can we teach/ learn to see patterns better?
• Artists talk about seeing things that rest of us can’t see
• Research found that –
• For detecting some patterns almost no learning is required, our neurons learn
this trick in the first few months of life
• Some learning happens for intermediate complex patterns
• Most learning happens to detect higher level of complex patterns
In Fig 6.50
1. Object a, has very simple grating pattern
which does not require any learning
2. Object b, intermediate complex pattern,
requires some learning
3. Object c, higher level pattern tasks, such
as finding downward pointing triangles