09 Plotting and Visualization
09 Plotting and Visualization
3
Introduction
4
Introduction
• Data visualization is the process of
translating data into easily understood
visuals.
• These visuals could be in the form of
graphs, charts, maps, plots, animations
and others.
• Visualization is an important skill for data
professionals for storytelling to
communicate observations effectively and
inform decisions.
5
Introduction
• Visualizations allows to understand
• Trends
• Changes
• Gaps
• Relationships
• directions.
6
Common Visualizations
7
Line, Bar and Histogram
• Line Plot: Used to show trends over time or continuous data.
• Plotting temperature over a week
• Bar Chart: Represents categorical data with rectangular bars.
• Number of students in each grade.
• Histogram: Displays the distribution of a dataset.
• Distribution of test scores
8
Bar vs. Histogram
Histogram Bar Plot
• Used for continuous (numerical) • Used for categorical data.
data. • Compares different categories or
• Shows the distribution of data groups.
values across intervals (called bins). • Bars are separated to show distinct
• Bars are touching to represent that categories.
the intervals are connected. • X-axis represents categories (e.g.,
• X-axis represents ranges of values. Grade 9, Grade 10).
• Y-axis represents frequency (how • Y-axis shows the value or count for
many values fall in each range) each category.
• Example: Distribution of students’ • Example: Number of students in
scores in a test. each grade.
9
Scatter and Pie Chart
• Scatter Plot: Shows the relationship between two numeric variables.
• Height vs. Weight of
• Pie Chart: Shows proportions of a whole as slices of a pie.
• Market share of different smartphone brands.
10
Box Plot and Heatmap
• Box Plot: Displays the distribution of data based on five-number
summary.
• Exam scores showing median and outliers.
• Heatmap: Uses color to represent values in a matrix.
• Example: Correlation between different variables.
11
Data Visualization Process
12
Visualization Process
• The dataset: You'll need to build a processed, clean dataset that adds the
information relevant to the story you want to tell
• The message: Select a single message for each chart. If you want multiple
messages, make multiple charts. Think about your audience and simplify the
details
• The graph: Each story has a graph. Think about what you want to highlight (trend,
comparison, distribution...)
• The shape and the color: do not stick to the default values. Choose the colors
that best convey the message (live, neutral, gradual, with contrast...) and improve
the graphic to capture attention (play to give it a touch of post-production, add
icons, use alternative fonts...)
• Interactive and/or video: takes advantage of the ability to make the interactive
graph or create a video that shows its evolution (attracts the audience and allows
a deeper analysis)
https://www.cardinalpath.com/blog/makes-good-visualization
13
Iterative Process of Data Visualization
14
15
Choosing the Right Visualization
16
What Do I want to
present and how to
make it possible?
17
Visual Cues
18
Visual Cues and Patterns
19
Choosing the Right Visualization
• The type of visualization depends on the data type.
• Data types:
• Categorical data
• Time series
• Spatial data
• Multiple variables
• Distributions
20
Categorical Data
• When data is straightforward with value for each category we can use
Bar and Symbol plots.
• Bar graphs must start at the zero-axis and must extend straight
across or upward to the corresponding value.
• Symbol plots, you can organize squares and circles in any way you
want in two-dimensional space.
21
Categorical Data
• With categorical data we can
use pie and stacked bar plots
22
Categorical Data
• It is common to
normalize the values and
show percentages
• Normalization!
• The total must add up to
100%.
• A normalized bar chart
helps emphasize the
distribution between
categories rather than
magnitude.
23
Subcategories
• Data can have a hierarchical structure which can be important in data
interpretation.
• In this case, we can use treemap or mosaic plots.
24
Tree map A treemap displays hierarchical data as nested rectangles, where
the size and color of each rectangle represent quantitative variables
25
Mosaic
• A mosaic plot visualizes the
relationship between two or
more categorical variables
using a tiled area chart, where
the size of each tile represents
the proportion of
observations in that category
combination.
https://policyviz.com/hmv_post/mosaic-plot-of-the-titanic/ 26
Time Series
• When you visualize time series data your goal is to see what has
passed, what is different, and what is the same, and by how much.
27
Time Series Plots
• There are a variety of ways to
see patterns over time.
• Using cues such as length,
direction and position.
• Bar graph: Useful for discrete
points in time
• Line chart: Line makes it easier
to see trends.
• Dot Plot: Shows distinct points
with line to show trends
28
Cycles for Time Series
• There are a lot of things that repeat themselves
on regular intervals like time of day, day of
week and month of the year.
• To show this repetition use Radial and Calendar
plots.
• Radial Plot: arranges time-based data (e.g.,
hours, days, months) in a circular layout to
reveal cyclical patterns or seasonality over time.
• Calendar Plot: Displays time series data over
days in a calendar layout (monthly or yearly),
making it easy to spot daily, weekly, or seasonal
patterns and anomalies.
29
Cycles for Time Series
• Another alternative is to use the
Spiral Plot.
• It maps data along a spiral
shape to highlight periodic
patterns, trends, and
seasonality over extended
periods while saving space and
showing continuity.
30
Spatial Data
• Spatial data, also known as geospatial data, is information about the
location and shape of objects on Earth.
• It includes coordinates (like latitude and longitude) and often additional
attributes describing those objects.
• There is a natural hierarchy to spatial data that allows you to explore at
different granularities
• The most obvious way to explore spatial data is with maps, which place
values within a geographic coordinate system
• Key Concepts:
• Location: Where something is (e.g., a house at specific GPS coordinates).
• Shape: The geometry of objects (point, line, or polygon).
• Attributes: Descriptive information (e.g., a city's name, population, or area).
31
Spatial Data
• Types of Spatial Data:
• Vector data: Represented by:
• Points (e.g., trees, wells)
• Lines (e.g., roads, rivers)
• Polygons (e.g., lakes, city boundaries)
• Raster data:
• Made of a grid of pixels (e.g., satellite images, weather
maps)
• Each pixel has a value (like temperature or elevation)
• Google Maps uses spatial data to show roads, places,
and directions.
• A GIS (Geographic Information System) maps flood
zones, land use, or traffic patterns.
32
Spatial Data
• Bubbles for the airports, sized by the number of outgoing flights.
• Where the busiest airports are
• how busy they are relative to each other.
33
Spatial Data
https://www.americansocceranalysis.com/
34
Multiple Variables
• Multiple variables data refers to datasets that contain more than one
variable (or feature) measured for each observation (or record).
• Each variable represents a different characteristic, and each
observation is a row in the dataset.
• Some visualization methods let you explore multivariate data in one
view.
• This allows you to interpret relationships between variables and
explore trends in individual ones.
35
Multiple Variables
• Scatter Plot can show up to 4
variables
• Example:
• X-Axis → usage percentage
• Y-Axis → points per game
• Area → rebounds
• Color → assists
36
Multiple Variables
• You might show four variables
with a scatter plot, but what
about five variables?
• There are views that are more
conducive to comparing
multiple variables at one
• When there are many
variables we can represent,
instead, the Pearson
correlation coefficient by
ellipses and colors
37
Plotting Distributions
• Distribution plots are used to
visualize the spread, central
tendency, and shape of data to
understand patterns, variability,
and potential outliers.
• Box plot shows summary statistics
(median, quartiles, and outliers) of
a distribution in a compact form
• Violin plot combines a box plot
with a rotated KDE plot to show
both summary statistics and the
full shape of the distribution.
38
Plotting Distributions
39
Plotting with matplotlib
40
Matplotlib: MATLAB-style Scientific
Visualization
• Matplotlib is a Python plotting library which produces
publication quality figures in a variety of hardcopy formats.
• Website: https://matplotlib.org/
• Also, check the tutorial package website:
https://matplotlib.org/tutorials/introductory/pyplot.html
41
A Brief matplotlib API Primer
• To set up Jupyter Notebook,
run %matplotlib notebook
(%matplotlib in IPython).
42
Figures and Subplots
• Plots in matplotlib reside within a
Figure object.
• plt.figure(): creates a new figure object
(main container for all plot elements such
as subplots, titles, axes, etc(.
• figsize=(4, 3): size of the figure in inches.
• You can create one or more subplots
inside a blank figure.
• fig.add_subplot(nrows, ncols, index):
splits the figure into a grid of nrows x
ncols subplots.
• The index (1-based) tells which cell in the
grid to place the subplot.
43
Figures and Subplots
• ax1.hist(...): Plots a
histogram on ax1 using those
values.
• ax2.scatter(...): Plots a
scatter plot on ax2
• ax3.plot(...): Line plot on ax3.
• plt.plot(…): draws on the last
active subplot, even though
it’s not explicitly connected
to any ax
44
Colors, Markers, and Line Styles
• In the plot you can specify line color, style and marker shape
• Line Color:
• Color name, shortcut , Hex code or RGB tuple.
• Short codes: 'b' (blue), 'g' (green), 'r' (red), 'k' (black), 'y' (yellow), etc.
46
Colors, Markers, and Line Styles
• You can Specify color style and marker either as string or
explicitly using the function parameters.
47
Colors, Markers, and Line Styles
• The drawstyle option in
Matplotlib controls how lines
are drawn between points.
• Options:
• 'default' or 'None': Straight line
between points (same as '-')
• 'steps‘: Horizontal-vertical steps
(like a staircase)
• 'steps-pre‘:Step goes to y-value
before the x-point
• 'steps-mid‘:Step goes to the
midpoint between x-values
• 'steps-post‘:Step goes to y-value
after the x-point
48
Annotation: Titles, Legends, Axes Labels
• Good visuals require good
annotation
• Title, axes labels, tick marks, tick marks
labels, legend, gridlines, drawing
shapes …
49
Annotation: Ticks and Ticks Labels
• You can specify where
the ticks appear on
both axes.
• You can also assign
labels to these ticks.
50
Annotation: Ticks and Ticks Labels
51
Annotation: Adding Text
52
Annotation: Drawing Shapes
53
Saving Plots to File
• When saving to a file, the
file extension specifies the
image format.
54
matplotlib Configuration
• In matplotlib, rc stands for plt.rc('figure', figsize=(10, 10))
runtime configuration font_options = {
56
matplotlib Configuration
57
Plotting with pandas
58
Plotting with pandas
• matplotlib is low-level tool; you assemble a plot from its
base components.
59
Line Plots with Series
• Series and DataFrame have plot method for making some basic
plot types.
60
Series.plot method arguments
61
Line Plots with DF
• DataFrame plots each of its
columns as a different line on the
same subplot, creating a legend The default is Line plot
automatically.
62
Line Plots with DF
• Select plot type using the kind
parameter: • df.plot(kind='box') # Box plot
• df.plot(kind='bar') # Bar chart • df.plot(kind='area') # Area plot
• df.plot(kind='hist') # Histogram • df.plot(kind='scatter', x='A', y='B')
63
DataFrame-specific plot arguments
64
DataFrame-specific plot arguments
65
Bar Plots
• The plot.bar and plot.barh make vertical and horizontal
bar plots.
66
Bar Plots
• With a DataFrame, bar plots group the values in each row together in a
group in bars, side by side, for each value.
df
Genus A B C D
one 0.801554 0.094551 0.469551 0.619210
two 0.208189 0.792578 0.648303 0.260912
three 0.642697 0.847883 0.767702 0.856446
four 0.113493 0.083676 0.283905 0.023767
five 0.220087 0.573322 0.800078 0.514133
six 0.929547 0.272519 0.783754 0.007303
67
Bar Plots
• We create
stacked bar plots
from a
DataFrame by
passing
stacked=True.
68
Bar Plots
• To visualize a Series’s value frequency use:
s.value_counts().plot.bar().
69
Bar Plots
• A crosstab (short for cross-tabulation) is a way to compute a frequency
table of two or more categorical variables.
• Example: Make a stacked bar plot showing the percentage of data points
for each Gender on each Preference.Hint: use crosstab
70
Bar Plots
71
Histograms and Density Plots
• A histogram is a kind of bar plot
that gives a discretized display of
value frequency.
72
Plotting with seaborn
73
Plotting with seaborn
• YouTube Video from Kimberly Fessel
https://youtu.be/vaf4ir8eT38
74
Bar Plots (seaborn)
• Example: Use seaborn to visualize tip
percent on tips dataset using bar plot.
• By default, Seaborn bar plots show 95%
confidence intervals as error bars on
top of each bar. (use ci = None to
remove).
• 95% confidence: If we repeated this
sampling many times, 95% of the time
the true mean would fall within this
interval.
76
Histograms and Density Plots
• Seaborn’s displot can plot both a histogram and a continuous
density estimate simultaneously.
• Example: bimodal normal distributions.
77
Scatter or Point Plots
• Point plots or scatter plots can be a useful way of examining the
relationship between two one-dimensional data series.
• sns.regplot function in Seaborn is used to create a scatter plot with a
regression line fit to the data.
78
Scatter or Point Plots
• In exploratory data analysis,
it’s helpful to look at all the
scatter plots; this is known as
scatter plot matrix.
79
Catplots
• You can use catplots to visualize data with many categorical variables.
• Example: Compare tip percentage with smoking.
80
Catplots
• Example: Show time in a
different facet.
81
Catplots
• Example: Draw a box plot to
show the median, quartiles, and
outliers.
82
Plotting with squarify and plotly
83
Plotting with squaify and plotly
• There are many python plotting packages that offer advanced
types of plots
• squarify: is a minimalistic library focused only on treemaps
(Matplotlib-based)
• plotly: supports many interactive plots, including hierarchical
plots.
• Treemap
• Sunburst Chart
• Icicle Chart
• Heatmaps
• Choropleth 84
Plotting Treemap Using squarify
85
Plotting Treemap Using squarify
86
Plotting Treemap Using squarify
87
Plotting Treemap Using pltly
88
Plotting Treemap Using pltly
89
Plotting Sunburst Chart with plotly
A sunburst chart is a multilevel pie chart used to
visualize hierarchical data. It shows how categories
are nested within each other, using concentric
circles.
90
Heatmap of Iris Datset
91
Choropleth of GDP
A choropleth is a type of map where regions are
shaded or colored in proportion to a numerical
value—essentially, it's a way to visualize how a
measurement varies across a geographic area.
92
Advanced Plots
93
Source: http://www.poppyfield.org/ 94
Sankey Diagram: https://sankey.csaladen.es/
95
Websites and Tools for Advanced Plots
• Data Visualization Catalogue: https://datavizcatalogue.com/
• Guide you to tools to generate specific plots
• Sunburst Charts: https://www.aculocity.com/labs/sunburst-chart
• Sankey Diagram: https://sankey.csaladen.es/
• Mosaics Plot: https://www.datavis.ca/online/mosaics/about.html
• Dependency Wheel: https://circos.ca/
• Data Visualization Catalogue: https://datavizcatalogue.com/
96
Exercises
97