Data Visualization in R and Python
Data Visualization in R and Python
1. Cover
2. Table of Contents
3. Title Page
4. Copyright
5. Preface
6. Introduction
7. About the Companion Website
8. Part I: Static Graphics with ggplot (R) and Seaborn (Python)
1. 1 Scatterplots and Line Plots
1. 1.1 R: ggplot
2. 1.2 Python: Seaborn
2. 2 Bar Plots
1. 2.1 R: ggplot
2. 2.2 Python: Seaborn
3. 3 Facets
1. 3.1 R: ggplot
2. 3.2 Python: Seaborn
4. 4 Histograms and Kernel Density Plots
1. 4.1 R: ggplot
2. 4.2 Python: Seaborn
5. 5 Diverging Bar Plots and Lollipop Plots
1. 5.1 R: ggplot
2. 5.2 Python: Seaborn
6. 6 Boxplots
1. 6.1 R: ggplot
2. 6.2 Python: Seaborn
7. 7 Violin Plots
1. 7.1 R: ggplot
2. 7.2 Python: Seaborn
8. 8 Overplotting, Jitter, and Sina Plots
1. 8.1 Overplotting
2. 8.2 R: ggplot
3. 8.3 Python: Seaborn
9. 9 Half-Violin Plots
1. 9.1 R: ggplot
2. 9.2 Python: Seaborn
10. 10 Ridgeline Plots
1. 10.1 History of the Ridgeline
2. 10.2 R: ggplot
11. 11 Heatmaps
1. 11.1 R: ggplot
2. 11.2 Python: Seaborn
12. 12 Marginals and Plots Alignment
1. 12.1 R: ggplot
2. 12.2 Python: Seaborn
13. 13 Correlation Graphics and Cluster Maps
1. 13.1 R: ggplot
2. 13.2 Python: Seaborn
3. 13.3 R: ggplot
4. 13.4 Python: Seaborn
9. Part II: Interactive Graphics with Altair
1. 14 Altair Interactive Plots
1. 14.1 Scatterplots
2. 14.2 Line Plots
3. 14.3 Bar Plots
4. 14.4 Bubble Plots
5. 14.5 Heatmaps and Histograms
10. Part III: Web Dashboards
1. 15 Shiny Dashboards
1. 15.1 General Organization
2. 15.2 Second Version: Graphics and Style Options
3. 15.3 Third Version: Tabs, Widgets, and Advanced Themes
4. 15.4 Observe and Reactive
2. 16 Advanced Shiny Dashboards
1. 16.1 First Version: Sidebar, Widgets, Customized Themes,
and Reactive/Observe
2. 16.2 Second Version: Tabs, Shinydashboard, and Web
Scraping
3. 16.3 Third Version: Altair Graphics
3. 17 Plotly Graphics
1. 17.1 Plotly Graphics
4. 18 Dash Dashboards
1. 18.1 Preliminary Operations: Import and Data Wrangling
2. 18.2 First Dash Dashboard: Base Elements and Layout
Organization
3. 18.3 Second Dash Dashboard: Sidebar, Widgets, Themes,
and Style Options
4. 18.4 Third Dash Dashboard: Tabs and Web Scraping of HTML
Tables
5. 18.5 Fourth Dash Dashboard: Light Theme, Custom CSS
Style Sheet, and Interactive Altair Graphics
11. Part IV: Spatial Data and Geographic Maps
1. 19 Geographic Maps with R
1. 19.1 Spatial Data
2. 19.2 Choropleth Maps
3. 19.3 Multiple and Annotated Maps
4. 19.4 Spatial Data (sp) and Simple Features (sf)
5. 19.5 Overlaid Graphical Layers
6. 19.6 Shape Files and GeoJSON Datasets
7. 19.7 Venice: Open Data Cartography and Other Maps
8. 19.8 Thematic Maps with tmap
9. 19.9 Rome’s Accommodations: Intersecting Geometries with
Simple Features and tmap
2. 20 Geographic Maps with Python
1. 20.1 New York City: Plotly
2. 20.2 Overlaid Layers
3. 20.3 Geopandas: Base Map, Data Frame, and Overlaid Layers
4. 20.4 Folium
5. 20.5 Altair: Choropleth Map
12. Index
13. End User License Agreement
List of Illustrations
1. Chapter 1
1. Figure 1.1 Output of the ggplot function with x and y
aesthetics.
2. Figure 1.2 First ggplot’s scatterplot.
3. Figure 1.3 Scatterplot with color aesthetic.
4. Figure 1.4 Scatterplot with color aesthetic for marital status
variable.
5. Figure 1.5 Scatterplot with income as dependent variable and
color aesthetic...
6. Figure 1.6 (a/b) Scatterplots with four variables.
7. Figure 1.7 United States’ inflation values 1960–2022.
8. Figure 1.8 Inflation values for a sample of countries.
9. Figure 1.9 Dots colors based on an aesthetic when over a
threshold, otherwis...
10. Figure 1.10 Markers colored based on two thresholds and
textual labels, US i...
11. Figure 1.11 Temperature measurement in some US cities,
minimum temperatures....
12. Figure 1.12 A problematic line plot, groups are not respected.
13. Figure 1.13 Line plot connecting points of same country.
14. Figure 1.14 Line plot with style options.
15. Figure 1.15 Scatterplot of the United States’ GDP time series
from the World...
16. Figure 1.16 Scatterplot of the GDP for a sample of countries.
17. Figure 1.17 Scatterplot with markers styled differently for from
year 2000 a...
18. Figure 1.18 Temperature measurement in some US cities,
maximum temperatures....
19. Figure 1.19 Line plot of GDP variations for a sample of
countries.
20. Figure 1.20 Line plot with line style varied according to country.
21. Figure 1.21 Line plot and scatterplot overlapped.
22. Figure 1.22 Line plot with markers automatically added.
2. Chapter 2
1. Figure 2.1 Bar plot with two variables.
2. Figure 2.2 Bar plot with custom color palette, horizontal bar
orientation, a...
3. Figure 2.3 Bar plot with ranges of values for PM10 derived from
a continuous...
4. Figure 2.4 Bar plot with ordered bars and x ticks rotated.
5. Figure 2.5 Bar plot with three variables and groups of bars.
6. Figure 2.6 Bar plot with month names and the legend moved
outside the plot....
7. Figure 2.7 Bar plot with stacked bars.
8. Figure 2.8 Bar plot with ranges of values derived from a
continuous variable...
9. Figure 2.9 Bar plots with quantile representation, subplots, and
style optio...
3. Chapter 3
1. Figure 3.1 Temperature measurement in some US cities,
minimum temperatures, ...
2. Figure 3.2 Facet visualization with bar plots, some facets not
readable due ...
3. Figure 3.3 Facet visualization with independent scale on y-axis.
4. Figure 3.4 Facet visualization with bar plots, facets are all well-
readable ...
5. Figure 3.5 Temperature measurement in some US cities,
maximum temperatures, ...
6. Figure 3.6 Facets and bar plot visualization.
7. Figure 3.7 Incorrect facet visualization (single facet detail).
8. Figure 3.8 Facet visualization with the general method,
unbalanced facets.
9. Figure 3.9 Facet visualization with the general method,
independent scales....
10. Figure 3.10 Facet visualization with balanced and meaningful
bar plots.
4. Chapter 4
1. Figure 4.1 Number of bins equals to 30.
2. Figure 4.2 Bin width equal to 10.
3. Figure 4.3 Facets visualization with histograms.
4. Figure 4.4 Histogram for bivariate analysis with rectangular
tiles.
5. Figure 4.5 Histogram for bivariate analysis with hexagonal tiles.
6. Figure 4.6 Histogram for bivariate analysis with facet
visualization.
7. Figure 4.7 Kernel density for bivariate analysis with isodensity
curves.
8. Figure 4.8 Kernel density for bivariate analysis with color
gradient, NYC ma...
9. Figure 4.9 Kernel density for bivariate analysis with color
gradient, NYC mi...
10. Figure 4.10 Histogram for univariate analysis, bin width equals
20.
11. Figure 4.11 Histogram for univariate analysis and kernel density,
bin width ...
12. Figure 4.12 Histogram for univariate analysis with stacked bars.
13. Figure 4.13 Histogram for bivariate analysis and continuous
variables.
14. Figure 4.14 Histogram for bivariate analysis with a categorical
variable.
15. Figure 4.15 Histogram for bivariate analysis and facet
visualization.
16. Figure 4.16 Histogram with logarithmic scale.
17. Figure 4.17 Histogram with logarithmic scale and symmetric
log.
18. Figure 4.18 Histogram with stacked visualization, logarithmic
scale, and sym...
19. Figure 4.19 Histogram with stacked visualization, logarithmic
scale, and sym...
5. Chapter 5
1. Figure 5.1 Diverging bar plot, yearly wheat production
variations for Argent...
2. Figure 5.2 Diverging bar plot with ordered bars and annotation,
yearly varia...
3. Figure 5.3 Lollipop plot, yearly wheat production variations for
Argentina....
4. Figure 5.4 Lollipop plot ordered by values and annotation,
yearly variations...
5. Figure 5.5 Diverging bar plot, yearly wheat production
variations for the Un...
6. Figure 5.6 Diverging bar plot, yearly wheat production
variations for the Un...
6. Chapter 6
1. Figure 6.1 Boxplot statistics.
2. Figure 6.2 Boxplot, air quality in Milan, 2021.
3. Figure 6.3 Boxplot with three variables, confused result.
4. Figure 6.4 Boxplot with three variables, unbalanced facet
visualization.
5. Figure 6.5 Boxplot with three variables, balanced facet
visualization.
6. Figure 6.6 Box plot with three variables, the result is confused.
7. Figure 6.7 Boxplot with three variables, facet visualization.
7. Chapter 7
1. Figure 7.1 Violin plot, OECD/Pisa tests, male and female
students, Mathemati...
2. Figure 7.2 Density plot, OECD/Pisa tests, male and female
students, Mathemat...
3. Figure 7.3 Boxplot, OECD/Pisa tests, male and female students,
Mathematics s...
4. Figure 7.4 Violin plot and scatterplot combined and correctly
overlapped and...
5. Figure 7.5 Violin plot and boxplot combined and correctly
overlapped and dod...
6. Figure 7.6 OECD/Pisa tests, male and female students,
Mathematics, Reading, ...
7. Figure 7.7 Violin plot, bike thefts in Berlin, and bike values.
8. Figure 7.8 Violin plot, bike thefts in Berlin for each month of
years 2021 a...
9. Figure 7.9 Bar plot, bike thefts in Berlin for each month of years
2021 and ...
10. Figure 7.10 Violin plot, bike thefts in Berlin for bike type and
month, year...
8. Chapter 8
1. Figure 8.1 Categorical scatterplot with jitter, OECD/Pisa tests
results for ...
2. Figure 8.2 Categorical scatterplot with reduced jitter.
3. Figure 8.3 Categorical scatterplot with increased jitter.
4. Figure 8.4 Violin plot and scatterplot with jitter, OECD/Pisa tests
results ...
5. Figure 8.5 Violin plot, boxplot, and scatterplot with jitter,
OECD/Pisa test...
6. Figure 8.6 Sina plot, OECD/Pisa tests results for male and
female students, ...
7. Figure 8.7 Sina plot and violin plot combined, OECD/Pisa tests
results for m...
8. Figure 8.8 Sina plot and boxplot, OECD/Pisa tests results for
male and femal...
9. Figure 8.9 Sina plot with stacked groups of data points and
color based on l...
10. Figure 8.10 Beeswarm plot, OECD/Pisa test results for male and
female studen...
11. Figure 8.11 Comparing overplotting, jitter, sina plot, and
beeswarm plot.
12. Figure 8.12 Strip plot, bike thefts in Berlin.
13. Figure 8.13 Swarm plot, men’s and ladies’ bike thefts in Berlin,
October 202...
14. Figure 8.14 Sina plot, men’s and ladies’ bike thefts in Berlin in
January 20...
9. Chapter 9
1. Figure 9.1 Half-violin plot, custom function, OECD/Pisa test
results for mal...
2. Figure 9.2 Half-violin plot, boxplot, and scatterplot with jitter
correctly ...
3. Figure 9.3 OECD/Pisa tests, male and female students,
Mathematics, Reading, ...
4. Figure 9.4 Left-side half-violin plots, male and female students,
Mathematic...
5. Figure 9.5 Raincloud plot, male and female students,
Mathematics, Reading, a...
6. Figure 9.6 Violin plot with groups of two subsets of points, bike
thefts in ...
7. Figure 9.7 Half-violin plots with sticks.
8. Figure 9.8 Half-violin plots with quartiles.
10. Chapter 10
1. Figure 10.1 “Many consecutive pulses from CP1919,” in Harold
Dumont Craft, J...
2. Figure 10.2 Ridgeline plot, OECD-Pisa tests, default alphabetical
order base...
3. Figure 10.3 Ridgeline plot, OECD-Pisa tests, custom order
based on arithmeti...
4. Figure 10.4 Ridgeline plot, OECD-Pisa tests, custom order
based on arithmeti...
5. Figure 10.5 Ridgeline plot, OECD-Pisa tests, custom order
based on arithmeti...
11. Chapter 11
1. Figure 11.1 Heatmap, bike thefts in Berlin for months and hours
of day.
2. Figure 11.2 Heatmap, bike thefts in Berlin for months and hours
and style el...
3. Figure 11.3 Heatmap, number of bike thefts in Berlin for
months and hours.
4. Figure 11.4 Heatmap, value of stolen bikes in Berlin for months
and hours.
12. Chapter 12
1. Figure 12.1 Marginal with scatterplot and histograms, bike
thefts in Berlin ...
2. Figure 12.2 Plots aligned in a vertical grid, marginals, bike
thefts in Berl...
3. Figure 12.3 Marginal with scatterplot and rug plots, bike thefts
in Berlin (...
4. Figure 12.4 Marginal with categorical scatterplot and rug plot,
number of st...
5. Figure 12.5 Subplots, a scatter plot and a boxplot horizontally
aligned, sto...
6. Figure 12.6 Subplots, a scatter plot and a boxplot vertically
aligned, stole...
7. Figure 12.7 Joint plot with density plots as marginals, stolen
bikes in Berl...
8. Figure 12.8 Joint grid with scatterplot and rug plots as
marginals, stolen b...
13. Chapter 13
1. Figure 13.1 Cluster map, bike thefts in Berlin (2021–2022),
values scaled by...
2. Figure 13.2 Cluster map, bike thefts in Berlin (2021–2022),
values scaled by...
3. Figure 13.3 Cluster map, stolen bikes in Berlin (2021–2022),
scaled by colum...
4. Figure 13.4 Cluster map, stolen bikes in Berlin (2021–2022),
scaled by rows....
5. Figure 13.5 Diagonal correlation heatmap, stolen bikes in Berlin
(2021–2022)...
6. Figure 13.6 Diagonal correlation heatmap, stolen bikes in Berlin,
correlatio...
7. Figure 13.7 Scatterplot heatmap, stolen bikes in Berlin (2021–
2022), correla...
14. Chapter 14
1. Figure 14.1 Altair, scatterplot with color aesthetic and style
options.
2. Figure 14.2 Altair, horizontal alignments of plots and differences
from assi...
3. Figure 14.3 Altair, facet visualization.
4. Figure 14.4 (a) Dynamic tooltip (example 1). (b) Dynamic
tooltip (example 2)...
5. Figure 14.5 (a) Dynamic legend, year 2005. (b) Dynamic
legend, year 2010.
6. Figure 14.6 (a) Dynamic zoom, zoom in. (b) Dynamic zoom,
zoom out.
7. Figure 14.7 Mouse hover, contextual change of color.
8. Figure 14.8 Drop-down menu.
9. Figure 14.9 Radio buttons.
10. Figure 14.10 (a) Selection with brush and synchronized table
(example 1). (b...
11. Figure 14.11 (a) (Left plot) brush selection; (right plot)
synchronized plot...
12. Figure 14.12 (a) Plot as interactive legend, all years selected.
(b) Plot as...
13. Figure 14.13 Line plots, mean per capita, total expenditure, and
total arriv...
14. Figure 14.14 Line plots with mouse hover, Oceania’s line is
highlighted (the...
15. Figure 14.15 (a) Line plot with mouse hover and coordinated
visualization of...
16. Figure 14.16 Line plot with mouse hover and coordinated
visualization in all...
17. Figure 14.17 (Left): Bar plot with segment for the arithmetic
mean.
18. Figure 14.18 (Right): Bar plot with horizontal orientation and
annotations....
19. Figure 14.19 Diverging bar plots, pirate attacks, yearly and
monthly variati...
20. Figure 14.20 Plot with two distinct y-axes and corresponding
scales.
21. Figure 14.21 Stacked bar plot, pirate attacks, and countries
where they took...
22. Figure 14.22 Bar plot with sorted bars and annotations.
23. Figure 14.23 (a) Synchronized bar plots, default visualization,
without sele...
24. Figure 14.24 Bar plots and tables synchronized with slider,
homeless in the ...
25. Figure 14.25 (a) Bar plots and slider, homeless in the US States
(year 2022)...
26. Figure 14.26 (a) Bubble plot and slider, homeless in the US
States (year 202...
27. Figure 14.27 Heatmap with dynamic tooltip, homelessness in
the US States (% ...
28. Figure 14.28 Univariate histogram, 100 bins, homeless in the
United States (...
29. Figure 14.29 Bivariate histogram, 20 bins, and scatterplot,
homeless in the ...
30. Figure 14.30 Bivariate histogram, 20 bins, and rug plot,
homeless in the Uni...
15. Part 3
1. Figure 1 Design for Tandem Cart, 1850–74, Gift of William
Brewster, 1923, Th...
16. Chapter 15
1. Figure 15.1 (a) Shiny, test MAT, and country AL (Albania)
selected. (b) Shin...
2. Figure 15.2 (a) Table and plot, test READ and country KR
(Korea) selected. (...
3. Figure 15.3 (a) A table, two plots, and light theme. (b) A table,
two plots,...
4. Figure 15.4 (a) Tab MAT, default theme. (b) Tab READ, dark
theme. (c) Google...
17. Chapter 16
1. Figure 16.1 (a) Layout with default configuration with years
range 2000–2021...
2. Figure 16.2 Excerpt of XML representation of a web-scraped
HTML page.
3. Figure 16.3 Selecting the table element through the Chrome’s
Inspect Element...
4. Figure 16.4 First data frame obtained through web scraping
from an HTML page...
5. Figure 16.5 Second data frame obtained through web scraping
from an HTML pag...
6. Figure 16.6 (a) Expeditions tab, default visualization. (b)
Summiteers tab, ...
7. Figure 16.7 Static and interactive Altair graphics in a Shiny
dashboard.
18. Chapter 17
1. Figure 17.1 Plotly, scatterplot with default dynamic tooltip.
2. Figure 17.2 Plotly, scatterplot with extended dynamic tooltip.
3. Figure 17.3 Plotly, line plot with tooltip.
4. Figure 17.4 Plotly, scatterplot with a histogram and a rug plot as
marginals...
5. Figure 17.5 Plotly, facet visualization.
19. Chapter 18
1. Figure 18.1 Dash dashboard with Plotly graphic.
2. Figure 18.2 (a) Slider with default range. (b) Slider with
modified range (2...
3. Figure 18.3 (a) Dash, graphic, slider, and data table with
interactive featu...
4. Figure 18.4 (a) Color palette selector and centered, resized data
table (exa...
5. Figure 18.5 Sidebar and reactive data table, all country
checkbox selected. ...
6. Figure 18.6 (a) Dash dashboard, default appearance. (b) Detail
of the scatte...
7. Figure 18.7 (a) First tab with a selection of countries from the
drop-down m...
8. Figure 18.8 (a) First tab, data table, reactive graphics, and
layout. (b) Se...
20. Chapter 19
1. Figure 19.1 World map from package maps.
2. Figure 19.2 Italy’s border map.
3. Figure 19.3 Provinces of Italy.
4. Figure 19.4 Choropleth map with an incoherent association
between data and g...
5. Figure 19.5 Regions of Italy.
6. Figure 19.6 Choropleth map with coherent data and
geographical areas.
7. Figure 19.7 Choropleth maps, from left to right: ratio of dogs
per resident,...
8. Figure 19.8 Annotated map with dots and city names for Milan,
Bologna, and R...
9. Figure 19.9 ggplot image transformed into a Plotly HTML
object.
10. Figure 19.10 Maps from Natural Earth, Sweden and Denmark’s
borders and regio...
11. Figure 19.11 Railroad and land maps from Natural Earth.
12. Figure 19.12 Land and railroad maps of Western Europe.
13. Figure 19.13 Busiest railway stations and railroad network in
Western Europe...
14. Figure 19.14 (a/b) Venice, streets, and canals cartographic
layers.
15. Figure 19.15 Venice municipality border map.
16. Figure 19.16 Venice, Municipality area, streets, and canals
layers.
17. Figure 19.17 Venice, historical insular part, map with overlaid
layers.
18. Figure 19.18 (a/b) Venice, ggmap, Stamen Terrain, and Toner
tiled web maps....
19. Figure 19.19 Venice, Leaflet base map from OpenStreetMap. (a)
Full view. (b)...
20. Figure 19.20 (a/b/c) Venice, Leaflet tile maps from Stamen,
Carto, and ESRI....
21. Figure 19.21 Venice, ggmap, tiled web maps with cartographic
layers. (a) Ope...
22. Figure 19.22 Venice, Leaflet with Carto Positron tile map, and
cartographic ...
23. Figure 19.23 Venice, Leaflet, civic numbers with dynamic
popups associated....
24. Figure 19.24 Venice, Leaflet, pedestrian areas.
25. Figure 19.25 Venice, ggplot, markers with annotations.
26. Figure 19.26 (a) Venice, Leaflet, aggregate circular marker and
popup, full ...
27. Figure 19.27 (a/b) Rome, tmap, choropleth maps of
neighborhoods and district...
28. Figure 19.28 (a) Rome, tmap, historical villas, plot mode
(static). (b) Rome...
29. Figure 19.29 (a) Rome, tmap view mode, city center
archaeological map with E...
30. Figure 19.30 Rome, accommodations for topographic area,
wrong bubble plot.
31. Figure 19.31 (a) Rome, tmap, full map with bubbles centered
on centroids and...
32. Figure 19.32 Rome, tmap, quantiles, and custom legend labels.
33. Figure 19.33 Rome, tmap, standard quantile subdivision, and
legend labels.
34. Figure 19.34 Rome region tmap, road map with dynamic
popups.
35. Figure 19.35 (a) Rome, tmap, Bed and Breakfasts, full map. (b)
Rome, tmap, H...
36. Figure 19.36 (a) Rome, tmap, hotels, full map. (b) Rome, tmap,
hotels, zoom ...
21. Chapter 20
1. Figure 20.1 NYC, plotly.express, choropleth map of licensed
dogs.
2. Figure 20.2 NYC, plotly.express, most popular dog breed for zip
code.
3. Figure 20.3 NYC, plotly.express, most popular dog breed for zip
code, OpenSt...
4. Figure 20.4 NYC, plotly go, base map, and dog runs layer.
5. Figure 20.5 NYC, plotly go, overlaid layers, Choropleth map,
and dog runs, C...
6. Figure 20.6 NYC, plotly.express and geopandas, dog runs,
extended tooltip.
7. Figure 20.7 NYC, plotly go and geopandas, dog runs, extended
tooltip.
8. Figure 20.8 NYC, plotly go and geopandas, dog breeds and dog
runs with disti...
9. Figure 20.9 (a) NYC, plotly go and geopandas, dog breeds, dog
run areas, and...
10. Figure 20.10 NYC, Folium, base map with default tiled web map
from OpenStree...
11. Figure 20.11 NYC, Folium, markers, popups, and tooltips,
Stamen Terrain tile...
12. Figure 20.12 (a/b) NYC, Folium, marker’s popups with HTML
iframe and image (...
13. Figure 20.13 NYC, Folium, base map, and GeoJSON layer with
FEMA sea level ri...
14. Figure 20.14 NYC, Folium choropleth map, rodent inspections
finding rat acti...
15. Figure 20.15 NYC, Folium and geopandas, rodent inspections
finding rat activ...
16. Figure 20.16 NYC, Folium heatmap of rodent inspections with
rat activity.
17. Figure 20.17 (a/b) Altair, NYC zip code areas, and boroughs.
18. Figure 20.18 Altair, NYC subway stations with popups.
19. Figure 20.19 Altair, choropleth maps for ethnic groups (from left
to right: ...
Data Visualization in R and Python
Marco Cremonini
Trademarks: Wiley and the Wiley logo are trademarks or registered trademarks of John
Wiley & Sons, Inc. and/or its affiliates in the United States and other countries and may not
be used without written permission. All other trademarks are the property of their
respective owners. John Wiley & Sons, Inc. is not associated with any product or vendor
mentioned in this book.
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their
best efforts in preparing this book, they make no representations or warranties with respect
to the accuracy or completeness of the contents of this book and specifically disclaim any
implied warranties of merchantability or fitness for a particular purpose. No warranty may
be created or extended by sales representatives or written sales materials. The advice and
strategies contained herein may not be suitable for your situation. You should consult with a
professional where appropriate. Further, readers should be aware that websites listed in this
work may have changed or disappeared between when this work was written and when it is
read. Neither the publisher nor authors shall be liable for any loss of profit or any other
commercial damages, including but not limited to special, incidental, consequential, or other
damages.
For general information on our other products and services or for technical support, please
contact our Customer Care Department within the United States at (800) 762-2974, outside
the United States at (317) 572-3993 or fax (317) 572-4002.
Wiley also publishes its books in a variety of electronic formats. Some content that appears
in print may not be available in electronic formats. For more information about Wiley
products, visit our web site at www.wiley.com.
October 8, Marco
2024 Cremonini
University of
Milan
Introduction
When you mention data visualization to a person who doesn’t know
it, perhaps adding that it involves data and the results of data
analysis with figures, sometimes even interactive ones, the reaction
you observe is often that the person in front of you looks intrigued
but doesn’t know exactly what it consists of. After all, if we have a
table with data and we want to produce a graph, isn’t it enough to
open the usual application, go to a certain drop-down menu, choose
the stylized figure of the graph you want to create and click? Is there
so much to say to fill an entire book? At that moment, when you
perceive that the interlocutor is thinking of the well-known
spreadsheet product, you may add that those described in the book
are graphic tools completely different from those of office
automation and, to tell the truth, we don’t even stop at the graphics,
even if interactive, but there are also dashboards, namely the latest
evolution of data visualization, when it is transformed into dynamic
web applications, and to obtain dashboards it is not sufficient to click
on menus but you have to go deeper into the inner logic and
mechanisms. It’s then that the expression of the interlocutor is
generally crossed by a shadow of concern and you can play the ace
up your sleeve by saying that in data visualization there are also
maps, geographical maps, sure, those are made by data too: spatial
data and geographical data, and the maps can be produced with the
many available widgets such as zoom, flags, and colored areas; and
we even go beyond simple maps, because there are also
cartographic maps with layers of cartographic quality, such as maps
of Rome, of Venice, of New York, of the most famous, and also not-
so-famous cities and places, possibly with very detailed geographical
information.
At that point the interlocutor has likely lost the references she or he
had from the usual experience with office automation products and
doesn’t really know what this data visualization is, only that there
seems to be a lot to say, enough to fill an entire book. If anyone
recognizes themselves in this imaginary interlocutor (imaginary up to
a certain point, to be honest), know that you are in good company.
Good in a literal not figurative sense, because data visualization is a
little like the Cinderella of data science that many admire from a
certain distance, it arrives last in a project and sometimes it does not
receive the attention it deserves. Yet there are many who, given the
right opportunity to study and practice it, sense that it could be
interesting and enjoyable, it could certainly prove useful and
applicable in an infinite number of areas, situations, and results. This
is due to a property that data visualization has and is instead absent
in traditional data analysis or code development: it stimulates visual
creativity together with logic. Even statisticians and programmers
use creativity, those who deny it have never really practiced one of
those disciplines, but that is logical creativity. With data visualization,
another dimension of data science that is otherwise neglected comes
into play, the visual language combined with computational logic, the
data represented with an expressive form that is no longer just
logical and formal, but also perceptive, and sensorial, comes into
play with shapes, colors, use and projections of space, and it is
always accompanied with meaning that the originator wish to convey
and the observers will interpret, often subjectively. Data visualization
conveys different knowledge and logic for an expressive form that
always has a double soul: computational for the data that feeds it,
visual and sometimes interactive for the language it uses to
communicate with the observer. Data visualization has always a
double nature: it is a key part of data science for its methods,
techniques, and tools, and it is storytelling; who produces visual
representations from data tells a story that may have different
guises and may produce different reactions. There is enough to fill
not just a single book.
The text is divided into four parts already mentioned in the previous
introduction. The first part presents the fundamentals of data
visualization with Python and R, the two reference languages and
environments for data science, employed to create static graphs as a
direct result of a previous data wrangling (import, transformation)
and analysis activity. The reference libraries for this first part are
Seaborn for Python and ggplot2 for R. They are both modern open-
source graphics libraries and in constant evolution, both produced by
the core developers and with the contributions of the respective
communities, very large and lively in engaging in continuous
innovations. Seaborn is the more recent of the two and partly
represents an evolved interface of Python’s traditional matplotlib
graphics library, made more functional and enriched with features
and graph types popular in modern data visualization. Ggplot2 is the
traditional graphic library for R, unanimously recognized as one of
the best ever, both in the open-source and proprietary world. Ggplot
is full of high-level features and constantly evolving, it receives
contributions from researchers and developers from various scientific
and application fields. A simply unavoidable tool for anyone
approaching data visualization. The two have different settings,
more traditional Seaborn, with a collection of functions and options
for the different types of charts supported. Instead, ggplot is
organized by overlapping graphic levels, according to a setting that
goes by the name of grammar of graphics, shared by some of the
most widespread digital graphics tools, and suitable for developing
even unconventional types of graphics, thanks to the extreme
flexibility it allows. This first part covers about a third of the work.
What I’m trying to say is that data visualization, like data science as
a whole, is not a sectoral discipline for which you need to have a
specific background, such as a statistician, computer scientist,
engineer, or graphic designer. It is not necessary at all, in fact the
opposite is needed, that is, that data visualization and data science
be as transversal as possible, being studied and used by all those
who, for their formation and work interests, in their specific field,
from economics to paleontology, from psychology to molecular
biology, find themselves working with data, whether numerical,
textual, or spatial and find useful to obtain high-quality visual
representations from those data, perhaps interactive or structured in
dashboards.
Between these two parts and the subsequent third and fourth parts,
there is a gap in terms of what is required and what is learned, for
this reason in the initial introductory part the last two parts were
presented as advanced content. It is necessary to have acquired a
good familiarity with the fundamentals, confidence in searching for
information in the documentation of libraries, and knowing how to
patiently and methodically manage errors. In other words, you need
to have done a good number of exercises with the fundamental part.
What is Excluded
https://www.wiley.com/go/Cremonini/DataVisualization1e
Codes
Figures
Datasets
Part I
Static Graphics with ggplot (R)
and Seaborn (Python)
Grammar of Graphics
References
Dataset
1.1 R: ggplot
1.1.1 Scatterplot
Let us start with the just mentioned relation between height and
weight of a sample of people. For this, we can use the dataset
heights, predefined into package modelr, which is part of the
tidyverse package. For simplicity, we always assume to load the
tidyverse package for all R examples. The dataset refers to a sample
of US citizens collected in a 2012 study of the U.S. Bureau of Labor
Statistics. Values are expressed as centimeters and kilograms, for
readers familiar with the Imperial system, they could be visualized
simply by omitting the two transformations into centimeters and
kilograms with the conversion coefficients shown in the code.
library(tidyverse)
df= modelr::heights
df$height_cm= 2.54*df$height
df$weight_kg= 0.45359237*df$weight
df
# A tibble: 7,006 × 10
income height weight age marital sex height_
<int> <dbl> <int> <int> <fct> <fct> <dbl>
1 19000 60 155 53 married female 152.
2 35000 70 156 51 married female 178.
3 105000 65 195 52 married male 165.
4 40000 63 197 54 married female 160.
5 75000 66 190 49 married male 168.
# … with 7,001 more rows
The dataset contains data of 7006 individuals, with information
regarding sex, income, and marital status, in addition to height and
weight.
Now we see the difference between men and women, with men, not
surprisingly, typically taller than women. However, for what regard
the causal relation between height and weight, the increasing trend
result is less evident if men and women are considered separately, in
particular for women, apparently exhibiting a larger variability, at
least for this sample.
Figure 1.3 Scatterplot with color aesthetic.
Figure 1.5 Scatterplot with income as dependent variable and color aesthetic for sex
variable.
What if we would like to introduce a fourth variable, for example, the
marital status in addition to height, weight, and sex? We have to use
another aesthetic in addition to x, y, and color, for example, the
shape of the markers. We have two possibilities: associate markers’
shape to the marital status ( color=sex, shape=marital ) or the
sex to the shape ( color=marital, shape=sex ). We try both ways
and use package patchwork (https://patchwork.data-imaginist.com/)
to plot the two graphics side by side ( plot1 + plot2 or plot1 |
plot2 ). To have them stacked one over the other, the syntax would
be plot1 / plot2 . Figure 1.6 shows the two alternatives.
library(patchwork)
plot1 / plot2
The result is almost unreadable in both ways. This simply shows that
just adding more aesthetics does not guarantee a better result that
is readable and informative; instead, it easily ends up in a confused
visual representation. These simple initial examples have touched
some important aspects that we recapitulate:
library(WDI)
infl = WDI(indicator='FP.CPI.TOTL.ZG')
infl= as_tibble(infl)
us_infl= filter(infl, iso2c=='US')
# A tibble: 62 × 5
country iso2c iso3c year FP.CPI.TOTL.ZG
<chr> <chr> <chr> <int> <dbl>
United States US USA 2022 8.00
United States US USA 2021 4.70
United States US USA 2020 1.23
United States US USA 2019 1.81
United States US USA 2018 2.44
United States US USA 2017 2.13
United States US USA 2016 1.26
United States US USA 2015 0.12
United States US USA 2014 1.62
United States US USA 2013 1.46
United States US USA 2012 2.07
United States US USA 2011 3.16
United States US USA 2010 1.64
# …
The time series goes from 1960 to 2022. In this case, the scatterplot
could be produced by associating years to inflation values. We use
the pipe notation and add some style options: a specific marker
( shape ) with a custom line width and internal color ( stroke and
fill ), the marker size ( size ), a certain degree of transparency
( alpha ), custom labels for aesthetics ( labs() ) – either associated
to axes, the legend, or as plot title/subtitle – and a graphic theme
( theme() ). Figure 1.7 shows the result.
We can draw again the scatterplot, in this case, without the many
stylistic options but with color as an aesthetic associated to countries
and a color palette from Viridis (Figure 1.8).
Figure 1.7 United States’ inflation values 1960–2022.
color_list= c("black","forestgreen","skyblue3","gold"
sample_infl %>%
mutate(color = ifelse(year>=2000,
as.character(sample_infl$coun
NA_character_)) %>%
ggplot(aes(x= year, y= FP.CPI.TOTL.ZG)) +
geom_point(aes(color=color), size= 2) +
scale_color_manual(breaks = unique(sample_infl$coun
values = color_list, na.value= "
labs(x= "Year", y= 'Inflation (%)', color= "Country
theme_light()
In this example, we define two thresholds for the inflation value and
color the points differently. We also add two horizontal segments
(using function geom_hline() ) to visually represent the thresholds.
Function scale_color_manual() allows assigning colors manually
to the color aesthetic. There exist several variants of scale functions,
the main ones are scale_color_* and scale_fill_* (the star
symbol indicating that several specific functions are available),
respectively, for configuring the aesthetic color or the aesthetic
fill . Moreover, scale functions are also important to configure
axes values and labels. We will use them in other examples. In
addition, we introduce an often very useful package called ggrepel,
which is the best solution when textual annotations should be added
to markers, to show a corresponding value. The problem with textual
annotations in scatterplots is that they easily end up overlapping in a
clutter of labels only partially readable. Package ggrepel
automatically separates them or, at least, makes its best effort to
produce a comprehensible visualization. It has obvious limits, when
markers are too many and too close, there is nothing that even
ggrepel could do to place all labels in a suitable way, but if markers
are a few, which is the correct situation for showing textual labels,
the result is usually good. Here we use it to add textual labels only
for years with a very high inflation (greater than 5%).
For this example, the logic is the following: with function cut() , we
can define three ranges of inflation values, i.e. from −2 to 2, from 2
to 5, and from 5 to infinite; variable val is defined as a list with
key=value pairs as elements, where keys are the values resulting
from function cut() and values are color codes; variable lab has
the different texts to visualize as legend keys.
library(ggrepel)
Time series provided with this set of data, in several cases, cover
many decades; for example, years have been selected from 2010 to
2022. Some data-wrangling operations are needed to prepare the
data frame. First, because each data series is referred to a single
measurement station, there could be more than one for each city,
and second, because they are recorded as separate CSV (comma-
separated values) datasets. We have chosen data collected from
airport measurement stations and, after reading each dataset, a
column specifying the city has been added, then separate data
frames have been combined to form a single one. The resulting data
frame has been transformed into long form to have both minimum
and maximum temperatures in a single column.
c1= vroom('datasets/CarnegieMU/7890488/USW00014839.cs
c2= vroom('datasets/CarnegieMU/7890488/USW00023044.cs
c3= vroom('datasets/CarnegieMU/7890488/USW00094728.cs
c4= vroom('datasets/CarnegieMU/7890488/USW00023183.cs
c5= vroom('datasets/CarnegieMU/7890488/USW00013874.cs
c6= vroom('datasets/CarnegieMU/7890488/USW00094012.cs
c1$city= 'Milwaukee'
c2$city= 'El Paso'
c3$city= 'New York'
c4$city= 'Phoenix'
c5$city= 'Atlanta'
c6$city= "Havre (MT)"
cities= bind_rows(c1,c2,c3,c4,c5,c6)
Years from 2010 to 2022 are selected, then the graphic has been
produced. Ticks on axes x and y have been customized according to
dates and temperatures; axes and legend values also have been
minimally tweaked (functions scale_x_date() and
scale_y_continuous() for axes’ ticks, functions theme() and
guides() for axes and legend values). The color palette is set with
scale_color_wsj() that imitates the typical color scale of The
Wall Street Journal.
Figure 1.11 shows the result for minimum temperatures. The shape
of the multitude of scatterplot markers provides an intuitive
information about the seasonal temperature variation, which is
qualitatively similar for all cities. The color aesthetic, set with city
names, offers specific information about cities, although not
completely clear, due to markers overlapping. The hottest city, i.e.
Phoenix, and the coldest, i.e. Havre, are fairly well recognizable in
their most extreme temperatures, but details are muddled for
temperatures in the middle range. We will see in a future chapter
how to approach a case like this for producing a clearer
visualization; for now, it is important to learn that scatterplots are
extremely flexible and adaptable to many cases, and creativity could
and should be exercised.
The line plot is a scatterplot variant that connects with a line the
data points belonging to the same group, meaning that they share
the same value of a certain variable (e.g., they are referred to the
same city). The same data points could or could not be visualized
with a marker. Let us consider a first example, which will result in an
incoherent graphic, but will be useful to understand the main
characteristic of line plots, which is the definition of homogeneous
groups of data points. We use the previous example with countries
and inflation values and add a new layer representing the line plot
with function geom_line() . Figure 1.12 shows the result, which is
problematic.
The result is clearly incoherent because the line just connects data
points in sequential order, which has no meaning at all. What we
would have wanted, instead, was to connect data points belonging
to the same country, this way resulting in different lines, one for
each country. We should use attribute group , which represents a
new aesthetic associated to the data frame variable with country
names. With attribute group , we specify the variable whose unique
values define the homogeneous group of points to connect with a
line. In the example, we set group=country , meaning that points
should be logically grouped for same country, and points belonging
to the same country should be connected with a line. Figure 1.13
shows the correct line plot.
Figure 1.12 A problematic line plot, groups are not respected.
The readability is still poor, but now the line plot is coherent having
one line for each country. We could improve it by removing the
scatterplot markers, using linetype as an aesthetic in addition to
color so that lines are different for the different countries, and by
tuning other style options such as line color and line width. The
result has a better look and is more readable (Figure 1.14).
color_list= c("gold","skyblue3","forestgreen","black"
TIP
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
gdp = pd.read_csv('datasets/world_bank/API_NY.GDP.MKT
KD.ZG_DS2_en_csv_v2_5358346.csv',
skiprows=4)
Country Country
1960 1961 … 2019
Name Code
… … … … … … …
… … … … …
1.2.1 Scatterplot
plt.figure(figsize = (8,5))
plt.rcParams.update({'font.size': 16})
sns.set(style='whitegrid', font_scale=0.9)
To replicate the example seen with ggplot and coloring data points
based on a threshold value, Seaborn does not offer many
opportunities other than to create two distinct subsets of data points
and draw two overlapped scatterplots. In this case, we use point size
and transparency to differentiate data points over or below the
threshold. Only one legend is shown, the second would be a
duplication (Figure 1.17).
# Milwaukee
c1=pd.read_csv('datasets/Carnegie_Mellon_Univ/7890488
# El Paso
c2=pd.read_csv('datasets/Carnegie_Mellon_Univ/7890488
# New York
c3=pd.read_csv('datasets/Carnegie_Mellon_Univ/7890488
# Phoenix
c4=pd.read_csv('datasets/Carnegie_Mellon_Univ/7890488
# Atlanta
c5=pd.read_csv('datasets/Carnegie_Mellon_Univ/7890488
# Havre, Montana
c6=pd.read_csv('datasets/Carnegie_Mellon_Univ/7890488
c1["city"] = "Milwaukee"
c2["city"] = "El Paso"
c3["city"] = "New York"
c4["city"] = "Phoenix"
c5["city"] = "Atlanta"
c6["city"] = "Havre (MN)"
The line plot, as we already know, follows the same logic of the
scatterplot, with the additional requirement that groups of points
should be correctly managed. Seaborn automatically manages
homogeneous data points and just few attributes should be
adjusted, with respect to the scatterplot, for example, linewidth is
needed to change the line width rather than s for marker size
(Figure 1.19).
sns.set(style='white')
sns.lineplot(data=sample_gdp, x="Year", y="GDP",
hue='Country Name', linewidth=1,
palette= 'viridis')
plt.legend(loc='lower center)
plt.xlabel("")
plt.ylabel('GDP (%)')
Dataset
Air Quality Report year 2021 (transl. Report qualità aria 2021), Open
Data Municipality of Milan, Italy
(https://dati.comune.milano.it/dataset/ds413-rilevazione-qualita-
aria-2021).
2.1 R: ggplot
A bar plot (or bar chart) is the reference type of graphic when
categorical variables are handled: each category has a value
associated, and a bar is drawn to represent it. Values could depend
on another variable, for example, a statistic, or could represent the
number of observations that fall in each category. Let us consider a
first example using data about the air quality of the city of Milan,
Italy, which is a heavily polluted city. It is a time series where, for
each day of the period, quantities of some pollutants are measured.
The variable pollutant is categorical, and we want to graphically
represent the variations of pollutant levels during the time period.
Column names have been translated into English.
df=read_csv2("datasets/Milan_municipality/
qaria_datoariagiornostazione_2021.csv")
df=rename(df, c(station_id=stazione_id, date=data,
pollutant=inquinante, value=valore))
head(df)
# A tibble: 6 × 4
station_id date pollutant value
<dbl> <date> <chr> <chr>
1 1 2021-12-31 NO2 <NA>
2 2 2021-12-31 C6H6 2
3 2 2021-12-31 NO2 54
4 2 2021-12-31 O3 2
5 2 2021-12-31 PM10 50
6 2 2021-12-31 PM25 32
df$value=as.numeric(df$value)
df%>%filter(!is.na(value)) -> df1
With the first bar plot, we want to show, for each pollutant, the total
value over the whole period; an aggregation operation is needed.
df1%>%group_by(pollutant) %>%
summarize(total=sum(value)) -> df1_grp
# A tibble: 7 × 2
pollutant total
<chr> <dbl>
1 C6H6 774.
2 CO_8h 644.
3 NO2 75839
4 O3 34720
5 PM10 26993
6 PM25 9267
7 SO2 1029
With this aggregated data frame, the bar plot could be created,
adding a few style options, like a color palette. Color Brewer
(https://r-graph-gallery.com/38-rcolorbrewers-palettes.html)
provides a number of predefined palettes for R and it is a common
choice, although not much original.
TIP
The ggplot function for bar plots is geom_bar() and a key attribute
is stat (statistic). By default, the stat attribute has value count,
meaning that the bar plot requires a single categorical variable as
the independent one (x-axis), and values of the y-axis are calculated
as the number of observations falling in each category. In our case
study, it would count the number of measurements for each
pollutant. When, instead, a bar plot with two variables is needed,
one for the categorical values and the second for values associated
to each category (in our example, the total quantity of pollutants
during the period), the attribute stat should be explicitly set to
value identity ( stat=’identity’ ). Another important attribute is
position that controls the visualization of groups of bars, where
for each group, bars could be placed beside one to the other
( position=’dodged’ ) or one on top of the other
( position=’stacked’ ), which is the default. The next example
shows a simple bar plot with two variables, pollutant names on the
x-axis and their quantities on the y-axis, therefore
stat=’identity’ is specified. Figure 2.1 shows the result.
library(ggthemes)
Figure 2.2 Bar plot with custom color palette, horizontal bar orientation, and ordered bars.
cols=c("C6H6"="#bda50b", "CO_8h"="#a1034a",
"NO2"="#295eb3", "O3"="#94770f", "PM10"="#471870",
"PM25"="#94420f", "SO2"="#356604")
df1_grp %>%
ggplot(aes(x=reorder(pollutant, total), y=total))
geom_bar(aes(fill=pollutant), stat="identity",
alpha=0.8, show.legend = FALSE) +
scale_fill_manual(values = cols)+
labs(x="Pollutant", y="Quantity")+
coord_flip()+
theme_light()
Figure 2.3 Bar plot with ranges of values for PM10 derived from a continuous variable.
df1_PM10 %>%
ggplot(aes(x=range)) +
geom_bar(aes(fill=range), show.legend = FALSE) +
scale_fill_tableau(palette = "Miller Stone")+
labs(x="Value ranges: PM10", y="Number of days")+
theme_minimal()
We replicate for Seaborn the examples seen with ggplot. First, data
should be prepared for plotting.
df=pd.read_csv("datasets/Milan_municipality/
qaria_datoariagiornostazione_2021.csv
df.columns=['station_id', 'date', 'pollutant', 'value
df["date"]=pd.to_datetime(df["date"], format="%Y-%m-%
df=df[∼df.isna().any(axis=1)]
df_grp=df.groupby(["pollutant"])[["value"]].sum()
df_grp.reset_index(inplace=True)
Now that we have the total quantity for each pollutant, we can start
with a simple bar plot using function sns.barplot() , to which we
add a few options: attribute order to order bars, which has a
peculiar syntax with the following general template:
order=df.sort_values("variable_y",ascending=False).vari
able_x) , meaning that variable_x is the variable whose bars should
be ordered, and variable_y the variable whose values define the
ordering criteria, ascending or descending. As a last option, we
rotate the labels on ticks of axis x by 45° to improve readability and
set axes labels (Figure 2.4).
Figure 2.4 Bar plot with ordered bars and x ticks rotated.
plt.xticks(rotation=45)
plt.xlabel("Pollutant")
plt.ylabel('Quantity')
plt.tight_layout()
plt.xticks(rotation=30)
plt.xlabel("Month")
plt.ylabel('Quantity')
plt.tight_layout()
The bar plot is correct although the style could be improved. For
example, we could use month names and move the legend outside
the plot. First, column month should be transformed into datetime
type. Then, we can use method dt.month_name() to obtain month
names. For the legend, to move it outside the plot, the specific
function sns.move_legend() has attribute bbox_to_anchor ,
style options in common with the previous graphic have been
omitted (Figure 2.6).
TIP
df_grp2["month"]=pd.to_datetime(df_grp2["month"], for
g=sns.barplot(df_grp2, x=df_grp2["month"].dt.month_na
y="value", hue="pollutant", palette='ma
plt.xticks(rotation=30)
plt.xlabel("")
plt.ylabel('Quantity')
sns.move_legend(g, "upper left", bbox_to_anchor=(1, 1
Let us consider a variant by using a color palette
( sns.color_palette() ) and with stacked bars rather than
dodged, for this attribute dodge must be set to False
( dodge=False ). Figure 2.7 shows the result, and style options in
common with previous plots have been omitted.
pal=sns.color_palette("magma")
g=sns.barplot(data=df_grp2,
x=df_grp2["month"].dt.month_name(), y="
hue="pollutant", dodge=False, palette=p
df_NO2= df[df.pollutant=="NO2"]
df_NO2[range] = pd.cut(x=df_NO2['value'],
bins=[0, 30, 40, 50, 60, 70, 80, 100, 120, 140, 200],
labels=['<30','30-40','40-50','50-60','60-70','70-80'
'80-100','100-120','120-140','>140']
station_id date pollutant value range
Figure 2.8 Bar plot with ranges of values derived from a continuous variable.
2.2.3 Visualizing Subplots
df_NO2= df[df.pollutant=="NO2"]
df_NO2['es1'] = pd.qcut(df_NO2['value'], q=4)
df_NO2['es2'] = pd.qcut(x=df_NO2['value'],
q=[0, .25, .5, .75, 1])
Figure 2.9 Bar plots with quantile representation, subplots, and style options.
ax[0].set(title="ES 1: q=4")
ax[1].set(title="ES 2: q=[0, .25, .5, .75, 1]")
ax[0].xaxis.set_tick_params(labelsize=7)
ax[1].xaxis.set_tick_params(labelsize=7)
f.tight_layout()
TIP
Dataset
3.1 R: ggplot
3.1.1 Case 1: Temperature
WARNING
# A tibble: 4,416 × 5
station_id date pollutant value month
<dbl> <date> <chr> <dbl> <dbl>
1 2 2021-12-31 C6H6 2 12
2 2 2021-12-31 NO2 54 12
3 2 2021-12-31 O3 2 12
4 2 2021-12-31 PM10 50 12
5 2 2021-12-31 PM25 32 12
# … with 4,411 more rows
df2%>%group_by(month, pollutant)%>%
summarize(total=sum(value)) -> df2_grp
# A tibble: 84 × 3
# Groups: month [12]
month pollutant total
<dbl> <chr> <dbl>
1 1 C6H6 100
2 1 CO_8h 74
3 1 NO2 7106
4 1 O3 1119
5 1 PM10 2493
6 1 PM25 910
7 1 SO2 95.5
# … with 77 more rows
Now we can produce bar plots with facets and some style options.
We specify month names by replacing month numbers with names.
For this, we use function scale_x_discrete() . There exist similar
functions for the y-axis or continuous values (i.e.,
scale_y_discrete() , scale_x_continue() ,
scale_y_continue() ). In this case, showing the legend would be
redundant, we omit it with option show.legend=FALSE . Figure 3.2
shows the result.
…
facet_wrap(vars(pollutant), ncol= 3, scales= "free_y"
…
# A tibble: 239 × 4
# Groups: station_id, month [60]
station_id month pollutant total
<dbl> <dbl> <chr> <dbl>
1 2 1 C6H6 25.5
2 2 1 NO2 1410
3 2 1 O3 614
4 2 1 PM10 657
5 2 1 PM25 508
6 2 1 SO2 95.5
# … with 233 more rows
Figure 3.3 Facet visualization with independent scale on y-axis.
We can use month, total, and pollutant variables for bar plots, and
station_id for facets. The style is customized with custom colors. The
result shown in Figure 3.4 looks aesthetically pleasant and
informative with no risk of ambiguity as for the previous case.
Figure 3.4 Facet visualization with bar plots, facets are all well-readable and balanced.
For line plots, the only differences with respect to scatterplots are
kind="line" and linewidth to set the line width. The following
code is the line plot corresponding to the previous scatterplot.
Figure 3.5 Temperature measurement in some US cities, maximum temperatures, facet
visualization.
We replicate the example seen before with data about the air quality
and pollutants in Milan. A few common data-wrangling operations
are needed to prepare the data frame.
3 April O3 3 405.0
⋯ ⋯ ⋯ ⋯
80 September O3 4 686.0
81 September PM10 2 027.0
Let us first use months as the facet variable. The result of Figure 3.6
is correct overall, with the exception of the scale on axis y that is
suitable for certain pollutants only (e.g., bars for C6H6, CO_8h, and
SO2 are always practically invisible).
sns.set_theme(style="white",font_scale=0.9)
g.set_axis_labels("Pollutants",'Quantity')
g.tick_params(axis='x', rotation=45)
Let us see a variant that replicates the example seen with ggplot. In
this case, we want to have pollutants as facets, months on the x-
axis, and coloring bars for each pollutant using attribute hue .
Figure 3.7 shows the detail of just one facet for clarity (i.e., for
pollutant NO2), the other ones are similar. The result is not visually
correct in this case because Seaborn plots bars as if they were
grouped side-by-side, this is the reason why they appear so thin and
difficult to recognize. The month order is also incorrect when names
are used.
Figure 3.6 Facets and bar plot visualization.
Figure 3.7 Incorrect facet visualization (single facet detail).
sns.set_theme(style="white",font_scale=0.7)
g.tick_params(axis='x', rotation=90)
g.tight_layout()
g=sns.FacetGrid(general elements)
g.map(specific graphic type and attributes)
We can reproduce the previous example to obtain a correct
visualization. We also fix the wrong month name order by defining a
list with month names correctly ordered, with that we configure
attribute order of function map() . Figure 3.8 shows the facet
visualization.
Figure 3.8 Facet visualization with the general method, unbalanced facets.
list=["January", "February", "March", "April", "May",
"June", "July", "August", "September", "October
"November", "December"]
g.tick_params(axis='x', rotation=90)
g.set_axis_labels("",'Quantity')
g.tight_layout()
Technically, the graphic is now correct. Still, the facets are not
homogeneous, due to the different scales of the pollutants. We can
correct it, similarly to what we did with ggplot, by making scales on
the y-axis independent. To make that, function FacetGrid() has
attributes sharex and sharey , which if True use a shared scale
for all facets respectively on axis x or axis y, if False otherwise. In
our case, we want independent scales on axis y ( sharey=False )
and common scales on axis x ( sharex=True ). In Figure 3.9 the
modified facet visualization is shown.
Figure 3.9 Facet visualization with the general method, independent scales.
…
g = sns.FacetGrid(df_grp1, col='pollutant', hue='poll
col_wrap=3, height=2, sharex=True,
…
df_grp2.reset_index(inplace=True)
df_grp2= df_grp2.rename(columns={"date":"month", "val
station_id month pollutant total
2 2 April O3 1 796.0
⋯ ⋯ ⋯ ⋯ ⋯
g2.set_axis_labels("Quantity",")
g2.tight_layout()
4
Histograms and Kernel Density
Plots
A histogram is a traditional type of graphics based on a continuous
variable. For the values of this variable, it defines a certain number
of ranges called bins and counts the number of observations for
each bin. Visually, it is schematic and typically aesthetically simple,
but it may provide useful information about data. For this reason, it
is often used as an analysis tool, not just in presentations, in order
to study general characteristics of data, such as anomalous
distributions. It is important to remember that histograms are most
useful when several combinations of bin width or numerosity are
tested.
Dataset
4.1 R: ggplot
# Bin width: 5
TIP
For these types of graphics, a good choice of colors and style options
is important, being the aesthetic impact possibly very effective.
yearsNY=c("1870","1920","1970","2000","2010","2021")
yearsNY=c("1940","1970", "2000","2021")
Finally, for curious readers, we also show the results with minimum
temperatures, still in New York and for the same years of the
previous plot (Figure 4.9).
Figure 4.7 Kernel density for bivariate analysis with isodensity curves.
Figure 4.8 Kernel density for bivariate analysis with color gradient, NYC maximum
temperatures.
Figure 4.9 Kernel density for bivariate analysis with color gradient, NYC minimum
temperatures.
Data for this section are from the Open Data of Bologna Municipality,
Italy, they contain the list of Bed and Breakfasts (BnB) present in
town.
df=pd.read_csv("datasets/comune_bologna/bologna-rilev
sep=";")
id neigh. price number_of_reviews
0 209692 Navile 32 22
1 229114 Navile 80 49
… … … … …
plt.xlabel("Number of Reviews")
plt.ylabel("Number of BnB (count)")
plt.title("binwidth=20")
plt.tight_layout()
Figure 4.10 Histogram for univariate analysis, bin width equals 20.
g=sns.histplot(data=df, x="number_of_reviews",
binwidth=40, fill=False, kde=True)
We can try with a third variable for neighborhoods and a stacked
layout (attribute multiple=’stack’ ); we also omit most expensive
BnBs to limit the price range. Unfortunately, the result shown in
Figure 4.12 is not clear because bars for BnBs with a high number of
reviews are almost invisible. We will improve it later in the chapter.
pal=sns.color_palette("cubehelix")
g=sns.histplot(data=df[df.price<750],
x="number_of_reviews", hue='neighbourh
bins=20, multiple="stack", palette=pal
plt.xlabel("Number of Reviews")
plt.ylabel("Number of BnB (count)")
plt.title("bins=20")
plt.tight_layout()
Figure 4.11 Histogram for univariate analysis and kernel density, bin width equals 40.
g=sns.histplot(data=df[df.price<750],
x="number_of_reviews", y='price',
bins=50, discrete=(False, False),
cbar=True, cbar_kws=dict(shrink=.75))
plt.xlabel("Number of Reviews")
plt.ylabel("Price")
plt.title("bins=50")
Figure 4.12 Histogram for univariate analysis with stacked bars.
pal=sns.color_palette("crest")
g=sns.displot(data=df[df.price<750],
x="number_of_reviews",
y='price', height=2.3,
kind='hist', col='neighbourhood',
hue='neighbourhood', col_wrap=3,
bins=10, discrete=(False, False), palet
cbar=True, cbar_kws=dict(shrink=.75),
The result is the classical logarithmic graphic that makes the tail of a
distribution more visible; in this case, emphasizing bars associated to
BnBs with a high number of reviews, which were almost invisible
with a linear scale.
We can try using bins=100 and apply the logarithmic scale to axis
x, with the number of reviews, to see the result. The problem is that
this time we have many data points corresponding to value zero
(i.e., BnBs with no reviews), which would correspond to log(0)=-inf
and an inevitable visualization error if we would simply set
log_scale=True in function histplot() . This is a common
problem of logarithmic scales that has a clever solution in matplotlib
called Symmetric log or symlog for short. A symlog turns the
logarithmic scale into a linear scale for a tiny range of values around
zero, this way avoiding the case of log(0) and allowing for a
meaningful visualization. The result (see Figure 4.17) shows the
presence of a considerable number of BnBs with no reviews and
permits visualizing also the tail of the distribution.
g=sns.histplot(df, x="number_of_reviews",
binwidth=2, fill=False)
plt.xscale('symlog')
plt.xlim(0,900)
plt.xlabel("Number of Reviews ")
plt.ylabel("Number of BnBs (count)")
plt.title("binwidth=2")
plt.yscale('symlog')
g.legend_.set_title('Neighborhoods')
plt.xlabel("Number of Reviews ")
plt.ylabel("Number of BnBs (count)")
plt.title("binwidth=20")
Figure 4.18 Histogram with stacked visualization, logarithmic scale, and symmetric log
(bin width equals 20).
Figure 4.19 Histogram with stacked visualization, logarithmic scale, and symmetric log
(bin width equals 5).
5
Diverging Bar Plots and Lollipop
Plots
This chapter presents two peculiar types of graphics, diverging bar
plots and lollipop plots, where the second one has an efficient
implementation in ggplot but not in Seaborn, which forces to delve
into maplotlib complications, lacking a specific support. It is of
course possible that future versions of Seaborn (v. 12.2 is the one
used for this book) will provide a native implementation of lollipop
plots.
Dataset
(https://www.oecd.org/termsandconditions/)
5.1 R: ggplot
For our example, we consider a new dataset from OECD with a time
series representing the production of agricultural goods for a set of
countries. The information we are interested in is the country name,
the year of production, and the quantity of a certain commodity.
Being interested in visualizing both negative and positive values, we
could derive yearly differences in production with a simple
procedure. For the analysis, we choose a particular product, wheat,
and calculate yearly differences as the difference between values of
two consecutive years.
oecd %>%
filter((Variable=='Production') & (Commodity=="W
select(LOCATION, Country, TIME, Time, Value) ->
df$DIFF=0
num_country= length(unique(oecd$Country))
num_year= length(unique(oecd$TIME))
k=0
for (j in 1:num_country) {
for (i in 2:num_year) {
$DIFF[i+k]=df$Value[i+k] - df$Value[i-1+k]
}
k=k+41
}
df2$LAG=NULL
In this case, ordering bars according to the values would not be the
most appropriate solution, because maintaining the year order is the
most useful information. We can consider a variant, where instead
would be useful to order the bars, by considering the whole set of
countries and a particular year (i.e., year 2000). We also add the
indication of the actual value at the top of each bar by means of
function geom_text() . Figure 5.2 shows the result.
Figure 5.1 Diverging bar plot, yearly wheat production variations for Argentina.
Figure 5.3 Lollipop plot, yearly wheat production variations for Argentina.
With the next example, we reproduce the second diverging bar plot
seen before, this time by using a lollipop plot instead of a bar plot
(Figure 5.4).
We replicate with Python and Seaborn the first diverging bar plot
seen with ggplot starting from the necessary data wrangling
operations for preparing the data frame.
oecd=pd.read_csv("../datasets/OECD/OECD-FAO_Agricultu
oecd1=oecd[(oecd.Variable=='Production')&(oecd.Commod
[['LOCATION', 'Country', 'TIME', 'Value']]
… … … … …
Figure 5.4 Lollipop plot ordered by values and annotation, yearly variations in wheat
production for year 2000 with respect to year 1999.
oecd1= oecd1.sort_values(by=['LOCATION','TIME'])
oecd1['LAG']=oecd1.groupby('LOCATION')[['Value']].shi
LOCATION Country TIME Value LAG
… … … … … …
oecd1['DIFF']=oecd1.Value-oecd1.LAG
oecd1.drop('LAG', axis=1, inplace=True)
LOCATION Country TIME Value DIFF
… … … … … …
With these simple operations, the data for all countries are almost
ready, we just have to remember that all rows corresponding to year
1990 should be removed because having inconsistent data in
production differences. Then, we can select the country for which
we want to plot the data, this time it is the United States, and plot.
oecd1=oecd1[oecd1.TIME!=1990]
usa=oecd1[oecd1.Country=='United States']
To plot the diverging bar plot, we start using the normal Seaborn
function sns.barplot() , in this example, with years (variable
TIME) on the x-axis and production differences (variable DIFF) on
the y-axis, a few style options are also added. However, this is not
sufficient to have a reasonable diverging bar plot because the color
scale will not be as we usually want it to be in this case, diverging
for positive and negative values. Here comes the tricky part because
for that seemingly obvious feature, there is no support from Seaborn
and we should turn to matplotlib that forces us to manually color
each bar.
sns.set_theme(style="white", font_scale=0.7)
div_colors = plt.cm.bwr(divnorm(heights))
# Style options
plt.xticks(rotation=90)
ax.set_ylabel("Wheat production (yearly variations, t
ax.set_xlabel("")
plt.title("United States: OECD-FAO Agricultural Outlo
plt.tight_layout()
Figure 5.5 Diverging bar plot, yearly wheat production variations for the United States,
vertical bar orientation.
The plot, however, is likely more readable if bars are horizontal and
years on the y-axis. That seems trivial, just switching attribute x with
y, or using attribute direction of function sns.barplot() ,
should be sufficient. Unfortunately, also this simple variation has
hidden subtleties.
Dataset
In this section, we use the dataset Report qualità aria 2021 (transl.
Air Quality Report year 2021), Municipality of Milan Open Data,
already introduced before.
6.1 R: ggplot
For boxplots, similar to what we have done with bar plots, we use
pollutant as the categorical variable, but this time, rather than
aggregating to calculate total quantities, we use all data points to
obtain statistics for the boxplot. Let us try with the simplest
configuration of function geom_boxplot() . Figure 6.2 shows the
result.
NOTE
month_list=['Janvier','Février','Mars','Avril','Mai',
'Juin','Juillet','Août','Septembre','Oct
'Novembre','Décembre']
df["Month"]=df['date'].dt.month_name(locale='fr_FR')
df.Month=pd.Categorical(
df.Month,
categories = month_list,
ordered = True)
Figure 6.6 Box plot with three variables, the result is confused.
plt.xticks(rotation=30)
plt.xlabel("Month (French)")
plt.ylabel("Value")
plt.legend(title="Pollutant")
sns.move_legend(g, "upper left", bbox_to_anchor=(1, 1
Similar to the R case, even now the visualization is unclear, with too
many graphical elements put together and not well recognizable.
Facets would be better and separating months into facets is likely a
good choice, as we did before (Figure 6.7).
sns.set_theme(style="whitegrid", font_scale=0.8)
g.set_xticklabels(rotation=90)
g_set_axis_labels("Pollutant","Value")
…
Figure 6.7 Boxplot with three variables, facet visualization.
7
Violin Plots
A violin plot is a boxplot variant initially introduced to add the
missing information about the actual distribution of data points.
Rather than the fixed rectangular boxplot shape, the violin plot
adapts its shape to the density of data points for each value of the
continuous variable, the shape is larger where data points are more
abundant and thinner where they are scarce. This often produces a
shape that vaguely reminds a violin, from which the name. The
drawback of the violin plot with respect to the boxplot is to be less
precise in the representation of descriptive statistics about quantiles
of the distribution.
Dataset
(https://www.oecd.org/termsandconditions/).
7.1 R: ggplot
Let us start with a simple example and elaborate on it. We use first
the OECD Skills Survey for Pisa tests, values are referred to the
average results and students are divided by gender, male and
female. The dataset is in Microsoft Excel format, so it needs package
readxl to be read and has been slightly modified with respect to the
original one (i.e., year values have been copied in all cells, the
header simplified, and a new column Test added with MAT for
Mathematics, READ for Reading, and SCI for Scientific skills).
library(readxl)
Mat=read_excel("datasets/Eurostat/
IDEExcelExport-Mar122024-0516PM.xlsx",
sheet ='Report 1', range='B12:F227', tr
Mat$Female = round(Mat$Female, 0)
Mat$Male = round(Mat$Male, 0)
library(ggthemes)
We can produce the density plot by setting test results (column Avg)
on the x-axis, and areas filled with different colors for gender
(column Sex). The y-axis will be automatically associated to the data
point density. For better readability, facets are configured based on
years. Finally, to have an orientation similar to a typical violin plot,
we flip the axes ( coord_flip() ).
The violin plot and the density plot are scaled differently, but
confronting the information they provide, we can immediately
recognize that it is the same.
These are the basis for understanding how to use violin plots. But,
as said before, violin plots are particularly effective when combined
with other graphic types, to produce ingenious representations. Let
us see the first two combinations.
7.1.1 Violin Plot and Scatterplot
Finally, we could also read Pisa test results for Reading and Scientific
skills, bind rows of the three data frames together, repeat the
transformation into long form, and plot the facets by means of
variable Test. Two details are to note: the first is that for the boxplot,
the dots representing outliers are redundant being the violin’s tails
conveying the same information. These could be omitted with
attribute outlier.shape=NA ; the second is that here we use
function facet_grid() , not facet_wrap() , for the facet
visualization; it is only for aesthetic purposes being facet_grid()
made for facets created with the combination of two variables’
values but we have just one, however, using our single variable for
rows, we can have facet titles beside each facet instead of on top of
them.
Rd=read_excel("datasets/Eurostat/IDEExcelExport-Mar12
sheet = 'Report 2', range = 'B12:F227',
Rd$Female = round(Rd$Female, 0)
Rd$Male = round(Rd$Male, 0)
Sci=read_excel("datasets/Eurostat/IDEExcelExport-Mar1
sheet = 'Report 3', range = 'B12:F227',
Sci$Female = round(Sci$Female, 0)
Sci$Male = round(Sci$Male, 0)
bikes= pd.read_excel("datasets/Berlin_open_data/
Fahrraddiebstahl_12_2022_EN.xlsx
From Figure 7.7, we see that most bikes stolen are in the range of
tens to hundreds of euros, while just a few are particularly expensive
(thousands of euros). Let us try some variations.
sns.violinplot(data=bikes,
x= bikes["START_DATE"].dt.month)
Figure 7.9 Bar plot, bike thefts in Berlin for each month of years 2021 and 2022.
bikes2= bikes.groupby(bikes["START_DATE"].dt.month).\
DAMAGES.count().reset_index()
0 1 2 201
1 2 2 140
2 3 3 083
3 4 3 074
4 5 3 877
5 6 4 167
6 7 3 995
7 8 4 387
8 9 4 494
9 10 4 550
10 11 3 642
START_DATE DAMAGES
11 12 1 559
Figure 7.10 Violin plot, bike thefts in Berlin for bike type and month, years 2021 and 2022.
We can now consider bike types for axis y and use attribute
scale=’count’ that scales dimensions with respect to the number
of observations. Attribute cut=0 restricts the shape of the violin
plot only to values actually present in data. This may sound bizarre;
how could it be that a plot shows inexistent data? Actually, it is what
the Seaborn violin plot would do in this case without attribute
cut=0 ; violin tails, purely for aesthetic reasons, would be extended
beyond the minimum or maximum data point; in this case, we would
have seen a tail going in the negative range of the x-axis, clearly
impossible being x the number of thefts. Function despine()
removes the visualization of Cartesian axes, which might be
aesthetically redundant sometimes (Figure 7.10).
g= sns.violinplot(data= bici,
x= bici["START_DATE"].dt.month,
y= "TYPE_OF_BICYCLE",
scale= 'count', cut=0,
palette= "cubehelix")
sns.despine(left=True, bottom=True)
8
Overplotting, Jitter, and Sina Plots
Dataset
In this chapter, we make use again of data from the OECD Skills
Survey, OECD 2022 (The Organisation for Economic Co-operation
and Development), and from Bicycle thefts in Berlin (trans.
Fahrraddiebstahl in Berlin) from the Municipality of Berlin, Germany,
Berlin Open Data, previously introduced.
8.1 Overplotting
8.2 R: ggplot
colorList= c('#1252b8','#fa866b')
This way, we have controlled the horizontal jitter effect. In the same
way, we could control the vertical jitter. We try two more cases by
varying attribute width . First, we reduce it to width=0.1 (Figure
8.2), then increase it to width=0.3 (Figure 8.3). The different
visual effects are evident.
TIP
colorList= c('gold','forestgreen')
colorList= c('gold','forestgreen')
library(ggforce)
colorList= c('#1252b8','#fa866b')
Figure 8.7 Sina plot and violin plot combined, OECD/Pisa tests results for male and female
students, Mathematics skills.
A violin plot and a sina plot could be combined as well, for a visual
effect with better-defined shapes (Figure 8.7), style options common
with previous graphics have been omitted.
colorList= c('#1252b8','#fa866b')
Let us try the sina plot with a boxplot. The combination could be
effective when data points are appropriate to this visualization, even
without a violin plot (Figure 8.8).
colorList= c('#1252b8','#fa866b')
ggplot(MatL, aes(x=as.factor(Year), y=Avg))+
geom_boxplot(aes(fill=Sex), alpha=0.7, outlier.shap
geom_sina(aes(color=Sex), shape=1)+
scale_color_manual(values = colorList) +
scale_fill_manual(values = colorList) +
labs(…
Figure 8.9 Sina plot with stacked groups of data points and color based on logical
condition.
Figure 8.9 shows the resulting plot.
library(ggbeeswarm)
colorList= c('#1252b8','#fa866b')
Which one to choose? In general, both the sina plot and the
beeswarm plot convey an additional information about the
distribution of data points with respect to traditional jitter for
categorical scatterplots. It could be observed that the sina plot
maintains a more realistic representation of the density of the data
points, closely resembling a violin plot, while the beeswarm plot
prefers a more stylized shape. However, choosing between the sina
plot and the beeswarm plot is largely a matter of subjective
preference, either aesthetical or of communication style.
g= sns.stripplot(data= bikes,
x= bici["START_DATE"].dt.month,
y= "DAMAGES",
alpha=0.7, size=1.5, jitter=0.3, pal='flare')
plt.xlabel("Month")
plt.ylabel("Bicycle Value ")
plt.title("Berlin: bicycle thefts")
bikes_ml=bikes[((bikes["TYPE_OF_BICYCLE"]=="men's bik
(bikes["TYPE_OF_BICYCLE"]=="ladies bi
(bikes["START_DATE"].dt.month==10) &
(bikes["START_DATE"].dt.year==2022)]
g=sns.swarmplot(data=bikes_ml,
x="TYPE_OF_BICYCLE",
y="START_HOUR",
size=2.5,
palette={"men's bike": "skyblue", "la
"darkred"})
plt.xlabel("")
plt.ylabel("Hour of day")
plt.title("Berlin: bicycle thefts (October 2022)")
plt.tight_layout()
Figure 8.13 Swarm plot, men’s and ladies’ bike thefts in Berlin, October 2022.
Sina plot does not exist as a native graphic type in Seaborn (up to
version 12.2, at least), but custom implementations have been
proposed and could be considered for use. An excellent one has
been realized by Matthew Parker and it is available from his GitHub
repository, a Jupyter notebook provides the usage instructions
(https://github.com/mparker2/seaborn_sinaplot).
bikes_ml= bikes[
((bikes["TYPE_OF_BICYCLE"]=="men's bike") |
(bikes["TYPE_OF_BICYCLE"]=="ladies bike"))
(bikes["DEED TIME_START_DATE"].dt.month==1
g= sinaplot(data= bikes_ml,
x= bikes_ml["DEED TIME_START_DATE"].dt.year,
y= "DAMAGES", hue= "TYPE_OF_BICYCLE",
palette= sns.color_palette(['forestgreen','skyblu
s=2, violin=False)
Figure 8.14 Sina plot, men’s and ladies’ bike thefts in Berlin in January 2021–2022.
9
Half-Violin Plots
The name half-violin plot could sound like an oddity, one of those
bizarre artifacts that sometimes data scientists and graphic designers
create for amusement, but it would be a mistake to consider it that
way. Instead, it is a relevant variant of the violin plot that is
particularly well-suited to be combined in different fashions to
convey a good deal of information in an intuitive and aesthetically
pleasant form.
Dataset
In this chapter, we make use again of data from the OECD Skills
Survey, OECD 2022 (The Organisation for Economic Co-operation
and Development), and from Bicycle thefts in Berlin (transl.
Fahrraddiebstahl in Berlin) from the Municipality of Berlin, Germany,
Berlin Open Data, previously introduced.
9.1 R: ggplot
colorList= c('#3c77a3','#b1cc29')
Figure 9.1 Half-violin plot, custom function, OECD/Pisa test results for male and female
students, Mathematics skills.
We can replicate some examples seen with violin plots by using the
half-violin graphic type. We choose the most complete, tuning
attribute width to correctly place the internal boxplot. Once again,
the result, shown in Figure 9.2, is aesthetically pleasant and conveys
information with a compact and original layout.
colorList= c('#3c77a3','#b1cc29')
ggplot(pisa, aes(x= as.factor(Year), y= Value,
fill= Sex)) +
geom_split_violin(alpha=0.4) +
geom_boxplot(position= position_dodge(width=0.2),
alpha=0.5, size=0.4, width= 0.2,
outlier.shape= NA) +
geom_point(aes(group= Sex),
position= position_jitterdodge(jitter.width=0.1
jitter.height=0,
dodge.width=0.9)
alpha=0.5, size=0.7, shape=1) +
scale_fill_manual(values= colorList) +
labs(x="", y="Test Results", fill="Gender:",
title="OCSE/Pisa test: Mathematics Skills"
)+
theme_hc()+
theme(legend.position = "bottom")+
theme(plot.margin = unit(c(1,1,1,1),"cm"))
Figure 9.2 Half-violin plot, boxplot, and scatterplot with jitter correctly aligned and dodged,
OCSE-PISA tests.
colorList= c('#3c77a3','#b1cc29')
ggplot(pisaMRS, aes(x=as.factor(Year) , y=Avg, fill=S
geom_split_violin(alpha=0.7)+
geom_boxplot(position=position_dodge(width=0.2),
alpha=0.5, size=0.4, width= 0.2, outlier
geom_point(aes(group=Sex),
position=position_jitterdodge(jitter.wid
jitter.hei
dodge.widt
alpha=0.5, size=0.7, shape=1)+
facet_grid(rows = vars(Test))+
scale_fill_manual(values = colorList) +
labs(
x="", y="Test Results", fill="Gender:",
title= 'OCSE/Pisa test: Mathematics, Reading, and
)+
…
Figure 9.3 OECD/Pisa tests, male and female students, Mathematics, Reading, and
Scientific skills.
library(gghalves)
colorList= c('#3c77a3','#b1cc29')
Figure 9.4 Left-side half-violin plots, male and female students, Mathematics skills.
With this as the basis, the raincloud plot could be produced. Some
care should be taken in order to correctly place the three graphics,
the half-violin plot, the boxplot, and the dot plot. In particular,
attribute position=position_nudge() is needed to overcome the
default placement; attribute stackratio of geom_dotplot() to
modify the distance of aligned markers, and attribute binaxis
defines the axis used to align markers (axis x is the default, we need
to specify axis y). The adoption of facet_grid() instead of
facet_wrap() has just an aesthetical reason, that way we have
facet titles vertically on the left side rather than on top. In the
example, we use a single variable for facets associated to rows of
the grid with attribute rows=vars() ; with attribute switch="y"
facet titles are shown on the right side. As a last detail, by resizing
the plot with attribute width and height of function ggsave() ,
which saves on file the last plot, we improve the excessive vertical
closeness of graphics of the original plot, otherwise not easy to tune.
library(gghalves)
colorList= c('#3c77a3','#b1cc29')
facet_grid(rows=vars(Test))+
coord_flip()+
labs(
x="", y="Test Results", fill="Gender:",
title="OCSE/Pisa tests 2006-2022"
)+
theme_hc()+
theme(legend.position = "bottom")+
theme(plot.margin = unit(c(1,1,1,1),"cm"))+
theme(axis.text = element_text(size = 10),
axis.title = element_text(size = 10),
legend.text = element_text(size = 10),
legend.title = element_text(size = 10))
The result is smart and imaginative, with the origin of the name (i.e.,
raindrop) that should be now manifest. It is, however, also effective
in conveying information in a compact form. Several hints about data
from the Pisa tests emerge quite evidently.
In this Python section, we make use again of data about bike thefts
in Berlin, as we did in the violin plot section. We already know that
half-violin plots are well-suited in case of groups of markers where
the variable has two values. For this reason, we select just two bike
types and plot the corresponding violin plots for each month. Figure
9.6 shows the result.
g= sns.violinplot(data= bikes_ml,
x=bikes_ml["START_DATE"].dt.month,
y= "DAMAGES", hue= "TYPE_OF_BICYCLE",
palette={"men's bike": '#3c77a3', "ladies b
linewidth=0.7)
g.legend_.set_title('Bike types')
plt.xlabel('Month')
plt.ylabel('Bicycle Value')
plt.show()
Figure 9.6 Violin plot with groups of two subsets of points, bike thefts in Berlin.
This is an ideal case for a half-violin plot for having a single violin
composed of the two halves instead of the two dodged violins.
Seaborn supports it natively with attribute split=True of function
sns.violinplot() . To show the result more clearly, we select just
one month (i.e., January). With attribute hue_order , we could set
a specific order of values of the variable used for groups and
associated to attribute hue . We also add a visual effect with
attribute inner=’stick’ that shows the data distribution as lines,
while directive sns.despine(left=True,bottom=True) removes
the external border (see Figure 9.7). By specifying inner=’quart’ ,
the quartiles of the distribution (Q1, median, and Q3) are shown
(see Figure 9.8).
data= bikes_ml[bikes_ml["START_DATE"].dt.month == 1]
g=sns.violinplot(data= data,
x= bikes_ml["START_DATE"].dt.year,
y= "DAMAGES", hue= "TYPE_OF_BICYCLE",
hue_order= ["men's bike", "ladies bike"],
palette={"men's bike": '#3c77a3', "ladies bike":
linewidth=0.1,
split= True, inner= 'stick')
sns.despine(left=True,bottom=True)
g.legend_.set_title('Bike types')
plt.xlabel(``)
plt.ylabel('Bicycle Value')
plt.show()
Figure 9.7 Half-violin plots with sticks.
Dataset
In this chapter, we make use again of data from the OECD Skills
Survey, OECD 2022 (The Organisation for Economic Co-operation
and Development) previously introduced.
10.2 R: ggplot
They are similar because they are results of the same observation
repeated in regional contexts that may differ, climate conditions
for temperatures, socioeconomic, political, organizational, and
cultural aspects for Pisa tests.
The main difference with temperatures is that while temperatures
are measured on a given scale, Pisa test results do not have an
implicit scale for ordering them. Different metrics are possible to
use, one should be chosen, and values derived from data.
MatL%>%
ggplot(aes(x=Avg, y=Country))+
geom_density_ridges(aes(fill=Country), scale=2, rel
scale_fill_viridis(discrete= TRUE, option= "viridis
labs(
x="Test results", y="",
title="OECD/Pisa test: Mathematics skills"
)+
theme_clean() +
theme(panel.grid.major.y = element_blank(),
legend.position = 'none')+
theme(axis.text.x =
element_text(size = 8, hjust = .75))
Figure 10.2 Ridgeline plot, OECD-Pisa tests, default alphabetical order based on country
names, Mathematics skills.
df1_high %>%
group_by(Country) %>%
summarize(Mean= mean(Value, na.rm= TRUE)) %>%
arrange(desc(Mean)) -> df1_sort
list1 = as.list(df1_sort$Country)
STEP 2a: In data frame with Pisa test results (df1_elev) country
names of column Country are transformed into factor type.
STEP 2b: With function fct_relevel() , each value of column
Country (now as factor) is associated, through its attribute level,
to the corresponding position of list1. For example, Korea is in
first position based on mean values of Pisa tests, therefore all
rows related to Korea are associated to factor level 1 and so on
for all countries.
STEP 3: Now we can sort the data frame based on the Country
column, obtaining the ordering based on the factor levels.
df1_high %>%
mutate(Country= factor(Country)) %>%
mutate(Country = fct_relevel(Country,list1)) %>%
arrange(Country) -> df_high_factor
We can now produce again the ridgeline plot as did before, style
directives are omitted for brevity (Figure 10.3).
df_high_factor %>%
ggplot(aes(x= Value, y= Country)) +
geom_density_ridges(aes(fill= Country),
scale= 2, rel_min_height=0.005)
scale_fill_viridis(discrete=TRUE, option="viridis")
The result is much better than the previous one. Now it is very
evident how results of Pisa tests differ for the set of countries. Next,
the color gradient is more meaningful this way, highlighting the
overall trend.
We can now replicate the same example for Reading and Scientific
skills, just changing the initial data frame, as already did previously
for other types of graphics.
theme_light() +
theme(panel.grid.major.x = element_blank(),
panel.border = element_blank(),
panel.grid.minor.x = element_blank(),
legend.position = 'none')
Data frame is SciL, derived from reading the original dataset for
Scientific skills and transforming it into long form.
Colors and line thickness:
geom_density_ridges(fill="black",color="white",size=
0.5,scale=1.5,
rel_min_height=0.005) .
Figure 10.3 Ridgeline plot, OECD-Pisa tests, custom order based on arithmetic mean of
test results, Mathematics skills.
Figure 10.4 Ridgeline plot, OECD-Pisa tests, custom order based on arithmetic mean of
test results, Reading skills.
Figure 10.5 Ridgeline plot, OECD-Pisa tests, custom order based on arithmetic mean of
test results, Scientific skills, a tribute to pulsar CP1919 and Joy Division.
The excerpt of code shows only the more relevant differences with
respect to previous plots.
…
geom_density_ridges(fill="black", color="white", si
scale = 1.5, rel_min_height = 0
labs(…)+
theme_clean() +
theme(panel.grid.major.y= element_blank(),
legend.position= 'none')+
theme(axis.text.x= element_text(size=8, vjust=6, hj
theme(panel.background= element_rect(fill= "black")
plot.background= element_rect(color='white')
axis.text.x= element_text(vjust= 1.5))
11
Heatmaps
Heatmaps are a type of graphic that is usually easy to produce and
could be aesthetically pleasant and effective to convey information in
an intuitive way. In practice, what a heatmap shows is a color-based
representation of a data frame in rectangular form, with two
categorical variables associated to the sides of the heatmap
(corresponding to the Cartesian axes), and a third variable, either
continuous or categorical, whose values are converted into a color
scale. The idea is that, through the color representation, an observer
could easily and intuitively grasp the values of the third variable
corresponding to the two variables on the axes. The information
conveyed by a heatmap is largely qualitative, the color scale usually
has quantitative values but, especially with a continuous gradient,
the exact value associated to a certain hue is difficult to determine,
so what an observer gets is often a broad approximation of the real
value. Therefore, with respect to the corresponding data frame, a
heatmap is certainly less precise but it gains in simplicity for an
observer to get the informational content. In addition to this,
heatmaps, being colorful and with their regular structure, are well-
adapted to be used in creative ways and combined with different
graphical elements.
Dataset
11.1 R: ggplot
We have not yet used the dataset of bike thefts in Berlin with R, so it
is worth reminding that, as previously discussed, this case study has
some subtleties to consider when the translated version, from
German to English, is used. Problems could arise due to incoherent
date formats deriving from intrinsic limitations of automatic
translation tools, which suggests caution when dealing with dates.
The Additional Online Material, in the section dedicated to violin
plots (Chapter 7), provides the details of this case and all Python
data-wrangling operations to correctly set up the data frame for
visualization. The same Additional Online Material, in the section
dedicated to this chapter on heatmaps, summarizes the same
operations for R. Those data-wrangling operations do not present
any particular difficulty; however, the subtleties and the logic should
be clear in order to fully grasp their meaning.
Here we start with the modified English dataset correctly set up with
coherent dates. We read it and adjusted some column names to
work more swiftly on them. Then, we aggregate bike values and
number of bikes stolen with respect to months and hours of the
theft.
df= read_csv("datasets/Berlin_open_data/
Fahrraddiebstahl_12_2022_EN_MOD.csv")
# A tibble: 288 × 4
# Groups: MONTH [12]
MONTH START_HOUR TOT_DMG NUM
<ord> <dbl> <dbl> <int>
1 January 0 28696 29
2 January 1 7746 13
3 January 2 8255 11
4 January 3 8328 11
5 January 4 6073 6
# … with 283 more rows
class(bikesR$MONTH)
[1] "ordered" "factor"
min_lim= min(bikesR$NUM)
max_lim= max(bikesR$NUM)
We still use the same dataset of bike thefts in Berlin with data frame
bikes from previous chapters. We aggregate the data frame to
obtain the value and number of bikes stolen for each month and
hour of day. First, we rename some columns for simplicity.
bikes.columns= ['DATE','START_DATE','START_HOUR',
'END_DATE','END_HOUR','LOR','DAMAGES'
'EXPERIMENT','TYPE_OF_BICYCLE',
'OFFENSE', 'DETECTION']
bikes2= bikes.groupby([bikes['DATE'].dt.month_name(),
'START_HOUR'])\
['DAMAGES'].agg(TOT_DMG= 'sum', NUM= 'count').
reset_index()
Now we want to correctly sort the new data frame bikes2 with
respect to month names. We need to employ the known technique
based on an external list. Here we show a tiny variant, deriving the
month list instead of manually writing it.
monthList= pd.date_range(
start='2022-01-01',
end='2022-12-01', freq='MS')
monthName=
monthList.map(lambda x: x.month_name()).to_list()
bikes2.DATE= pd.Categorical(bikes2.DATE,
categories= monthName, ordered= True)
For the example, we transform data frame bikes2 into wide form by
using column START_HOUR for new column names and the number
of bikes stolen for values.
DATE
January 29 13 11 11 … 69 65 5
February 28 10 3 5 … 76 64 4
March 50 18 15 10 … 108 95 5
DATE
December 23 14 9 4 … 72 46 3
Now that we have the data frame in rectangular form, the Seaborn
heatmap is very easy to produce with function sns.heatmap() , we
just need to select a color palette, as we wish; a few style options
have been applied (Figure 11.3)
Figure 11.3 Heatmap, number of bike thefts in Berlin for months and hours.
sns.set_theme(style="white")
g.xaxis.set_tick_params(labelsize=8, rotation=30)
plt.xlabel("Hour of day")
plt.ylabel("")
plt.title("Bicycle thefts in Berlin: number of thefts
plt.tight_layout()
We can repeat it, this time by using bikes value for the wide form
transformation (Figure 11.4).
sns.set_theme(style="white")
g= sns.heatmap(bici_wide, cmap="cubehelix")
g.xaxis.set_tick_params(labelsize=8, rotation=30)
plt.xlabel("Hour of day")
plt.ylabel("")
plt.title("Bicycle thefts in Berlin: bikes value (202
plt.tight_layout()
Figure 11.4 Heatmap, value of stolen bikes in Berlin for months and hours.
12
Marginals and Plots Alignment
So-called marginals are a family of graphics made by the
combination of different plots with a main one presented in the
central position and one or two others associated to the x and y
axes. For example, we may have a scatterplot as the main graphic
and histograms, density plots, or boxplots associated to the axes.
Several other variants are possible.
Dataset
12.1 R: ggplot
Dataset read and change of column names are the same as already
shown before, here omitted. We aggregate values for year, month,
and bike type, calculating bike values, and number of stolen bikes.
bikesR= group_by(df,
year(DATE),
month(DATE, label=TRUE, abbr=FALSE),
TYPE_OF_BICYCLE) %>%
summarize(TOT_DMG= sum(DAMAGES), NUM =n()) %>%
rename(YEAR = 1, MONTH_CREATED = 2)
12.1.1 Marginal
library(ggExtra)
library(ggthemes)
ggMarginal(p,
type= "histogram",
fill= "lightblue",
xparams= list(bins=20))
12.1.2 Plots Alignment
In order to see the marginal variants with boxplots and density plots,
we introduce a new possibility to define the layout of the result that
would permit to align different plots in different ways. Several
solutions exist for that feature, with different degrees of difficulty.
Previously, we already saw an example by using packet patchwork,
which is the easiest, but unfortunately does not support graphical
objects produced with ggMarginal and cannot be fine-tuned. We
present one of the most flexible solutions for plot alignment provided
by package gridExtra. With gridExtra, it is possible to create complex
layouts with different graphical objects and images. Here, we use it
in a simple way, just to vertically align three plots: the one created in
p1 and two variants. The main function is grid.arrange() , which
lets specify the number of rows (attribute nrow ) and columns
(attribute ncol ) of the grid. The creation of ggplot object p is
identical to the previous example and is omitted. Figure 12.2 shows
the result.
Figure 12.1 Marginal with scatterplot and histograms, bike thefts in Berlin (2021–2022).
library(gridExtra)
... -> p
Figure 12.2 Plots aligned in a vertical grid, marginals, bike thefts in Berlin (2021–2022).
12.1.3 Rug Plot
… -> p
Figure 12.4 Marginal with categorical scatterplot and rug plot, number of stolen bikes in
Berlin for hours and types of bikes (2021–2022).
bikes2= bikes.groupby([bikes['DATE'].dt.month_name(),
'TYPE_OF_BICYCLE'])['DAMAGES'].\
agg(TOT_DMG= 'sum', NUM= 'count').\
reset_index()
sns.set_theme(style="white")
Figure 12.5 Subplots, a scatter plot and a boxplot horizontally aligned, stolen bikes in
Berlin (2021/2022).
# Style elements
ax[0].set(
xlim=(0, 2200), ylim=(0, 2.5e+06),
xlabel='Number of bikes (month)',
ylabel='Value (month)',
)
ax[0].legend()
ax[1].set(
xlabel=' Number of bikes (month)',
ylabel=``
)
ax[1].yaxis.set_label_position("right")
ax[1].yaxis.tick_right()
# Despine subplots
for ax in ax.flat:
sns.despine(bottom=False, left=False, ax=ax)
f.tight_layout()
Figure 12.6 Subplots, a scatter plot and a boxplot vertically aligned, stolen bikes in Berlin
(2021–2022).
for ax in ax.flat:
sns.despine(bottom=False, left=True, top=True, righ
sns.set_theme(style="ticks")
g.ax_joint.legend_._visible= False
g.fig.legend(bbox_to_anchor= (1.0, 1.0), loc=1)
plt.ylabel("Bike values")
plt.xlabel("Number of stolen bikes")
12.2.3 Marginals: Joint Grid
The Joint grid is the extended version of the Joint plot, which
specifies explicitly the configuration. The logic is similar to what we
have seen for facets, whose general approach combines functions
FacetGrid() with map() , the first to define general attributes
and the facet grid, the second to associate to facets a specific
graphic type.
Figure 12.7 Joint plot with density plots as marginals, stolen bikes in Berlin (2021–2022).
JointGrid() defines the grid for the main plot and the two
marginals, and possibly additional graphical elements associated
to variables.
plot_joint() defines the type for the main plot and optional
elements.
plot_marginals() defines the type for marginals and optional
elements.
g= sns.JointGrid(data= bikes2,
y= "TOT_DMG", x= "NUM",
hue= "TYPE_OF_BICYCLE",
space=0, ratio=5)
# Main graphic
g.plot_joint(sns.scatterplot, s=80, alpha=.6,
legend=True, palette= 'inferno')
# Marginals
g.plot_marginals(sns.rugplot, height=1,
color="teal", alpha=.8)
g.ax_joint.legend_._visible= False
g.fig.legend(bbox_to_anchor=(1.0, 1.0), loc=1)
13
Correlation Graphics and Cluster
Maps
Correlation graphics are a family of graphics aimed at showing the
possible statistical correlation between variables. With respect to
case studies discussed in previous sections, for instance, we may
want to know which is the correlation between the hour of day or
the month with bike thefts in Berlin. From the statistical correlation
index is then possible to analyze the possible cause–effect
relationship between two variables. For example, is it true that thefts
happen more frequently in certain hours of the day or in certain
months? Intuitively we might be tempted to answer positively, but
intuition often fails us when correlation is inquired and it is not rare
to end up misleading pure chance with causality or imagining a
direct correlation between two events when instead they are
correlated with a third one (e.g., seasonal phenomena), somehow
hidden or ignored.
Dataset
13.1 R: ggplot
We start with a graphic type that goes often under the name of
cluster map and represents an extension of traditional heatmaps,
enhancing them with graphical elements derived from clusterization
methods, which are statistical methods aimed at grouping
observations based on similarity or correlation metrics. The goal is to
recognize which observations are more similar, with respect to a
statistical criterion, and divide the sample into clusters of
observations that are more alike with each other than with respect
to all others. The information provided is that observations in the
same cluster have something in common, which depends on the
specific clusterization metric employed, more than what they have in
common with respect to observations not belonging to the cluster.
We use again data frame bikes and, this time, we need to bring
them into wide form. We choose column START_HOUR as values.
We also add prefix h to hours to avoid backticks in column names.
MONTH h0 h1 h2 h3 h4 h5 h6 …
1 January 29 13 11 11 6 16 35 …
2 February 28 10 3 5 8 22 34 …
3 March 50 18 15 10 11 18 38 …
4 April 47 21 14 9 10 15 39 …
5 May 58 23 29 14 8 16 42 …
6 June 79 51 24 13 16 32 61 …
7 July 88 47 37 15 12 23 57 …
8 August 72 51 19 19 24 37 75 …
9 September 88 42 29 18 21 34 62 …
10 October 67 51 29 19 18 27 73 …
11 November 54 25 15 14 23 23 71 …
12 December 23 14 9 4 8 8 20 …
bikes_matrix= as.matrix(bikes_matrix)
stats::heatmap(bikes_matrix,
scale= 'row', margins= c(2,0))
The result is not just a simple heatmap as seen before but has
statistical information about clusters of observations. The color scale
communicates variations in the number of thefts (dark is the
highest, light is the lowest), but it is the graphical element on the
axes to inform us about clusters and how columns, in this case,
since we have scaled by row, have been reordered. Hours have been
reordered by respecting their similarity in terms of thefts along the
whole year, for example, from 16:00 to 19:00 (i.e., columns h16-
h19) they are similar, same between 00:00 and 06:00 (i.e., columns
h0-h6), and the graphic on top of the cluster map shows the details.
That type of graphics is called dendrogram and shows clusters at
different levels, with lower levels representing the more similar
clusters. So, for instance, looking at the lowest level of the
dendrogram on top side, hours 19:00 and 20:00 are very similar, so
are 16:00 and 17:00; moving to the upper level, we see that the two
clusters 19:00 and 20:00 and 16:00 and 17:00 form a cluster
together, meaning they are similar but somehow less similar than
the clusters considered individually, moving up again we discover
that the combination of clusters 19:00 and 20:00/16:00 and 17:00 is
similar to 18:00 but yet somehow less so than the two separated.
This is how a dendrogram is read, bottom-up.
The color scale now shows relative variations of bike thefts among
months (dark is the highest, light is the lowest), with dendrograms
having the same meaning as described in the previous example. In
this case, scaling by hours (i.e. columns), differences among months
look less marked than among hours of day shown in Figure 13.1,
however, winter months have visibly less thefts, then they rise in
spring, and in summer and autumn they do not exhibit large
variability. Not truly surprising as a conclusion, but so is statistics
that often is necessary to state what is common sense but in a
methodologically sound way.
Figure 13.2 Cluster map, bike thefts in Berlin (2021–2022), values scaled by columns.
import scipy
sns.set_theme(color_codes=True)
With the second example, we scale by row, thus the color gradient
will show relative variation of bike thefts among hours,
independently from months (Figure 13.4).
13.3 R: ggplot
A positive value of the correlation index means that the two series
of values (i.e. two columns) are directly correlated (or positively
correlated), namely they tend to both increase or decrease.
A negative value means that the two series are inversely
correlated (or negatively correlated), namely when one increases
the other tends to decrease, and vice versa.
A correlation index is a value in the range [−1, +1], when the
value is close to +1 or −1, it means that the correlation, positive
or negative, is strong, while for values in the middle of the range,
hence close to 0, the correlation is weak.
0 1 2 3 4 5 6 17 18
0 29 13 11 11 6 16 35 … 168 20
0 1 2 3 4 5 6 17 18
1 28 10 3 5 8 22 34 … 168 21
2 50 18 15 10 11 18 38 … 240 35
3 47 21 14 9 10 15 39 … 230 32
4 58 23 29 14 8 16 42 … 277 42
5 79 51 24 13 16 32 61 … 281 44
6 88 47 37 15 12 23 57 … 250 41
7 72 51 19 19 24 37 75 … 317 42
8 88 42 29 18 21 34 62 … 327 47
9 67 51 29 19 18 27 73 … 319 45
10 54 25 15 14 23 23 71 … 335 42
11 23 14 9 4 8 8 20 … 144 16
With the data frame correctly configured, we can create the
correlation matrix. Correlation is among columns, therefore for N
columns, the result will be a matrix N × N; here we have 24 hours,
and it results in a 24 × 24 correlation matrix. The function is the
standard corr() .
corrHour= bikes_corr.corr()
0 1 2 21 22
… … … … … … …
sns.set(style="white", font_scale=0.7)
g.yaxis.set_tick_params(labelsize=8, rotation='auto')
g.set(xlabel='Hour of day', ylabel='Hour of day')
0 1 2 9 10
… … … … … … …
Figure 13.6 Diagonal correlation heatmap, stolen bikes in Berlin, correlation among
months.
corrHour_Long= corrHour.stack().\
reset_index(name= "correlation")
0 0 0 1.000000
1 0 1 0.571533
2 0 2 0.512906
3 0 3 0.446323
4 0 4 0.279243
… … … …
571 23 19 0.555608
level_0 level_1 correlation
572 23 20 0.682778
573 23 21 0.731613
574 23 22 0.811389
575 23 23 1.000000
# Invert y axis
g.ax.invert_yaxis()
# Style options
g.set(xlabel="", ylabel="", aspect="equal")
g.despine(left=True, bottom=True)
g.ax.xaxis.set_ticks(np.arange(0, 23, 1))
g.ax.yaxis.set_ticks(np.arange(0, 23, 1))
g.ax.margins(.02)
# Set border in legend keys like in scatterplot marke
for artist in g.legend.legendHandles:
artist.set_edgecolor(".1")
g.tight_layout()
Figure 13.7 Scatterplot heatmap, stolen bikes in Berlin (2021–2022), correlation between
hours (it is suggested to look at the colored version of this figure from the Additional Online
Material for an optimal view of the many hues).
Part II
Interactive Graphics with Altair
With Altair, a Python-based graphical library, we enter into the realm
of interactive graphics with graphics that take the form of HTML or
JSON objects (other formats are available). We will still see some
static graphics, similar to those presented in Part 1 of the book,
because we need them as building blocks for interactive ones,
however, the main interest now is not specifically on them but on the
logic and mechanisms supporting the interactivity of those visual
objects with actions performed by the observer. Hence, graphics
become responsive to user’s choices, they dynamically adapt
through user’s inputs, which may take different forms like mouse
clicks and hovering, or gestures on the touchpad/touchscreen.
Standard country or area codes for statistical use (M49) from the
Statistics Division of the United Nations. Official denominations,
codes, and information of countries
(https://unstats.un.org/unsd/methodology/m49/overview/).
Source: Benden, P., Feng, A., Howell, C., & Dalla Riva, G. V. (2021).
«Crime at Sea: A Global Database of Maritime Pirate Attacks (1993–
2020)». Journal of Open Humanities Data, 7, 19. DOI:
http://doi.org.ebrpl.idm.oclc.org/10.5334/johd.39
14.1 Scatterplots
import numpy as np
import pandas as pd
import altair as alt
df= pd.read_csv("datasets/
UN/SYB65_176_202209_Tourist-Visitors Arrival and E
thousands=',')
… … … … …
df1.columns= ['Country','Year','Expenditure','Arrival
df1["Per_capita_Exp(x1000)"]= (df1.Expenditure/df1.Ar
Pe
Country Year Expenditure Arrivals
(x1
… … … … … …
p1 = alt.Chart(df2)
p1.mark_circle(size=80, opacity=0.7)
alt.Chart(df2).mark_circle(size=80, opacity=0.7)
alt.Chart(df2).mark_circle(size=80, opacity=0.7).enco
x= alt.X('Arrivals:Q',
axis= alt.Axis(title='Arrivals (thousand
y= alt.Y('Expenditure',
type= 'quantitative',
axis= alt.Axis(title='Expenditure (millio
scale= alt.Scale(padding=1)),
Finally, we add a third variable (Year) associated to the color
aesthetic (attribute color and function Color() ); to specify a
color palette we use again function Scale() ; the legend is placed
on top of the graphic with attribute legend and function
Legend() . With knowledge of ggplot acquired in Part 1 of this
book, the Altair syntax should look familiar. Figure 14.1 shows the
Altair scatterplot for this example.
Figure 14.1 Altair, scatterplot with color aesthetic and style options.
color= alt.Color('Year:O',
scale= alt.Scale(scheme='viridis'),
legend= alt.Legend(title="Years", orient
Let us delve into some details by saving the previous plot as a JSON
file and looking at its content. The following excerpt of code is the
beginning of the JSON data structure. JSON follows the Python
dictionary specifications, keys mark and type with value circle could
be seen, corresponding to Altair function mark_circle() , followed
by local attributes opacity and size, then encoding and so on. It is
the JSON equivalent of the Altair script.
"mark": {
"type": "circle",
"opacity": 0.7,
"size": 80
},
"encoding": {
"color": {
"field": "Year",
"legend": {
"orient": "top",
"title": "Years"
},
"datasets": {
"data-afce4904be12f430c4cee42cfa3e79c6": [
{
"Country": "Albania",
"Year": 2010,
"Expenditure": 1778,
"Arrivals": 2191,
"Per_capita_Exp(x1000)": 0.812
},
{
"Country": "Albania",
"Year": 2018,
"Expenditure": 2306,
"Arrivals": 5340,
"Per_capita_Exp(x1000)": 0.432
},
…
This is the full data frame used for plotting the graphic, which, as
said before, when accessed locally, is stored within the Altair object.
And the same happens if we produce an interactive graphic in HTML
format, inside it has the full data frame, if read locally. This should
convince everyone that having a limitation on the size of data to be
accessed locally is a wise choice, configurable at will, but being
aware of the possible consequences.
).properties(
width=150,
height=150
)
alt.hconcat(
base.encode(color='Year:Q').properties(title='quan
base.encode(color='Year:O').properties(title='ordi
base.encode(color='Year:N').properties(title='nomi
)
14.1.2 Facets
Figure 14.2 Altair, horizontal alignments of plots and differences from assigning different
data types to variable Year.
alt.Chart(df2
).mark_point(
size=40,
opacity=0.5
).encode(
x= 'Arrivals:Q',
y= 'Expenditure:Q'
).properties(
width=150,
height=150
).facet(
facet= 'Year:O',
columns=3 )
# Tooltip specification
tooltip= ['Country:N','Per_capita_Exp(x1000)']
)
Figure 14.4 (a) Dynamic tooltip (example 1). (b) Dynamic tooltip (example 2).
14.1.3.2 Interactive Legend
For the examples, we use the new methods from Altair 4, which
has deprecated some previous methods. Specifically,
alt.selection_multi() and alt.selection_single()
have been superseded by alt.selection_point() ;
alt.selection(type=’interval’) is to be replaced by
alt.selection_interval() ; and add_selection() by
add_params() .
The older versions still work, but being deprecated, they will stop
being supported in future releases of Altair. However, since many
examples of Altair scripts that could be found are based on the
older functions, it is worth knowing that they could be easily
adapted to the new syntax.
The two screenshots, Figure 14.5a and Figure 14.5b show how the
transparency of different markers changes by changing the legend
selection.
alt.Chart(df2).mark_point(…
).add_params(
selection
# Dynamic zoom
).interactive()
hover= alt.selection_point(on='mouseover',
nearest=True, empty=False
alt.Chart(df2).mark_point(…
For this example, in the tooltip, the year is also present. The same
could be done for visualization by facets.
Drop-down menus and radio buttons are other two typical elements
of interactive interfaces that could be added to an Altair graphic as
well. The first example has a drop-down menu with a list of years to
select. The logic now should be familiar because it is similar to what
we have seen previously, only the specific functions and methods
change.
input_dropdown= alt.binding_select(options=[1995,2005
2018,2019
name='Yea
For the selection, it uses attribute fields to specify the data frame
column with data points to select, it corresponds to the same column
used for the definition of the drop-down menu (i.e. Year), and it is
connected to the variable representing the drop-down menu with
attribute bind .
selection= alt.selection_point(fields=['Year'],
bind= input_dropdown
# Actions
change_color= alt.condition(selection,
alt.Color('Year:N', legend=None),
alt.value('lightgray'))
change_opacity= alt.condition(selection,
alt.value(1.0), alt.value(0.3))
# Graphic
alt.Chart(df2).mark_point(…
color= change_color,
opacity= change_opacity
).add_params( selection )
With radio buttons we proceed the same way, the only difference is
the initial definition, now of radio buttons with function
alt.binding_radio() . Figure 14.9 shows the result.
input_dropdown= alt.binding_radio(options=[1995,2005,
2018,2019,
name='Year
brush= alt.selection_interval()
STEP 4. Finally, we have object plot for the graphic and data with
the table, what is still missing is their visualization. We want them
side-by-side, so again function hconcat() . With
resolve_legend() the legend position could be corrected, but this
is just a tiny detail.
Here is the full script and two screenshots in Figure 14.10a and
Figure 14.10b.
# Scatterplot
plot= alt.Chart(df2).mark_circle(size=80, opacity=0.7
x= alt.X('Arrivals:Q',
axis= alt.Axis(title='Arrivals (thousands
y= alt.Y('Expenditure',
type= 'quantitative',
axis= alt.Axis(title='Expenditure (millio
color= alt.Color('Year:O',
scale= alt.Scale(scheme='viridis'
legend= alt.Legend(title="Years",
orient="top"))
# Table definition
ranked_text= alt.Chart(df2).mark_text().encode(
y= alt.Y('row_number:O',axis=None)
).transform_window(
row_number= 'row_number()'
).transform_filter(
brush
).transform_window(
rank= 'rank(row_number)'
).transform_filter(
alt.datum.rank < 15
)
# Encoding of columns
country= ranked_text.encode(text= 'Country:N'
).properties(width=150, title='Country')
# Table visualization
data= alt.hconcat(country, arrivals, expenditure)
# Graphic and table visualization
alt.hconcat(plot, data
).resolve_legend( color="independent" )
Figure 14.10 (a) Selection with brush and synchronized table (example 1). (b) Selection
with brush and synchronized table (example 2).
# Brush definition
brush= alt.selection_interval()
).properties(width=300,height=300
).add_params(brush)
plot2= plot.encode(
y=alt.Y('Per_capita_Exp(x1000)',
axis= alt.Axis(title='Per_capita Expenditu
alt.hconcat(plot1, plot2)
selection= alt.selection_point(fields=['Year'])
change_color= alt.condition(selection,
alt.Color('Year:O', legend=None,
scale= alt.Scale(scheme='pla
alt.value('lightgray'))
Now we need to define the main graphic and a second one acting
and looking like a legend. The main graphic is still our scatterplot
with aesthetic color associated to the selection. Instead, the graphic
mimicking a legend could be defined having a rectangular shape
with mark_rect() and only axis y, with no x (technically it is a
heatmap with a single column). Axis y will be associated to column
Year and to the condition for changing colors. To this graphic is also
associated the selection, to reconfigure its colors too. The result is
very similar to an actual legend and allows for multiple selections
(usually using the uppercase key). This way, we may select all
combinations of years. The full script is presented, and screenshots
are shown in Figure 14.12a and Figure 14.12b.
# Main graphic
plot= alt.Chart(df2).mark_point(size=80, opacity=0.7)
x= alt.X('Arrivals',
axis= alt.Axis(title='Arrivals (thousand
y= alt.Y('Expenditure',
axis= alt.Axis(title='Expenditure (milli
color= change_color,
tooltip= ['Country:N','Per_capita_Exp(x1000)'
)
# Visualization
plot | legend
To be noted how the two plots have been horizontally aligned. The
notation plot1 | plot2 corresponds to
hconcat(plot1,plot2) , whereas plot1 & plot2 corresponds
to vconcat(plot1,plot2) for vertical alignment.
Figure 14.12 (a) Plot as interactive legend, all years selected. (b) Plot as interactive
legend, only years 1995, 2010 and 2020 selected and the scatterplot reconfigured.
14.2 Line Plots
We see now line plots in Altair and the peculiar interactive actions
that could be introduced.
Sub- ISO
Global Region Country M49
region alp
Code Name or Area Code
Name Co
… … … … … … …
Region
Year Expenditure Arrivals Per_cap
Name
… … … … … …
alt.Chart(df1_ext).mark_line().encode(…
y= alt.Y(field='Per_capita_Exp(x1000)',
aggregate='mean',
type='quantitative',
axis=alt.Axis(title='Mean Per_capita Expe
(thousands $)')),
…
If we wish to show both the mean per capita expenditure and the
total expenditure ( aggregate=’sum’ ), the possibility to define
them directly into Altair would be handy. The following script
presents them both together with the total of arrivals. Figure 14.13
shows the plots aligned.
plot1= alt.Chart(df1_ext).mark_line().encode(
x= alt.X('Year:O', axis= alt.Axis(title='Year')),
y= alt.Y(field='Per_capita_Exp(x1000)', aggregate
type='quantitative',
axis= alt.Axis(title='Mean Per_capita Expend
(thousands $)')),
color= alt.Color('Region Name:N',
scale= alt.Scale(scheme='magma'),
legend= alt.Legend(title="Regions", orie
).properties(width=200, height=250)
plot2 = alt.Chart(df1_ext).mark_line().encode(
x= alt.X('Year:O', axis=alt.Axis(title='Year')),
y= alt.Y(field='Expenditure', aggregate='sum',
type='quantitative',
axis= alt.Axis(title='Total Expenditure (
color= alt.Color('Region Name:N',
scale= alt.Scale(scheme='magma'))
).properties(width=200, height=250)
plot3= alt.Chart(df1_ext).mark_line().encode(
x= alt.X('Year:O', axis= alt.Axis(title='Year')),
y= alt.Y(field='Arrivals', aggregate='sum', type=
axis= alt.Axis(title='Total Arrivals (thousa
color= alt.Color('Region Name:N',
scale =alt.Scale(scheme='magma'))
).properties(width=200, height=250)
Figure 14.13 Line plots, mean per capita, total expenditure, and total arrivals.
14.2.2 Interactive Graphics
The overlapping of the two graphics is done with the plus symbol +
(e.g. lines + points1 ). Therefore, with lines + points1 ,
Altair first draws the lines, then it overlays points to them. This is the
reason to specify the size in the line plot because it is created first.
The opposite would be necessary if we reverse the order (i.e.
points1 + lines ). The same we do for the second graphic with
lines + points0 , finally the two plots are aligned horizontally.
The full script follows, and Figure 14.14 shows the result.
highlight= alt.selection_interval(on='mouseover',
fields=['Region Name'], nearest=True)
base= alt.Chart(df2_ext).encode(
x= alt.X('Year:O', axis= alt.Axis(title='Year')),
y= alt.Y('Per_capita_Exp(x1000):Q',
axis= alt.Axis(title='Mean Per_capita Expend
(thousands)')),
color= alt.Color('Region Name:N',
scale= alt.Scale(scheme='magma'),
legend= alt.Legend(title="Regions",
orient="right")))
points0= base.mark_point().encode(
opacity= alt.value(0)
).add_params( highlight )
# Line plot
lines= base.mark_line().encode(
size= alt.condition(∼highlight, alt.value(1), alt
).properties( width=300,height=300 )
# Selection criteria
selection= alt.selection_point(nearest=True,
on='mouseover',
fields=['Year'], empty=False
# Line plot
line= alt.Chart(df2_ext).mark_line().encode(
x= alt.X('Year:O', axis =alt.Axis(title='Year')),
y= alt.Y('Per_capita_Exp(x1000):Q',
axis= alt.Axis(title='Mean Per_capita Expenditure
(thousands)')),
color= alt.Color('Region Name:N',
scale= alt.Scale(scheme='magma'),
legend= alt.Legend(title="Regions",
orient="right")))
points0= alt.Chart(df2_ext).mark_point().encode(
x='Year:O',
opacity= alt.value(0),
).add_params( selection )
rules= alt.Chart(df2_ext).mark_rule(color='gray').enc
x='Year:O'
).transform_filter( selection )
alt.layer(
line, points0, points1, rules, text
).properties( width=300, height=500 )
The visual effect of this solution could vary from case to case. In
particular, visualizing the textual values is effective when the result is
sufficiently separated to be clearly read. On the contrary, if lines of
the line plot are too close to one another, the textual labels will
overlap, resulting in practically unreadable and the overall effect will
appear confused. Figure 14.15a and Figure 14.15b show two
screenshots for x coordinates that let textual labels to be read
sufficiently well; that would not be the case for years where lines are
very close to each other.
Figure 14.15 (a) Line plot with mouse hover and coordinated visualization of all values
and the vertical segment for the corresponding year (example with year 2019). (b) Same
for year 2018.
After scatterplots and line plots, we consider bar plots, the typical
graphic type for categorical variables. As before, we start from the
static definition followed by interactive components. We will see
some of the main aspects, for the full list, we forward the reader to
the Altair official documentation. As data, we will use dataset Crime
at Sea: A Global Database of Maritime Pirate Attacks (1993–2020).
df= pd.read_csv("datasets/Pirate_Attacks/pirate_attac
Figure 14.16 Line plot with mouse hover and coordinated visualization in all facets for the
corresponding year (example with year 2010).
df1= df.groupby(['Year',"Month"])[['date']].\
count().reset_index().\
rename(columns= {"date": "Attacks"})
0 1993 1 11
1 1993 2 13
2 1993 3 10
3 1993 4 13
4 1993 5 9
… … … …
331 2020 8 8
332 2020 9 8
333 2020 10 18
Year Month Attacks
334 2020 11 24
335 2020 12 16
df2= df1.groupby('Year')[['Attacks']].sum().reset_ind
plot= alt.Chart(df2).mark_bar(fill='lightblue').encod
x='Year:O',
y= alt.Y('Attacks:Q',
axis= alt.Axis(title='Number of Pirate Attack
Figure 14.17 (Left): Bar plot with segment for the arithmetic mean.
As a second basic example, we use the original data frame df1 and
native Altair aggregation features ( aggregate=’sum’ ), then we
plot it horizontally by exchanging the axes definition and add the
information about the actual value at the end of each bar using
function mark_text() , associated to bar definition. Attribute text
has the sum of monthly attacks as value. Figure 14.18 shows the
result.
bars= alt.Chart(df1).mark_bar(fill='teal').encode(
y= 'Year:O',
x= alt.X(field='Attacks',
aggregate='sum', type='quantitative',
axis= alt.Axis(title='Number of Pirate Attac
(bars + text).properties(height=450)
Figure 14.18 (Right): Bar plot with horizontal orientation and annotations.
df1['lag'] = df1['Attacks'].shift(1)
df1['diff']= df1['lag']-df1['Attacks']
… … … … … … …
10-01
plot1= alt.Chart(df1).mark_bar().encode(
x= alt.X('Date:T', axis= alt.Axis(title=None
y= alt.Y('diff:Q',
axis= alt.Axis(title='Difference in Num
Pirate Attacks'))
color= alt.condition(alt.datum.diff>= 0,
alt.value("black"), alt.value("ora
).properties(height=200,width=800, title='Monthly var
plot2= alt.Chart(temp).mark_bar().encode(
x= alt.X('Year:O', axis= alt.Axis(title=None
labels=False, ticks=True)),
y= alt.Y('diff:Q',
axis= alt.Axis(title='Difference in Num
Pirate Attacks')),
color= alt.condition( alt.datum.diff>= 0,
alt.value("black"), alt.value("oran
).properties(height=200,width=800, title='Yearly vari
# Vertical alignment
plot2 & plot1
trade= pd.read_csv("datasets/UN/HBS2022_5.1Fig1.csv")
Figure 14.19 Diverging bar plots, pirate attacks, yearly and monthly variations.
trade= trade.iloc[0:26,:]
trade['Goods loaded']= pd.to_numeric(trade['Goods loa
trade['Category']= pd.to_numeric(trade['Category'])
trade.columns= ['Year','Goods_loaded']
Year Goods_loaded
0 1996 4.758
Year Goods_loaded
1 1997 4.953
2 1998 5.631
3 1999 5.683
4 2000 5.984
5 2001 6.020
… … …
20 2016 10.247
21 2017 10.714
22 2018 11.019
23 2019 11.071
24 2020 10.645
Year Goods_loaded
25 2021 10.985
We are ready for the visualization. First the bar plot, we aggregate
data for year and define the plot (variable barplot). For goods
loaded, instead, we use two graphics for purely aesthetic reasons: a
line plot (variable line, function mark_line() ) and an area plot
(variable area, function mark_area() ).
# Aggregation
df2= df1.groupby('Year')[['Attacks']].sum().reset_ind
barplot= alt.Chart(df2).mark_bar(color='gray').encode
x= 'Year:O',
y= alt.Y('Attacks:Q',
axis= alt.Axis(title='Number of pirate attac
alt.Chart(df6).mark_bar().encode(
x='Year:O',
y=alt.Y('Attacks:Q',
axis= alt.Axis(title='Number of pirate attack
color= alt.Color('country_name:N',
scale=alt.Scale(scheme='plasma'),
legend=alt.Legend(title="Countries",
orient="right")))
Figure 14.21 Stacked bar plot, pirate attacks, and countries where they took place.
As a final feature for static bar plots, we see how to sort bars with
respect to a quantitative variable. The dataset is still that of pirate
attacks. We need a logical condition to select a subset of values
(function transform_filter ), this time based on the number of
attacks ( alt.datum.Attacks > 50 ). We want the bars, each one
referred to a country, sorted for number of attacks. Attribute
sort=-x will be specified for axis y, meaning that countries (i.e.,
values of the y-axis) should be sorted in decreasing order with
respect to the number of attacks (i.e., values of the x-axis). We also
add the textual value of the number of attacks at the end of each
bar, as we have seen in a previous example by using function
mark_text() . Data frame df5 is the result of some common
transformations presented in the Additional Online Material (Figure
14.22).
# Aggregation
data= df5.groupby('country_name')[['Attacks']].\
agg('sum').reset_index()
# Bar plot
plot= alt.Chart(data).mark_bar(
).encode(
y= alt.Y('country_name:N',
sort='-x',
axis= alt.Axis(title=``)),
x= alt.X('Attacks:Q',
axis= alt.Axis(title=
'Number of pirate attacks (1993-2020)'))
).transform_filter(
'datum.Attacks> 50')
# Textual values
text= plot.mark_text(
align='left', dx=3,
baseline='middle'
).encode( text='Attacks:Q')
plot + text
Figure 14.22 Bar plot with sorted bars and annotations.
data= df5.groupby('country_name')[['Attacks']].\
agg('sum').reset_index()
selection= alt.selection_point(fields=['country_name'
change_color1= alt.condition(selection,
alt.value('teal'), alt.value('lightg
change_color2= alt.condition(selection,
alt.Color('country_name:N'),
alt.value('lightgray'))
bar_ordered= alt.Chart(data).mark_bar().encode(
y= alt.Y('country_name:N', sort='-x',
axis= alt.Axis(title=``)),
x= alt.X('Attacks:Q',
axis= alt.Axis(title='Number of pirate attac
color= change_color1,
).transform_filter('datum.Attacks> 40'
).add_params( selection )
bar_stacked= alt.Chart(df5).mark_bar().encode(
y= alt.Y('Year:O',
axis= alt.Axis(title=None)),
x= alt.X('Attacks:Q',
axis= alt.Axis(title='Number of pirate attac
color= change_color2,
).transform_filter('datum.Attacks> 10'
).add_params( selection)
bar_ordered | bar_stacked
Figure 14.23 (a) Synchronized bar plots, default visualization, without selection. (b)
Synchronized bar plots with multiple selections of countries.
14.3.2.2 Bar Plot with Slider
Let us consider a first simple example of bar plot with slider. A slider
is a graphical element that allows selecting a range of values, quite
often without exact precision if the minimum step is not small, but it
is anyway a popular and handy widget for interactively selecting and
changing ranges. We need to define the slider object as associated
to a range with function alt.binding.range() , by setting the
minimum and maximum values of the scale and the step of the
slider (i.e. the minimum increment associated to a movement of the
slider). After that, we define a base plot (object base) to be used to
instantiate the final bar plots. It will be just an Altair Chart
associated to data and the slider selection regarding a range of
years. Next, the definition of the color scale is added.
base= alt.Chart(df_hl1).add_params(
select_year
).transform_filter(
select_year
).properties(
width=250
)
# Color scale
With these elements, we can define the four final plots that will
respond to the interactive selection operated through the slider. For
all, category Overall Homeless is omitted from data
( alt.datum.Category != ’Overall Homeless’ ), because being
the total of all values, in the visualization it will be too large with
respect to the scale of individual categories, producing a bad visual
effect. Individual plots are derived from the base plot: plots left and
right are bar plots (function mark_bar() ), while middle1 and
middle2 are textual tables (function mark_text() ). Bars in the left
plot are sorted with function alt.SortOrder() . With the slider, a
range of years is selected and both bar plots and tables are
automatically updated. Finally, the four plots are aligned together.
Figure 14.24 shows the result with a certain range of years selected.
left= base.transform_filter(
alt.datum.Category != 'Overall Homeless'
).encode(
y= alt.Y('Category:N', axis=None),
x= alt.X('Value:Q', title='Population',
sort=alt.SortOrder('descending')),
color= alt.Color('Category:N', scale=color_scale,
legend=None)
).mark_bar().properties(height=150,title='Gender')
right= base.transform_filter(
alt.datum.Category == 'Overall Homeless'
).encode(
y= alt.Y('Category:N', axis=None),
x= alt.X('Value:Q', title='Population'),
color= alt.Color('Category:N', scale=color_scale,
).mark_bar().properties(height=50, title='Overall Hom
middle1= base.transform_filter(
alt.datum.Category != 'Overall Homeless'
).encode(
y= alt.Y('Category:N', axis=None),
text= alt.Text('Category:N'),
).mark_text().properties(height=150,width=100)
middle2= base.transform_filter(
alt.datum.Category == 'Overall Homeless'
).encode(
y= alt.Y('Category:N', axis=None),
text= alt.Text('Category:N'),
).mark_text().properties(height=50,width=100)
Figure 14.24 Bar plots and tables synchronized with slider, homeless in the United States,
year 2022.
# Base plot
base= alt.Chart(df_hl2).add_params( select_year
).transform_filter( select_year
).properties( height=250 )
middle= base.encode(
x= alt.X('State:N', axis=None),
text= alt.Text('State:N'),
).mark_text().properties( height=20 )
# Plot alignment
select_year= alt.selection_point(name='Year',
fields=['Year'], bind=slider, init={'Y
# Base plot
plot= base.mark_circle().encode(
y= alt.Y('Category:N',title=None),
x= alt.X('State:N',title=None),
size= alt.Size('Value:Q',
scale= alt.Scale(domain=(100, 20000, 1200
legend= alt.Legend(title='Population', orient="to
color= alt.Color('Category:N',
scale=alt.Scale(scheme="darkblue"), lege
plot.properties(title='Homeless 2015-2022,
US states and insular territor
Figure 14.26 (a) Bubble plot and slider, homeless in the US States (year 2022). (b) Bubble
plot and slider, homeless in the US States (year 2021).
14.5.1.1 Heatmaps
14.5.1.2 Histograms
base= alt.Chart(df_h).mark_bar(opacity=0.8).encode(
x= alt.X('Value:Q', bin= alt.Bin(maxbins=100)
y= alt.Y('count():Q'),
).properties(title='Homeless people 2015-2022, US sta
and insular territories')
base= alt.Chart(df_h).mark_rect().encode(
x= alt.X('Value:Q', bin=alt.Bin(maxbins=20)),
y= alt.Y('Time:N'),
color= alt.Color('count()',
scale= alt.Scale(scheme='purpleblue'),
legend= alt.Legend(title=Number of points)))
# Overlapped scatterplot
scatter1= alt.Chart(df_h).mark_circle(size=5,color='b
).encode(
x= alt.X('Value:Q',title='Values (binned)'),
y= alt.Y('Time:N',title=None))
scatter2 = alt.Chart(df_h).mark_tick(size=15,color='b
).encode(
x= alt.X('Value:Q',title='Values (binned)'),
y= alt.Y('Time:N',title=None))
hconcat(plot1, plot2).properties(
title='Homeless people 2015-2022,
US states and insular territories')
Figure 14.29 Bivariate histogram, 20 bins, and scatterplot, homeless in the United States
(% variation).
Figure 14.30 Bivariate histogram, 20 bins, and rug plot, homeless in the United States (%
variation).
Part III
Web Dashboards
A web dashboard represents the conclusion of a journey into data
visualization projects being the final step of a pipeline started with
static graphics and moved to interactive ones, which are clearly
already web oriented.
For what concerns us, the most important aspect to learn is the
concept of reactive logic that represents the basis for understanding
the functioning principle of all dashboards, regardless of the specific
technology or tool. The concept of reactive logic is the theoretical
ground for learning to programmatically define reactive events, the
core components of dashboards, namely the implementation of the
logic that allows for intercepting client-side user interactions with the
graphical interface and reacting to them server-side through the
functionalities that have been defined, by adapting the visual
content, modifying the data, and, this the most important aspect,
maintaining the overall consistency of the information presented.
This has to be granted for all users possibly interacting,
simultaneously or not, with the dashboard, each one of them has to
always see a coherent information, resulting from her/his own
actions.
Both Shiny and Plotly/Dash are rich in functionalities and are highly
configurable, in this book we could only see the main features, those
necessary to learn the reactive logic and how to configure the
layout. In addition, all examples that will be presented could have
been realized in several alternative ways, equally effective and
possibly better. The goal is not to show the best way to produce a
certain case study but to demonstrate the possibilities and inspire
other applications. We will proceed incrementally, step-by-step,
always starting with a simple, rudimental dashboard and enriching it
with new elements, either interactive, aesthetical, or of the layout.
Another goal is to foster creativity, that dashboards make possible to
exercise. It is with a certain degree of disappointment that many
real dashboards look too similar one to the other, all seemingly
derived from the same few templates. For some applications that is
perfectly fine, there is no need of creativity, just efficient
functionalities presented rationally. But that is not always the case,
there are plenty of occasions where creativity would make a
remarkable difference, and it should be exercised, it does not come
for granted or just as a gift of nature. Last, it should be conceded
that dashboards have made a long journey from their inception to
our days (Part 3, Figure 1).
Figure 1 Design for Tandem Cart, 1850–74, Gift of William Brewster, 1923, The Met, New
York, NY.
Dataset
fluidRow(
column( )
),
With this construct, we specify a row in the virtual grid of variable
size ( fluidRow() ) and within that row a column ( column() ). The
column could be configured with a certain width as the number of
columns of the virtual grid, so, for example, column(6, …)
specifies a width equal to 6 virtual columns, or 50% of the page
width, being 12 the virtual columns; column(3, …) corresponds to
25% of the page width, and so on. This also means that, on a single
row, more columns could be defined, possibly each one with a
relative size, corresponding to graphical elements aligned
horizontally. Instead, several rows are visualized vertically aligned.
We can now start with a first simple example, just focusing on the
user interface with no server-side actions. First, we import R libraries
tidyverse and shiny and read the dataset with Pisa test results for
low-achieving students from Eurostat. It is in compressed form but
both functions read_csv() and vroom() (in this case package
vroom is necessary) are able to read it directly and extract the CSV
dataset. For ease of comprehension, we replace string EF461,
indicating mathematics tests, with MAT and obtain the list of
countries and tests (i.e., reading comprehension and scientific
knowledge are the other two tests, respectively indicated with READ
and SCI in the following).
library(tidyverse)
library(shiny)
pisa= read_csv("datasets/Eurostat/
educ_outc_pisa__custom_4942428_linear.csv.gz"
choice:test= unique(pisa$field)
choice:geo= unique(pisa$geo)
# Title
titlePanel("PISA test: Low achieving 15-year-olds
reading, mathematics or science by se
Let us consider the two tables. Input elements are the drop-down
menus, which, when modified, will communicate the new values to
use for reconfiguring the tables. Shiny defines such elements as
reactive objects, meaning that they could trigger reactive actions in
the server logic, so they have to be monitored for any change.
Function reactive() (and the similar eventReactive() ) is the
main one for the definition of a reactive action in the server logic. In
our case, the reactive action has to be executed if and only if the
corresponding reactive object changes, meaning a new selection is
done through the drop-down menus. We start with the first drop-
down menu, that of Pisa tests. We have defined it with
selectInput("test", "TEST", choices=choice:test) , where
the first attribute is the identifier (attribute inputId ), the second is
the title to be visualized, and the third the list of values, here stored
in variable choice:test.
These are just the filtering operations; we still have to define them
as reactive actions. For this we need to enclose each one of them
into function reactive() .
Let us make a pause. You may have noticed something strange: why
are variables with data stated with parenthesis (i.e. selected1() and
selected2())? Functions have parenthesis, not variables, so why is
that? Here lies a fundamental difference between a dashboard and a
normal R script. In a normal R script, variables are just R objects,
but in a Shiny dashboard, there are variables that are common R
objects, but there are also variables that are something different,
they are reactive objects. Here, selected1() and selected2() are
reactive objects because they depend on input elements and the
associated reactive actions. Specifically, they are defined as type
reactiveExpr, meaning that they technically are functions, therefore
they should be written with parenthesis. From this example comes
an important rule for Shiny dashboards: all reactive objects are
functions, not simple R objects.
# Tables rendering
# First table
output$table1 <- renderTable(
selected1() %>%
group_by(geo, sex) %>%
summarize(Mean= mean(OBS_VALUE, na.rm=TRUE))
)
# Second table
output$table2 <- renderTable(
selected2() %>% select(4,5,7,8,9)
)
}
Putting together the user interface and the server logic parts, we can
run the complete Shiny dashboard of our first example. It is still the
bare minimum for a dashboard, but nevertheless it is a fully
functioning and complete dashboard with all fundamental parts.
From this one, we will move on adding elements and complicating
the interface and the logic. Figure 15.1a and Figure 15.1b show two
screenshots with different selections from drop-down menus and
corresponding tables.
fluidRow(
column(6, selectInput("test", "Test",
choices = choice:test, selected='READ')
)
),
fluidRow(
column(6, selectInput("country", "Country",
choices = choice:geo, selected="IT")
)
),
fluidRow(
column(4, tableOutput("table1")),
column(4, plotOutput("pisa_MF"))
)
)
Figure 15.2a and Figure 15.2b show two screenshots for different
selections with the corresponding table and plot.
shinythemes::themeSelector(),
More relevant are changes to make to the server logic. Let us start
with creating the second plot, then we will deal with style options.
The ridgeline plot, different from the first line plot, does not depend
on country selection because it shows all countries. We have to
change the data selection, meaning to create a new reactive action
and reactive object. We also omit missing values and total values in
order to keep only data for male and female students.
Figure 15.2 (a) Table and plot, test READ and country KR (Korea) selected. (b) Table and
plot, test MAT and country KR selected.
Error in `.getReactiveEnvironment()$currentContext()`
! Operation not allowed without an active reactive co
You tried to do something that can only be done from
list1= reactive(as.list(df1_sort()$geo))
Finally, the last step is sorting with respect to the external list, which
consists of using data (reactive object selected2()), converting the
column with country names (geo) into factor type, and associating
categories (level) to the sorted list list1. As it should be already clear,
this operation requires a reactive context, being dependent on the
two reactive objects selected2() and list1().
df_elev_factor= reactive(
selected2() %>%
mutate(geo= factor(geo)) %>%
mutate(geo= fct_relevel(geo, list1(
arrange(geo)
)
theme(
panel.background= element_rect(fill='transparent'),
plot.background= element_rect(fill='transparent', c
panel.grid.major= element_line(color ='lightgray'),
panel.grid.minor= element_line(color ='lightgray'),
legend.background= element_rect(fill='lightgray'),
legend.box.background= element_rect(fill='transpare
axis.text= element_text(size = rel(1.3),color ='gra
axis.title= element_text(size = rel(1.3),color ='gr
)
renderPlot({
…
}, bg="transparent")
Let us delve into the technical details. The first element we consider
is thematic automatic styling, which is defined outside the user
interface.
thematic_shiny(font='auto')
With this, thematic functionalities are activated, and graphics are
adapted to the selected theme. Specific fonts or font families, for
instance from Google Fonts, could be indicated, or the choice is left
to the tool with font=’auto’ .
In the user interface, the theme definition is set with bslib function
bs_theme() . We do not specify a certain theme because we want
to use the selector. On the contrary, a theme could be indicated,
attributes are the Bootswatch version (attribute version , the
current one at the time of writing is 5) and the theme’s name
(attribute bootswatch ).
theme = bslib::bs_theme(),
Next, in the user interface, we want to include the widget for the
multiple selection. It exists the simple version with standard Shiny
element selectInput() already seen in the previous versions of
the example, which supports attribute multiple=TRUE , which
makes it possible to select more values from the drop-down menu.
However, with widget multiInput() a richer layout is provided.
The first attributes are the same of function selectInput() ,
specific are instead attribute selected , configured with elements
selected by default, and attribute options as a list with the
activation of the search functionality and the labels for selected and
nonselected values.
multiInput(
inputId= "country", label= "Countries :",
choices= unique(pisa$COUNTRY),
selected= "United States of America",
width= '100%',
options= list(
enable_search= FALSE,
non_selected_header= "List:",
selected_header= "Selected:"
)
)
tabsetPanel(
id= ,
tabPanel(
id="IdTab1",
fluidRow(…)
),
tabPanel(
id="IdTab2",
fluidRow(…)
),
…
)
For beginners, this good practice might look like an additional level
of complexity, but actually it is the opposite, with a little effort the
code is much more readable, clear, and manageable. It is a little
effort well spent. In the following, the excerpt with the definition of
custom function plot_tabs, having the schema:
For the server logic, we could activate the panel Theme customizer
with function bs_themer() to insert the theme selector and make
tests by changing the dashboard’s theme.
observeEvent(input$tabs, {
… actions for all tab pages
}
observeEvent(input$tabs, {
if(input$tabs == "tab1"){
… tab1 actions
} else if(input$tabs == "tab2"){
… tab2 actions
} else {
… tab3 actions
}
}
if(input$tabs == "MAT") {
# tab MAT
…
plot and table rendering
…}
} else if (input$tabs == "READ") {
# tab READ
…}
plot and table rendering
…
} else {
# tab SCI
…
plot and table rendering
… }
})
Let us introduce the new elements for this second Shiny dashboard.
Customized bslib theme. Preconfigured Bootswatch themes (or of
shinythemes) are useful and well-done but generic and unoriginal. In
a dashboard project, it is often appreciated a certain degree of
customization, not only operational but also aesthetical, instead of
just applying an ordinary graphical theme. For this, a graphic project
and manual customizations are required. In our case, we will present
some examples by manually customizing some elements of the
layout by means of bslib, which supports configurations of the HTML
page and CSS style sheets. We will also make use of Google Fonts
and personalized colors chosen with a Color Picker (we suggest
trying to modify the choices of the example and test different
outcomes).
Sidebar. The first new dashboard element of the user interface is the
sidebar, which could be defined according to the following schema
with functions sidebarLayout() and sidebarPanel() , with
parameter width to set the sidebar width:
sidebarLayout(
sidebarPanel(
…
widgets, text, graphical elements
)
, width = …)
All these widgets are configured in a similar way of the ones already
seen, with first attribute id as the identifier, needed in the server
logic to handle input or output from/to that widget, then a title or
text to visualize, and some specific attributes like minimum and
maximum values for the slider, a list of choices for the checkboxes
and so on. The same applies to other widgets not presented in this
book.
Main panel. The new user interface element main panel (function
mainPanel() ) defines the page space except the sidebar (and
other similar elements like the navbar, the panel on top of the page
typically used for navigating into the dashboard or menus, we will
not use it). In the main panel, we define the usual layout of the user
interface with rows ( fluidRow() ) and columns ( column() ).
Error in input$selectall :
Can't access reactive value 'selectall' outside of re
Do you need to wrap inside reactive() or observe()?
Now, we can try clicking several times on the button and look at the
console to check the outcome:
ALL RESULTS: 0
RESULTS: Success Success (Claimed)
ALL RESULTS: 1
RESULTS: Accident Attempt Rumored Bad Conditions Bad
Weather Did not Climb Illness, AMS Lack of Supplies L
of Time Not to Reach BC Other Route Difficulty Succes
Success (Subpeak, ForeSmt) Unknown
ALL RESULTS: 2
RESULTS:
ALL RESULTS: 3
RESULTS: Accident Attempt Rumored Bad Conditions Bad
Weather Did not Climb Illness, AMS Lack of Supplies L
of Time Not to Reach BC Other Route Difficulty Succes
Success (Subpeak, ForeSmt) Unknown
ALL RESULTS: 4
RESULTS:
…
observeEvent(input$selectall, {
# selectall equals to 0
if(input$selectall == 0) return(NULL)
This piece of code executes the rendering of the data table (function
DT::renderDT() ) as output element of the user interface. That
function requires data to render being in HTML format, and this is
the task of function DT::datatable() that takes tabular textual
data (a matrix or a data frame) and transforms them into an HTML
table. The data frame is provided by the reactive object
table_data() , which we have created with a filter()
instruction and the original data frame. The resulting HTML table
should be formatted with function DT::formatStyle() , reducing
font size and centering the text, for example. Here comes the subtle
problem. Function formatStyle() , has first attribute table that
requires an HTML table, the one created with datatable() and
passed with the pipe; with attribute columns the names of columns
to be formatted are specified. We want to format all columns; how
can we specify that? The trivial solution is to explicitly list them all, it
works but it is not a general solution. We want to specify it so that
all columns are automatically formatted, it does not sound difficult
but instead, it is not as easy as it looks like. To understand the
problem clearly, a toy example would help.
First, what we need is to obtain all column names from the table
created with datatable() . As a toy example, we can try using just
datatable() with data frame him, the original one produced by
reading the dataset, and the common R function colnames() .
Then, we test two simple operations: first we format a single column
(i.e., Year) just coloring red its values; second we try the same with
all columns, by using colnames() to obtain the list of names,
expecting to see all columns values colored red.
> colnames(him)
[1] "Year" "Season" "Host" "Nationalities" "Leader (
[7] "Result" "Smtrs" "Dead" "Exped ID" "Nation"
# Tests:
# 1) Just column Year is formatted by coloring red it
> datatable(data= him) %>% formatStyle(columns="Year"
color='red')
# 2) Same but for all columns by using colnames(.)
> datatable(data= him) %>% formatStyle(columns=colnam
color='red')
The result is that, with the first test, we correctly obtain values of
column Year colored red. But with the second, by using the normal
dot notation from magrittr to specify where to place data passed
through a pipe, no value is colored red. The formatting has not been
applied to any column. Something is wrong. The problem, as said, is
subtle, and it has to do with the fact that the table produced by
datatable() is not a normal R data frame, so the traditional dot
notation with pipe does not work. A particular syntax is needed:
.$x$data , which means that from data passed by pipe (the dot
notation), which is an HTML table, data used by datatable() are
considered ( $data ), and of them all columns ( $x ). It is certainly
not crystal clear as a syntax, but it is correct, and by using it we
have all values colored red in our toy example.
dashboardPage(
dashboardHeader(),
dashboardSidebar(),
dashboardBody()
)
shinyApp(ui, server)
At any rate, for our purposes and examples, we are not interested in
systematically web scraping a large amount of data or proprietary
data for which the owner forbids web scraping attempts, but we limit
our attention to the most innocuous and simple of online data: HTML
tables from static pages. That is easy to do and problem-free,
dynamic content generated by JavaScript is more difficult to retrieve,
you can try it without fear of triggering angry reactions.
We wish to retrieve data from two HTML tables, one present in the
Wikipedia page “List of people who died climbing Mount Everest”
(https://en.wikipedia.org/wiki/List_of_people_who_died_climbing_M
ount_Everest), regarding mortal accidents that happened during
expeditions; the other from The Himalayan Database, by selecting
the Peak Ascents Report with Mount Everest code (Peak ID: EVER)
(https://www.himalayandatabase.com/scripts/peaksmtr.php), which
provides the full list of Everest expeditions members (at the time of
writing, 11 341 members).
Let us consider the basic logic for retrieving those data by means of
R functionalities and inserting them into the Shiny dashboard. First,
we need package rvest, included into tidyverse. The general idea is
that we read the HTML page corresponding to a certain URL, then
from the page source, we retrieve the table we are interested in and
transform it into a data frame. Let us consider the corresponding
code for the Wikipedia table:
library(rvest)
The first instruction simply assigns to a variable (url) the URL of the
page with the table we wish to read. Then, with function
read_html() we read the page’s HTML source; the first attribute
( x ) should be a local path or a URL. The result, assigned to variable
webpage, is in XML format (a structured format often used for web
content). Figure 16.2 shows part of the XML file visualized with the
RStudio viewer, the main tag <html> is on top and includes tag
<head> , the HTML page’s header, and tag <body> , with page
content.
On the far left of the menu bar, the one with items like Elements,
Console, and Sources, there is a little icon with an arrow and a
square, by clicking on it, it will turn blue, meaning that we can select
page elements just by hovering the mouse on each one of them.
When the mouse hovers on an element, it will be highlighted, and a
tooltip will show its properties. In the panel at the bottom (Elements
menu selected) the corresponding source code is visualized showing
HTML tags and elements.
Now the tricky step. What we need to do is to select the HTML table
element and to do it, a certain amount of patience is required
because you will likely end up selecting many other elements before
catching the table (try to hover on the table border, that would be
easier). At that point, you will see the whole table (and only the
table) highlighted, and the corresponding tooltip will give you the
required information (Figure 16.3 shows exactly that tooltip). In our
case, it states that the CSS selector is
table.wikitable.sortable.jquery-tablesorter . We are
almost done, now we should try selecting with the R code. We try
executing on the console (or in a script as well, of course) function
html_elements() with that selector and look at the result.
data <- html_elements(webpage,
'table.wikitable.sortable.jquery-tableso
> data
{xml:nodeset (0)}
Now the result has something and looking at it, we easily recognize
that it is the table (we see the table tag, the tbody tag with
column names Name, Date, and so on).
So, we have the table and same result would have been obtained
with just table.wikitable as selector.
One way or another, we put the HTML table into variable data and
just one more step is left before obtaining a data frame. Function
html_table() provides the tabular data, then with common R
functions bind_rows() and as_tibble() the data frame
corresponding to the original table is ready.
data <- html_table(data, header= TRUE)
dead_him <- data %>%
bind_rows() %>%
as_tibble()
Figure 16.4 First data frame obtained through web scraping from an HTML page.
This time we look for the right CSS selector by starting with just
table as a selector. The following is the result we obtain.
data2
{xml:nodeset (5)}
[1] <table width="100%" border="0" cellspacing="0"
cellpadding="0"><tbody>\n<tr>\n<td bgcolor="#2F …
[2] <table width="100%" border="0" cellspacing="0"><t
<td width="15"></td> <td …
[3] <table width="100%" height="79%" border="0" cellp
cellspacing="0"><tbody><tr>\n<td …
[4] <table width="100%" height="100%" border="0" cell
<tbody><tr>\n<td valign="top">\n< …
[5] <table id="Peaks" border="1"><tbody>\n<tr>\n
<th style="width: 40px" align="left"><small> Peak
>
library(tidyverse)
py_bin <- reticulate::conda_list() %>%
filter(name == "r-reticulate") %>%
pull(python)
Sys.setenv(RETICULATE_PYTHON = py_bin)
library(reticulate)
Examples of Shiny dashboards with Altair graphics are quite rare and
often they are outdated and no longer working, being based on
Altair version 3, but superseded by version 4 that has deprecated
some functions previously required and now replaced with the
original ones from package vegawidget, the one for which Altair acts
as an interface.
We also recommend not to proceed with the third version of our full
dashboard of Himalayan expeditions before having tested the
functioning with this simplified dashboard.
library(shiny)
library(reticulate)
library(vegawidget)
library(altair)
# User Interface
ui <- fluidPage(
# Server logic
server <- function(input, output) {
# Rendering vegawidget
output$test_altair <- vegawidget::renderVegawidget(
}
# Run App
shinyApp(ui= ui, server= server)
For this reason, the user interface requires just one output element
defined by vegawidget::vegawidgetOutput , climb_altair is the
identifier, and the graphic is placed in the same fluidRow, over the
table, as we did with previous ggplot graphics in other tabs.
tabPanel("SUMMITERS",
fluidRow(
br(),
vegawidget::vegawidgetOutput("climb_alta
p(),hr(),
column(12, DTOutput("climb"))
)
),
Now the server logic starts with graphical rendering. The output
element is referred as output$climb_altair and for the
rendering we use function vegawidget::renderVegawidget() . To
create the Altair graphics, we define a custom function called
plot_climb() with three attributes that correspond to the
different data frames required by the graphic types that we will
produce. Those attributes are data_climb1(), data_climb2(), and
data_climb3(), and the corresponding data frames have been
prepared with common data wrangling operations.
vegawidget::renderVegawidget(plot_climb(
data_climb1(),
data_climb2(),
data_climb3()
)
) -> output$climb_altair
We consider now the data frames. They are derived from climb_him,
the one with data read through web scraping from the web page
saved locally from The Himalayan Database.
We start with data_climb1. This requires a simple data aggregation
with rows grouped for nationality (column Citizenship). The number
of members for each nationality is counted and values are stored in
the new column Num_summit. For simplicity, nationalities with less
than 100 members are omitted. These data will be used to produce
a bar plot having on top of each bar the numerical value.
reactive(climb_him %>%
separate(`Yr/Seas`,
into = c('Year','Season'),
sep = ' ') %>%
mutate(Year= as.integer(Year))%>%
group_by(Name, Citizenship) %>%
summarize(Num_summit= n()) %>%
filter(Num_summit>= 10) %>%
arrange(desc(Num_summit))) -> data_cli
Finally, the third data frame data_climb3(), this time clearly more
elaborate than the others. As before, we separate Year from Season
and convert the data type; then we eliminate rows without a valid
value in column Time because they refer to seemingly spurious
entries always duplicated that would create problems in following
operations.
Let us see the details. In a grouped data frame, to obtain, for each
group, the rows with the highest values with respect to a certain
column, the following expression could be used
filter(rank(desc(column))<=num) , with num indicating the
number of rows with highest values we want to obtain. Conversely,
for each group, to obtain rows with the lowest values with respect to
a certain column, the following form should be used
filter(rank(column)<=num) . The key is that being both filter
operations, the rank expressions are actually logical conditions, so
they could be combined with logical operators. In our case we want
the disjunction (i.e., OR) and retrieve just one row for each
condition, so to obtain, for each person, the expedition whose age
was the oldest OR the expedition whose age was the youngest:
(filter(rank(Age)<=1 | rank(desc(Age))<=1) ).
In other words, we obtain, for each person, the age of the first and
the age of the last expedition.
reactive(climb_him %>%
separate(`Yr/Seas`,
into= c('Year','Season'),
sep= ' ') %>%
mutate(Year=as.integer(Year))%>%
filter(Time!="") %>%
group_by(Name) %>%
mutate(Num_summit=n()) %>%
filter(rank(Age)<2 |
rank(desc(Age))<2) %>%
arrange(desc(Num_summit),Name) %>%
select(2,3,7,8,9,10,13) %>%
distinct()
) -> data_climb3
Bar plots Number of expeditions /Name: these are two distinct bar
plots, bar_plot2 and bar_plot3, for Nepalese Sherpas and non-
Nepalese climbers. They use the same data frame but a different
selection condition, respectively:
$transform_filter("datum.Citizenship==’Nepal’") and
$transform_filter("datum.Citizenship!=’Nepal’") .
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
We read the United Nations dataset with data about tourists’ arrivals
and expenditures.
df=pd.read_csv("datasets/UN/
SYB65_176_202209_Tourist-Visitors Arrival and
thousands=',')
df1.columns= ['Country','Year','Expenditure','Arrival
df1["Per_capita_Exp(x1000)"]= (df1.Expenditure/df1.Ar
round(3)
df2= df1[∼((df1.Expenditure.isna()) | (df1.Arrivals.i
⋯ ⋯ ⋯ ⋯ ⋯ ⋯
17.1.1 Scatterplot
df2["Year"]= df2["Year"].astype(str)
scatter2= px.scatter(df2, x="Arrivals", y="Expenditur
color="Year", size='Per_capita_Exp(x1000)
size_max=60, hover_data=['Country'],
color_discrete_sequence=px.colors.qualita
scatter2.show()
Figure 17.1 Plotly, scatterplot with default dynamic tooltip.
For the line plot, the Plotly function is px.line() and, again, it has
the usual attributes (see Figure 17.3).
17.1.3 Marginals
17.1.4 Facets
Figure 17.4 Plotly, scatterplot with a histogram and a rug plot as marginals.
Figure 17.5 Plotly, facet visualization.
18
Dash Dashboards
For developing dashboards with the Dash framework, it is
recommended using a Python Interactive Development Environment
(IDE). The support for Jupyter Notebook and JupyterLab exists, but
there are some differences, and, in general, a Python IDE will serve
you much better in this case.
The following code should come before all excerpts of code that will
be presented in this chapter. For brevity, these instructions will not
be repeated each time, but they are required to run the examples.
# Plotly
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
Here is the excerpt of code for data import of the United Nations’
dataset and the data-wrangling operations required to prepare the
data for visualization.
df1= df.pivot(index=['Country','Year'],
columns='Series',
values='Value').reset_index()
df1.columns= ['Country', 'Year', 'Expenditure','Arriv
df1["Per_capita_Exp(x1000)"]=
(df1.Expenditure/df1.Arrivals).ro
df2= df1[∼((df1.Expenditure.isna())|(df1.Arrivals.isn
min_arr= df2.Arrivals.min()
max_arr= df2.Arrivals.max()
country_list= df2.Country.unique()
The result shown in Figure 18.1 does not look impressive, to say the
least, it is practically the same as the simple Plotly graphic,
nevertheless, the important part is under the hood because this is
not just a graphic but a full web application and a Dash dashboard.
We will improve it considerably in the remaining of the chapter.
app = Dash(__name__)
app.layout = html.Div([
html.H4('Simple scatterplot'),
dcc.Graph(
id= "graph",
figure= scatter2)
])
if __name__ == '__main__':
app.run_server(host='127.0.0.1', port=8051)
18.2.2 Themes and Widgets
min , max , and step for the minimum and maximum values,
and the minimum step when the slider is moved.
value represents the values shown by default.
Figure 18.1 Dash dashboard with Plotly graphic.
On top of the slider, we may want to add a text, like a title. We can
do this with html.P() (again, the Dash translation of HTML tag
<p> ). All these elements are vertically aligned in the page layout.
app.layout= html.Div([
html.H3('Scatterplot + Slider',
style={
'textAlign': 'center',
'color': 'teal'
}),
dcc.Graph(id="scatter"),
html.P("Tourist arrivals:"),
dcc.RangeSlider(
id='slider',
min= min_arr, max= max_arr, step=5000,
value= [min_arr, max_arr]
)
])
To recap, the logical flow to manage a Dash reactive event is: the
input element is changed, this activates the corresponding callback
( @app.callback() ). The following custom function (e.g.,
update_scatterplot() ) is executed and a result is produced,
such as the graphic is recreated (e.g., px.scatter() ) or a table is
recalculated. The result is stored in a variable (e.g., fig in the
example) that is returned, and the dashboard is updated. The
following excerpt of code shows the details of the example. Figure
18.2a and Figure 18.2b show two screenshots of the dashboard, the
first with default slider values, and the second after having changed
the slider input.
# Callback definition
# Input type 'value', id 'slider'
# Output type 'figure', id 'scatter'
@app.callback(
Input("slider", "value"))
Output("scatter", "figure"),
def update_scatterplot(slider_range):
low, high = slider_range
mask = (df2['Arrivals']>= low) & (df2['Arrivals']
fig = px.scatter(df2[mask],
x="Arrivals", y="Expenditure", color="Ye
size='Per_capita_Exp(x1000)', size_max=6
hover_data=['Country']
)
return fig
if __name__ == '__main__':
app.run_server(port=8051)
dash_table.DataTable(
data=df2.to_dict('records'),
columns=[{'id': c, 'name': c} for c in df2.column
filter_action="native",
sort_action="native",
sort_mode="multi",
column_selectable="single",
row_selectable="multi",
row_deletable=True,
selected_columns=[],
selected_rows=[],
page_action="native",
page_current=0,
page_size=10,
style_as_list_view=True,
style_table={'margin-top': '48px', 'overflowX': '
style_cell={'textAlign': 'left', 'fontSize': 14,
'font-family': 'sans-serif'},
style_data={'backgroundColor': 'white'},
style_data_conditional=[
{
'if': {'row_index': 'odd'},
'backgroundColor': 'rgb(220, 220, 220)',
}
],
style_header={
'backgroundColor': 'teal',
'color': 'white',
'fontWeight': 'bold'
}
)
18.2.5 Color Palette Selector and Data Table Layout
Organization
colorscales= px.colors.named_colorscales()
html.Div([
html.H4('Interactive color scale'),
html.P("Select your palette:"),
dcc.Dropdown(
id= 'dropdown',
options= colorscales,
value= 'viridis'
),
]),
This is for the layout definition; now it comes with the corresponding
reactive action to apply the selected color palette to the graphic. We
need to define a callback and the associated custom function. The
callback should associate the input from the drop-down menu to the
output represented by the scatterplot graphic. A callback that takes
an input and associates the output to the scatterplot already exists,
it is the one defined for the slider. We do not need to create a new
one, the one that exists could be customized with an additional input
(i.e., Input("dropdown", "value") ), and the corresponding
update_scatterplot() custom function modified to handle the
two cases: the input from the slider (identifier slider) and the input
from the drop-down menu (identifier dropdown). If the logic is clear,
the code is easy to rewrite. The update_scatterplot() now has
two parameters: slider_range with the values from the slider and
scale with the selected color palette. With scale, we can just add the
attribute color_continuous_scale to the scatterplot to have the
graphic produced with the selected palette.
# Callback
# input slider and dropdown, both of type value
# output scatter of type figure
@app.callback(
Output("scatter", "figure"),
Input("slider", "value"),
Input("dropdown", "value"))
# Custom function
dbc.Row(
[
dbc.Col(
html.Div(),
xs=12, sm=12, md=3, lg=3, xl=3,
),
dbc.Col(
html.Div(
dash_table.DataTable(
data= df2.to_dict('records'),
columns= [{'id': c, 'name': c} for
df2.columns],
… )
),
xs=12, sm=12, md=3, lg=3, xl=6,
),
dbc.Col(
html.Div(),
xs=12, sm=12, md=3, lg=3, xl=3
)
], className="g-0" # This removes space between col
)
])
Figure 18.4 (a) Color palette selector and centered, resized data table (example 1). (b)
Color palette selector and centered, resized data table (example 2).
The first new step is to add the sidebar with some widgets.
# SIDEBAR
# MAIN PAGE
# First row
content_first_row= dbc.Row(
[
dbc.Col(…),
dbc.Col(…)
]
# Second row
content_second_row= dbc.Row(
[
dbc.Col(…),
dbc.Col(…)
]
content= html.Div(
[
content_first_row,
content_second_row,
],
style= CONTENT_STYLE)
sidebar = html.Div(
[
html.H4('Controls', style= TEXT_STYLE),
html.Hr(),
html.P('Countries:', style= TEXT_STYLE),
dcc.Dropdown(id= "dropdown",
options= country_list,
value= ['Italy'],
multi= True),
html.Br(),
dcc.Checklist(id= "checklist",
options=[{'label': 'All countri
'value': 'AllC'}],
value=['AllC']
)
],
style=SIDEBAR_STYLE,
)
For managing the checkbox, we have to modify the callback and the
associated custom function, it is an adaptation of logical conditions
selecting the rows from the data frame.
The logic is: if the checkbox is selected, then all countries should be
included, meaning that no row selection is required and the choices
from the drop-down menu should be ignored; otherwise, if the
checkbox is not selected, then only rows corresponding to countries
selected through the drop-down menu should be presented. For the
slider, only rows corresponding to countries having tourist arrivals
included in the selected range will be presented.
This is for the first callback; a second one is now needed because
we also want the data table to be reactive and reconfigure itself
based on input selection from the drop-down menu for country
selection and the All countries checkbox. The output should be of
type data . The logic behind the reactive event associated to the
data table is equivalent to the one for the scatterplot and depends
on the same two inputs. The callback associated to the data table
should have its corresponding custom function ( update_table() )
for calculating the table values and the rendering. The following
excerpt of code presents the solution: Figure 18.5a and Figure 18.5b
are two screenshots showing the result, with the All countries option
or a list of countries selected.
@app.callback(
Output("scatter", "figure"),
Input("slider", "value"),
Input("dropdown", "value"),
Input("checklist", "value"))
fig= px.scatter(df2[mask],
x= "Arrivals", y= "Expenditure", color= "Yea
size= 'Per_capita_Exp(x1000)', size_max=60,
hover_data= ['Country'],
color_continuous_scale= 'geyser')
return fig
@app.callback(
Output("datatable1", "data"),
Input("dropdown", "value"),
Input("checklist", "value"))
A first novelty of this version is the external CSS style sheet, whose
reference is stated at the beginning of the script. The one referred to
is a widely used CSS style sheet also mentioned in the official Dash
documentation; many others are available, as well as the possibility
to customize a CSS of your own. Technically, to link an external CSS,
attribute external_stylesheets of function Dash() should be
used. With the same attribute, we can also select the theme; in this
case, the dark theme SLATE from Bootswatch. With
load_figure_template(’slate’) , the theme is loaded and ready
to be applied.
dbc_css= "https://cdn.jsdelivr.net/gh/AnnMarieW/
dash-bootstrap-templates@V1.0.2/dbc.min.css"
load_figure_template("slate")
html.Hr(),
html.P('Axis:', style=TEXT_STYLE),
dcc.Markdown(``'_X Axis:_"'),
dcc.RadioItems(list(df2.columns), 'Arrivals',
id='radio_X', inputStyle= RADIO_STYLE)
html.Br(),
dcc.Markdown(``'_Y Axis:_"'),
dcc.RadioItems(list(df2.columns), 'Expenditure',
id='radio_Y', inputStyle= RADIO_STYLE)
# Callback
@app.callback(
Output("scatter", "figure"),
Input("slider", "value"),
Input("dropdown", "value"),
Input("checklist", "value"),
Input("radio_X", "value"),
Input("radio_Y", "value")
)
# Custom function
fig= px.scatter(df2[mask],
x= radio_X, y= radio_Y, color="Year",
size='Per_capita_Exp(x1000)', size_max=60,
hover_data=['Country'],
color_continuous_scale= px.colors.sequenti
fig.update_layout(plot_bgcolor='rgba(0, 0, 0, 0)',
paper_bgcolor='rgba(0, 0, 0, 0)')
return fig
18.3.4 Bar Plot
content_first_row= dbc.Row(
[
dbc.Col([
dcc.Graph(id= "scatter"),
…
]),
dbc.Col([
dcc.Graph(id= "bar")
], width=4)
], className="g-0")
# Callback
@app.callback(
Output("bar", "figure"),
Input("slider", "value"),
Input("dropdown", "value"),
Input("checklist", "value")
)
df2["Year"]= df2["Year"].astype(str)
df2['Country']= df2.Country.str.replace(
'United States of America', 'US
fig.update_layout(plot_bgcolor='rgba(0, 0, 0, 0)',
paper_bgcolor='rgba(0, 0, 0, 0)')
return fig
18.3.5 Container
if __name__ == '__main__':
app.run_server(port='8051')
Figure 18.6a shows the default appearance of the dashboard with all
elements and the dark theme. Figure 18.6b presents the details of
the scatterplot reconfigured according to the selection of radio
buttons (Per capita expense on axis y) and the dynamic tooltip.
Figure 18.6c shows the scatterplot further reconfigured with years
on axis y and the bar plot adapted according to the selection on the
legend (years 2010 and 2018 selected).
To introduce tabs, this time we start from the end. The final result
we have to achieve, in order to assemble tabs in a correct Dash
layout is an organization similar to what showed in the following
excerpt of code. The final Container combines the sidebar and tabs
objects, meaning that tabs are not part of the sidebar and include
the main content, this is the first information we have to know.
Moving backward, we should define the main tab context. Function
dcc.Tabs() specifies the general multi-page layout, while single
tabs are defined with function dcc.Tab() . For each tab, the layout
is better specified by defining variables (e.g., content_tab1 and
content_tab2), for the same reasons we have previously divided the
content into a sidebar object, first row, second row, and so on. Such
an organization is orderly and clear; it helps to reduce the
complexity and to ease the readability and maintenance of the code.
It also helps associating different graphical styles to tabs, for
example, to differentiate between the one selected and the others.
Figure 18.6 (a) Dash dashboard, default appearance. (b) Detail of the scatterplot
reconfigured by changing variable on axis y. (c) Scatterplot reconfigured with another
variable on axis x and bar plot adapted to selection on the dynamic legend.
tabs= dcc.Tabs([
dcc.Tab(label='Countries', children=[
content_tab1
], style= TAB_STYLE, selected_style= TAB_SELECTED_S
dcc.Tab(label='Cities', children=[
content_tab2
], style= TAB_STYLE, selected_style= TAB_SELECTED_S
])
# First row
content_first_row= dbc.Row(
[
dbc.Col([
…
]),
dbc.Col([
…
])
# Second row
content_second_row= dbc.Row(
[
dbc.Col(
… ]
)
# HTML div
content_tab1= html.Div(
[
content_first_row_tab1,
html.Hr(),
content_second_row_tab1,
],
style= CONTENT_STYLE
)
Web scraping in Python is very easy, at least for basic cases like
collecting an HTML table from a static page. The main function is
offered by pandas and is pd.read_html() , the attribute to specify
should be a URL. For example, we read an HTML table contained in
the Wikipedia page List of cities by international visitors
(https://en.wikipedia.org/wiki/List_of_cities_by_international_visitors
).
The array dfs contains the result and dfs[0] is the data frame
corresponding to the table. Values of columns Growth in arrivals
(Euromonitor) have symbol % that should be removed to transform
them in numerical type. Furthermore, the symbol used as the
negative sign is not the minus sign but actually a hyphen, so it
should be replaced with the correct symbol; otherwise, it is not
recognized as a negative numeric value in the type transformation.
# First row
content_first_row_tab2= dbc.Row(
[
dbc.Col([
html.P("Top 20 cities for growth in arrivals
dcc.Graph(id="bar2"),
], width=6),
dbc.Col([
html.P("Top 20 cities for arrivals (2018)"),
dcc.Graph(id="bar3")
], width=6)
], className="g-0")
# Second row
content_second_row_tab2= dbc.Row(
[
dbc.Col(
html.Div(
dash_table.DataTable(
data=dfs[0].to_dict('records'),
id="datatable2",
…
)
)
]
)
# Tab's content
content_tab2= html.Div(
[
content_first_row_tab2,
html.Hr(),
content_second_row_tab2,
],
style= CONTENT_STYLE
)
Bar plot (id=bar2): with this bar plot, we want to show countries in
order of growth in tourist arrivals (column Growth in arrivals
[Euromonitor]). Countries are selected either from the list of the
drop-down menu or through the All countries checkbox. The logic is
that, if the checkbox is selected, then all countries are considered
and we show the first 20 countries in decreasing order of growth in
tourist arrivals; if the checkbox is not selected, the countries plotted
in the bar plot are those selected with the drop-down menu. We
want also to show another graphical effect, bars should be colored
differently whether they represent a positive or a negative
increment; for this reason, we create the new column Color with a
textual value. We add the dynamic tooltip with attribute hover .
Finally, attribute barplot=’relative’ of function px.bar()
indicates to draw the bar plot relatively to value zero, meaning that
bars with positive and negative values take opposite direction. It is a
diverging bar plot, the one we produce, and Plotly supports it
natively. For sake of precision, option barplot=’relative’ is not
strictly necessary to specify, being the default, we show it for clarity.
Other possible values other than relative are overlay, when bars of
the same group are stacked, and group to have bars of same group
beside each other.
Bar plot (id=bar3): the second bar plot differs from the first one for
the sorting criteria of countries when the All countries checkbox is
selected. In this case, they are sorted according to the Euromonitor’s
ranking (column Rank (Euromonitor)) and the first 20 countries by
rank visualized. In the Plotly bar plot, we add attribute
color_discrete_map() to associate different colors to values
Negative and Positive of column Color.
temp1= dfs[0].copy(deep=True)
temp1["Color"]= np.where(temp1["Growth in arrivals
(Euromonitor)"] < 0, 'Negative', 'Positiv
if checkbox_value:
data= temp1.sort_values(by='Growth in arrivals
(Euromonitor)', ascending=False).head(20)
else:
mask= temp1['Country / Territory'].\
isin(dropdown_selection)
data= temp1[mask].sort_values(by='Growth in arr
(Euromonitor)', ascending=False)
# Bar plot
fig= px.bar(data, x="Growth in arrivals (Euromonito
y="City", barmode='relative',
orientation='h', color="Color"
hover_data={'Color': False, 'City': Fa
"Country / Territory": Tru
"Arrivals 2018 (Euromonito
labels={"City": ""}
)
fig.update_layout(showlegend=False,
plot_bgcolor='rgba(0, 0, 0, 0)',
paper_bgcolor='rgba(0, 0, 0, 0)')
return fig
@app.callback(
Output("bar3", "figure"),
Input("dropdown", "value"),
Input("checklist", "value")
)
temp2 = dfs[0].copy(deep=True)
temp2["Color"]= np.where(temp2["Growth in arrivals
(Euromonitor)"] < 0, 'Negative', 'Positi
checkbox_value=1
if checkbox_value:
data= temp2.sort_values(by='Rank (Euromonitor)'
ascending=True).head
else:
mask= temp2['Country / Territory'].\
isin(dropdown_selection)
data= temp2[mask].sort_values(by='Growth in arr
(Euromonitor)', ascending=False)
# Bar plot
fig.update_layout(barmode='relative', showlegend=Fa
plot_bgcolor='rgba(0, 0, 0, 0)'
paper_bgcolor='rgba(0, 0, 0, 0)
return fig
dbc_css= "https://cdn.jsdelivr.net/gh/AnnMarieW/
dash-bootstrap-templates@V1.0.2/dbc.min.css"
app= Dash(__name__,
external_stylesheets= [dbc.themes.UNITED, dbc_
load_figure_template("united")
The association with the CSS style sheet (which should be placed in
the same directory of dashboard’s Python file) is managed by Dash
Core Components objects; in our case, the specific tab page created
with dcc.Tab() . Attribute style specified in previous versions of
the dashboard is no longer needed (in the code it has been
commented, for clarity) and replaced with references to CSS classes
such as className=’custom-tabs’ ,
selected_className=’custom-tab--selected’ ) with custom-
tabs and custom-tab—selected the names of directives defined in
the external CSS tabs.css. The following excerpt of code shows
these references.
tabs= dcc.Tabs(
parent_className='custom-tabs',
className='custom-tabs-container',
children=[
# First tab
dcc.Tab(label='Countries',
className='custom-tabs',
selected_className='custom-tab--selected'
children=[content_tab1]
),
# Second tab
# Replaced: style=TAB_STYLE, selected_style=TAB_SELEC
dcc.Tab(
label='Cities',
className='custom-tabs',
selected_className='custom-tab--selected',
children=[content_tab2],
),
# Third tab
# Replaced: style=TAB_STYLE, selected_style=TAB_SELEC
dcc.Tab(
label='Altair charts',
className='custom-tabs',
selected_className='custom-tab--selected',
children=[content_tab3],
)
])
From the previous excerpt of code, you should probably have been
noted that a third tab has been defined titled Altair charts, it is
similar to the others and referred to local variable content_tab3 for
its layout, which is presented in the following explanation.
content_first_row_tab3= dbc.Row(
[
dbc.Col([
html.P("Altair interactive graphics
(interactive legend example)"),
html.Iframe(
id= 'altair1',
width="900",
height="1500"
)
])
]
)
content_tab3= html.Div(
[
content_first_row_tab3,
],
style= CONTENT_STYLE
)
The callback is the trickiest part. Let us start with the definition of
input and output parameters. We want Altair graphics to be reactive,
as we did with Plotly graphics; otherwise, they will be just simple
HTML objects to include in an iframe. For the example, we chose to
make Altair graphics reactive to changes in the already defined drop-
down menu and All nations checkbox, so by changing the selection
of those input elements in the sidebar, Altair graphics should be
recreated.
With the data frame prepared for visualization, Altair graphics could
be defined, as a bar plot and a scatterplot. They both are interactive
by means of the dynamic legend of the scatterplot allowing for the
selection of countries. The selection modifies the colors of markers
and bars, highlighting those corresponding to the selected countries
and turning transparent those for non-selected countries (In Part 2,
we have seen the same example with an Altair scatterplot). In the
bar plot, we want a different coloring for bars associated to positive
or negative values. Finally, the two charts are vertically aligned, and
the background is made transparent. Other difficulties have been
encountered in sizing the iframe, which is delicate and requires some
tests before finding a correct setting. As we were saying at the
beginning of this section, integrating Altair graphics into Dash
requires patience and several tries. The more elaborate is the layout
the more delicate is placing and sizing the iframe; the layout of the
example is simple. However, giving a try to Altair is worth the effort
because an excellent outcome could be obtained.
# Callback
@app.callback(
Output('altair1', 'srcDoc'),
Input("dropdown", "value"),
Input("checklist", "value")
)
temp3= dfs[0].copy(deep=True)
temp3= temp3.rename(columns={"Country / Territory":
temp3= temp3.groupby('Country')\
[['Arrivals 2018 (Euromonitor)',\
'Arrivals 2016 (Mastercard)',\
'Income (billions $) (Mastercard)']
agg('sum').reset_index()
temp3['Diff_Arr_percent']= \
100*(temp3['Arrivals 2018 (Euromonitor
temp3['Arrivals 2016 (Mastercard)'])/
temp3['Arrivals 2016 (Mastercard)']
# Data selection
if checkbox_value:
data= temp3.sort_values(by='Income (billions $)
(Mastercard)', ascending=False).he
else:
mask= temp3['Country'].isin(dropdown_selection)
data= temp3[mask].sort_values(by='Income (billi
(Mastercard)', ascend
# ALTAIR CHARTS
selection= alt.selection_point(fields=['Country']
bind='legend')
change_opacity= alt.condition(selection, alt.valu
alt.valu
# Bar plot
bar_alt= alt.Chart(data).mark_bar().encode(
y= alt.Y('Country:O', axis=alt.Axis(title=``)),
x= alt.X('Diff_Arr_percent:Q',
axis= alt.Axis(title='Difference in arrivals
color= alt.condition(alt.datum.Diff_Arr>= 0,
alt.value("#325ea8"),
alt.value("#ad0a72"),
),
opacity= change_opacity,
tooltip=['Arrivals 2018 (Euromonitor)',
'Arrivals 2016 (Mastercard)',
'Income (billions $) (Mastercard)']
).properties(title='Percent Difference in arrivals
# Scatterplot
scatter_alt= alt.Chart(data).mark_circle(size=200).
y= alt.Y('Arrivals 2018 (Euromonitor)',
type='quantitative',
axis=alt.Axis(title='Arrivals')),
x= alt.X('Income (billions $) (Mastercard)',
type='quantitative',
scale= alt.Scale(domain=[0, 60])),
color= alt.Color('Country:O',
scale= alt.Scale(scheme='category
legend= alt.Legend(title="Years",
orient="right"
opacity= change_opacity,
tooltip=['Country', 'Arrivals 2018 (Euromonitor
'Income (billions $) (Mastercard)']
).add_params(selection
).properties(title='Income and arrivals 2018')
chart.save('iframes/altair_chart.html')
return chart.to_html()
The complete code for this dashboard version, together with the
external CSS tab.css, is available in the Additional Online Material -
Fourth Dashboard: Interactive Altair graphics, custom CSS, and light
theme. Figure 18.8a shows the first tab with the personalized theme,
the scatterplot and bar plot reconfigured according to the slider
selection. Figure 18.8b presents the second tab with the two bar
plots. Figure 18.8c shows the third tab with default configuration of
Altair plots, Figure 18.8d represents the same tab but Altair plots
have been reconfigured based on a subset of countries.
Figure 18.8 (a) First tab, data table, reactive graphics, and layout. (b) Second tab, bar
plots, and data table from web scraping. (c) Third tab, interactive Altair graphics, and
default configuration. (d) Third tab, country selection, and reconfigured Altair graphics.
Part IV
Spatial Data and Geographic Maps
The visualization of spatial data and geographical maps represents a
broad and relatively recent area of data visualization which, for some
aspects, is close and sometimes partially overlaps traditional
cartography and geographical maps produced with Geographical
Information Systems (GISs). In this last part of the book, we
introduce the main techniques available in R and Python
environments, while cartographic techniques and GISs remain out of
the scope, being a technical and scientific sector clearly distinct from
data visualization and data science with its own peculiarities, skills,
and practices.
(https://www.salute.gov.it/anagcaninapublic_new/AdapterHTTP).
(https://www.salute.gov.it/portale/p5_0.jsp?lingua=italiano&id=50,
http://creativecommons.org/licenses/by/3.0/it/legalcode)
(http://dati.istat.it/Index.aspx?DataSetCode=DCIS_POPRES1)
(https://www.dati.gov.it/content/italian-open-data-license-v20)
(https://dati.comune.roma.it/catalog/dataset/d386)
(https://dati.comune.roma.it/catalog/dataset/suar2023)
Copyright: Creative Commons Attribution License (cc-by)
(https://opendefinition.org/licenses/cc-by/)
As usual, let us start from the basics with some simple examples.
With these, we will produce some rudimental maps, useful for
learning the logic and principles of data visualization with spatial
data and geographic maps.
For the first example, we use R package maps that contains some
maps, not particularly updated but handy for a start.
library(tidyverse)
library(lubridate)
library(maps)
str(map1)
List of 4
$ x : num [1:10671] -69.9 -70.1 -70.1 -69.9 NA …
$ y : num [1:10671] 12.5 12.5 12.6 12.5 NA …
$ range: num [1:4] -180 190.3 -85.2 83.6
$ names: chr [1:1627] "Aruba" "Afghanistan" "Angola"
Figure 19.2 shows the generated map. This time it is Italy with the
scale and axes, whose values are expressed as longitude North and
latitude East degrees. As before, object map1 is a list and the key
names has eight values. We can look at them, they correspond to
Italy and its major islands. As it will become clear in the following,
there is a technical reason for not just mapping the single country as
a whole but with its main islands separately, which has to do with
the peculiar technique employed to represent planar surfaces as
spatial data. For a hint about the reason, a reader could try another
country, for example, the United States (i.e., region=’US’) ). They
will find that also in that case there is one name for the United
States, representing the continental region south of Canada, and
several for Hawaii, which is an archipelago, but also a distinct name
(actually more than just one) for Alaska, which is not an island, but
a territory geographically disconnected from the other US states on
the continent. The logic should be clear, a geographical region could
be represented with spatial data as a unique object only if there is
territorial continuity, not if there are disconnected parts. In that
case, each disconnected part, to be mapped with spatial data, has to
be represented individually, hence the major islands and
geographically disconnected regions are separately mapped from the
main portion of a country’s territory.
Figure 19.2 Italy’s border map.
map1$names
[1] "Italy:Isola di Pantelleria" "Italy:Sicily"
"Italy:Sant'Antonio" "Italy:Forio"
[5] "Italy:Asinara" "Italy:Sardinia"
"Italy:Isola d'Elba" "Italy"
head(italy)
# A tibble: 6 × 6
long lat group order region subregion
<dbl> <dbl> <dbl> <int> <chr> <chr>
1 11.8 46.5 1 1 Bolzano-Bozen <NA>
2 11.8 46.5 1 2 Bolzano-Bozen <NA>
3 11.7 46.5 1 3 Bolzano-Bozen <NA>
4 11.7 46.5 1 4 Bolzano-Bozen <NA>
5 11.7 46.5 1 5 Bolzano-Bozen <NA>
6 11.6 46.5 1 6 Bolzano-Bozen <NA>
After the conversion into tibble (i.e. a dataframe type), we see that
the first two columns represent longitude and latitude. We also see
that there is information associated with each row, like the specific
region (column region) and, possibly a subregion. The excerpt of
code shows rows about the Italian province of Bolzano–Bozen, a
northern area at the border with Austria. To note is that for such
province, there are multiple rows, we can verify the number of rows
associated to each Italian province.
# A tibble: 95 × 2
# Groups: region [95]
region n
<chr> <int>
1 Agrigento 146
2 Alessandria 105
3 Ancona 68
4 Aosta 110
5 Arezzo 105
# … with 90 more rows
From this, we learn that each province, meaning a certain
geographical region, same would have been for states or counties in
the US, has a different number of rows associated, each row with a
pair of longitude and latitude coordinates. What is the meaning of
those rows and coordinates? Those coordinates actually refer to the
specific way planar surfaces, for example, geographic areas, are
represented in such maps, namely through the juxtaposition of small
polygonal elements that approximate the real shape of a geographic
area. Those polygonal elements are not visualized with the map, but
they exist and correspond to each single row of the data. This
explains why different areas (e.g. Italian provinces) are represented
with a different number of rows, it depends on the number of
polygons used to approximate the real shape and border of each
area. There exist other ways to represent geographic elements,
other than with polygons. It depends on their type; if they are not
planar surfaces, they could be represented with points or lines. We
will see examples.
We can plot the map that corresponds to data frame italy with
ggplot and function geom_polygon() . Columns long and lat will be
associated to the Cartesian axes x and y, while attribute group will
be assigned to column group. Function geom_polygon() supports
style options like color and linewidth for the borders, as well as color
for filling the areas. Graphical theme theme_void is the common
choice for maps, being devoid of graphical elements, like grids, axes,
and so on. Figure 19.3 shows the corresponding map.
The reader could replicate this example with any other country,
provided it is present in the map package.
Figure 19.3 Provinces of Italy.
What we have seen so far is the basis to start working with spatial
data and geographic maps. Now, we want to create our first
choropleth map. The logic is that we have data about something
(e.g. population data) related to territorial areas at a certain
granularity (e.g. country, state, county, region, or province) and we
need a map with the corresponding areas as spatial data. Or vice
versa, we have a map representing certain areas, and we need
corresponding data for a phenomenon of interest. Given the two
elements, data and map, the result is that areas will be colored to
represent data values according to a certain color scale. One of the
main reasons for the diffusion of choropleth maps is that both maps
at different granularities and data about territorial areas have
become more available in recent years, another is that they are eye-
catching, easy to understand, and to produce.
The color scale used in choropleth maps follows the same rules of
traditional graphs, when a continuous value has to be represented it
is normally a continuous palette, when, instead, discrete values are
represented, the color palette is discrete, sequential, or classic.
Examples widely popular represent with choropleth maps electoral
results, with areas assuming the color of the winning coalition or
party, income levels, crime rates, ethnic majority, and so on, there
are almost infinite examples.
If the logic is clear, we can run the first example. As data, we use
the Excel dataset extracted from the Italian Registry of Domestic
Animals regarding registered dogs and a dataset about the resident
population from the Italian National Institute of Statistics (ISTAT).
… … … … …
A correct choropleth map would have used both the map and the
data with same granularity, either both at region level or both at
province level. We fix it by looking for a map of Italy at regional
level, which is very likely to be found freely available.
19.2.1 Eurostat – GISCO: giscoR
NOTE
library(sf)
library(giscoR)
gisco_get_nuts(
year= 2021, resolution= 20,
nuts_level= 2, country= "Italy") %>%
select(NUTS_ID, NAME_LATN)
ggplot() +
geom_sf(data= nuts2_IT)+
theme_void() -> nuts2_IT
nuts2_IT
At first sight, the sf data type may look unfamiliar, but actually it is
an R data frame, so we can handle it with common operations, such
as executing a normal inner join between data frame nuts2_IT, for
geographic data, and dogs. The join key should be region names,
which, in nuts2_IT, corresponds to column NAME_LATN. Then, we
can produce the choropleth map again with function geom_sf() by
filling regions with the color scale corresponding to the ratio
between dogs and residents. With the other attributes, we color the
borderlines white and set the line width. A little tweak is needed to
align the name of an Italian region between the two data frames.
Figure 19.6 shows the result that, now, conveys an information
coherent and unambiguous.
Figure 19.6 Choropleth map with coherent data and geographical areas.
nuts2_IT$NAME_LATN = str_replace_all(nuts2_IT$NAME_LA
"Provincia Autonoma di Trento", "Trentino-Alto Adig
nuts2_IT$NAME_LATN = str_replace_all(nuts2_IT$NAME_LA
"Provincia Autonoma di Bolzano/Bozen", "Trentino-Al
… -> p1
… -> p3
(p1 | p2 | p3)
Figure 19.7 Choropleth maps, from left to right: ratio of dogs per resident, region
population, and number of dogs registered in each region.
Ggplot function annotate() does what we look for; its syntax has a
first attribute geom specifying the type of annotation, in our
example, it will be point for the dots and text for city names;
attributes x and y specify longitude and latitude of the annotation,
then style options follow.
p2 +
annotate(geom="point", x=12.496, y=41.903, color="d
annotate(geom="text", x=11.95, y=41.903, label="Rom
size=3, color="darkred") +
annotate(geom="point", x=9.190, y=45.464, color="da
annotate(geom="text", x=9.190, y=45.65, label="Mila
size=3, color="darkred") +
annotate(geom="point", x=11.342, y=44.495, color="g
annotate(geom="text", x=11.6, y=44.7, label="Bologn
size=3, color="gold") +
coord_sf(default_crs = sf::st_crs(4326)) +
theme(text= element_text(size=12),
legend.position= 'top',
legend.key.width= unit(1.5, 'cm'),
legend.key.height= unit(0.5, 'cm'),
legend.text= element_text(size=8, angle=0, vj
legend.title= element_text(size=8))
Figure 19.8 Annotated map with dots and city names for Milan, Bologna, and Rome.
library(plotly)
ggplotly(p1)
library(sf)
library(sp)
library(rnaturalearth)
sp::plot(ne_countries(country= c("sweden","denmark"),
scale= "medium"))
sp::plot(ne_states(country= c("sweden","denmark")))
sp::plot(ne_coastline(scale= "medium"))
Figure 19.10 Maps from Natural Earth, Sweden and Denmark’s borders and regions,
coastline world map.
We delve now into the details of formats sp and sf. The three maps
just created are in sp format, the default format returned by
rnaturalearth functions. If we try to visualize them by using ggplot
and geom_sf() an error would be raised: `stat_sf()` requires the
following missing aesthetics: geometry. The message error is
interesting. It tells us that in the data, namely in the sp format, the
required variable geometry is missing. We have already seen that
variable in a previous example with the sf format; it contains, for
each area, the list of coordinates of the geometry and polygons for
planar surfaces. So, what does it mean that error message? That
format sp has no polygons? We can check it directly with function
str() as shown by the following excerpt of code.
First of all, we note that, like format sf, also format sp is actually an
R data frame, therefore usable by ggplot, just not recognized by
function geom_sf() , for example, function geom_polygon()
would have handled it. Then, we see a list of variables/columns and
values. We recognize country codes as alpha2 ISO standards (the
two-letter code such as SE for Sweden), names of geographic areas
(e.g. Norrbotten), Swedish postal codes, and finally latitude and
longitude coordinates. Are those polygons coordinates? No, those
are a single pair of latitude and longitude coordinates, one pair for
each area, so they just identify a specific geographic point, not
multiple polygons. What are those coordinates? They represent the
single point that is conventionally used to identify an area called
centroid of the area, which represents the geographic center of a
planar surface.
names(ne_states(country= c("sweden","denmark")),
returnclass="sf")
We try them with the two objects sw_dk1 (format sp) and sw_dk2
(format sf).
# From sp to sf
sw_dk_sf <- sf::st_as_sf(sw_dk1)
# From sf to sp
sw_dk_sp <- sf::as_Spatial(sw_dk2)
str(sw_dk_sp)
Formal class 'SpatialPolygonsDataFrame' [package "sp"
..@ data :'data.frame': 26 obs. of 121 varia
…
Let us start with maps that Natural Earth makes available. The list
can be read in the package documentation (https://cran.r-
project.org/web/packages/rnaturalearth/vignettes/rnaturalearth.html
). Two of them interest us: railroads and land. Not all scales are
available, railroad has only scale 10. We use format sf.
They are both world maps, for these ones there is no possibility to
select a certain region, it should be cropped from each one of these
by specifying the coordinates, again with sf function coord_sf() .
In the following code, Western Europe is selected through
coord_sf() , CRS 4326 is indicated as the reference coordinate
system, and coordinates are defined with attributes xlim and
ylim , which set the boundaries of a rectangle, limiting the area of
interest. Then, the ggplot graphic is produced by overlaying the
railroad map over the land map. Figure 19.12 shows the result. This
map is the basis for working toward our final result; now we have to
overlay the other graphical elements referred to the busiest railway
stations.
Figure 19.11 Railroad and land maps from Natural Earth.
ggplot() +
geom_sf(data= land) +
geom_sf(data= rail) +
coord_sf(
default_crs= sf::st_crs(4326),
xlim= c(-10,20),
ylim= c(35,60)
)
library(rvest)
library(ggrepel)
rail <- ne_download(scale=10, type="railroads",
category="cultural", returnclass="sf")
land <- ne_download(scale=50, type="land",
category="physical", returnclass="sf")
'London Waterloo'),
# Ggplot graphic
ggplot() +
geom_sf(data= land, fill="ghostwhite") +
geom_sf(data= rail, lwd=0.1) +
geom_jitter(data= head(busiest_rail_geo,15),
aes(x= Lon, y= Lat, size= Sum, fill= Country
color='black', alpha=0.6, shape=21, width=0.
geom_label_repel(data= head(busiest_rail_geo,15),
aes(x= Lon, y= Lat, label= `Railway station`
size=2.0, alpha =0.85, na.rm = TRUE,
box.padding = unit(0.75, "lines"))+
scale_size_binned(range= c(3,25), n.breaks=5,
nice.breaks= TRUE)+
labs(size="Passengers\n(Mil per year)",
title="Busiest Railway Stations in Western Eur
coord_sf(default_crs= sf::st_crs(4326),
xlim= c(-10,20),
ylim= c(35,60)) -> p1
# Style options
p1 +
scale_fill_brewer(palette= "Dark2")+
guides(fill= "none") +
theme_void() +
theme(legend.position= 'right',
legend.text= element_text(size=8, vjust=0.5),
legend.title= element_text(size=8),
title= element_text(family= "Helvetica", size
color= "darkred"))
Figure 19.13 Busiest railway stations and railroad network in Western Europe.
19.6 Shape Files and GeoJSON Datasets
When the interest in working with maps and geographic data grows,
it is inevitable to meet cartographic data and geodatasets since they
are now often made available as open data by municipalities and
other public or private subjects. This gets us closer to the world of
traditional cartography, the best systems in this sector, with a long
tradition and a well-earned reputation of quality. These systems are,
however, typically not open-source, the best of them at least, and
they require specialized skills for handling complex projects, skills
that are only partially shared with data science and data
visualization.
library(tidyverse)
library(lubridate)
library(sf)
library(sp)
library(geojsonsf)
waterways= st_read('datasets/Venice/Strato01_Viabilit
Trasporti/
Tema0103_AltroTrasporto/EL_ACQ.shp')
sea= st_read('datasets/Venice/Tema0402_AcqueMarine/CS
streets= st_read('datasets/Venice/Strato01_Viabilita_
Tema0101_Strade/AC_PED.shp')
canals= st_read('datasets/Venice/Tema0404_ReticoloIdr
CAN_LAG.shp')
bridges= st_read('datasets/Venice/Strato02_Immobili_A
Tema0203_OpereInfrastruttureTrasporto/PONTE.s
terrain= st_read('datasets/Venice/Strato05_Orografia/
Tema0503_FormeTerreno/SCARPT.shp')
green= st_read('datasets/Venice/Strato06_Vegetazione/
Tema0604_VerdeUrbano/AR_VRD.shp')
civicNo= st_read('datasets/Venice/Strato03_GestioneVi
Tema0301_ToponimiNumeriCivici/CIVICO.shp')
We have read the shape files, let us look at the content of one of
those R objects, for example waterways, with the content of
EL_ACQ.shp.
ggplot() +
geom_sf(data= streets.crs, color= "black", lwd=0.1)
theme_void() -> plot1
ggplot() +
geom_sf(data= canals.crs, fill= "lightblue") +
theme_void() -> plot2
So, now we have sf objects from the cartographic shape files whose
coordinates are expressed according to the Monte Mario reference
system and this map with coordinates expressed according to the
WGS 84 reference system. These objects cannot be layered one on
top of the other because coordinates would not be aligned (you can
try, they will not match).
Now we can stack these layers, including the map, one on top of the
other. Figure 19.16 is realized by overlaying the map, the streets
layer, and the canals layer.
ggplot() +
geom_sf(data= ve_map.crs, fill= "ghostwhite") +
geom_sf(data= steets.crs, color= "gray", lwd=0.1) +
geom_sf(data= canals.crs, fill= "lightblue") +
theme_void()
STEP 1. First, we figure out the coordinates for the two points,
xmin, ymin, and xmax, ymax, in the familiar longitude and
latitude degrees. We can easily find them by looking at online
maps that provide for geographical coordinates of selected
locations; otherwise, we can use the map from the GeoJSON file,
which is expressed in WGS 84 coordinates, by cropping it with
function coord_sf() until the desired area is produced.
Figure 19.17 Venice, historical insular part, map with overlaid layers.
ggplot() +
geom_sf(data=ve_map.crs, fill= "ghostwhite") +
geom_sf(data=canals.crs, fill= "skyblue2") +
geom_sf(data=waterways.crs, color= "skyblue2") +
geom_sf(data=sea.crs, color= "skyblue4") +
geom_sf(data=bridges.crs, fill= "tomato3") +
geom_sf(data=streets.crs, color= "gray", lwd=0.1) +
coord_sf(default_crs = sf::st_crs(3004),
xlim = c(2308690,2316697),
ylim = c(5030945,5036255)) +
theme_void() -> plot2
The name might sound unfamiliar, but everybody knows them: they
are the base maps that we look at when we use an online map
service like Google Maps, OpenStreetMap, and the like, namely those
maps that offer us a zoom feature, usually controlled by a gesture
on the touchpad or touchscreen, that let us place markers to set a
position and other interactive features. The same tile maps are used
for data visualization with the tools we are examining; there is no
technical limit to their usage, there is a commercial limit, instead,
because an increasing number of tile map providers has transformed
the service that they were used to offer freely into a paid
subscription one, the most renown example being Google Maps.
With commercial providers, an API key is required, which is a
particular code to specify for downloading the map. The way to
obtain an API key depends on the legal terms of the specific
commercial service. Nevertheless, a few tile map providers have kept
a free option, among them Stamen (http://maps.stamen.com/),
OpenStreetMap (https://wiki.openstreetmap.org/wiki/Tiles), and in a
limited way Carto (https://carto.com/blog/getting-to-know-positron-
and-dark-matter). Google Maps offers the possibility to use tile maps
freely up to a certain monthly threshold, but even in that case it
requires to obtain an API key with a formal contractual subscription.
A comment that could be made to this evolution into the commercial
realm, is that, on the one hand, the possibility to freely experiment
with tile maps has drastically shrunk, on the other, though, this is
likely a signal that an increasing professionalization and diffusion of
geographic data visualization is now a fact and it is growing.
NOTE
library(ggmap)
Stamen’s free tile maps are not much informative, they serve
aesthetic purposes only, as base for other stacked informative layers
placed on top of them. A comment on package ggmap is that it
offers good functionalities but, unfortunately, it suffers from the lack
of support of Google Map and OpenStreetMap. It is worth a mention
and a try, anyway.
Leaflet has many features, which make it a complete tool for the
visualization of interactive geographic maps, not just a library with
some useful functions. Therefore, Leaflet is for sure a solution to
consider very seriously. A more detailed overview of Leaflet’s
functionalities will be presented in the final Python’s chapter,
however, all examples, shown here for R, are fully replicable in
Python too, just by adapting the code, with the specific functions
being by all means identical because in both environments, R and
Python, what is used is a wrapper to the same JavaScript library.
library(leaflet)
mapL <- leaflet() %>%
addTiles() %>%
fitBounds(lng1= 12.30, lat1= 45.40,
lng2= 12.40, lat2= 45.45) %>%
setView(12.3359,45.4380, zoom=14)
mapL
Other tiled web maps are available, although the actual availability
depends on the particular selected area (http://leaflet-
extras.github.io/leaflet-providers/preview/index.html). To use them,
package leaflet.providers is required. In the example, we add
Stamen’s Toner map, Carto’s Positron map, and ESRI’s
WorldImaginery map. Figure 19.20a, Figure 19.20b, and Figure
19.20c show the corresponding base maps.
library(leaflet.providers)
What we have seen so far are examples with packages ggmap and
Leaflet just showing base maps, which for Leaflet could be enriched
with graphical elements offered by the package. This is not
sufficient, though, because we are working with topographic layers
(i.e., cartographic shape files, GeoJSON datasets) for which we have
produced the corresponding sf objects and we want to add them to
the base map. Let us see how to do that.
Figure 19.19 Venice, Leaflet base map from OpenStreetMap. (a) Full view. (b) Zoom in.
Figure 19.20 (a/b/c) Venice, Leaflet tile maps from Stamen, Carto, and ESRI.
This case would not create any particular problem if it were not for
the complication represented by coordinate systems having different
CRSs, as for our case study. Examples available in the
documentation are typically presented with the assumption that all
layers have same CRS (usually WGS 84), which removes any
obstacle. However, reality is always more complicated than didactic
examples and, as the adage says, the devil hides in the details.
Having layers with different CRSs (i.e., WGS 84 and Monte Mario),
we have two main options:
Both options have pros and cons, let us start with the first one.
ggmap(basemap) +
geom_sf(data= canals.4326, fill= "skyblue2",
inherit.aes= FALSE) +
geom_sf(data= streets.4326, color= "gray", lwd=0.1,
inherit.aes= FALSE) +
geom_sf(data= waterways.4326, color= "skyblue2",
inherit.aes= FALSE) +
geom_sf(data= bridges.4326, fill= "tomato3",
inherit.aes= FALSE) +
geom_sf(data= green.4326, size=0.05, alpha=0.01,
color= "forestgreen", inherit.aes= FALSE) +
geom_sf(data= civicNo.4326, color= "darkred",
inherit.aes= FALSE) +
coord_sf(default_crs= sf::st_crs(4326),
xlim= c(12.30,12.40),
ylim= c(45.40,45.45)) +
theme_void()
With this, the raster base map object has been made compliant with
function st_transform() and then the coordinates converted to
the same CRS of cartographic layers, we could now visualize the
resulting map. The code is the same presented for Option 1, except
for the instruction using function coord_sf() .
ggmap(map) +
...
coord_sf(default_crs = sf::st_crs(3004),
xlim = c(2308690, 2316697),
ylim = c(5030945, 5036255)) +
theme_void()
The maps produced are identical for the two solutions, with the
exception of a tiny misalignment of the base map with respect to the
cartographic layers, introduced with the empirical custom solution,
an error that could be corrected with a more precise tuning of
parameters of the bounding box, a confirmation that empirical
methods should be adopted only when standard methods are not
available. Figure 19.21a and Figure 19.21b show two versions of the
resulting map with different tiled web maps, OpenStreetMap in the
first case and Stamen Toner in the second one, green areas are now
visible.
We see two examples. With the first one, we replicate the layered
map just produced with ggmap. The syntax is intuitive and the result
is in HTML format, so we save it with function save_html() of
package htmltools. Being an HTML object, it offers native features
like the zoom, activated with gestures or clicking on the buttons with
+ and – symbols. Figure 19.22a and Figure 19.22b show two
screenshots of the resulting HTML map, respectively with the Venice
full map and a detail by zooming in on Ponte di Rialto (Rialto Bridge)
and Piazza San Marco (St. Mark’s Square). The tile map is Carto
Positron.
htmltools::save_html(mapL, "Leaflet1.html")
htmltools::save_html(mapL2, "Leaflet2.html")
Figure 19.23 Venice, Leaflet, civic numbers with dynamic popups associated.
Let us see another example, this time with the cartographic layer
representing pedestrian areas from sf object streets.crs, from which
we omit missing values. We proceed the same way as in the
previous case, with the result shown in Figure 19.24.
ggplot() +
geom_sf(data= na.omit(street.crs),
aes(color= AC_PED_ZON,
fill= AC_PED_ZON), lwd=0.3) +
labs(color="Pedestrian Zone", fill="Pedestrian Zone
coord_sf(default_crs= sf::st_crs(3004),
xlim= c(2308690, 2316697),
ylim= c(5030945, 5036255)) +
scale_fill_tableau(palette="Color Blind",
labels = pedestrianType, directi
scale_color_tableau(palette="Color Blind",
labels = pedestrianType, directi
theme_void()
Figure 19.24 Venice, Leaflet, pedestrian areas.
The pin marker comes from a free icon made by Freepik from
www.flaticon.com and used by ggplot function geom_image() from
package ggimage. Textual annotations, instead, are produced with
function geom_label_repel() of package ggrepel that we have
already used in a previous example. As base map, we use instead a
cartographic layer from Venice Municipality and data are simply
created with a little custom data frame of a few points of interest.
The result shown in Figure 19.25 is still aesthetically simple but as a
concept once again is interesting and could inspire many applications
and variants.
Figure 19.25 Venice, ggplot, markers with annotations.
library(ggimage)
icon= "./pin.png"
data= data.frame(
name= c("Guggenheim Museum, Dorsoduro 701-704",
"Ca d'Oro, Cannaregio 3932",
"Ca' Foscari University, Dorsoduro 3246",
"Cinema Palace, Lungomare Guglielmo Marconi
lon= c(45.4308, 45.44065, 45.4345, 45.40579),
lat= c(12.3315, 12.33413, 12.3264, 12.36719))
ggplot() +
geom_sf(data= strade.4326, color= "cornsilk3", lwd=
coord_sf(default_crs = sf::st_crs(4326),
xlim= c(12.30, 12.40), ylim= c(45.40, 45.4
theme_void() -> plotX2
plotX2 +
ggrepel::geom_label_repel(data= data,
aes(x= lat, y= lon, label= name),
size=2.5, alpha=0.7, na.rm=TRUE,
box.padding= unit(0.75, "lines")) +
ggimage::geom_image(data= data,
aes(x= lat, y= lon, image= icon
size=0.05)
NOTE
library(tidyverse)
library(tmap)
library(sf)
library(sfheaders)
# NEIGHBORHOODS
tm_shape(data1) +
tmap_options(max.categories=35) +
tm_polygons("quartiere",
title='Neighborhoods')+
tm_layout(legend.position= c("right", "top"),
title='Rome Neighborhoods',
title.position= c('left', 'top'),
legend.width=100)
# DISTRICTS
tm_shape(data2) +
tmap_options(max.categories=35) +
tm_polygons("quartiere",
title='Districts')+
tm_layout(legend.position= c("right", "top"),
title='Rome Districts',
title.position= c('left', 'top'),
legend.width=100)
The difference between the two modes is substantial: for the plot
mode (static), a map is generated as image file and typically
visualized in the RStudio tab Plots as customary for ggplot graphics;
for the view mode (interactive) a map is produced as a leaflet
object, therefore as an HTML file. The two previous choropleth maps
were generated in plot mode.
We read new data in addition to toponymy areas and convert them
into WGS 84 coordinates (CRS 4326):
tmap_mode('view')
map_center +
tm_shape(archeo) +
tm_borders(col= "skyblue3", alpha=0.5) +
tm_compass(position= c("left", "bottom"), size=2) +
tm_scale_bar(position= c("left", "bottom"), width=0
tm_layout(title= 'City center: archeological sites'
bg.color= "ghostwhite")
Figure 19.29 (a) Rome, tmap view mode, city center archaeological map with ESRI tiled
web map. (b) Rome, tmap view mode, zoom in on the Colosseum area with OpenStreetMap
tiled web map.
Quartiere Num
Esquilino 1026
Prati 895
Monti 890
Aurelio 880
Trionfale 769
Trastevere 639
… …
19.9.1 Centroids and Active Geometry
This is the tricky detail that lets us analyze the important concept of
centroid, already mentioned before, and an important extension to
geometries.
tmap_mode("plot")
tm_shape(roma) +
tm_polygons(col="ghostwhite", lwd=1.0)
tm_shape(int_result) +
tm_bubbles(size= 'n', col= "red") +
tm_layout(frame= FALSE,
legend.width=0.3,
legend.position= c('right','bottom'),
)
The result of Figure 19.30 might look nice at first sight, we see little
bubbles on the map, but unfortunately it is plain wrong. Pay close
attention to what the picture is showing. It is not showing what we
were expecting, bubbles are spread all over the areas, and there is
no single bubble for each area representing the number of
accommodations in there. In this image, bubbles spread over an
area correspond to all accommodations and are resized
proportionally to the total number of accommodations in that area.
This is not what a bubble plot is supposed to look like, this is a
mess.
What was the problem? The problem is subtle, and it is that the
geometry of sf object int_result is not the one we need because
there is no single representative point for each area in there, instead
for each area there is a list of many points, as many as the
accommodations in the area. We need a second geometry with just
one representative point for each area. That point is the centroid.
The steps we should take are the following ones:
1. Calculate the centroid for each topographic area.
2. Create a second geometry with the centroids only.
3. Reconfigure the sf object int_result so that the second geometry
is used for the map visualization.
We have made a step further toward the solution of our problem and
the correct bubble plot. Now we know how to pick the single
representative point for each area, which is that whose coordinates
are written in the centroid geometry of the sf object.
Here is the final step. We know that just one geometry could be
active, and it is column geometry, not centroid; we should change it.
For this, function st_geometry() helps again, and it is easy, we
just need to instruct it that the new active geometry should be
column centroid. The following excerpt of code shows the centroid
column turned into the active geometry and the corresponding
metainformation in the sf object confirms the change,
map_roma +
tm_shape(int_result1) +
tm_bubbles(size = 'n', scale= 0.5, col = "red",
popup.vars=c("Neighboorhood: "="quartier
"No. accommodations: "="n")
tm_layout(frame = FALSE,
title= 'Accommodations',
title.position = c('center', 'top'),
) -> tmap2
tmap2
1. Use again the original sf object int_result, the one with the single
geometry, centroids are not useful for this example.
Figure 19.31 (a) Rome, tmap, full map with bubbles centered on centroids and popups
associated to topographic areas. (b) Rome, tmap, detail zooming in with popups
associated to bubbles.
1. Sort and group areas for decile and, for each group, extract the
first and the last row, which corresponds to the minimum and
maximum value for that decile.
2. Transform from sf object to traditional R data frame with
sfheaders::sf_to_df(data, fill = TRUE) , otherwise some
operations cannot be executed.
3. Omit missing values and select only the necessary columns, the
decile index, and minimum and maximum values.
4. Create the textual labels for the legend and save the data frame
(cat2).
With this small data frame with decile index and textual labels for
the legend, we can modify the data frame by adding a new column
CAT with the textual labels, then with another new column dec2
defined as factor type (i.e. categorical) we set factor levels
corresponding to decile indexes and factor labels as the legend
labels. It is somewhat tricky, but it is a useful data-wrangling
exercise in order to tune that little detail of the legend.
len1= length(categories$n)
tmap_mode("view")
map_roma <- tm_shape(roma_bb2) +
# tm_basemap("Stamen.TerrainBackground") +
# tm_basemap("OpenTopoMap") +
# tm_basemap("CartoDB.DarkMatterNoLabels") +
tm_basemap("Stamen.TonerLite") +
tm_polygons(col="dec2", palette='-cividis', alpha=0
colorNA=NULL, title='Accommodations',
popup.vars=c( "Neighborhood: "="quartie
"No.accomodations: "="n",
"Decile: "="decile", "Ran
map_roma
Figure 19.32 Rome, tmap, quantiles, and custom legend labels.
Figure 19.33 Rome, tmap, standard quantile subdivision, and legend labels.
Figure 19.35a shows a screenshot of the full map with Bed and
Breakfasts, while Figure 19.36a and Figure 19.36b show two
screenshots for the Hotels, the full map, and a zoom in.
tmap_mode("view")
map_roma <- tm_shape(roma_bb2) +
tm_fill(fill= NULL, colorNA= NULL, alpha=0.0,
interactive= FALSE,
popup.vars=c("Neighborhood: "="quartiere.x"
"No.accomodations: "="n")) +
tm_shape(BnB) +
"category"="categoria"))+
tm_layout(frame = FALSE,
title= 'Accommodations: Bed and Breakfast
title.position = c('center', 'top')
map_roma
saveWidget(tmap_Roma, file="...")
Figure 19.34 Rome region tmap, road map with dynamic popups.
Figure 19.35 (a) Rome, tmap, Bed and Breakfasts, full map. (b) Rome, tmap, Hotels, full
map. (c) Rome, tmap, Hotels, zoom in.
Figure 19.36 (a) Rome, tmap, hotels, full map. (b) Rome, tmap, hotels, zoom in.
20
Geographic Maps with Python
Dataset/Geodataset
As the source for the base map, we choose the territorial division of
areas into Zip Codes; it is a geodataset in GeoJSON format and the
graphical library we will use is Plotly. The following excerpt of code is
the usual list of Python libraries to import and the general load
operation to access the dataset’s content.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from PIL import Image
import plotly.express as px
import plotly.graph_objects as go
import json
nycgeo= json.load(open('datasets/NYC_opendata/
Nyc-zip-code-tabulation-areas-polygons.geoj
nycgeo['features'][0]
{'type': 'Feature',
'id': 0,
'properties': {'OBJECTID': 1,
'postalCode': '11372',
'PO_NAME': 'Jackson Heights',
'STATE': 'NY',
'borough': 'Queens',
'ST_FIPS': '36',
'CTY_FIPS': '081',
'BLDGpostalCode': 0,
'Shape_Leng': 20624.6923165,
'Shape_Area': 20163283.8744,
'@id': 'http://nyc.pediacities.com/Resource/PostalC
'geometry': {'type': 'Polygon',
'coordinates': [[[-73.86942457284175, 40.7491568709
[-73.89507143240856, 40.74646547081214],
[-73.89618737867819, 40.74850942518086],
...
[-73.87207046513889, 40.75386200705204],
[-73.86942457284175, 40.74915687096787]]]}}
features:
|____ id:
|____ properties:
|____ <attributes list as key:value>
|____ geometry:
|____ coordinates:
Now, we need the data for the first layer to overlay over the base
map. We choose the dataset Dogs Licensing in NYC; it is a standard
CSV dataset.
dogs= pd.read_csv('datasets/NYC_opendata/
NYC_Dog_Licensing_Dataset.csv')
0 Paige F 2014 A
1 Yogi M 2010
2 Ali M 2014
3 Queen F 2013 A
c
4 Lola F 2009
… … … …
dogs= dogs[∼dogs.ZipCode.isna()]
dogs.ZipCode= dogs.ZipCode.astype('int64')
dogs_zipcount=
dogs.groupby(['ZipCode']).size().\
reset_index(name='counts').\
sort_values(by='counts', ascending=False)
dogs_zipcount
ZipCode counts
⋯ ⋯ ⋯
400 11274 1
395 11242 1
1 121 1
352 11108 1
783 99508 1
plt.figure(figsize=(80,80))
a1= dogs.groupby(['ZipCode','BreedName']).size().\
reset_index(name= 'counts').\
sort_values(by= 'counts', ascending=False)
dogs_maxbreed= a1.groupby('ZipCode').head(1)
dogs_maxbreed
Figure 20.2 NYC, plotly.express, most popular dog breed for zip code.
⋯ ⋯ ⋯ ⋯
plt.figure(figsize=(80,80))
With the previous data aggregation, the breed that very often
appear as the most popular for every zip code is actually the
unknown breed, which, probably, is not particularly meaningful
information to convey. To improve the result, we could omit dogs
with unknown breed. We slightly modify the previous code.
a1= dog_breeds.groupby(['ZipCode','BreedName']).size(
reset_index(name='counts').\
sort_values(by='counts', ascending= False)
dogs_maxbreed= a1.groupby('ZipCode').head(1)
dogs_maxbreed
ZipCode BreedName counts
20.1.1.2 Mapbox
fig.update_layout(mapbox_style= "open-street-map",
mapbox_zoom=9,
mapbox_center= {"lat": 40.7831, "lon": -73.97
fig.update_layout(margin= {"r":0,"t":0,"l":0,"b":0},
width=600, height=600)
fig.show()
The resulting map is like that of previous Figure 20.3, however, this
way the tooltip content could be carefully customized; we forward
the interested reader to the Plotly documentation for an overview of
the many possibilities (https://plotly.com/r/hover-text-and-
formatting/).
20.1.3 GeoJSON Polygon, Multipolygon, and Missing
id Element
dogRuns = json.load(open("datasets/NTC_opendata/
NYC Parks Dog Runs.geoj
dogRuns['features'][0]
{'type': 'Feature',
'properties': {'zipcode': '10038',
'name': 'Fishbridge Garden Dog Run',
'system': 'M291-DOGAREA0043',
...
'seating': None,
'councildis': '1',
'borough': 'M'},
'geometry': {'type': 'MultiPolygon',
'coordinates': [[[[-74.0016459156729, 40.7093268047
[-74.00098833771662, 40.70879507039175],
[-74.00099960334362, 40.70878952584789],
...
[-74.00167338737218, 40.709306008827475],
[-74.0016459156729, 40.70932680472401]]]]}}
for i in range(len(dogRuns["features"])):
dogRuns["features"][i]["id"]= str(i)
Let us start by simply adding the dog runs as the single layer over
the base map by using plotly.express. The main difference with
respect to previous examples is the configuration of attribute
locations , which before was assigned with the data frame column
of zip codes, to produce the choropleth map. In general, this
attribute should have the list of values to use as keys to create
associations with the corresponding GeoJSON elements, which by
default are those of the id element or could be specified with
attribute featureidkey .
If the logic is clear, the only complication is that the GeoJSON has a
dictionary data organization, hence we cannot just specify the name
(i.e. id ) of the element as for a data frame column, but they must
be explicitly read, for example with an iteration. This is why in the
following code there is a for-cycle, written in the compact form
( [f["id"] for f in dogRuns["features"]] ), as the value of
attribute locations , it reads all elements id nested in the main
element features .
It may look a little complicated, but that is the price for dealing with
dictionary structures, however, once the logic is clear the remaining
are technicalities.
fig.update_layout(margin= {"r":0,"t":0,"l":0,"b":0},
width=600, height=600,
hoverlabel= dict(
bgcolor="white", font_size=16,
font_family= "Rockwell")
)
The same could be done with plotly go with syntax adapted as the
only difference. Figure 20.4 shows the map with the dog runs
overlaid to a base map.
figure= go.Figure(
data= [go.Choroplethmapbox(
geojson= dogRuns,
locations= [f["id"] for f in dogRuns["feature
Now we want to overlay the two layers, the choropleth maps with
the zip codes and tooltips, and the dog runs. We start with the
choropleth map and save the resulting object in variable fig.
Here, there is a new function that we need to overlay a new layer:
add_trace() of plotly go. Using function add_trace() is the
more general and recommended technique to stack graphical layers
on a Plotly map, which makes plotly go the preferred Plotly module if
these type of maps should be produced. It is also possible with
plotly.express but it is not that easy and there is basically no reason
to use that. The resulting map is shown in Figure 20.5, and the tiled
web map is Carto Positron.
fig= px.choropleth_mapbox(dogs_maxbreed,
geojson= nycgeo,
locations= 'ZipCode', color= 'counts'
featureidkey= "properties.postalCode"
color_continuous_scale= "Cividis",
hover_data= ["ZipCode","BreedName","c
labels= {'BreedName':'Razza',
'counts':'Numero cani'},
mapbox_style= 'carto-positron',
zoom=13, opacity=0.4,
center= {"lat": 40.7831, "lon": -73.9
width=600, height=600
)
fig.add_trace(go.Choroplethmapbox(
geojson= dogRuns,
locations= [f["id"] for f in dogRuns["features"]]
z= [1]*len(dogRuns["features"]),
marker= dict(opacity=0.9, line=dict(color="red",
colorscale= [[0, "red"], [1, "red"]],
showscale=False,
))
fig.update_layout(margin= {"r":0,"t":0,"l":0,"b":0})
Figure 20.5 NYC, plotly go, overlaid layers, Choropleth map, and dog runs, Carto Positron
tiled web map.
20.3 Geopandas: Base Map, Data Frame,
and Overlaid Layers
nyc_gpd= gpd.read_file('datasets/NYC_opendata/
nyc-zip-code-tabulation-areas-polygons.geojson')
postalCode PO_NAME STATE borough geom
−73.7
40.75
We read also the GeoJSON dataset of dog runs with geopandas and
reproduce the map seen before.
dogruns_gpd= gpd.read_file('datasets/NYC_opendata/
NYC Parks Dog Runs.geo
fig= px.choropleth_mapbox(dogruns_gpd,
geojson= dogruns_gpd.geometry,
locations= dogruns_gpd.id,
mapbox_style= 'open-street-map',
hover_name= 'name',
hover_data= {'id':False, "zipcode":Tru
"borough":True, "precinct"
labels= {'zipcode':'<i>Zip Code</i>',
'borough':'<i>Borough</i>',
'precinct':'<i>Precinct</i>'}
center= {"lat": 40.7831, "lon": -73.97
zoom=14, opacity=1.0,
width=600, height=600
)
fig.update_layout(margin= {"r":0,"t":0,"l":0,"b":0},
hoverlabel=dict(
bgcolor="white",
font_size=16,
font_family="Rockwell")
)
Figure 20.6 NYC, plotly.express and geopandas, dog runs, extended tooltip.
By using plotly go, two details need special attention. The first is
that function go.Choroplethmapbox still requires a dictionary for
the geojson attribute, not a data frame, therefore assigning it with
dogruns_gpd.geometry as we did for plotly.express produces an
error. The data frame column should be transformed with
eval(dogruns_gpd.geometry.to_json()) , which returns it as
dictionary type. The second important detail to take care of is that
attribute locations by default refers to element id of the
GeoJSON, but when this element is absent, an alternative solution
that works is to refer to the implicit index of the GeoDataFrame (a
reference for this workaround is
https://gis.stackexchange.com/questions/424860/problem-plotting-
geometries-in-choropleth-map-using-plotly/436649#436649).
figure= go.Figure(
data= [go.Choroplethmapbox(
geojson= eval(dogruns_gpd.geometry.to_json())
locations= dogruns_gpd.index,
z= [1]*len(dogruns_gpd),
marker= dict(opacity=.8,
line= dict(color="blue", width=2
hovertext= tooltip,
colorscale= [[0, "red"], [1, "red"]],
showscale= False
)],
layout= go.Layout(
margin= dict(b=0, t=0, r=0, l=0),
width=600, height=600,
mapbox= dict(
style= "carto-positron",
zoom=14,
center_lat= 40.7831,
center_lon= -73.9712,
)))
Figure 20.7 NYC, plotly go and geopandas, dog runs, extended tooltip.
20.3.2 Overlaid Layers: Dog Breeds, Dog Runs, and
Parks Drinking Fountains
We start with the dog breeds for the choropleth map and dog runs
area.
nycdogs_gpd= nycdogs_gpd.set_index('OBJECTID')
# First tooltip
fig= go.Figure()
fig.add_trace(
go.Choroplethmapbox(
geojson= eval(nycdogs_gpd.geometry.to_json())
locations= nycdogs_gpd.index,
z= nycdogs_gpd['counts'],
colorscale= "bluered", zmin=0, zmax=600,
marker_opacity=0.8, marker_line_width=1,
hovertext= tooltip1,
hoverinfo= 'text'
))
fig.add_trace(
go.Choroplethmapbox(
geojson= eval(dogruns_gpd.geometry.to_json())
locations= dogruns_gpd.index,
z= [1]*len(dogruns_gpd),
marker= dict(opacity=.8,
line= dict(color="blue", width=2)),
hovertext= tooltip2,
hoverinfo= 'text',
colorscale= [[0, "red"], [1, "red"]],
showscale= False
))
fig.update_layout(mapbox_style="open-street-map", map
mapbox_center= {"lat": 40.7831,
"lon": -73.9712},
margin= {"r":0,"t":0,"l":0,"b":0},
autosize= False, width=600, height=
fountains_gpd= gpd.read_file('datasets/NYC_opendata/
NYC Parks Drinking Fountains.
fountains_gpd['lon']= fountains_gpd.geometry.x
fountains_gpd['lat']= fountains_gpd.geometry.y
1 C Robert M C, In
Moses Playground
fountain_ty signname borough descr
Playground
2 D Chelsea M D, Under
Park Tree, Near
Ballfield
3 D John V. M D, Just
Lindsay Outside
East River Playground,
Park Near
Ballfield
With the two columns for longitude and latitude, we can configure a
third tooltip for the drinkable fountains (tooltip3) and add the new
layer composed of points. Figure 20.9a, Figure 20.9b, Figure 20.9c,
and Figure 20.9d show screenshots of the final result with the full
map, a detail with tooltip of a drinkable fountain, a detail with tooltip
of a dog run area, and a detail with tooltip for the most popular dog
breeds in zip codes.
tooltip3= '<b>Sign Name: </b>' + fountains_gpd['signn
'<br>' + '<br>' \
+ '<b>Position: </b>' + fountains_gpd['position']
fig= go.Figure()
fig.add_trace(
go.Choroplethmapbox(
geojson= eval(nycdogs_gpd.geometry.to_json())
locations= nycdogs_gpd.index,
z= nycdogs_gpd['counts'],
colorscale= "grays", zmin=0, zmax=600,
marker_opacity=0.8, marker_line_width=1,
hovertext= tooltip1,
hoverinfo= 'text'
))
fig.add_trace(
go.Scattermapbox(
lat= fountains_gpd.lat,
lon= fountains_gpd.lon,
mode='markers',
marker= go.scattermapbox.Marker(
size=5,
color= 'orangered',
opacity=0.7
),
hovertext= tooltip3,
hoverlabel= dict(bgcolor =
['gray','#00FF00','rgb(252,141,89
hoverinfo= 'text'
))
fig.add_trace(
go.Choroplethmapbox(
geojson= eval(dogruns_gpd.geometry.to_json())
locations= dogruns_gpd.index,
z= [1]*len(dogruns_gpd),
marker= dict(opacity=.8,
line=dict(color="blue", width=2)
hovertext= tooltip2,
hoverinfo= 'text',
colorscale= [[0, "red"], [1, "red"]],
showscale= False
))
fig.update_layout(mapbox_style= "open-street-map",
mapbox_zoom=9,
mapbox_center= {"lat": 40.7831, "lon": -73.9
margin= {"r":0,"t":0,"l":0,"b":0},
autosize= False, width=1000, height=700)
Figure 20.9 (a) NYC, plotly go and geopandas, dog breeds, dog run areas, and parks
drinkable fountains, full map. (b) NYC, zoom in on a tooltip for a drinkable fountain. (c)
NYC, zoom in on a tooltip for a dog run area. (d) NYC, zoom in on a tooltip for the most
popular dog breed in a zip code.
20.4 Folium
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from PIL import Image
import json
import geopandas as gpd
import folium
Figure 20.11 shows the resulting Folium map, this time with Stamen
Terrain tiled web map.
folium.Marker([40.7116, -74.0132],
popup= "<i>The World Trade Center and the Nation
September 11th Memorial and Museum</i
tooltip= "Ground Zero",
icon= folium.Icon(icon="building",
prefix='fa', color='black')
).add_to(map1)
folium.Marker([40.6892, -74.0445],
popup= "<b>The Statue of Liberty is a gift from
people of France</b>",
tooltip= "<b>Statue of Liberty</b>",
icon= folium.Icon(color="lightblue",
icon='ship', prefix='fa')
).add_to(map1)
folium.Marker([40.7813, -73.9740],
popup= "<b>200 Central Park West, New York, NY \
10024</b>" + "<br>" + \
"Open Hours: 10AM:5.30PM",
tooltip= "<b>American Museum of Natural History<
icon= folium.Icon(icon="institution", prefix='fa
).add_to(map1)
folium.Marker([40.7580, -73.9855],
tooltip= "<b>Times Square</b>",
icon= folium.Icon(icon="square", prefix
color='red')
).add_to(map1)
Figure 20.11 NYC, Folium, markers, popups, and tooltips, Stamen Terrain tiled web map.
import branca
iframe1= branca.element.IFrame(html=html1,
width=250, height=350)
popup1= folium.Popup(iframe1,
max_width=250, max_height=1000)
iframe2= branca.element.IFrame(html=html2,
width=300, height=400)
popup2= folium.Popup(iframe2,
max_width=350, max_height=1000)
#map2.save('./image/map2.html')
Figure 20.12 (a/b) NYC, Folium, marker’s popups with HTML iframe and image (Redd F /
Unsplash.com & Willian Justen de Vasconcellos / Unsplash.com).
With the next example, we add to a base map a layer with elements
from the Sea Level Rise Maps (2050s 500-year Floodplain) GeoJSON
dataset, whose data represent estimates made by FEMA. For the
details about the style options, we forward the reader to the official
Leaflet documentation (https://leafletjs.com/). Figure 20.13 shows
the resulting Folium map.
seaRise= json.load(open('datasets/FEMA/
Sea Level Rise Maps (2050s 500-year Floodplain)
folium.GeoJson(seaRise, name="geojson",
style_function = lambda x: style
).add_to(map1)
Figure 20.13 NYC, Folium, base map, and GeoJSON layer with FEMA sea level rise
estimates.
20.4.4 Choropleth Maps
nyc_zip= json.load(open('datasets/NYC_opendata/
nyc-zip-code-tabulation-areas-polygons.geo
From all rodent inspections, we select only those that revealed the
presence of rodents (i.e. Rat Activity).
rats= pd.read_csv('datasets/NYC_opendata/Rodent_Inspe
rats1= rats[(rats.RESULT == 'Rat Activity')]
inspTot= rats1.groupby('ZIP_CODE')[['RESULT']].count(
reset_index().sort_values(by= 'RESULT', ascendi
inspTot.ZIP_CODE= inspTot.ZIP_CODE.astype('int64')
ZIP_CODE RESULT
74 10457 15093
75 10458 13691
⋯ ⋯ ⋯
NOTE
map1= folium.Map(
location=[40.7831, -73.9712],
zoom_start=10, width=500,
height=500)
folium.Choropleth(
geo_data= nyc_zip,
name= "choropleth",
data= inspTot,
columns= ["ZIP_CODE","RESULT"],
key_on= "feature.properties.postalCode",
fill_color="Grays",
fill_opacity=0.6,
line_opacity=0.2,
legend_name="Rat presence",
).add_to(map1)
folium.LayerControl().add_to(map1)
Figure 20.14 NYC, Folium choropleth map, rodent inspections finding rat activity.
20.4.5 Geopandas
nyc_gpd= gpd.read_file('datasets/NYC_opendata/
nyc-zip-code-tabulation-areas-polygons.
driver='GeoJSON')
nyc_gpd.postalCode= nyc_gpd.postalCode.astype('int64'
popup= GeoJsonPopup(
fields= ["ZIP_CODE", "borough", "PO_NAME", "RESUL
aliases= ["Zip Code: ", "Borough: ", "Neighborhoo
"Num. Inspections: "],
localize= True, labels= True,
)
fig1.geojson.add_child(popup)
folium.TileLayer('cartodbpositron').add_to(map1)
Figure 20.15 NYC, Folium and geopandas, rodent inspections finding rat activity.
NOTE
We should prepare the data frame because the inspections are too
many and would produce an unclear result. We select just those
made in year 2022 and omit rows with missing values in date or
coordinates, or that have value zero.
rats1['INSPECTION_DATE']=
pd.to_datetime(rats1['INSPECTION_DATE'])
rats1= rats1[∼rats1['INSPECTION_DATE'].isna()]
rats1['YEAR']=
rats1['INSPECTION_DATE'].dt.year.astype('Int64
rats2022= rats1[rats1.YEAR == 2022]
rats2022= rats2022[
∼rats2022.LATITUDE.isna() & ∼rats2022.LONGITUDE.is
(rats2022.LATITUDE != 0) & (rats2022.LONGITUDE !=
The data frame is now ready. We can create the Folium map with
function folium.Map() setting global elements. Then, we need to
subset the data frame by just extracting the two columns for latitude
and longitude (e.g., ratsHeat=
rats2022[[’LATITUDE’,’LONGITUDE’]] ). This new data frame
should be the data for function plugins.HeatMap() creating the
Folium heatmap, which will be finally added to the map as a new
layer with method add_to() . Style options are also specified for
transparency and blur. Figure 20.16 shows the result, which, as
noted, does not look like a heatmap, as usually intended, but a
kernel density plot.
ratsHeat= rats2022[['LATITUDE','LONGITUDE']]
plugins.HeatMap(ratsHeat,
min_opacity=0.2, blur=11,
).add_to(map4)
Figure 20.16 NYC, Folium heatmap of rodent inspections with rat activity.
20.5 Altair: Choropleth Map
import numpy as np
import pandas as pd
import altair as alt
import geopandas as gpd
import json
20.5.1 GeoJSON Maps
nycgeo= json.load(open('datasets/NYC_opendata/
nyc-zip-code-tabulation-areas-polygons.geoj
data_obj_geojson= alt.Data(values=nycgeo,
format=alt.DataFormat(property="fe
plot_a= alt.Chart(data_obj_geojson).mark_geoshape(
fill='ghostwhite', stroke='skyblue')
plot_b= alt.Chart(data_obj_geojson).mark_geoshape(
).encode(color='properties.borough:N')
stations= gpd.read_file('datasets/NYC_opendata/
Subway Stations.geojson')
stations['lon']= stations.geometry.x
stations['lat']= stations.geometry.y
To create the plot, the syntax is the same as seen in Part 2 for Altair
graphics, with the novelty of function mark_geoshape() and some
geographical attributes like latitude and longitude . Figure
20.18 shows the NYC map with points corresponding to subway
stations.
basemap= alt.Chart(data_obj_geojson).mark_geoshape(
fill= 'ghostwhite',
stroke= 'skyblue')
points= alt.Chart(stations).mark_circle(
size=10,
color= 'darkred'
).encode(
longitude= 'lon:Q',
latitude= 'lat:Q',
tooltip= ['name','line','notes'])
residents = pd.read_csv('datasets/NYC_opendata/
Demographic_Statistics_By_Zip_Code
⋯ ⋯ ⋯ ⋯
Now we read the GeoJSON dataset with geopandas and join it with
the data frame of ethnic groups just derived.
etnicGroups= etnicGroups[∼etnicGroups.value.isna()]
etnicGroups.value= 100*etnicGroups.value
nyc_gpd= gpd.read_file('datasets/NYC_opendata/
nyc-zip-code-tabulation-areas-polygons.geo
nyc_gpd.postalCode= nyc_gpd.postalCode.astype('Int64'
Figure 20.19 Altair, choropleth maps for ethnic groups (from left to right: Hispanic Latino,
Asian Non-Hispanic, White Non-Hispanic, and Black Non-Hispanic).
Index
%% 300–301, 470
%in% 10, 65, 67, 71, 169
%Y-%m-%d 24, 34, 235
a
accommodations 392, 460–479
active geometry 462–466
aesthetic 1, 5–8, 10, 12, 13, 16, 18, 19, 65, 89, 113, 115, 120,
129, 166, 168, 199–202, 205, 208, 210, 220, 223, 228, 242, 248,
253, 257, 283, 296, 399, 404, 416, 430, 431, 460, 487
aggregation 30, 226, 238, 240, 244, 246, 276, 323, 460, 462,
485, 487
Agricultural Outlook 83, 86, 89, 91, 95, 97
alpha 10, 16, 21–22, 27, 33, 51, 112, 114–116, 125–126, 128,
133, 139–142, 144, 166, 168–169, 171, 173, 175, 399, 418, 438,
447, 457, 470, 473, 479
alpha2 ISO 410
alt.binding_select() 212–214
alt.Chart() 198, 199, 202, 203, 205, 208, 210, 214, 216,
220, 223, 226, 228, 229, 231, 232, 234, 237–240, 244, 246, 248,
251, 256, 257, 260, 262, 263, 322, 382, 523, 525, 527
alt$Chart() 321, 322
alt.Color() 199, 205, 214, 216, 223, 226, 228, 229, 231,
244, 248, 252, 253, 256, 260, 263, 382, 527
alt.condition() 208, 210
alt.Data() 523
alt.DataFormat() 523
alt.datum 216, 240, 246, 252, 253, 256, 382
alt.hconcat() 201, 527
alt.layer() 232
alt.Legend() 199, 216, 220, 226, 229, 231, 246, 256, 257,
262, 263, 382
alt.Scale() 199, 205, 208, 216, 220, 223, 226, 228, 229,
231, 246, 252, 253, 256, 257, 260, 263, 382
alt.selection_interval() 207, 214, 220
alt.selection_point() 207, 248
alt.SortOrder() 252
alt.value() 199, 208, 210, 214, 220, 223, 228, 229, 231,
232, 240, 248, 382
alt.vconcat() 201
Altair 1, 19, 193–265, 295, 315–327, 377–387, 511, 522–528
altair 197, 260, 320–323, 380, 382, 523, 527
altair::install_altair() 319
American Museum of Natural History 511–514
angle 16, 46, 49, 89, 102, 104, 112, 404, 408
annotate() 406
annotated map 404–408
API 20, 400, 408, 430, 508
API key 430, 508
app.layout 341, 342, 344, 356, 365, 369, 379
app.run_server() 342, 345, 365
area plot 242, 244
arrange() 167
artist 189, 191
ascending 3, 34, 35, 130, 364, 373, 374, 381, 483, 485, 487,
516, 526
as.character() 12, 418
as_cmap 187–189
as.factor() 46, 48, 86, 89, 101, 102, 104, 111–116, 122,
125–129, 131, 138–142, 279
as.list() 151, 282
as.matrix() 179
as.numeric() 418, 461
assign() 12, 239, 307, 319, 491, 496
as_tibble() 310, 313, 315, 395, 417, 470
astype() 20, 239, 331, 364, 483, 489, 498, 500, 516, 519,
521, 527
as_vegaspec() 320, 321, 326
Atlanta 15, 23, 24
attribute 3, 5, 9, 12, 18, 21, 25, 27, 31, 34–37, 39–40, 47, 49–
50, 54–56, 59, 63, 71–73, 86, 94, 95, 105, 113–115, 120, 122–
125, 128, 130, 133, 135, 139, 141, 142, 145, 151, 152, 166–168,
172–174, 179–180, 182, 197–200, 202–203, 205, 207–208, 210,
212–213, 216, 226, 232, 238, 242, 246, 262, 273, 275, 277, 287,
297–298, 302, 310, 323, 331, 334, 342, 345, 348, 351–352, 357,
361, 370, 372, 378–380, 393, 396, 399, 403, 406, 411, 413, 416,
427, 433, 447, 450, 460, 472, 473, 482, 484, 486–487, 489–492,
495–496, 498, 500, 508, 509, 517, 519, 523, 525
automatic styling 286
ax 40, 41, 94–97, 135, 163, 170–173, 176, 182, 184, 189–191
ax_heatmap 182
ax.invert_yaxis() 189, 190
axis x 31, 35, 47, 50, 52, 54–56, 78, 84, 142, 166, 168, 199,
262, 334, 361, 366
axis y 46, 47, 52, 56, 76, 120, 142, 166, 189, 199, 215, 220,
223, 246, 334, 361, 365, 366
axis.text 16, 46, 49, 61, 64, 86, 89, 91, 102, 104, 111, 112,
116, 123, 144, 149, 156, 159, 160, 169, 279, 283
axis.text.x 16, 46, 49, 61, 64, 86, 89, 91, 102, 104, 112,
149, 156
axis.title 64, 111, 112, 116, 123, 144, 159, 160, 279, 283
ax.marg_x.set_xlim() 173, 176
ax.marg_y.set_xlim() 173, 176
ax.margins() 191
ax_joint.legend_._visible 173, 176
azure3 13
b
bar chart 3, 29
bar plot 3, 29–58, 83–97, 100–101, 104, 119, 235–256, 262,
320, 321, 323, 326, 363–365, 371–375, 380, 382, 387
base map 416, 430, 431, 433–440, 450, 457, 473, 481, 483,
484, 491–493, 495–511, 514, 515
bbox_to_anchor 36, 37, 106, 173, 176
beeswarm plot 129–132
Berlin 110, 117–121, 133–137, 144, 145, 157–160, 162, 163,
165, 167–172, 174, 175, 177, 178, 180, 181, 183, 186, 188, 190,
418
bin 39, 59, 60, 65, 71–73, 79, 262, 263, 320
Binaxis 142, 144
bind 15, 115, 116, 140, 207, 208, 210, 212–214, 251, 253, 310,
313, 381, 417
bind_rows() 313, 315, 417
Bing 309–311
binwidth 59–61, 63–65, 71–73, 78, 79, 144
bivariate analysis 63–66, 68–70, 73–77
black 12, 18, 94, 96, 147, 156, 166, 240, 253, 256, 263, 283,
307, 418, 424, 510, 514, 526–528
Bologna 59, 71, 407
Bootstrap 286, 340, 342, 352, 361, 364, 365, 377, 509
Bootswatch 280, 286, 287, 296, 303, 342, 361, 365, 377
bounding box 402, 421, 425–427, 439, 440, 450, 453, 454, 457,
462, 465
boxen plot 51
boxplot 49, 51, 99–107, 109, 112–117, 125, 126, 128, 137, 139,
141, 142, 165–167, 171–172
branca 513, 514
branca.element.Iframe 513, 514
breakpoint 352
breaks 12, 13, 16, 33, 45, 64, 67, 71, 159, 169, 418
bridges 420, 421, 425, 427, 429, 437, 438, 440
brush 214–221
bslib 286, 287, 296
bslib::bs_theme() 287
bs_themer() 289
bubble plot 257–260, 416, 460, 462–464, 466
button 207, 214, 220, 270, 290, 297–301, 303, 304, 361–363,
365, 440
c
cache 275
CALLBACK 344–345, 351, 355, 357, 360, 362–364, 373, 379–
381
canals 424, 425, 427, 429, 437, 438, 440
Carnegie Mellon University 4, 15, 23, 44, 59
Cartesian axes 3, 5, 120, 157, 168, 189, 199, 260, 361, 396
Carto 430, 435, 440, 442, 492–494, 498
cartography 389, 419–448, 457
cat() 299
categorical scatterplot 49, 89, 122–124, 131, 169, 170,
categorical variable 3, 29, 31, 33, 49, 51, 67, 74, 76, 84, 99,
100, 118, 122, 157, 168, 198, 235, 244, 282, 325, 466
causal 6, 7
causation 177, 240
cause-effect 3, 177
cbar 73–75, 187, 188
celsius 15
centroid 410–411, 462–467
char_kws 73
chart 3, 29, 198, 199, 202, 203, 205, 208, 210, 214, 216, 220,
223, 226, 228, 229, 231, 234, 237–240, 244, 246, 248, 251, 256,
257, 260, 262, 263, 321, 322, 326, 378–382, 523, 525, 527
chart.show() 208
checkbox 272, 297–301, 308, 355–364, 372–374, 379, 381
checkboxGroupInput() 298
checkboxInput() 297
choropleth map 389, 392, 397–406, 408, 419, 450–452, 466,
484–494, 496, 498, 500, 501, 507, 515–519, 522–528
Chrome 310, 311, 313
cividis 46, 181, 404, 470, 488, 489, 494
class() 393
Cleveland plot 325–327
climate 15, 149
cluster 177–191
cluster map 177–191
clusterization 178
cmap 163, 182, 187–189
colnames() 302
color 6–8, 10–14, 16–19, 21, 37, 40, 44, 45, 60, 61, 63, 72, 74,
89, 91, 94, 95, 97, 122, 126, 128, 129, 131, 135, 139, 152, 156,
160, 166, 169, 175, 182, 187, 189, 199, 201, 202, 205, 208, 210,
214, 216, 219, 220, 223, 226, 228, 229, 231, 232, 238, 240, 244,
248, 251–253, 256, 260, 263, 279, 283, 288, 289, 302, 308, 331,
334, 345, 348, 351, 352, 360, 363, 364, 372–374, 382, 397, 404,
406, 407, 418, 419, 424, 427, 429, 438, 440, 443, 445, 446, 448,
457, 484–488, 492–494, 498, 502, 505, 506, 510, 511, 514, 517,
519, 523, 525, 527
color blind 111, 112, 445
Color Brewer 30, 166
color palette 10, 16, 30, 32, 37, 40, 50, 72, 74, 86, 94, 135,
145, 162, 166, 187, 189, 198, 199, 202, 248, 260, 303, 331, 348–
355, 357, 398, 416
Color Picker 296
color scale 16, 64, 67, 94, 157, 159, 178, 180, 181, 248, 251–
253, 256, 260, 331, 351, 398, 399, 403, 404, 484, 489, 492, 517,
523
color_discrete_map() 372
Colosseum 457, 458
cols 15, 33, 110, 116, 178
column() 271–272, 288, 297
column_to_rownames() 179
col_wrap 50–52, 54–56, 58, 75, 106, 344
conda 319, 320, 495
configure_mark() 199
consumer prices 4, 9
container 94–97, 288, 364–365, 369, 378, 379
contextual change of color 210–212
continuous variable 3, 33–34, 37–39, 59, 71, 75, 83, 99, 109,
184, 196, 257, 262, 466
coord_cartesian() 168
coord_flip() 31, 33, 49, 86, 89, 91, 112, 144
coordinate reference system 406
coordinates transformation 427
coordinate system 1, 413, 421, 422, 426, 435
coord_sf() 406, 413, 427, 428, 439
copy(deep=True) 373, 374, 381
copyright 110, 195, 196, 269, 295, 391, 392, 481
corr() 185, 187
correlation 3, 177–191, 242
correlation analysis 177, 184
correlation index 177, 184, 187, 189
correlation matrix 184–189
correlation metric 178
cosmo 280
count plot 51
county level 392
CRAN 138, 141, 319
Creative Commons 4, 29, 59, 195, 269, 391, 392
CRS 406, 407, 414, 418, 421–429, 436–440, 444–447, 450, 452,
461, 462, 465
CRS 4326 406, 407, 414, 418, 426, 429, 439, 446, 447, 457, 461
CSS 286, 295, 296, 308, 310, 312, 314, 339, 356, 361, 377–
387
CSS inline 356
CSS selector 310, 312, 314
CSV 15, 273, 308, 460, 483
cubehelix 52, 55, 58, 72, 106, 119, 120, 163, 171
custom function 134, 135, 137–144, 277, 288, 297, 300, 322,
323, 325, 344, 345, 351, 357, 360, 362–364, 367, 374, 381, 438
custom legend 466–472
custom palette 31
custom solution 178, 268, 439, 440
cut() 13, 33, 39
d
dark theme 280, 283, 284, 291, 360–361, 365
darkred 14, 64, 86, 89, 129, 134, 253, 406, 407, 419, 438, 443,
525
Dash 268, 329, 339–387
Dash() 342, 361
Dash Bootstrap Container 364
Dash Core Components 342, 378
Dash data table 345, 348
dashboard 205, 267–269, 271–327, 329, 334, 339–387
dashboardBody() 307
dashboardHeader() 307
dashboardPage() 307
dashboardSidebar() 307
dash_bootsrap_components 340
dash_bootstrap_templates 340
dash.table.DataTable() 345
data frame 5–6, 15, 18, 20–21, 23–24, 30, 43–44, 52, 84, 91,
115, 117, 122, 138, 149, 151–152, 157–162, 173, 178–179, 182,
184–185, 187, 189, 197–198, 201, 205, 212–213, 215–216, 223,
225–226, 228, 235, 238–239, 242, 244, 246, 248, 251, 253, 257,
260, 289, 297–298, 300–302, 309–310, 313–315, 321–326, 330,
345, 357, 361, 363, 370, 380, 395–396, 398–399, 403, 410–413,
416, 418, 425, 438, 440, 442, 446–447, 450, 460–461, 469–470,
473, 483–484, 486, 489, 491, 495–507, 516, 518–519, 521, 527
data import 267, 308, 341
data wrangling 15, 20, 23, 52, 91, 110, 117, 158, 225, 235,
242, 251, 275, 323, 330, 340–341, 363, 370, 380, 381, 398, 414,
415, 460, 469, 483, 525
dataset 4–5, 15, 20, 23, 29–30, 34, 44, 59, 63, 71, 83, 84, 91,
100, 109–110, 115–117, 121, 133, 137, 140, 148, 149, 152, 157–
160, 165, 170, 178, 195–197, 201, 225, 235, 240, 242, 244, 246,
251, 257, 269, 273–274, 295, 302, 308, 314, 329–330, 340, 341,
391–392, 398, 400, 408, 415–416, 419, 421, 425, 433, 450–452,
460–461, 481–484, 490, 495–496, 500, 503, 514–516, 518, 523,
525, 527
datum 215, 216, 240, 246, 248, 251–253, 260, 262, 322, 326,
382, 422, 527
datum.rank() 215
dbc.Col() 352, 363, 365
dbc.Container() 364
dbc.min.css 361, 377
dbc.Row() 352, 363, 365
dbc.themes 342, 344, 361, 377
dcc.Checklist() 356
dcc.Dropdown() 348, 356
dcc.Graph() 342
dcc.Markdown() 362
dcc.RadioItems() 361
dcc.RangeSlider() 342
dcc.Tab() 369, 378
dcc.Tabs() 369
default_crs 369
dendrogram 180, 181
Denmark 409–412
density plot 49, 59–81, 111, 112, 138, 147, 149, 165, 168, 173,
174, 521
dependencies 319, 339, 449, 450, 495
dependent variable 3, 8, 130
deployment 130, 267, 298
desc() 151, 282, 324, 325, 461, 469
df.columns 34
df.pivot() 197, 330, 341
diagonal correlation heatmap 186–189
dict 73–75, 170, 173, 345, 348, 352, 360, 371, 395, 482, 492,
494, 497, 498, 502, 506
dictionary 200, 345, 419, 482, 489, 491, 492, 495, 496, 498,
516, 517, 519, 523
direction 19, 84, 95, 152, 181, 239, 248, 372, 445
directly correlated 184
discrete 46, 48, 73–75, 101, 102, 149, 152, 160, 202, 331, 334,
364, 372, 374, 398
disjunction 325
districts 450–452, 457
diverging bar plot 83–97, 239–241, 253, 371, 372
diverging color palette 86, 189
divnorm() 94, 96
dodged 31, 35, 37, 114, 115, 128, 139, 145, 244
dodged bar plot 244
dodge.width 124, 125, 130, 140, 141
dog runs 481, 490–494, 496, 497, 499–507
Domestic Animals 391, 398
dot notation 302, 325, 341
dot plot 141, 142
double scale 240–244
downgrade 319
dplyr 469
drinking fountains 481, 500–507
drop() 93, 185, 212–214, 272–275, 277, 280, 287, 297, 348,
351, 352, 356–358, 363, 371, 372, 374, 375, 379
drop-down menu 212–214, 272–275, 277, 280, 287, 297, 351,
352, 356–358, 363, 372, 374, 375, 379
DT::datatable() 301
DT::formatStyle() 302
dt.month 35, 36, 52, 57, 105, 119, 120, 133, 135, 144, 161,
170, 235
dt.month_name() 36, 37, 52, 57, 105, 161, 170
DTOutput() 323
DT::renderDT() 301
dt.year 24, 117, 133, 135, 145, 235, 521
dynamic legend 209, 364, 366, 380
dynamic popup 440, 444, 447, 466, 473, 474, 519
dynamic tooltip 205–206, 210, 231, 261, 326, 331–333, 365,
372, 408, 433, 485–487, 489, 496
dynamic zoom 208–211, 219
e
Edgecolor 94, 96, 190, 191
element id 480, 484, 491, 498, 500
element_blank() 61, 63, 64, 149, 152, 156, 160
element_text() 16, 46, 49, 61, 64, 86, 89, 91, 102, 104,
111, 112, 116, 123, 144, 149, 156, 159, 160, 169, 279, 283, 404,
408, 419
El Paso 15, 24
enable_search 287
encode() 199
encoding 199–201, 205, 208, 210, 213, 216, 219
EPSG 3004 422
error 3, 47, 78, 99, 117, 121, 274, 282, 296, 298, 299, 319,
410, 411, 422, 440, 450, 498
ESRI Shapefile 421
European Union 401
Eurostat 110, 115, 116, 269, 273, 400–404
eventReactive() 275
Excel 110, 115–117, 398
excerpt of code 85, 89, 156, 200, 235, 273, 282, 301, 341, 345,
352, 357, 369, 378, 379, 396, 410, 437, 460, 465, 470, 473, 482,
489
external CSS 361, 377–379, 382
external_stylesheet 342, 344, 361, 377
extreme event 99
f
facet 1, 43–58, 61–63, 65–67, 71, 74, 77, 102–104, 106–107,
112, 115–116, 133, 140–142, 144, 173–174, 202–204, 210, 212,
234–236, 289, 294, 334
facet() 203, 235
facet_col 334
facet_col_wrap 334
facet_grid() 44, 115, 142
facet_wrap() 44, 47, 61, 63, 115, 142
factor() 46, 48, 63, 86, 89, 101, 102, 104, 111–116, 122,
125–129, 131, 138–142, 148, 151, 152, 159, 279, 282, 283, 469,
470
Fahrenheit 15, 23
fct_relevel() 151–152, 283
Featureidkey 484–486, 488–491, 493, 517
FEMA 514, 515
fields 207, 208, 210, 213, 214, 223, 229, 231, 248, 251, 253,
257, 381, 401, 421, 427, 464, 465, 467, 519
fig.legend() 173, 176
figsize 21, 24, 38, 40, 50, 170, 173, 182, 484, 486
fig.update_geos() 485, 487
fill 10–11, 13, 17–18, 31, 33–34, 46, 48, 60–61, 63–65, 67, 71–
72, 78, 86, 89, 100–102, 104, 111–116, 125–126, 128, 138–142,
144, 149, 152, 156, 159–160, 166–167, 228–229, 237, 239, 283,
289, 397, 399, 404, 416, 418–419, 425, 427, 429, 438, 440, 445,
450, 469, 473, 479, 517, 519, 523, 525
filter() 232, 260, 302, 324, 425
filter(rank() ) 324
Firefox 311
fitBound() 433, 440, 443, 447, 485, 487
flaticon 445
FLATLY 342, 344
flickering 300
fluidPage() 271, 273
fluidRow() 271, 272, 288, 297
folium 507–522
folium.Choropleth() 516, 519
folium.Circle() 511
folium.CircleMarker() 511
folium.GeoJson() 514
folium heatmap 520–522
folium.Icon() 508
folium.LayerControl() 517
folium.Map() 508, 521
folium.Marker() 508
folium plugin 519, 520
folium.Popup() 514
font.size 21, 24, 38
font_scale 21, 52, 54, 58, 94, 96, 106, 187
FontAwesome 508, 509
for() 84, 470
for-cycle 491
forestgreen 12, 14, 18, 125, 135, 438, 440
format 34, 36, 37, 52, 83, 110, 161, 189, 194, 200–201, 235,
239, 301, 302, 310, 320, 322, 345, 362, 377, 382, 393, 395, 401,
410, 412, 413, 419, 420, 425, 438, 440, 450, 481, 495, 508, 512,
513, 516, 519, 523
FP.CPI.TOTL.ZG 4, 9–11, 13, 14, 17, 18
France 10, 21, 392, 510
free_x 47
free_y 47
Freepik 445
freezeReactiveValue() 301
fr_FR 105
function
actionButton() 297
add_child() 519
add_params() 207–208, 210, 214, 216, 220, 223, 229,
231, 248, 251, 256, 257, 382
add_to() 517
add_trace() 493, 501
addCircleMarkers() 447
addPolygons() 440
addPolylines() 440
addTiles() 433
aes() 5, 6
all_of() 158
alt.Axis() 199, 202, 205, 208, 216, 220, 223, 226, 228–
229, 231, 238–240, 244, 246, 248, 251, 382
alt.Bin() 212, 262
alt.binding_radio() 214
alt.binding_range() 251, 253, 257
@app.callback() 344
g
GDP 4, 20–22, 25–27, 84
GeoDataFrame 496, 498, 500, 503, 518, 519, 522
geodataset 391–392, 419, 460, 461, 481, 490, 503
geographic coordinates 401, 415, 416, 418, 421, 450, 495
geographic maps 329, 389–479, 481–528
GeoJSON 419, 425, 428, 433, 450–452, 481–484, 486, 487,
489–496, 498, 500–504, 506, 514–520, 522, 523, 525, 527
geojson_sf() 425
GeoJsonPopup() 519
Geojsonsf 419, 425
geom_bar() 31, 33, 34, 46, 48, 86, 89
geom_beeswarm() 130, 131
geom_bin2d() 63–65
geom_boxplot() 100–102, 104, 113, 115, 116, 125, 128,
139, 141, 142
geom_density() 63, 67, 71, 111, 112, 138, 149, 152, 156
geom_density_ridges() 149, 152, 156
geom_density2d() 63, 67, 71
geom_density2d_filled() 63, 67, 71
geom_dotplot() 141, 142, 144
geom_half_violin() 141, 142
geom_hex() 63–65
geom_histogram() 59–61, 63
geom_hline() 13, 14
geom_image() 445, 447
geom_jitter() 122–124, 416, 418
geom_label_repel() 13, 14, 416, 418, 445, 447
geom_line() 17–19, 279, 288
geom_point() 5–8, 10, 11, 13, 14, 16–18, 44, 89, 91, 113,
114, 122, 125, 139, 141, 166, 169, 289, 416
geom_polygon() 396, 397, 399, 410
geom_raster() 159
geom_rect() 159
geom_rug() 168, 169
geom_segment() 89, 91
geom_sf() 401, 403, 404, 410, 411, 413, 414, 418, 424, 425,
427, 429, 438, 445, 446
geom_sina() 126, 128, 129
geom_split_violin() 138–141
geom_text() 86, 89, 91
geom_tile() 159, 160
geometry 401, 402, 410–412, 421, 425, 427, 440, 450, 453,
461–466, 482, 490, 495, 496, 498, 500–504, 506, 519, 523, 525
geometry.to_json() 498, 501, 502, 504, 506
geopandas 495–508, 518–520, 523–528
Germany 10, 110, 117, 121, 137, 157, 165, 178
gesture 193, 208, 219, 272, 430, 440
get_height() 94, 95
get_stamenmap() 430, 431, 438
get_width() 95, 96
ggbeeswarm 130, 131
ggExtra 165, 166
ggforce 126
gghalves 141, 142
ggimages 445–447
ggmap 430–433, 435–441, 448
ggmap() 431, 437, 439
ggMarginal() 165–167
ggplot 1, 4–19, 21–22, 29–34, 43–49, 51–52, 56, 59–71, 83–
91, 100–105, 110–117, 122–131, 138–144, 147–160, 165–170,
173, 178–182, 184, 196, 198–199, 257, 277, 283, 286, 288, 294,
297, 322–323, 396, 399, 408–414, 433, 445–446, 452, 460, 521
ggplot() 5–8, 10–11, 13–14, 17–18, 31, 33–34, 44, 46, 48,
86, 89, 91, 100–102, 104, 111–116, 122, 125–129, 131, 138–142,
149, 152, 159–160, 166, 169, 397, 399, 413
ggplotly() 408
ggrepel 13–14, 416, 445, 447
ggridges 148, 149
ggsave() 142, 144
ggthemes 31, 111, 166
ghostwhite 397, 418, 427, 429, 457, 463, 466, 523, 525
GIS 390, 420, 421
GISCO 400–404, 408
gisco_get_nuts() 401
giscoR 400–404, 408
go.Choroplethmapbox() 489, 492, 494, 498, 500–504, 506
go.Figure() 489, 492, 498, 501, 504
gold 10, 12, 14, 18, 65, 125, 252, 253, 407
Google Font 286, 287, 291, 294, 296
Google Maps 430
go.Scattermapbox() 503, 505
gpd.read_file() 495, 496, 503, 518, 525, 527
grammar of graphics 1, 5, 8, 19, 27, 194, 199, 232, 329, 433,
448
graphical layers 6, 413–419, 440, 493, 500
green areas 420, 427, 437
grey 50,14
grid.arrange() 167
gridExtra 166, 167
grid of plots 43
gridspec_kw 170, 173
group 17–19, 30, 31, 35, 113, 214, 219, 244, 282, 323, 324,
372, 396, 469, 527
groupby() 34, 36, 52, 57, 93, 119, 161, 170, 225, 235, 237,
240, 244, 246, 248, 381, 483, 485, 487, 516
group_by() 30, 46, 47, 85, 151, 158, 165, 276, 282, 323–325,
396, 461, 469
groups of bars 31, 35, 36
groups of points 11, 25, 113, 128
guide_legend() 16, 160
guides() 3, 16, 419
h
half-violin plot 137–156
Havre 15–17, 23–25
head() 30, 308, 310, 364, 373, 374, 381, 395, 418, 485, 487
header 110, 287, 307, 308, 310, 313, 315, 342, 348, 415, 417
heatmap 157, 159–163, 178–182, 185–191, 223, 260, 261,
263, 520–522
height_ratios 172, 173
hexadecimal RGB 32
highlighted line 228–230
Himalayan Database 295, 303, 309, 314–316, 323, 326
histogram 49, 59–81, 165–167, 173, 260–265, 334, 336
horizontal orientation 31, 238
hover_data 331, 334, 345, 352, 360, 363, 364, 373, 374,
486–488, 494, 497
hoverinfo 500, 502, 503, 505–507
hovertemplate 500
hovertext 489, 498, 500, 502, 505–507
HTML 193, 196, 200, 201, 205, 272, 296, 301–303, 322, 331,
339–342, 344, 351, 356, 362, 377–379, 408, 409, 440, 452, 454,
457, 489, 500, 508, 511
HTML 5 380
HTML table 301, 302, 308–315, 365–376, 414, 417
html.Br() 356, 362
html.Div() 341, 344, 351, 352, 355, 356, 365, 370–372, 379
html_element() 310, 312, 314, 417
html.H4() 342
html.Hr() 356, 362, 370, 372
html.Iframe() 379
html.P() 344, 351, 356, 362, 371, 379
html_nodes() 310
html_table() 310, 313, 315, 417
htmltools 440, 443, 448
http service 274
hue 21, 22, 24, 25, 27, 35–37, 50, 51, 54–56, 58, 64, 65, 72,
75, 79, 94, 106, 135, 144, 145, 157, 171, 173, 175, 190, 207, 210
hue_order 24, 145
i
identifier 273, 275, 276, 289, 297, 298, 322, 342, 344, 351,
356, 361–363, 379, 496
identity 31, 33, 46, 48, 86, 89, 128, 129
id_vars 20, 24, 526
ifelse() 12, 89, 91
iframe 340, 377, 379–382, 511–514
implicit index 498
import 19, 20, 94, 135, 182, 197, 260, 267, 272, 308, 322, 330,
340, 341, 482, 495, 508, 511, 519, 521, 523, 527
independent variable 3
index 36, 52, 57, 83, 119, 161, 163, 170, 177, 184, 185, 189,
195, 197, 225, 235, 237, 240, 244, 246, 286, 288, 295, 319, 330,
341, 345, 348, 381, 391, 433, 469, 483, 485, 487, 495, 498, 500,
502, 505–507, 516
inflation variations 9
inherit.aes 6, 438
inline 186, 189, 301, 308, 339, 356, 361, 362, 377, 378, 380,
416
inner_join() 399, 404
inplace 34, 36, 52, 57, 93, 161, 185, 500
Input() 351, 360, 362, 364, 373, 381
input element 272–277, 289, 290, 298–300, 345, 348, 362,
363, 379
inputId 275, 287
inputStyle 362
installation 130, 319–320, 339, 449, 450, 495
interactive() 210
interactive graphic 193–194, 201, 204–224, 228–235, 247–256,
260, 377, 379
interactive legend 207–208, 210, 220, 224, 379
intersecting geometries 460–479
inversely correlated 184
isin() 360, 363, 364, 373, 374, 381
isna() 34, 52, 198, 330, 341, 360, 483, 521, 526
is.na() 30, 45, 279, 282, 461, 469, 470
iso2c 9, 10
ISTAT 391, 398
Italy 29, 59, 71, 278, 304, 356, 392–401, 422, 425, 427
j
JavaScript 205, 286, 309, 431
jitter 121–136, 139, 140, 416
jitter.height 124, 125, 140, 141
jitter.width 124, 125, 140, 141
join 398, 399, 403, 404, 418, 469, 472, 484, 489, 491, 500,
516–519, 527
joint grid 173–176
joint plot 173, 174
JSON 193, 194, 196, 200–201, 322, 419, 523
json.load() 482, 490, 514, 515, 523
Jupyter 134, 205, 339, 340
jupyter-dash 339, 340
JupyterDash() 340
k
Kde 63, 72, 74
kernel density 49, 59–81, 521
kind 49–52, 54, 74, 106, 123, 168, 173, 268, 298, 309, 445,
484
Korea 151, 281
l
label 13, 14, 63–65, 89, 91, 158, 165, 172, 287, 308, 356, 369,
378, 406, 407, 416, 418, 445, 447, 486
labelsize 41, 163, 187
labs() 10, 11, 13, 14, 16–18, 32–34, 45, 46, 48, 61, 63, 64,
86, 89, 91, 102, 104, 111, 112, 116, 122, 125, 126, 128, 129,
131, 139–141, 144, 149, 156, 159, 160, 166, 169, 279, 289, 399,
404, 418, 445
lag() 85, 93, 239, 420, 421
latency 274
latitude 394–396, 406, 410, 411, 416, 422, 426, 428, 429, 460,
461, 495, 503, 504, 508, 521, 523, 525
layout 35, 36, 41, 54, 55, 58, 71, 72, 89, 95, 97, 117, 130, 134,
139, 141, 148, 163, 166, 167, 172, 191, 244, 267–269, 271, 272,
286–288, 295–297, 303, 304, 307, 308, 321, 341–357, 362–365,
369, 371, 373, 374, 379–381, 383, 416, 450, 451, 455, 457, 463,
466, 479, 485, 487–489, 492, 495, 497, 498, 500, 503, 507
leaflet 431–435, 440–445, 447, 448, 451, 452, 457, 470, 473,
479, 508, 514, 522
leaflet() 431–435, 440, 442–445, 447, 448, 451, 452, 457,
470, 473, 479, 508, 514, 522
leaflet.js 431
leaflet.providers 433
legend.legendHandles 189, 191
legend.position 61, 64, 104, 111, 112, 116, 122, 139, 140,
144, 149, 152, 156, 166, 169, 404, 408, 419, 451, 463
legend_set_title() 79, 145, 146
length() 84, 112, 168–170, 219, 470
levels 1, 29, 63, 67, 151, 180, 208, 398, 401, 469, 470, 492
library() 1, 4, 8, 9, 13, 14, 19, 20, 31, 40, 94, 110, 111,
126, 130, 131, 135, 141, 142, 147, 149, 166, 167, 179, 193, 194,
196, 205, 273, 307, 309, 315, 320, 322, 329, 339, 392, 401, 404,
408, 416, 419, 431, 433, 446, 450, 481, 507, 513
lightblue 60, 61, 63, 64, 129, 166, 167, 220, 237, 425, 427, 510
light theme 16, 21, 86, 89, 131, 283, 284, 361, 377, 382
linear correlation analysis 184
line plot 3–28, 43, 49, 50, 147–156, 196, 225–236, 242, 244,
277, 280, 282, 283, 325, 326, 334, 335
linetype 14, 18, 19, 288, 289
linewidth 14, 18, 25, 27, 50, 51, 91, 94, 96, 106, 145, 168, 187,
188, 279, 288, 396, 397, 404, 416
list 13, 54, 71, 96, 105, 122, 138, 151, 159, 161, 205, 207, 212,
273, 275, 287, 313, 323, 340–341, 351, 356, 361–362, 370, 393,
395, 419, 486
list() 16, 151, 166, 167, 282, 287, 301, 361, 362
load_figure_template() 340, 361, 377
loc[] 20, 22, 25
locale 105
localhost 274
logarithmic scale 75–81
logical condition 12, 128, 129, 198, 208, 219, 228, 240, 246,
289, 298, 300, 322, 324, 325, 357, 373, 374, 527
log_scale 78
lollipop plot 83–97
long form 15, 20, 21, 23, 24, 115, 116, 122, 140, 149, 152,
159, 189, 260, 330, 525
longitude 394–396, 401, 406, 410, 411, 416, 422, 426, 428,
429, 433, 460, 461, 495, 503, 504, 508, 521, 523, 525
low achieving 269, 273, 277, 279, 289
Lubridate 16, 44, 45, 60, 61, 63–65, 67, 71, 105, 392, 419
Lxml 340
m
MacOS 311, 449, 450
magma 37, 50, 51, 220, 226, 228, 229, 231
Magrittr 302
main diagonal 186, 187
main island 393, 395, 399
main panel 297
mainPanel() 321
mako 35–37, 163
mamba 319
Manhattan 508
Manhattan plot 123
map() 54–55, 58, 161, 508, 511, 514, 517, 519, 521
map.axes() 393, 394
mapbound 431
mapbox 487–489, 492–494, 496–498, 503, 507
mapbox_style 487–489, 492, 494, 497, 503, 507
map_data() 395
maps 177–191, 208, 329, 389–479, 481–528
map.scale() 393, 394
maptype 431
marginals 165–168, 170, 172–176, 334, 336
marginal_x 334
marginal_y 334
margins 180, 181, 191
mark_area() 242, 244
mark_bar() 237, 239, 240, 244, 248, 251–253, 256, 262,
321, 382
mark_circle() 198–200, 202, 208, 214, 216, 220, 228, 257,
263, 382, 525
mark.geoshape() 523, 525, 527
mark_line() 225, 226, 228, 229, 231, 234, 242, 244
mark_point() 203, 205, 210, 214, 228, 229, 231, 232
mark_rect() 223, 260, 263
mark_rule() 232, 238
mark_text() 215, 216, 232, 238, 239, 246, 252, 253, 256
mark_tick() 263
marker 3, 10, 17, 25, 198, 205, 210, 231, 232, 257, 334, 445,
447–448, 489, 492, 494, 498, 502, 505–506, 508–511, 514
markerClusterOption() 447, 448
markers with annotations 446
mask 187, 188, 194, 345, 352, 360, 363, 364, 373, 374, 381
Mathematics 109–116, 122–130, 138–143, 149, 150, 269, 273
Matplotlib 19–21, 40, 78, 94, 170, 173, 482, 508
matrix 178–180, 182, 184–189, 301
maxwidth 128, 129
mean() 4, 151, 154, 155, 187, 225–227, 237, 238, 276, 282,
377, 410, 422
median 99, 101, 145, 151
melt() 20, 24, 526
Mercator 393
metadata 195, 406, 422, 450
Milan 29, 30, 34, 44, 45, 52, 59, 100, 101, 407, 418
Milwaukee 15, 23–25, 63
missing value 12, 30, 179, 198, 280, 330, 444, 460, 461, 469,
483, 521
modelr 4
Monte Mario 422, 426–429, 436, 438–440, 451
month() 45, 63–65, 67, 71, 101, 102, 104, 105, 158, 165
Mount Everest 295, 309, 310
mouse hover 205, 210, 212, 228–233, 236, 311
mouse hovering 210–212
mouseover 210, 229, 231
Mozilla 380, 511
mpl.colors.TwoSlopeNorm() 94, 96
multiInput() 286, 287, 297
multilinestring 286, 287, 297
multi-page dashboard 272, 303, 307, 365
multi-page layout 286, 369
multiple selection 220, 223, 248, 249, 286, 287, 297, 355–360
multipoligon 401–403, 425, 427, 490–491, 495, 503
multi-polygon 440, 460
mutate() 12, 45, 63, 85, 101, 102, 104, 152, 283, 324, 325,
404, 470
n
NA_character 13
names_from 178
names_prefix 178
names_to 15, 110, 116
Natural Earth 408–410, 413, 414
na.value 12, 13
Navbar 272, 297
n.breaks 64, 67, 71, 418
Ncol 45–48, 63, 65, 67, 71, 102, 104, 148, 167, 289
ne_coastline() 409
ne_countries() 409
ne_download() 408, 413, 416, 417
ne_states() 409–412
negatively correlated 184
neighborhoods 72, 74, 79, 450–452, 454
New York 15, 23–25, 63–71, 392, 481–491, 510
nice.breaks 418
nominal 202, 203
non_selected_header 287
normal distribution 184
normality test 177, 184
np.arange() 191
np.array() 94, 96
np.ones_like() 187, 188
np.triu() 187, 188
ntile() 469, 472
NumPy 19, 20, 197, 330, 340, 482, 508, 523
NUTS geometries 401
o
observation 3, 29, 31, 33, 35, 38, 59, 63–65, 99, 117, 120,
147–149, 178, 180, 412
observe() 166, 289–295, 297–299, 301, 457, 482
observeEvent() 289, 290, 297, 298, 301
OECD 83, 84, 109–113, 116, 121, 126–128, 130, 137, 138, 140,
148, 149, 269, 280
OECD-FAO 84, 86, 89, 91, 95, 97
opacity 199, 200, 202, 203, 205, 208, 214, 216, 220, 223, 228,
229, 231, 232, 244, 262, 334, 382, 448, 487–489, 492, 494, 497,
498, 502, 505, 506, 514, 517, 519, 521
OpenPolis 425
OpenStreetMap 430, 431, 433, 434, 440, 441, 458, 487, 488,
496, 508, 509
OpenTopoMap 470, 473
OR 325
ordinal 198, 202, 203
outlier 99, 101, 112, 115, 116, 125
outlier.shape 115, 116, 125, 128, 139, 141
output() 5, 271–277, 279, 280, 289, 297, 298, 301, 320–323,
340, 344, 345, 351, 357, 360, 362–364, 373, 379–381
output element 272–276, 277, 289, 301, 322, 323, 363, 379
outputId 275–277
overplotting 121–136
override.aes 16
p
padding 199, 314, 352, 416, 418, 447
paired 166
Palace 427, 446, 447
Pandas 19, 20, 37, 197, 226, 330, 340, 370, 482, 484, 495,
500, 508, 523
panel.background 156, 283
panel.grid.major.y 61, 63, 64, 149, 156
patchwork 8, 40, 166, 404
pd.Categorical() 96, 105, 161
pd.concat() 24
pd.cut() 37–39
pd.date_range() 161
pd.pivot_table() 161, 163
pd.qcut() 39–41
pd.read_csv() 20, 23, 34, 71, 91, 197, 225, 235, 242, 330,
341, 483, 515, 525
pd.read_excel() 117
pd.read_html() 370
pd.to_datetime() 24, 34, 37, 52, 235, 239, 521
Pearson correlation matrix 184
pedestrian areas 444, 445
phenomenon 3, 15, 99, 389, 398
Phoenix 15, 17, 24, 25, 63
PHP 309, 314
pin marker 445
pio.renderers.default 340
pip 319, 320, 495
pipe 10, 302, 322
pirate attacks 195, 235, 237–241, 243–251
Pisa test 111, 112, 115, 116, 122, 124, 127–130, 138–142,
149, 151, 273, 274, 277, 283, 294
pivot_longer() 15, 110, 116
pivot_wider() 178
planar surface 395, 396, 401, 410, 411, 421, 426, 450, 490
plasma 67, 101, 102, 104, 152, 223, 246
plot 6, 8, 10, 43, 47, 65, 67, 166, 216, 224, 238, 243, 280–281,
288, 322, 364, 404, 450, 451–452, 454, 455, 460, 462–466, 489,
521, 525
plot() 401, 409
plot_joint() 174, 175
plot_marginals() 174, 175
plot.margin 112, 122, 131, 139, 140, 144, 168, 169
plot mode 451, 452, 454, 455
plotly 329, 331, 334–336, 339, 340–345, 357, 364, 372, 379–
380, 408–409, 481–501, 503, 504, 507–508, 515, 517–519, 522,
523
plotly.express 329, 330, 340, 363, 482, 484–489, 491, 493,
496–498, 508
plotly.graph.object 329
plotly.io 330, 340
plotOutput() 277, 279, 280
plots alignment 165–176
plt.cm.bwr() 94
plt.cm.PiYG() 96
plt.figure() 21, 24, 38, 50, 484, 486
plt.legend() 21, 22, 25, 106
plt.rcParams.update() 21, 24, 38
plt.subplots() 40, 170, 173
plt.tight_layout() 35, 36, 71, 72, 95, 97, 117, 134, 163
plt.xlabel() 24, 25, 35–37, 39, 71–73, 76, 79, 106, 117,
133, 134, 145, 146, 163, 173
plt.xlim() 79
plt.xscale() 78
plt.xticks() 35–37, 39, 95, 106
plt.ylabel() 24, 25, 35–37, 39, 71–73, 76, 79, 106, 117,
133, 134, 145, 146, 163, 173
plt.yscale() 76, 79
plugins 520, 521
plugins.HeatMap() 521
PNG/JPG 200
point plot 51
pollutant 29–31, 33–39, 45–48, 52, 54–58, 100–104, 106
polygon 396, 399, 410–412, 439, 482, 490–491, 493, 496
polygonal elements 396
popup 205, 442, 443, 447, 448, 466, 470, 472, 473, 479, 509–
511, 514, 519, 520
position 21, 31, 36, 61, 64, 104, 111–116, 121–125, 128–130,
139–142, 144, 149, 151, 152, 156, 165, 166, 169, 172, 216, 232,
404, 406, 408, 419, 430, 433, 451, 463, 466, 470, 473, 479, 504
position_dodge() 113–116, 122, 125, 139, 141
position_jitterdodge() 124, 125, 140, 141
position_nudge() 142, 144
positively correlated 184
Positron 430, 433, 440, 442, 443, 492–494, 498
possession 253, 260, 393
pretty_breaks() 169
print() 83, 110, 177, 193
properties() 199, 202, 203, 216, 219, 220, 226, 228, 229,
232, 238–240, 251–253, 256, 257, 260, 262, 263, 312, 326, 382,
482–486, 488–490, 493, 517, 519, 523, 525, 527
province 396–400
pull() 320
pulsar 147, 152, 155
purrr 392, 393
px.bar() 363, 364, 372–374
px.choropleth() 484, 486, 487, 492, 493, 496
px.choropleth_mapbox() 487, 492, 493, 496
px.colors.named.colorscales() 351
px.colors.qualitative 331, 334
px.colors.sequential 331, 363, 364
px.line() 334
px.scatter() 331, 334, 345, 352, 360, 363
py_run_string() 322
PyCharm 339
Pyplot 19, 20, 50, 94, 482, 508
Python 1, 19–28, 34–41, 49–58, 71–81, 91–97, 105–107, 117–
120, 131–136, 144–148, 160–163, 170–176, 182–191, 193, 200,
239, 268, 296, 309, 315, 319–320, 322, 329, 331, 339, 341, 344,
356, 370, 377–378, 389–390, 393, 419, 431, 449, 481–528
Python IDE 339
Python interpreter 320
q
quantile 39, 40, 109, 466–472
quantitative 157, 198, 199, 202, 216, 226, 228, 239, 246, 257,
382
query() 144
r
R 1, 4–19, 21–22, 23, 29–34, 35, 37, 44–49, 51–52, 56, 57, 59–
71, 83–91, 100–105, 106, 110–117, 122–131, 137, 138–144,
148–160, 165–170, 178–182, 184, 268, 272, 276, 299–300, 302,
309, 312–313, 315, 319, 322, 325, 329, 377, 389–479, 495, 522
radio buttons 207, 212–215, 220, 361–363, 365
railroad network 413, 417
railway station 413–418
raincloud plot 141–144
Ralli Car
ranges of values 33–34, 37–39, 469
r-color-palettes 31
reactive() 275–276, 279, 282, 289, 295, 298–299, 323–325
reactive action 274–276, 280, 290, 297, 344, 351, 357, 363,
372
reactive context 282, 289, 290, 298, 299, 322, 344
reactive event 268, 277, 282, 283, 289, 295, 296, 299, 342,
344–345, 357, 362, 371–376
reactiveExpr 276
reactive logic 268
reactive object 275, 276, 280, 282, 302
read_csv() 20, 23, 30, 34, 71, 91, 158, 197, 225, 235, 242,
273, 330, 341, 483, 515, 525
read_csv2() 30
read_excel() 110, 115–117, 398
read_html() 310, 314, 370, 417
read_xlsx() 398
reading 15, 44, 109, 110, 115, 116, 140, 141, 143, 149, 152,
154, 196, 269, 273, 282, 302, 322, 420–422, 460, 517, 518, 520
Readxl 110
recalculation 289, 297, 298
rectangular form 157, 159, 161, 162, 179, 182, 184
region 149, 225, 226, 228, 229, 231, 235, 389, 392, 393, 395,
396, 398–405, 409, 410, 413, 474
remove_rownames() 179
rename() 30, 36, 52, 57, 158, 161, 165, 185, 235, 381
rename_axis() 185
renderDatTable() 286
rendering 118, 274–276, 286, 289, 290, 301, 320–323, 339,
356, 357, 378
renderPlot() 277, 279, 283, 288
renderTable() 275–277, 279
reorder() 31, 33, 89, 91, 180
repulsive textual annotation 13–14
reset_index() 34, 36, 52, 57, 119, 161, 170, 185, 189, 197,
225, 235, 237, 240, 244, 246, 248, 330, 341, 381, 483, 485, 487,
500, 516
resolve_legend() 216, 219
resolve_scale() 242, 244
reticulate 319–320, 322
reticulate::conda_list() 320
RETICULATE_PYTHON 319, 320
Rialto Bridge 433, 440, 442
ridgeline plot 147–156, 280, 282, 283
rnaturalearth 408, 410, 411, 413
road map 413, 415, 473, 474
rocket 22, 31
rodent 481, 515, 516, 518, 520, 522
Rome 273, 310, 311, 313, 391, 392, 406, 407, 450–453, 455,
457, 458, 460–479, 481
round() 60, 89, 91, 110, 115, 116, 197, 225, 242, 330, 341
Rprofile 319
r-reticulate 319, 320
RStudio 274, 286, 299, 308, 310, 320, 452
rug plot 168–170, 174, 175, 263, 265, 334, 336
run app 271, 274, 279, 321
Rvest 309, 310, 313, 415, 416
s
save_html() 440, 443, 448
saveWidget() 479
scale by column 181, 182
scale by row 180, 182
scale_color_brewer() 166
scale_color_manual() 12–14, 18, 122, 126, 128, 129, 131,
139
scale_color_wsj() 16, 45
scale_fill_gradient() 65, 86, 89, 160
scale_fill_gradient2() 86, 89
scale_fill_manual() 33, 125, 126, 128, 140, 141, 144
scale_fill_tableau() 34, 111–116, 445
scale_fill_viridis_b() 11, 17, 18, 31, 46, 48, 67, 71,
101, 102, 104, 149, 152, 399, 404
scale_fill_viridis_d() 11, 17, 18, 31, 46, 48, 67, 71,
101, 102, 104
scale_fill_viridis_d() 11, 17, 18, 31, 46, 67, 71, 101,
102, 104
scale function 13
scale_size_binned() 418
scale_x_date() 16, 45
scale_x_discrete() 46, 48, 101, 102
scale_y_continuous() 16, 45, 67, 71, 169
scale_y_discrete() 46, 160
scaled 112, 179–181, 183
scales 1, 3, 5, 33, 47, 55, 56, 75, 78, 84, 94, 101, 102, 120,
169, 171, 189, 208, 210, 219, 242–244, 331, 351, 413
scatterplot 3–28, 29, 33, 44, 49–50, 84, 89, 113–114, 116,
121–126, 131, 132, 140, 141, 165–167, 169–171, 175, 188–191,
196–224, 225, 228–229, 231–232, 235, 248, 257, 263, 264, 280,
283, 325, 331–333, 336, 341–342, 351, 357, 361–363, 365–366,
380, 387, 416, 460, 462
scatterplot heatmap 188–191
science 19, 96, 177, 267–269, 273, 309, 319, 389, 419
Scipy 182
scrolling 208, 272
Sea Level Rise 481, 514, 515
Seaborn 1, 19–28, 29, 34–41, 49–58, 71–81, 83, 91–107, 117–
120, 130, 131–136, 144–148, 159, 160–163, 170–176, 178, 182–
191, 196, 257, 329, 334
Seaborn Objects 1, 19
second geometry 463
select() 24, 37, 84, 86, 93, 133, 135, 144, 145, 151, 162,
166, 207, 212–214, 219, 223, 231, 246, 247, 251, 253, 256, 257,
273, 277, 279, 287, 297, 298, 311, 312, 320, 323–325, 357, 361,
363, 401, 409, 413, 420, 461, 469, 515, 521
selected_header 287
selectInput() 273–275, 277, 279, 287, 297
selection 207–208, 210, 212–214, 223, 228–229, 231–232,
248, 249, 251–253, 257, 272, 275–277, 280, 282, 286–287, 289,
297, 314, 324, 326, 328, 355–366, 371, 373–375, 379–382, 387,
473, 527
selection with brush 214–221
selector 280, 283, 287, 289, 310, 312–315, 348–355, 357
SelectorGadget 313
server logic 271, 273–277, 279, 280, 288, 289, 297, 299, 321–
323, 325
session 271, 274, 276, 279, 298, 301
set() 21, 25, 40, 135, 171, 187, 189
set_axis_label() 50, 52, 55, 58, 106, 176
set_edgecolor() 191
set_facecolor() 95, 97
set_title() 21, 75, 79, 145, 146
setView() 433, 440, 443, 448, 473, 479
sf 400–401, 403, 406, 408–413, 419, 420–423, 425–427, 429,
433–445, 447, 448, 450–454, 460–466, 469, 495, 522
sf::as_Spatial() 412
sf::st_as_sf() 412, 447
sf_centroid() 463
sf_geometry() 463
sfheaders 450, 469
sfheaders::sf_to_df() 469
shape 8, 10, 15, 16, 44, 109, 111, 115, 116, 118, 119, 125,
126, 128, 131, 135, 139–142, 166, 203, 223, 289, 326, 389, 392,
418–422, 429, 433, 450–452, 455, 457, 463, 466, 470, 472, 473,
479, 482, 490, 511
shape file 389, 392, 419–422, 425, 426, 433, 452, 457
sharex 56
sharey 56
shift() 93, 239
Shiny 268, 271–327, 342, 344, 377
shiny 272–273, 320
shinyApp() 271, 274, 279, 307, 321
shinydashboard 303–321
shinythemes 280, 286, 296
shinyWidgets 286
show.legend 32–34, 44, 46, 86, 89, 91
sidebar 272, 295–303, 307, 308, 320, 321, 355–365, 369, 371,
379
sidebarLayout() 296, 321
sidebarMenu() 307
sidebarPanel() 296, 321
similarity 178, 180, 181
Simple Features 400, 408–413, 460–479
sina plot 121–136, 138
sinaplot() 134, 135
SITAR 392, 452
size 10, 11, 13–14, 16–18, 21–22, 24, 25, 38, 44, 46, 49, 61,
64, 73, 86, 89, 91, 111–112, 114–116, 118, 123, 125–126, 129,
131, 133, 139–142, 144, 149, 156, 159–160, 166, 169, 174, 190,
196, 198–203, 205, 208, 216, 220, 223, 228–230, 244, 257, 263,
272, 277, 279, 283, 289, 326, 331, 334, 345, 348, 352, 357, 360,
363, 379, 382, 399, 404, 406–408, 416, 418–419, 438, 447, 450–
451, 457, 463, 466, 483, 485, 487, 492, 498, 505, 508, 525
size_norm 190
Skills Survey 109, 110, 121, 137, 148, 149
Skiprows 20
skyblue3 10, 12, 18, 440, 457
slate 361,
slice:max() 324
slice:min() 324
slider 251–254, 257, 258, 272, 297, 342, 344–346, 349, 351,
352, 357, 360, 362–364, 387
sns.barplot() 34–38, 55, 58, 94–96, 119
sns.boxplot() 105, 106, 171
sns.catplot() 49, 52, 54, 106
sns.clustermap() 182
sns.color_palette() 37, 40, 72, 74, 135
sns.countplot() 38, 40
sns.despine() 120, 145, 146, 172, 173
sns.displot() 49, 74
sns.diverging_palette() 187–189
sns.FacetGrid() 54–56, 58
sns.heatmap() 162, 163, 186–188
sns.histplot() 71–74, 76, 78, 79
sns.JointGrid() 175
sns.jointplot() 173
sns.lineplot() 25, 27
sns.move_legend() 24, 36, 37, 106
sns.relplot() 49–51, 190
sns.rugplot() 174, 175
sns.scatterplot() 21, 22, 24, 27, 171, 175
sns.set() 21, 25, 52, 54, 58, 94, 96, 106, 163, 170, 173,
182, 187, 189
sns.set_theme() 52, 54, 58, 94, 96, 106, 163, 170, 173, 182
sns.stripplot() 133
sns.swarmplot() 133
sns.violinplot() 117, 120, 144, 145
solar 365
sort_values() 34, 35, 93, 161, 334, 364, 373, 374, 381, 483,
485, 487, 516, 526
sorted bars 246–247
sp 408–413, 419, 448, 465, 468
sp::plot() 409
spatial data 389–398, 401, 408–413, 419, 460, 462, 508, 522
SpatialPolygonsDataFrame 410, 413
srcDoc 380, 381
St. Mark’s Square 440, 442, 443
st_as_sfc() 439, 453, 454, 457
st_bbox() 439, 453, 454, 457
st_crs() 406, 407, 414, 418, 429, 438, 439, 445, 446, 461
st_geometry() 401, 464, 465
st_intersection() 460, 461
st_read() 420, 421
st_set_crs() 422–425
st_transform() 427, 437–439, 451, 452
Stack Overflow 138
stack() 72, 79, 189, 427, 493, 500
stacked 8, 31, 37, 38, 40, 43, 57, 72, 74, 79–81, 128, 129, 244,
247, 248, 251, 363, 372, 420, 431
stacked bar plot 43, 244–248
stackratio 142, 144
Stamen 430–433, 435, 440, 441, 470, 509–511
standard_scale 182
standardized 179
stat 31, 33, 46, 48, 86, 89, 238, 410
state level 196, 392
static graphic 1, 193, 196–200, 205, 225, 231, 235–239, 267,
326, 329
statistics 1, 4, 60, 63, 99–101, 109, 177, 182, 195, 253, 391,
398, 481, 525
stats 83, 178–181, 195,
stats::heatmap() 178–181
Statue of Liberty 510
str() 410
strip plot 51, 122, 131–134
stroke 10, 256, 326, 448, 523, 525
style option 10, 18, 19, 24, 25, 30, 36, 37, 40, 46, 63–65, 94,
95, 124, 126, 127, 140, 162, 168, 190, 198, 200, 242, 280, 281,
283, 284, 286, 303, 326, 352, 355–365, 371, 396, 406, 416, 418,
451, 484, 487, 488, 501, 508, 514, 517, 521
subplot 170, 172
Subway stations 481, 523–528
sum() 30, 34, 36, 46, 47, 52, 57, 158, 161, 165, 170, 226, 228,
237–240, 244, 246, 248, 381, 418
summarize() 30, 46, 47, 151, 158, 165, 276, 282, 323, 324
Superfishel Stone 116
SVG 200
swarm plot 51, 131, 133, 134
Sweden 409–412
switch 142
Symlog 78, 79
symmetric log 78–81
synchronized bar plots 247–251
sys.path.append() 135
Sys.setenv() 320
t
tab 274, 288–291, 303, 307, 315, 316, 320, 322, 326, 339, 365,
369, 371, 374, 375, 377–379, 382, 383, 387, 452
tab.css 377, 382
tableOutput() 273, 274, 276, 279
tabPanel() 287, 288, 303, 323
tabsetPanel() 286–288, 303
tag 310, 312, 313, 342, 344, 362
teal 175, 239, 244, 248, 344, 348
temperature 4, 15–17, 23–25, 44–45, 50, 51, 59–70, 148, 149
terrain 421, 425, 427, 430–432, 470, 509–511
Thematic 286, 287, 294, 448–460
thematic map 448–460
thematic_shiny() 286
theme() 16, 286, 294
theme_bw() 160, 279
theme_clean() 31, 46, 49, 61, 63, 149, 156, 159
theme_hc() 104, 111, 112, 116, 122, 139, 140, 144, 166, 169
theme_light() 10, 11, 13, 14, 16–18, 33, 45, 64, 86, 89,
100, 102, 131, 152
theme_minimal() 34, 61, 91, 289
theme_void() 396, 397, 399, 401, 404, 419, 424, 425, 427,
429, 431, 438, 439, 445, 446
themeSelector() 280
threshold 12–14, 22, 33, 196, 197, 219, 430
tibble 4, 10, 30, 45, 46, 48, 158, 310, 313, 315, 321, 395, 396,
417, 462, 465, 470
tick_params() 41, 50, 52, 54, 55, 163, 187
tidyverse 4, 169, 272, 273, 309, 313, 320, 392, 415, 419, 450,
469
tile map 430, 431, 433, 435, 440, 442, 447, 488
tiled web map 430, 432–435, 440, 441, 450, 457, 458, 487, 493,
494, 496, 498, 508–510
time series 10, 15, 21, 29, 43, 84, 86, 117, 240
timereg 41
Times Square 511
titlePanel() 271, 273, 277, 321
tm_basemap() 470, 473
tm_borders() 450, 455, 457, 473
tm_dots() 473, 479
tm_fill() 450, 479
tm_layout() 450, 455, 457, 463, 466, 479
tm_polygons() 451, 455, 463, 466, 470, 472, 473
tm_shape() 450, 451, 455, 457, 463, 466, 470, 472, 473, 479
tmap 448–461, 463, 465–467, 469, 470, 473–475, 477, 479
tmap_mode(‘plot’) 451, 455, 463
tmap_mode(‘view’) 451, 455, 457
tmap_options() 451, 457
to_dict() 345
to_list() 161
tolist() 96
Toner 431–433, 440, 470
TopoJSON 425
transform_filter() 215, 216, 232, 246, 248, 251–253,
256, 257, 260, 262, 322, 326, 527
transform_window() 215, 216
transpose 187
u
unique() 341
united 377
United Kingdom 10, 22
United Nations 195, 197, 329, 330, 341
United States 10, 21, 93, 95–97, 251, 252, 262, 264, 265, 287,
322, 364, 371, 395
univariate analysis 60–63, 71–74
Unsplash 511, 512, 514
update_traces() 334
update_layout() 363, 364, 373, 374, 485, 487–489, 492,
495, 497, 503, 507
updateCheckboxGroupInput() 298, 301
URL 83, 196, 309, 310, 314, 370, 380
US Department of Housing and Urban Development 196, 251,
256, 260
usability 272, 275
user interface 268, 271–274, 276, 277, 280, 286–288, 296, 297,
301, 307, 321, 322
v
value_name 20, 24
values 3–6, 10–18, 23, 25, 29–31, 33–35, 37–39, 43–44, 46,
48, 59–60, 63, 67, 78, 84–86, 89, 93–94, 96, 105, 115, 118, 120,
122, 125–126, 128–129, 131, 134, 138–141, 144–145, 149, 151,
157–159, 161, 163, 165–166, 172–173, 177–182, 184, 186–187,
197–198, 202, 207, 212, 214–215, 219, 223, 226, 232–233, 239–
240, 246, 251–253, 257, 260, 262–263, 275–277, 280, 282, 287,
295, 297–302, 307, 322–326, 330, 331, 334, 341–342, 345, 351–
352, 361–364, 370, 372–374, 380, 393–394, 398, 406, 410, 422,
427–428, 447, 450, 460–461, 466, 469, 483–485, 487, 491, 496,
500, 516, 521, 523, 526
values_from 178
values_to 15, 110, 116
var_name 20, 24
variable 3, 5–9, 13, 17–18, 21, 25, 29, 31–39, 43–44, 47–52,
54, 57, 59–60, 63, 67, 71–76, 83–84, 91, 94–95, 99–107, 109,
115, 117–118, 122, 130, 133, 138, 142, 144–145, 149, 157, 166,
168, 170, 174, 177, 184, 196, 198–199, 201–203, 205, 207–208,
213, 215–216, 220, 228, 235, 242, 244, 246, 257, 260, 262–263,
271–273, 275–276, 282, 307, 310, 313, 319, 325, 331, 334, 341–
342, 345, 351, 356, 361–362, 366, 369, 379, 401, 404, 408, 410–
413, 421, 431, 442, 462, 466, 489, 493, 495, 523, 527
vars() 142
Vega 194, 319, 320, 326
Vega-Lite 194, 319
Vegawidget 319–323
vegawidget::renderVegawidget() 320, 321, 323
vegawidget::vegawidgetOutput() 320, 322, 323
Venice 392, 420–448, 450, 481
view mode 451, 452, 454, 455, 458, 466
villas 452, 454–456
violin plot 51, 109–115, 117–128, 131, 137–142, 144, 145
Viridis 10, 11, 17, 18, 25, 27, 31, 38, 40, 46, 48, 67, 71, 101,
102, 104, 149, 152, 169, 199, 205, 208, 216, 289, 331, 351, 399,
404, 472, 485, 486
virtual column 134, 272
virtual environment 319, 320, 495
virtual grid 272
visual communication 31, 109
vjust 16, 61, 89, 91, 156, 404, 408, 419
vroom 273
vroom() 15, 273
w
waiver() 64, 67, 71
waterways 421, 425, 427, 429, 438
WDI 9
web browser 274, 310, 339, 378, 380
web scraping 303–321, 323, 365–376, 383, 415, 417
WGS 426, 427, 435, 436, 439, 440, 462, 465
Whisker 99
wide form 159, 161, 163, 178, 184, 197
widget 207, 251, 283, 286, 287, 295, 297–301, 342, 348, 361,
371, 445, 466
width_ratios 170
Wikipedia 309, 310, 313, 315, 370–372, 374, 375, 414, 417
Windows 311, 449
World Bank 4, 9, 20, 21
World Geodetic System 406
world map 392, 393, 410, 413, 433
WorldImaginery 433
The World Trade Center 509, 511, 512, 514
wrapper 431
x
X() 239
xaxis.set_tick_params() 41, 163
xend 89, 91
xlim 79, 171, 173, 176, 413, 414, 418, 429, 439, 445, 446
xmax 421, 425, 427, 428, 439, 453, 454, 457, 462, 465
xmin 421, 422, 425, 427, 428, 439, 453, 454, 457, 462, 465
XML 310, 312–314, 340
Xparams 166, 167
y
Y() 226, 228
yaxis.set_label_position() 172
yaxis.tick_right() 172
year() 16, 44, 60–61, 63, 65, 67, 71, 165
yend 89, 91
yintercept 14
ylim 171, 173, 176, 413, 414, 418, 427, 429, 439, 445, 446
ymax 421, 425, 427, 428, 439, 453, 454, 457, 462, 465
ymin 421, 422, 425, 427, 428, 439, 453, 454, 457, 462, 465
z
zip codes 481, 483, 484, 491, 493, 504, 515, 523
zip() 95, 97
WILEY END USER LICENSE
AGREEMENT
Go to www.wiley.com/go/eula to access Wiley’s ebook EULA.