0% found this document useful (0 votes)
41 views85 pages

Covid 19

The document discusses visualizing and predicting COVID-19 data. It describes the goals of understanding misconceptions, interpreting visualizations, and predicting cases and deaths. It outlines the data sources and processing, defines key terms, and discusses visualization tools and techniques like aggregation. Visuals were created in Plotly to examine worldwide trends, compare countries and US states, and perform time series forecasting.

Uploaded by

nileshmachhi148
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views85 pages

Covid 19

The document discusses visualizing and predicting COVID-19 data. It describes the goals of understanding misconceptions, interpreting visualizations, and predicting cases and deaths. It outlines the data sources and processing, defines key terms, and discusses visualization tools and techniques like aggregation. Visuals were created in Plotly to examine worldwide trends, compare countries and US states, and perform time series forecasting.

Uploaded by

nileshmachhi148
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 85

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/348404372

Visualizing and Predicting with COVID-19

Presentation · January 2021


DOI: 10.13140/RG.2.2.26570.59840

CITATIONS READS

0 3,259

1 author:

Erika Diaz
Rowan University
5 PUBLICATIONS 0 CITATIONS

SEE PROFILE

All content following this page was uploaded by Erika Diaz on 12 January 2021.

The user has requested enhancement of the downloaded file.


Visualizing And Forecasting with
Covid-19
Erika Diaz
● Introduction
○ Goal
○ Why COVID-19?
○ Data
■ Source(s)
■ Processing
■ Terminologies
● Visualization
Tools

Overview

○ Aggregation Terminologies
○ Worldwide
○ United States vs Other Countries
○ U.S. States
● Time Series Forecast
○ Algorithms
○ Predicting USA Cases
○ Predicting NJ Cases and Deaths
● Conclusion
Introduction
● Ever heard of smart terms such as “quantum physics” being used as a common
phrase in movies when talking about something highly intelligent or even
impossible?
● Quantum physics in Transformers, eigenvalue and an inverted Mobius Strip in
creating a time machine in Avengers’ Endgame, etc.

● Not a lot of viewers know what these words mean!


● This is a technique used to make someone sound credible.
● Data is being used the same way...
● Dr. Anthony Breitzman did a presentation tackling bad and good visualizations
on the Coronavirus data (https://www.researchgate.net/publication/346784309_Data_Science_in_the_Time_of_COVID-19)
● Some of them were created to support personal opinions such as TESTING
CAUSES POSITIVE CASES TO RISE!
● Testing will indicate if one is positive or not. But getting tested does not give you
the virus.
● I get tested twice every week, since the month of May. I have never been Covid
positive.
Goal
● To study COVID-19 data to avoid believing misconceptions that the media and
those in power try to impose.
● Learn how to interpret different visualizations
● Assess to see how close it is to possibly predict cases and deaths

How?

● Use visualization tools and time series prediction algorithms


Why Should We Care About COVID-19?
● There are lot of misconceptions with SARS-COV-2
● “It is not lethal!”
○ This is true, but it has a 20% more mortality rate than the flu!
○ “The issue isn’t the lethality as much as the overall impact of the outbreak. While these other
diseases like H1N1, MERS, SARS, may be more lethal, the combination of reproductive factor (R0),
receptivity in the population (susceptibility), and immunity may make them much more
manageable.
● “Only older adults and people with preexisting conditions are at risk of infections
and complications”
○ That is true, but no one is safe from complications and deaths.
○ COVID-19 is new to the population. So unless, you’re an animal, you’re not immune!
○ Ever heard of adults in their 30’s or 40’s getting stroke due to Covid?
Why Should We Care About COVID-19?
● A lot of people lost their their lives.
● In New Jersey alone, as of December 9, there has been 17,608 deaths attributed to
the disease.
Why Should We Care About COVID-19?
● A lot of people lost their job.
● “The unemployment rate peaked at an unprecedented level, not seen since data
collection started in 1948, in April 2020 (14.7%) before declining to a still-elevated
level in November(6.7%).”
● “In April, every state and the District of Columbia reached unemployment rates
greater than their highest unemployment rates during the Great Recession.”
Data Sources
● Worldwide Data
○ Our World in Data (www.ourworldindata.org)
■ By Oxford University
■ Collects from European Center for Disease Prevention and Control (ECDC) & JHU
■ Used by Harvard, Stanford, Cambridge, the NYT, Wall Street, Nature Journal
■ Data from 12/31/2019-11/08/2020
● United States Data
○ CDC (www.cdc.gov)
○ The Covid Tracking Project (www.covidtracking.com)
■ Collects from local/state public health authorities
■ Used by John Hopkins University
■ Data from 1/22/2020-11/08/2020
Data Cleaning
● Worldwide Data - for Visualization
○ Remove columns
○ Make date format uniform
○ Fill nulls with 0s
○ Change data types
● State Data - for Prediction
○ Remove columns that are not needed
○ Merge CDC dataset and Covid Tracking Project dataset
■ Covid Project has NYC and New York State -- combine
■ Cases, Deaths from CDC, Recovered and Testing from Covid Project
■ Left outer join
■ Join on state and date column--make date format uniform
■ Fill nulls with 0s
■ Change data types
Why am I using different sets of data?
● Worldwide ● United States State Data
○ Does not have U.S. state ○ Focuses on U.S. states and
data territories
○ Shows us where our ○ More reliable reporting?
country stands against the ○ RAW DATA!
world
○ Already a collection of
datasets
Data Terminologies
● Confirmed Cases / Cases
○ Confirmed + probable new cases
○ Some countries do not report probable cases
○ Some U.S. States do not report probable cases due to some states following old criteria for diagnosis
● Criteria for Covid Diagnosis
○ April 15 version
■ fever (measured or subjective), chills, rigors, myalgia, headache, sore throat, new olfactory
and taste disorder(s)
○ August 5 Version
■ Same as the former version but now includes, nausea or vomiting, diarrhea, fatigue,
congestion, runny nose, cough, shortness of breath, pneumonia or acute respiratory distress
syndrome
Data Terminologies (continued)
● Confirmed Deaths / Deaths
○ New Fatalities with confirmed or probable COVID diagnosis
● Tests/Testing
○ Testing which includes rapid test, antigen, PCR test, and antibody test
● ICU Patients
○ # of COVID-19 patients in intensive care units (ICUs) on a given day
● Hospital Patients
○ # of COVID-19 patients in hospital on a given day
● Reproductive Rate (see next slides)
● Stringency Index (see next slides)
Reproductive Rate, R0
● Real-time estimate of the effective reproduction rate (R) of COVID-19
○ The average number of people each person with a disease goes on to infect
○ If R is > 1.0, the virus will spread quickly. When Rt is < 1.0, the virus will stop spreading
○ Range from 0-6
● Not used by a lot of countries when implementing policies
○ One reason is that it does not take into account, 'superspreaders'
■ They pass on the disease many more times than average
■ They flock in crowded, indoor events where the virus spreads more easily
○ Represents only an average across a region
■ Misses clusters
○ No uniform algorithm
■ Epidemiologists each have their own approach to combining and using sources of data to work
out R, relying on their own statistical models to look at trends in presumed infections
● When countries consider when to reopen schools and offices, a key question is not
only R, but what the actual number of infected people walking around is.
Stringency Index
● Government’s policy responses to COVID
● Based on 18 indicators
● The data from indicators is aggregated into a set of four common indices,
reporting a number between 1 and 100 to reflect the level of government action
on the topics in question
● Indicators include school closures, workplace closures, public events and
gathering restrictions, public transport and travel control, testing policy, contract
tracing, face coverings, stay at home orders, public info campaigns, vaccines, etc.
Stringency Index: Four Indices
● Government response index
○ records how government responds has varied over all indicators in the database, becoming
stronger or weaker over the course of the outbreak
● Containment and health index
○ combines ‘lockdown’ restrictions and closures with measures such as testing policy and contact
tracing, short term investment in healthcare, as well investments in vaccine
● Economic support index
○ measures such as income support and debt relief
● Original stringency index
○ records the strictness of ‘lockdown style’ policies that primarily restrict people’s behaviour
Visualization
Visualization Tools
● All visuals used, line graphs, bar graphs, worldwide and country maps, including
time series prediction graphs were created using Plotly
● Plotly is a visualization package that can be used in both R and Python
● Plotly is interactive!
○ Loses the interactive format when imported to slides/ppt.
○ I tried taking advantage of this feature but I couldn’t make it to work.
○ Apparently, Microsoft had a Plotly add-in before, but it no longer exists!
Aggregation Terminologies
● Total
○ Cumulative, by day, recorded since first day of case was recorded
● Total per 1,000,000 or 100,000 people
○ Cumulative, by, day, using the rate of 1,000,000 or 100,000 people
○ Worldwide Cases/Deaths uses 1,000,000
○ Worldwide Testing uses 100,000
○ Normalizes population data
● Weekly/Smoothed
○ Instead of daily, data was aggregated over 7 days.
○ Provides a smoother looking graph, “Smoothing”
○ Done to count for the weekends due to no testing done, discharges done, on the weekends
● Weekly Averaged
○ Data aggregated weekly is averaged
Aggregation Terminologies (continued)
● I am presenting data in these various ways since a lot of sources use different ways
to represent data
● All data are reported as of November 8, 2020.
● Can’t drop null values since we do not have a lot of data
Visualizing Worldwide
Cases per Continent
Deaths per Continent
Worldwide Weekly Increase
This is the amount of new
cases reported every week,
worldwide.

Recall dataset started


record on 12/31.

The data was aggregated


weekly starting on a
Monday 12/30 to account
for 12/31. Every week
starts on a Monday.
Worldwide Weekly Average
For each week, all new
cases are added. They
are then divided by 7 to
account for the number
of weeks hence
‘averaged’.
Worldwide Weekly Increase Per 1M
The 1 million rate is
obtained by dividing the
number of new cases over
the number of population
at that specific week, and
multiplying it by 1M. The
same method for 100,000.

For ICU and Hospital, the


total number of deaths at
that week is used instead.
Distribution of Cases
For the U.S., the case
per million rate is
29791.172.

The current population


right now is 328.2 M.

So 29791 x 328.2 =
9.777462.65.

That is the total


number of cases in the
U.S.
Distribution of Deaths
For Brazil, the deaths
per million rate is
763.4054

The current population


right now is 209.5M.

So 763.4054 x 209.5=
159,933.473

That is the total


number of deaths in
Brazil.
Distribution of Testing
This shows the total
number of tests given
per country.

U.S. has the highest


number of cases. Next
is India, and then
Russia.

This shows how much


testing is implemented.
Reproductive Rate per Country
Shows the highest
infection rate a
country ever had.

Remember, R is the
average number of
people each person
with a disease goes
on to infect.

It is not the # of
infected people
walking around.
Stringency Index per Country
This is the strictest a
country has ever been
at one point.

As we can see, the U.S.


never implemented an
index of 100.
Visualizing Countries
Top Countries With Covid Cases
Top Countries With Covid Deaths
Cases per Country
Deaths per Country
United States Versus Other Countries
● Since not all countries report ICU and hospitalization data, let’s compare U.S.’
amount of cases with a few countries, including China!
Comparing New Cases per 1M
Who has flattened the
curve?

Can we see second


waves?
Comparing Stringency Indices
How strict have
countries been over
time?

It seems that the U.S


has stayed the same
while Spain became
lenient.
United States Vs. Italy
● Since not all countries report ICU and hospitalization data, let’s compare U.S.
with one of the hard hit countries that reports ICU and hospitalization data
Italy
Everyday a person is at the
hospital, he or she is being
recorded.

This shows that at specific


period, particularly in the
beginning of the pandemic,
almost everyone with the
COVID was or stayed in the
hospital.

We can attribute this to


isolation purposes
United States
The number of new
cases seem to follow the
total number of
hospitalizations after
about a week, or two.
Visualizing
The United States
Weekly Moving Average of Cases
This is the averaged
U.S. cases in a week.

Recall that first record


in the dataset started on
01/22/2020.

To account for this


date, the data was
aggregated weekly, and
the week started on
01/20, which is a
Monday.
Weekly Moving Average of Deaths
We see two sharp
spikes in deaths…

From the last slide, we


saw two spikes as well
in cases
U.S. Weekly Moving Average
This is a proof that
higher testing does not
lead to higher number
or cases!
Distribution of Cases
California, Texas and
Florida now have the
most number of total
cases, even if the first
hotspots were in New
Jersey and New York!
Distribution of Deaths
Despite the lower
number of cases in New
York now, the state still
has the most number of
deaths recorded.
Distribution of Testing
Weekly Moving Average of Some States
New York and New
Jersey had spikes in the
beginning being the
hotspots, but over time,
the states have been
doing well compared to
other states like Florida!
Weekly Moving Average in NJ
We can see that the
spike in the number of
cases was followed by a
spike in deaths.
Time Series Forecast
Time Series Prediction
● What is Time Series?
○ series of data points over time
● Components of a Time Series
○ trend - the increasing or decreasing values in a series
○ seasonality - the repeating short-term cycles in the series
○ random white noise - random variation in the series
● What is Prediction or Forecasting?
○ Forecasting is simply the process of using past data values to make educated predictions on future
data values
● Time Series Prediction
○ Using a series of past data points over time to make an educated predictions on future value
Time Series Prediction
● We want our time series to be stationary
● Stationary means having a constant mean and variance across the time series.
● A time series needs to be stationary in order for it to make good predictions

● So far we’ve seen that COVID data is not stationary but we’ll do our best to make
predictions
● Will be using a few algorithms to predict cases in the U.S only.
● The best algorithm will be used to predict cases and deaths in New Jersey
Time Series Algorithms
● ARIMA Models
○ Auto Regressive (AR)
○ Moving Average (MA)
○ Autoregressive Integrated Moving Average (ARIMA)
○ Seasonal Autoregressive Moving Average (SARIMA)
● Holt Models
○ Holt Linear Model
○ Holt Winter
● Machine Learning Models
○ Linear Regression
○ Polynomial Regression
○ Support Vector Machine
ARIMA Models
● Auto Regressive (AR) Integrated (I) Moving Average (MA)
● AR Model
○ uses observations from previous time steps as input to predict the value at the next step
● MA Model
○ next observation is the mean of every past observation
● Auto ARIMA
○ Combination of the past two models but takes into account non seasonal differences needed for
stationarity.
○ Makes non stationary data stationary by removing trends
● SARIMA
○ Extension of the ARIMA
○ Adjusts a non-stationary time series by removing trend and seasonality.
● In Python, ‘from pmdarima import auto_arima’
Holt Models
● Exponential Smoothing
○ weighted averages of past observations, with the weights decaying exponentially as the
observations get older.
○ In other words, the more recent the observation, the higher the associated weight.
● Holt Linear Model
○ Builds upon simple exponential smoothing (SES), which is a method suitable for forecasting with
no clear trend or seasonal pattern.
○ Extends SES by allowing the forecasting of data with a trend
● Holt-Winters Model
○ Extends the linear model by capturing seasonality
● In Python, ‘from statsmodels.tsa.api import Holt, SimpleExpSmoothing,
ExponentialSmoothing”
Machine Learning Models
● Regression is a form of predictive modelling technique which investigates the
relationship between a dependent and independent variable
● Linear Regression
○ Uses a linear relationship to predict the average values of Y for a given value of X using a straight
line or the regression line.
● Polynomial Regression
○ Fits a polynomial line on data that is correlated but does not look linear
○ Reduces errors that would otherwise be produced by a linear regression line
● In python, both are in Sci-kit Learn
Machine Learning Models (continued)
● Support Vector Machine Model Regressor
○ SVM: tries to find a line/hyperplane) that separates classes. Then it classifies the new point
depending on whether it lies on the positive or negative side of the hyperplane
○ SVR uses the same principle as SVM except for regression
○ Acknowledges the presence of non-linearity in the data
○ Linear regression uses a regression line , SVR uses a hyperplane
○ Support vectors: data points on either side of the hyperplane that are closest to the hyperplane
■ used to plot the boundary line
○ SVR tries to fit the best line within a threshold value or the distance between hyperplane and
boundary line
■ Does not try to minimize the error between the real and predicted value, like regression
models
● In scikit learn
Forecasting U.S. Data
Before we proceed...
U.S. States data on cases and deaths used weekly average, or were aggregated using the
mean.

This was used for the time series prediction in both the U.S. and NJ.

A must for time series since COVID data is volatile!


ARIMA: AR
ARIMA: MA
ARIMA: SARIMA
ARIMA: Auto ARIMA
Holt Linear
Holt Winter
Linear Regression
Polynomial Regression
SVM Regressor
Prediction Outcomes: RMSE
Prediction Outcomes: # of Cases
New Jersey Prediction
● Used Holt Winters Model, Auto ARIMA and MA to predict NJ cases and deaths
● With Cases
○ SARIMA, RMSE of 182.125
○ Holt Winters, RMSE of 472.407
○ Auto Arima, RMSE of 944.003
○ AR Model, RMSE of 991.912
○ MA Model, RMSE of 1312.273
● With Deaths
○ Auto Arima , SARIMA and AR are tied, RMSE of 5.983
○ MA Model RMSE of 8.3708
○ Holt Winters, RMSE of 34.591
NJ Case And Death Prediction Outcome
NJ Best Case Prediction
NJ Best Death Predictions
Conclusion on Data
● Cannot trust data without knowing what it means!
● Cannot trust data without knowing how to derive it!
● Hard to gauge accuracy of data since countries have different ways of reporting
data!
● Even U.S. states have different ways of reporting data!
● Smoothing or averaging per week makes data clearer to understand
● Using rates per population gives us a better idea on how we’re doing as a country
compared to others!
● If you don’t believe it, see it for yourself and do visualizations!
Conclusion on Visualization
● Not everything graphed is true!
● COVID is unpredictable!
● It seems that U.S. as a whole has flattened the curve.
● New Jersey and New York flattened the curved considering how these states were
the main hotspot
● Higher testing does not lead to more positive cases!
● We really won’t know how were doing, what were doing wrong or great ,unless
we look at other countries.
● Plotly is a very helpful tool due to its interactiveness
Conclusion on Time Series
● COVID is very volatile
● COVID does not have a trend so it is hard to predict what actual number are
going to be in real time.
● We can also attribute the difficulty of prediction to false reporting
● I wanted to use Facebook’s Prophet to predict data to it killed my kernel every
time!
● ARIMA Models and Holt Winters are good enough models
Conclusion on COVID-19
● Death rates are low. Let’s keep it that way!
● According to NYT, 38% of Covid deaths have occured in long term care settings.
● It is my job to practice non-maleficence since I work in a nursing home!
● Don’t be the person who takes up medical resources that someone else needs
more than you do.
● Stay safe!
Sources
● Idea on the project
○ https://towardsdatascience.com/14-data-science-projects-to-do-during-your-14-day-quarantine-
8bd60d1e55e1
● Data
○ covidtracking.com
○ cdc.gov
○ Ourworldindata.org
○ https://www.bsg.ox.ac.uk/research/research-projects/coronavirus-government-response-tracker
○ https://www.nature.com/articles/d41586-020-02009-w
● Statistics in this presentation
○ https://elemental.medium.com/why-we-should-care-commonly-asked-questions-and-answers-about-
covid-19-6b166f1876e9
○ https://www.nytimes.com/interactive/2020/us/coronavirus-nursing-homes.html
Sources
● Time Series Algorithms
○ https://machinelearningmastery.com/exponential-smoothing-for-time-series-forecasting-in-python/
○ https://towardsdatascience.com/holt-winters-exponential-smoothing-d703072c0572
○ https://towardsdatascience.com/simple-and-multiple-linear-regression-in-python-c928425168f9
○ https://alkaline-ml.com/pmdarima/quickstart.html
○ https://towardsdatascience.com/comparing-the-performance-of-forecasting-models-holt-winters-vs-
arima-e226af99205f
○ https://otexts.com/fpp2/non-seasonal-arima.html
○ https://bookdown.org/JakeEsprabens/431-Time-Series/introduction-to-time-series.html#what-is-a-
time-series
● Visualization
○ www.plotly.com
Sources
● Kaggle Kernels I Followed
● https://www.kaggle.com/fedi1996/covid-19-analysis-visualization-and-comparaisons/notebook
● https://www.kaggle.com/fedi1996/covid-19-analysis-visualization-and-comparaisons/notebook
● A lot of stackoverflow

View publication stats

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy