Analisis Dan Visualisasi Data
Analisis Dan Visualisasi Data
Objectives
**Part 1: Clean and Analyze Data **
Scenario/Background
In this lab, you will learn how to use the pandas library to perform the preliminary steps that are
needed before perfoming any data analysis. This includes removing missing values, changing the
format of data, and performing preliminary statistical analysis. After cleaning the data, you will
use matplotlib for data exploration and visualization.
Required Resources
1 PC with Internet access
Raspberry Pi version 2 or higher
Python libraries: datetime, csv, subprocess, pandas, numpy
Datafiles: data_long.csv
Load data from the file rpi_data_compact.csv. This file contains measurements of internet
speed, acquired during the lab Internet_Speed_Data_Acquisition. In particular, the focus is on three
quantities: ping time (ms), download speed (Mbit/s), and upload speed (Mbit/s).
# Code Cell 1
import pandas as pd
import numpy as np
In [ ]:
# Code Cell 2
# Import data from csv file, and visualize the first rows
#df_compact =
#df_compact.?()
As you may have noticed, the dataframe df_compact has an extra column. Use the
command df.drop to remove this column. Look at the Internet_Speed_Data_Acquisition lab for
help.
In [ ]:
# Code Cell 3
# Remove extra index columns
c) Remove NaNs.
A common issue that affects data quality is the presence of NaN values. These can make data
analysis functions abruptly terminate the calculation, throw an error, or produce incorrect results.
Typically, NaN values represent a piece of information that should be contained in, but is missing
from, the dataset. In this example, NaN values in df_compact may represent measurements where
the Internet connection fell, or queries from the Raspberry Pi that the Speedtest.net server failed
to answer. The code below finds out how many NaN values are contained in our dataframe.
First, go to http://pandas.pydata.org/pandas-docs/version/0.14.1/index.html and search
for isnull in the search box.
The documentation of the isnull function is a little difficult to understand. Fortunately, the Jupyter
interactive interface allows us to call this function and quickly examine its output.
In [ ]:
# Code Cell 4
NaNs_in_df = df_compact.isnull()
print(type(NaNs_in_df))
NaNs_in_df.head()
The outcome of the isnull function is a new data frame that contains a True of False if the
corresponding element of the DataFrame is NaN or not. Using the function sum on
this DataFrame will automatically convert the values True into 1s and False into 0s.
In [ ]:
# Code Cell 5
NaNs_per_column = NaNs_in_df.sum()
print(type(NaNs_per_column))
NaNs_per_column.head()
The resulting NaN_per_column is a pandas Series object, that can be thought of as a single
column of DataFrame (a DataFrame is actually a dict of Series, where the keys are the column
names). A Series object contains almost all of the functionalities of a DataFrame. Use
the sum function on the Nan_per_column Series and display the outcome.
In [ ]:
# Code Cell 6
NaNs_total = NaNs_per_column.sum()
NaNs_total
It is possible to concatenate all this instruction in one line, as follows:
In [ ]:
# Code Cell 7
df_compact.isnull().sum().sum()
Compute the number of missing values as a percentage of all the elements in the dataframe (round
the result to the second decimal
using numpy.round https://docs.scipy.org/doc/numpy/reference/generated/numpy.round_.html)
Use the pandas function dropna to remove NaN values from df_compact.
In [ ]:
# Code Cell 8
NaNs_pct = np.round(df_compact.isnull().sum().sum()/float(len(df_compact)*len
(df_compact.columns))*100, decimals = 4)
print('The DataFrame contains : {} NaNs, equal to {} of the measurements'.for
mat(NaNs_total, NaNs_pct)) #EDL : moved parenthesis
The function dropna, if called with only default parameters, removes all the rows of a DataFrame if
any of its values is `NaN'.
In [ ]:
# Code Cell 9
# Remove NaN values
df_compact_clean = df_compact.dropna()
Compare the length of the values before and after using dropna. Do you notice something odd?
Why?
In [ ]:
# Code Cell 10
The columns for ping, upload, and download in the dataframe df_compact contain numerical
values. It is therefore reasonable to expect that they all share one datatype, for example float64.
This, however, is not the case, as it can be detected using dtypes:
In [ ]:
# Code Cell 11
# Ping and Download are not floats
df_compact.dtypes
Use the Python function float() to convert a string value into a into float format.
In [ ]:
# Code Cell 12
str_val = '10.56'
float_val = float(str_val)
print(str_val, type(str_val), float_val, type(float_val))
Now convert all the values of the columns 'Ping (ms)' and 'Download (Mbit/s)' into float. Hint:
use apply and lambda. For help, look at the Internet_Speed_Data_Acquisition lab.
In [ ]:
# Code Cell 14
# this disables a notebook warning that is not relevant for our use case
pd.options.mode.chained_assignment = None
# Code Cell 15
# Remove the original 'Ping (ms)' and 'Download (Mbit/s)' columns
# Rename the new 'Ping (ms) float' and 'Download (Mbit/s) float ' to Ping (ms
)' and 'Download (Mbit/s)
In [ ]:
# Code Cell 16
df_compact_clean.head()
Before saving the DataFrame, it makes sense to reposition Upload as the last column. This can be
achieved using the reindex function.
In [ ]:
# Code Cell 17
df_compact_clean = df_compact_clean.reindex(columns = ['Date', 'Time', 'Ping
(ms)','Download (Mbit/s)','Upload (Mbit/s)']);
df_compact_clean.head()
Now that the dataset is finally clean, store it in a csv file and rename it.
In [ ]:
# Code Cell 18
# Let's save the new cleaned dataframe to a csv
df_compact_clean.to_csv('./rpi_data_processed.csv', index=False)
df_clean = df_compact_clean
New data requires not only cleaning, but also a good deal of getting used to. When you start a data
analyis project, it is worthwhile to invest effort in exploring and calculating some basic statistical
properties of the data. This entails computing averages, standard deviations, and correlations.
a) Calculate mean and standard deviation using Pandas.
The mean and the standard deviation of all the columns of a DataFrame can be computed
using mean() and std() . Look for them in the pandas library documentation and apply them to
the df_clean DataFrame
Quote the results as quantity = mean ± standard_deviation. Do not forget to include the
units of measurement associated with each quantity.
In [ ]:
# Code Cell 19
# Compute mean and std for all the columns of df_compact
# SOLUTION:
# means = ...
# stands = ...
# Print the mean value ± the standard deviation, including measuring units
print('Average ping time: {} ± {} ms'.format(stats_ping[0],stats_ping[1]))
print('Average download speed: {} ± {} Mbit/s'.format(*stats_download))
print('Average upload speed: {} ± {} Mbit/s'.format(*stats_upload))
Search in the pandas library documentation for how to compute the minimum and the maximum
values for all the columns in the DataFrame.
In [ ]:
# Code Cell 23
# Compute max and min for all the columns of df_compact
mins = df_clean.min()
maxs = df_clean.max()
Execute the following line of code. Notice how much time the pandas library can save you with even
a single line of code!
In [ ]:
# Code Cell 24
df_clean.describe()
Using the pandas argmin and argmax functions, find dates and times corresponding to the longest
and shortest ping time, the lowest and highest download speed, and the lowest and highest upload
speed.
In [ ]:
# Code Cell 25
# Find the min and max ping time
argmin_ping = df_clean['Ping (ms)'].argmin()
argmax_ping = df_clean['Ping (ms)'].argmax()
# Code Cell 26
# Create a small DataFrame and access its rows using iloc
# Code Cell 28
#Print the corresponding Date and Time
#print('Ping measure reached minimum on {} at {}'.format(df_clean.loc[...,
# df_clean.loc[...))
e) Create a correlation.
It is useful to analyze if the speed of download tends to increase and decrease together with the
speed of upload. The reasoning behind this would be that network usage and technical issues ought
to affect download and upload equally.
In this scenario, download and upload speeds would be called positively correlated. This means that
faster download and upload would typically occur together. This would refer to the general trend, but
instances of fast download with slow upload would still be possible.
On the other hand, you may infer that a higher download speed implies a lower upload speed, and
vice-versa. In this case, the argument would be that the internet line can only support a limited
amount of information being exchanged. Download and upload would then compete, and keep each
other 'in check'.
In this scenario, download and upload speeds would be called negatively correlated. This means
that download would typically be faster when upload is slower, and vice-versa. As before, this would
refer to a trend, so that simultaneous good download and upload speeds would still be possible.
To complete the picture, the time of ping may be positively or negatively correlated with either
upload or download. It is then natural to think of a table, where each quantity is compared with all
others. Such tables are well-known mathematical objects, and are dubbed correlation matrices.
Use the pandas function corr to derive the correlation matrix of ping, upload, and download. Store
the result in a variable called df_corr.
In [ ]:
# Code Cell 29
# Are these variables correlated?
df_corr = df_clean.corr()
df_corr
In [ ]:
# Code Cell 30
corr = df_corr.values
print('Correlation coefficient between ping and download: {}'.format(corr[0,
1]))
print('Correlation coefficient between ping and upload: {}'.format(corr[0, 2]
))
print('Correlation coefficient between upload and download: {}'.format(corr[2
, 1]))
These numbers answer the questions on the 'relationship' between ping, download, and upload.
Perfect positive correlation yields a value of +1, whereas perfect negative correlation yields a value
of -1. Yet, the correlations between download and ping, and between download and upload are
close to zero. Moreover, the correlation between upload and ping is small. This leads to the
conclusion that the three quantities are, in fact, mutually uncorrelated.
Python has a comprehensive library for making plots, called Matplotlib. As an additional learning
resource, it is certainly worth taking a look at the official Matplotlib documentation, and in particular
at the numerous examples.
a) Import Matplotlib.
In [ ]:
# Code Cell 31
import matplotlib.pyplot as plt
# The following allows your Jupyter notebook to create plots inside a cell
%matplotlib inline
Based on what you learned in the SF_Crime_Data lab, generate a plot containing three lines: 1. ping
(ms) as a function of time, 2. upload (Mbit/s) as a function of time, and 3. download (Mbit/s) as a
function of time. Use the legend() function to add a legend to your graph, but do not worry about
labelling the axes. We will work out how to do that in a later task.
In [ ]:
# Code Cell 32
# Initialise figure
fig, ax = plt.subplots(figsize=(10, 5))
# Create x-axis
t = pd.to_datetime(df_clean['Time'])
# Insert legend
ax.legend()
plt.show()
Because ping measurements include large and abrupt variations, they are perhaps better visualized
using dots. Within the command ax.plot(...) for diplaying ping data, specify that these
measurements are represented as dots. (Most of the code, here, can be recycled from the previous
task.)
In [ ]:
# Code Cell 33
# Initialise figure
fig, ax = plt.subplots(figsize=(10, 5))
#ax.plot(...
#ax.plot(...
#ax.plot(...
# Insert legend
# Code Cell 35
# Initialise figure
fig, ax = plt.subplots(figsize=(10, 5))
# Insert legend
ax.legend()
Use the 'fivethirtyeight' style context to make the previous graph more visually appealing.
To do this, add the row with the with statement to your code, before calling
the Matplotlib functions.
In [ ]:
# Code Cell 36
# Use a style context
#with ...
# Initialise figure
#fig, ax =
f) Create a Histogram.
A histogram is a graphical representation of the frequency of the values of numerical data. Examine
the code below. An additional level of complexity is the use of subplots to display the histograms
side-by-side.
In [ ]:
# Code Cell 37
with plt.style.context('fivethirtyeight'):
nbins = 100
# Initialize figure
fig, ax = plt.subplots(2, 2, figsize=(10, 10))
ax[0][0].hist(df_clean['Ping (ms)'], nbins)
ax[0][0].set_xlabel('Ping (ms)', fontsize=16)
ax[0][0].tick_params(labelsize=14)
ax[0][1].hist(df_clean['Upload (Mbit/s)'], nbins)
ax[0][1].set_xlabel('Upload (Mbit/s)', fontsize=16)
ax[0][1].tick_params(labelsize=14)
ax[1][0].hist(df_clean['Download (Mbit/s)'], nbins)
ax[1][0].set_xlabel('Download (Mbit/s)', fontsize=16)
ax[1][0].tick_params(labelsize=14)
ax[1][1].set_visible(False)
Scenario / Background
In statistics, linear regression is a way to model a relationship between dependent variable yy and
independent variable xx.
In this lab, you will analyze district sales data and perform a simple linear regression to predict
annual net sales based on the number of stores in the district.
Required Resources
1 PC with Internet access
Python libraries: pandas, numpy, scipy, and matplotlib
Datafiles: stores-dist.csv
matplotlib.pyplot as plt
numpy as np
pandas as pd
In [ ]:
# Code Cell 1
In [ ]:
# Code Cell 3
# The district column has no relevance at this time, so it can be dropped.
salesDist = salesDist.rename(columns={'annual net sales':'sales','number of s
tores in district':'stores'})
salesDist.head()
sales.head()
From the correlation coefficent data, what type of correlation did you observe between annual net
sales and number of stores in the district?
<font color = 'gray'>Type your answer here.</font>
# Create a scatter plot: Number of stores in the District vs. Annual Net Sale
s
plt.plot(x,y, 'o', markersize = 15)
Step 1: Calculate the slope and y-intercept of the linear regression line.
In [ ]:
# Code Cell 8
# Use numpy polyfit for linear regression to fit the data
# Generate the slope of the line (m)
# Generate the y-intercept (b)
m, b = np.polyfit(x,y,1)
print ('The slope of line is {:.2f}.'.format(m))
print ('The y-intercept is {:.2f}.'.format(b))
print ('The best fit simple linear regression line is {:.2f}x + {:.2f}.'.form
at(m,b))
Step 3: Overlay the regression line and the centroid point on the plot.
In [ ]:
# Code Cell 10
# Create the plot inline
%matplotlib inline
# Create legend
plt.legend(loc = 'upper right', fontsize = 20)
Step 4: Prediction
Using the linear regression line, you can predict the annual net sales based on the number of stores
in the district.
In [ ]:
# Code Cell 11
# Function to predict the net sales from the regression line
def predict(query):
if query >= 1:
predict = m * query + b
return predict
else:
print ("You must have at least 1 store in the district to predict the
annual net sales.")
In [ ]:
# Code Cell 12
# Enter the number of stores in the function to generate the net sales predic
tion.
What is the predicted net sales if there are 4 stores in the district?
<font color = 'gray'>Type your answer here.</font>