0% found this document useful (0 votes)
2 views15 pages

Analisis Dan Visualisasi Data

This document outlines a lab focused on cleaning, analyzing, and visualizing internet speed data using Python libraries such as pandas and matplotlib. It includes detailed steps for data cleaning, statistical analysis, and creating visualizations. The lab emphasizes the importance of data quality and provides instructions for handling missing values, changing data formats, and generating correlation matrices.

Uploaded by

Devara Dipa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views15 pages

Analisis Dan Visualisasi Data

This document outlines a lab focused on cleaning, analyzing, and visualizing internet speed data using Python libraries such as pandas and matplotlib. It includes detailed steps for data cleaning, statistical analysis, and creating visualizations. The lab emphasizes the importance of data quality and provides instructions for handling missing values, changing data formats, and generating correlation matrices.

Uploaded by

Devara Dipa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Lab - Internet Meter Visualization

Objectives
 **Part 1: Clean and Analyze Data **

 **Part 2: Visualize Data **

Scenario/Background
In this lab, you will learn how to use the pandas library to perform the preliminary steps that are
needed before perfoming any data analysis. This includes removing missing values, changing the
format of data, and performing preliminary statistical analysis. After cleaning the data, you will
use matplotlib for data exploration and visualization.

Required Resources
 1 PC with Internet access
 Raspberry Pi version 2 or higher
 Python libraries: datetime, csv, subprocess, pandas, numpy
 Datafiles: data_long.csv

Part 1: Clean and Analyze Data


It is sometimes said that machine learning practitioners spend 80% of their time cleaning data. While
this claim is possibly an exaggeration, it is true that data cleaning is a fundamental task, necessary
to ensure the success of a machine learning project. As in many other lines of work, the law of
'garbage in, garbage out' does apply. It is simply not possible to do meaningful data analysis on dirty
data.

Step 1: Clean the data.

Load data from the file rpi_data_compact.csv. This file contains measurements of internet
speed, acquired during the lab Internet_Speed_Data_Acquisition. In particular, the focus is on three
quantities: ping time (ms), download speed (Mbit/s), and upload speed (Mbit/s).

a) Read data from rpi_data_compact.csv.


Read the ./Data/rpi_data_compact.csv file into a Pandas dataframe called df_compact and
inspect the first three rows of this dataframe using df.head.
In [ ]:

# Code Cell 1
import pandas as pd
import numpy as np
In [ ]:

# Code Cell 2
# Import data from csv file, and visualize the first rows
#df_compact =
#df_compact.?()

b) Remove unwanted columns.

As you may have noticed, the dataframe df_compact has an extra column. Use the
command df.drop to remove this column. Look at the Internet_Speed_Data_Acquisition lab for
help.
In [ ]:

# Code Cell 3
# Remove extra index columns

c) Remove NaNs.

A common issue that affects data quality is the presence of NaN values. These can make data
analysis functions abruptly terminate the calculation, throw an error, or produce incorrect results.
Typically, NaN values represent a piece of information that should be contained in, but is missing
from, the dataset. In this example, NaN values in df_compact may represent measurements where
the Internet connection fell, or queries from the Raspberry Pi that the Speedtest.net server failed
to answer. The code below finds out how many NaN values are contained in our dataframe.
First, go to http://pandas.pydata.org/pandas-docs/version/0.14.1/index.html and search
for isnull in the search box.
The documentation of the isnull function is a little difficult to understand. Fortunately, the Jupyter
interactive interface allows us to call this function and quickly examine its output.
In [ ]:

# Code Cell 4
NaNs_in_df = df_compact.isnull()
print(type(NaNs_in_df))
NaNs_in_df.head()
The outcome of the isnull function is a new data frame that contains a True of False if the
corresponding element of the DataFrame is NaN or not. Using the function sum on
this DataFrame will automatically convert the values True into 1s and False into 0s.
In [ ]:

# Code Cell 5
NaNs_per_column = NaNs_in_df.sum()
print(type(NaNs_per_column))
NaNs_per_column.head()
The resulting NaN_per_column is a pandas Series object, that can be thought of as a single
column of DataFrame (a DataFrame is actually a dict of Series, where the keys are the column
names). A Series object contains almost all of the functionalities of a DataFrame. Use
the sum function on the Nan_per_column Series and display the outcome.
In [ ]:

# Code Cell 6
NaNs_total = NaNs_per_column.sum()
NaNs_total
It is possible to concatenate all this instruction in one line, as follows:
In [ ]:

# Code Cell 7
df_compact.isnull().sum().sum()
Compute the number of missing values as a percentage of all the elements in the dataframe (round
the result to the second decimal
using numpy.round https://docs.scipy.org/doc/numpy/reference/generated/numpy.round_.html)
Use the pandas function dropna to remove NaN values from df_compact.
In [ ]:

# Code Cell 8
NaNs_pct = np.round(df_compact.isnull().sum().sum()/float(len(df_compact)*len
(df_compact.columns))*100, decimals = 4)
print('The DataFrame contains : {} NaNs, equal to {} of the measurements'.for
mat(NaNs_total, NaNs_pct)) #EDL : moved parenthesis
The function dropna, if called with only default parameters, removes all the rows of a DataFrame if
any of its values is `NaN'.
In [ ]:

# Code Cell 9
# Remove NaN values
df_compact_clean = df_compact.dropna()
Compare the length of the values before and after using dropna. Do you notice something odd?
Why?
In [ ]:

# Code Cell 10

d) Change values format.

The columns for ping, upload, and download in the dataframe df_compact contain numerical
values. It is therefore reasonable to expect that they all share one datatype, for example float64.
This, however, is not the case, as it can be detected using dtypes:
In [ ]:

# Code Cell 11
# Ping and Download are not floats
df_compact.dtypes
Use the Python function float() to convert a string value into a into float format.
In [ ]:

# Code Cell 12
str_val = '10.56'
float_val = float(str_val)
print(str_val, type(str_val), float_val, type(float_val))
Now convert all the values of the columns 'Ping (ms)' and 'Download (Mbit/s)' into float. Hint:
use apply and lambda. For help, look at the Internet_Speed_Data_Acquisition lab.
In [ ]:
# Code Cell 14
# this disables a notebook warning that is not relevant for our use case
pd.options.mode.chained_assignment = None

# Convert Ping and Download to float


#df_compact_clean['Ping (ms)_float'] = ...
#df_compact_clean['Download (Mbit/s)_float'] = ...

#Check that the types have been successfully converted


# ...
Now, remove the original Ping (ms) and Download (Mbit/s) columns, and rename the
new Ping (ms)_float and Download (Mbit/s)_float to Ping (ms) and Download
(Mbit/s). Use df.drop and df.rename, like in the Internet_Speed_Data_Acquisition lab.
In [ ]:

# Code Cell 15
# Remove the original 'Ping (ms)' and 'Download (Mbit/s)' columns

# Rename the new 'Ping (ms) float' and 'Download (Mbit/s) float ' to Ping (ms
)' and 'Download (Mbit/s)
In [ ]:

# Code Cell 16
df_compact_clean.head()
Before saving the DataFrame, it makes sense to reposition Upload as the last column. This can be
achieved using the reindex function.
In [ ]:

# Code Cell 17
df_compact_clean = df_compact_clean.reindex(columns = ['Date', 'Time', 'Ping
(ms)','Download (Mbit/s)','Upload (Mbit/s)']);
df_compact_clean.head()
Now that the dataset is finally clean, store it in a csv file and rename it.
In [ ]:

# Code Cell 18
# Let's save the new cleaned dataframe to a csv
df_compact_clean.to_csv('./rpi_data_processed.csv', index=False)

df_clean = df_compact_clean

Step 2: Basic statistics.

New data requires not only cleaning, but also a good deal of getting used to. When you start a data
analyis project, it is worthwhile to invest effort in exploring and calculating some basic statistical
properties of the data. This entails computing averages, standard deviations, and correlations.
a) Calculate mean and standard deviation using Pandas.

The mean and the standard deviation of all the columns of a DataFrame can be computed
using mean() and std() . Look for them in the pandas library documentation and apply them to
the df_clean DataFrame

Quote the results as quantity = mean ± standard_deviation. Do not forget to include the
units of measurement associated with each quantity.
In [ ]:

# Code Cell 19
# Compute mean and std for all the columns of df_compact
# SOLUTION:
# means = ...
# stands = ...

# Place mean and std for each column in a tuple


stats_ping = (means['Ping (ms)'], stands['Ping (ms)'])
stats_download = (means['Download (Mbit/s)'], stands['Download (Mbit/s)'])
stats_upload = (means['Upload (Mbit/s)'], stands['Upload (Mbit/s)'])

# Print the mean value ± the standard deviation, including measuring units
print('Average ping time: {} ± {} ms'.format(stats_ping[0],stats_ping[1]))
print('Average download speed: {} ± {} Mbit/s'.format(*stats_download))
print('Average upload speed: {} ± {} Mbit/s'.format(*stats_upload))

b) Calculate min and max deviation using Pandas.

Search in the pandas library documentation for how to compute the minimum and the maximum
values for all the columns in the DataFrame.
In [ ]:

# Code Cell 23
# Compute max and min for all the columns of df_compact
mins = df_clean.min()
maxs = df_clean.max()

# Place mean and std for each column in a tuple


mima_ping = (mins['Ping (ms)'], maxs['Ping (ms)'])
mima_download = (mins['Download (Mbit/s)'], maxs['Download (Mbit/s)'])
mima_upload = (mins['Upload (Mbit/s)'], maxs['Upload (Mbit/s)'])

# Print the mean and max values, including measuring units


print('Min ping time: {} ms. Max ping time: {} ms'.format(*mima_ping))
print('Min download speed: {} Mbit/s. Max download speed: {} Mbit/s'.format(*
mima_download))
print('Min upload speed: {} Mbit/s. Max upload speed: {} Mbit/s'.format(*mima
_upload))

c) Use the pandas describe function.

Execute the following line of code. Notice how much time the pandas library can save you with even
a single line of code!
In [ ]:

# Code Cell 24
df_clean.describe()

d) Use argmin, argmax and iloc.


Let's assume you would like to have a computer script that automatically emails status reports to
your internet provider. The reports would have to include the date and time corresponding to the
minimum internet speed. The temporal information would allow the internet provider to accurately
identify the cause behind the slow connection you observed.

Using the pandas argmin and argmax functions, find dates and times corresponding to the longest
and shortest ping time, the lowest and highest download speed, and the lowest and highest upload
speed.
In [ ]:

# Code Cell 25
# Find the min and max ping time
argmin_ping = df_clean['Ping (ms)'].argmin()
argmax_ping = df_clean['Ping (ms)'].argmax()

# Find the min and max download speed


argmin_download = df_clean['Download (Mbit/s)'].argmin()
argmax_download = df_clean['Download (Mbit/s)'].argmax()

# Find the min and max upload speed


argmin_upload = df_clean['Upload (Mbit/s)'].argmin()
argmax_upload = df_clean['Upload (Mbit/s)'].argmax()
The argmin and argmax functions return an index relative to a the Dataframe rows. To access a
specific row using this index, use iloc.
In [ ]:

# Code Cell 26
# Create a small DataFrame and access its rows using iloc

# A pandas DataFrame can be initialized passing a dict as a parameter to the


constructor pd.DataFrame().
# The key will represent the column, the values the rows.
df = pd.DataFrame({'field_1': [0,1], 'field_2': [0,2]})
df.head()
In [ ]:
# Code Cell 27
# To access the field_1 of the first row using iloc()
df.iloc[1]['field_1']
Use the indices computed using argmax and argmin in combination with iloc to visualize
the Date and the Time of the maximum/mimimum Ping, Download, and Upload.
In [ ]:

# Code Cell 28
#Print the corresponding Date and Time
#print('Ping measure reached minimum on {} at {}'.format(df_clean.loc[...,
# df_clean.loc[...))

#print('Download measure reached minimum on {} at {}'.format(...

#print('Upload measure reached minimum on ...

#print('Ping measure reached maximum on ...

#print('Download measure reached maximum on ...

#print('Upload measure reached maximum on ...

e) Create a correlation.

It is useful to analyze if the speed of download tends to increase and decrease together with the
speed of upload. The reasoning behind this would be that network usage and technical issues ought
to affect download and upload equally.

In this scenario, download and upload speeds would be called positively correlated. This means that
faster download and upload would typically occur together. This would refer to the general trend, but
instances of fast download with slow upload would still be possible.

On the other hand, you may infer that a higher download speed implies a lower upload speed, and
vice-versa. In this case, the argument would be that the internet line can only support a limited
amount of information being exchanged. Download and upload would then compete, and keep each
other 'in check'.

In this scenario, download and upload speeds would be called negatively correlated. This means
that download would typically be faster when upload is slower, and vice-versa. As before, this would
refer to a trend, so that simultaneous good download and upload speeds would still be possible.

To complete the picture, the time of ping may be positively or negatively correlated with either
upload or download. It is then natural to think of a table, where each quantity is compared with all
others. Such tables are well-known mathematical objects, and are dubbed correlation matrices.

Use the pandas function corr to derive the correlation matrix of ping, upload, and download. Store
the result in a variable called df_corr.
In [ ]:
# Code Cell 29
# Are these variables correlated?
df_corr = df_clean.corr()
df_corr
In [ ]:

# Code Cell 30
corr = df_corr.values
print('Correlation coefficient between ping and download: {}'.format(corr[0,
1]))
print('Correlation coefficient between ping and upload: {}'.format(corr[0, 2]
))
print('Correlation coefficient between upload and download: {}'.format(corr[2
, 1]))
These numbers answer the questions on the 'relationship' between ping, download, and upload.
Perfect positive correlation yields a value of +1, whereas perfect negative correlation yields a value
of -1. Yet, the correlations between download and ping, and between download and upload are
close to zero. Moreover, the correlation between upload and ping is small. This leads to the
conclusion that the three quantities are, in fact, mutually uncorrelated.

Part 2: Visualize Data


As the proverb goes, 'A picture is worth a thousand words'. Informative, meaningful, and intuitive
graphs play a crucial role in the exploration of data. Plots are useful in the initial stages of a project,
and well beyond that. Graphs are a great way to present the results and conclusions of your work in
front of an audience.

Python has a comprehensive library for making plots, called Matplotlib. As an additional learning
resource, it is certainly worth taking a look at the official Matplotlib documentation, and in particular
at the numerous examples.

Step 1: Create a first visualization of the Internet Speed Data.

Visualize the content of the df_clean DataFrame.

a) Import Matplotlib.
In [ ]:

# Code Cell 31
import matplotlib.pyplot as plt
# The following allows your Jupyter notebook to create plots inside a cell
%matplotlib inline

b) Plot Internet speed stats.

Based on what you learned in the SF_Crime_Data lab, generate a plot containing three lines: 1. ping
(ms) as a function of time, 2. upload (Mbit/s) as a function of time, and 3. download (Mbit/s) as a
function of time. Use the legend() function to add a legend to your graph, but do not worry about
labelling the axes. We will work out how to do that in a later task.
In [ ]:
# Code Cell 32
# Initialise figure
fig, ax = plt.subplots(figsize=(10, 5))

# Create x-axis
t = pd.to_datetime(df_clean['Time'])

# Plot three curves of different colors


az.plot(t, df_clean['Ping (ms)'], label='Ping (ms)')
#ax.plot(...
#ax.plot(...

# Insert legend
ax.legend()
plt.show()

c) Change the linestyle.

Because ping measurements include large and abrupt variations, they are perhaps better visualized
using dots. Within the command ax.plot(...) for diplaying ping data, specify that these
measurements are represented as dots. (Most of the code, here, can be recycled from the previous
task.)
In [ ]:

# Code Cell 33
# Initialise figure
fig, ax = plt.subplots(figsize=(10, 5))

# Plot three curves. Ping data


# is visualized using dots
t = pd.to_datetime(df_clean['Time'])

#ax.plot(...
#ax.plot(...
#ax.plot(...

# Insert legend

d) Add axis labels.


A plot without axis labels, and perhaps a title, is difficult to understand, as one cannot know for sure
which quantities are being depicted. Make the above graph compliant with standard scientific
practice by adding axis labels and a title. Specify a fontsize of about 16, so that title and labels are
printed nice and clear.
In [ ]:

# Code Cell 35
# Initialise figure
fig, ax = plt.subplots(figsize=(10, 5))

# Plot three curves


t = pd.to_datetime(df_clean['Time'])
ax.plot(t, df_clean['Ping (ms)'], 'o', label='Ping (ms)')
ax.plot(t, df_clean['Upload (Mbit/s)'], label='Upload (Mbit/s)')
ax.plot(t, df_clean['Download (Mbit/s)'], label='Download (Mbit/s)')

# Insert legend
ax.legend()

# Add axis labels and title


#ax.set_xlabel(...
#ax.set_ylabel(...
#ax.set_title(...

# Change tick size


ax.tick_params(labelsize=14)

e) Change the plot theme.

Use the 'fivethirtyeight' style context to make the previous graph more visually appealing.
To do this, add the row with the with statement to your code, before calling
the Matplotlib functions.
In [ ]:

# Code Cell 36
# Use a style context
#with ...

# Initialise figure
#fig, ax =

# Plot ping as a function of time

# Add axis labels and title

# Change tick size

f) Create a Histogram.

A histogram is a graphical representation of the frequency of the values of numerical data. Examine
the code below. An additional level of complexity is the use of subplots to display the histograms
side-by-side.
In [ ]:
# Code Cell 37
with plt.style.context('fivethirtyeight'):

nbins = 100
# Initialize figure
fig, ax = plt.subplots(2, 2, figsize=(10, 10))
ax[0][0].hist(df_clean['Ping (ms)'], nbins)
ax[0][0].set_xlabel('Ping (ms)', fontsize=16)
ax[0][0].tick_params(labelsize=14)
ax[0][1].hist(df_clean['Upload (Mbit/s)'], nbins)
ax[0][1].set_xlabel('Upload (Mbit/s)', fontsize=16)
ax[0][1].tick_params(labelsize=14)
ax[1][0].hist(df_clean['Download (Mbit/s)'], nbins)
ax[1][0].set_xlabel('Download (Mbit/s)', fontsize=16)
ax[1][0].tick_params(labelsize=14)
ax[1][1].set_visible(False)

Lab - Simple Linear Regression in Python


Objectives
In this lab, you will become familiar with the concepts of simple linear regression and working with
the provided data to make a prediction.
 **Part 1: Import the Libraries and Data **
 **Part 2: Plot the Data **
 **Part 3: Perform Simple Linear Regression **

Scenario / Background
In statistics, linear regression is a way to model a relationship between dependent variable yy and
independent variable xx.

In this lab, you will analyze district sales data and perform a simple linear regression to predict
annual net sales based on the number of stores in the district.

Required Resources
 1 PC with Internet access
 Python libraries: pandas, numpy, scipy, and matplotlib
 Datafiles: stores-dist.csv

Part 1: Import the Libraries and Data


In this part, you will import the libraries and the data from the file stores-dist.csv.

Step 1: Import the libraries.


In this step, you will import the following libraries:

 matplotlib.pyplot as plt
 numpy as np
 pandas as pd

In [ ]:
# Code Cell 1

Step 2: Import the data.


In this step, you will import the data from the file stores-dist.csv and verify that the file was
imported correctly.
In [ ]:
# Code Cell 2

# Import the file, stores-dist.csv


salesDist = pd.read_csv('./Data/stores-dist.csv')

# Verify the imported data


salesDist.head()
The column headings, annual net sales and number of stores in district are
renamed to make it easier during data processing.

 annual net sales to sales


 number of stores in district to stores

In [ ]:
# Code Cell 3
# The district column has no relevance at this time, so it can be dropped.
salesDist = salesDist.rename(columns={'annual net sales':'sales','number of s
tores in district':'stores'})
salesDist.head()

Part 2: Plot the Data


Step 1: Determine the correlation.
In this step, you will investigate the correlation of the data prior to regression analysis. You will also
drop any unrelated columns as necessary.
In [ ]:
# Code Cell 4
# Check correlation of data prior to doing the analysis
# # Hint: check lab 3.1.5.5
From the correlation coefficent, it appears that the column district has low correlation to
the annual net sales and number of stores in the district. So the district column is
not necessary as part of the regression analysis. The district column can be dropped from the
dataframe.
In [ ]:
# Code Cell 5
# The district column has no relevance at this time, so it can be dropped.
#sales = salesDist.drop(...)

sales.head()
From the correlation coefficent data, what type of correlation did you observe between annual net
sales and number of stores in the district?
<font color = 'gray'>Type your answer here.</font>

Step 2: Create the plot.


In this step, you will create a plot to visualize the data. You will also assign stores as the
independent variable xx and sales as the dependent variable yy.
In [ ]:
# Code Cell 6
# dependent variable for y axis
y = sales['sales']
# independent variable for x axis
x = sales.stores
In [ ]:
# Code Cell 7
# Display the plot inline
%matplotlib inline

# Increase the size of the plot


plt.figure(figsize=(20,10))

# Create a scatter plot: Number of stores in the District vs. Annual Net Sale
s
plt.plot(x,y, 'o', markersize = 15)

# Add axis labels and increase the font size


plt.ylabel('Annual Net Sales', fontsize = 30)
plt.xlabel('Number of Stores in the District', fontsize = 30)

# Increase the font size on the ticks on the x and y axis


plt.xticks(fontsize = 20)
plt.yticks(fontsize = 20)

# Display the scatter plot


plt.show()

Part 3: Perform Simple Linear Regression


In this part, you will use numpy to generate a regression line for the analyzed data. You will also
calculate the centroid for this dataset. The centrod is the mean for the dataset. The generated simple
linear regression line must also pass through the centroid.

Step 1: Calculate the slope and y-intercept of the linear regression line.
In [ ]:
# Code Cell 8
# Use numpy polyfit for linear regression to fit the data
# Generate the slope of the line (m)
# Generate the y-intercept (b)
m, b = np.polyfit(x,y,1)
print ('The slope of line is {:.2f}.'.format(m))
print ('The y-intercept is {:.2f}.'.format(b))
print ('The best fit simple linear regression line is {:.2f}x + {:.2f}.'.form
at(m,b))

Step 2: Calculate the centroid.


The centroid of the dataset is calculated by using the mean function.
In [ ]:
# Code Cell 9
# y coordinate for centroid
y_mean = y.mean()
# x coordinate for centroid
x_mean = x.mean()
print ('The centroid for this dataset is x = {:.2f} and y = {:.2f}.'.format(x
_mean, y_mean))

Step 3: Overlay the regression line and the centroid point on the plot.
In [ ]:
# Code Cell 10
# Create the plot inline
%matplotlib inline

# Enlarge the plot size


plt.figure(figsize=(20,10))

# Plot the scatter plot of the data set


plt.plot(x,y, 'o', markersize = 14, label = "Annual Net Sales")

# Plot the centroid point


plt.plot(x_mean,y_mean, '*', markersize = 30, color = "r")

# Plot the linear regression line


plt.plot(x, m*x + b, '-', label = 'Simple Linear Regression Line', linewidth
= 4)
# Create the x and y axis labels
plt.ylabel('Annual Net Sales', fontsize = 30)
plt.xlabel('Number of Stores in District', fontsize = 30)

# Enlarge x and y tick marks


plt.xticks(fontsize = 20)
plt.yticks(fontsize = 20)

# Point out the centroid point in the plot


plt.annotate('Centroid', xy=(x_mean-0.1, y_mean-5), xytext=(x_mean-3, y_mean-
20), arrowprops=dict(facecolor='black', shrink=0.05), fontsize = 30)

# Create legend
plt.legend(loc = 'upper right', fontsize = 20)

Step 4: Prediction
Using the linear regression line, you can predict the annual net sales based on the number of stores
in the district.
In [ ]:
# Code Cell 11
# Function to predict the net sales from the regression line
def predict(query):
if query >= 1:
predict = m * query + b
return predict
else:
print ("You must have at least 1 store in the district to predict the
annual net sales.")
In [ ]:
# Code Cell 12
# Enter the number of stores in the function to generate the net sales predic
tion.
What is the predicted net sales if there are 4 stores in the district?
<font color = 'gray'>Type your answer here.</font>

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy