0% found this document useful (0 votes)
173 views49 pages

Ds Capstone Template Coursera

Uploaded by

William Andreas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
173 views49 pages

Ds Capstone Template Coursera

Uploaded by

William Andreas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 49

William Andreas

July 11, 2022


Outline

• Executive Summary
• Introduction
• Methodology
• Results
• Conclusion
• Appendix

2
Executive Summary
• Summary of methodologies
• Data Collection with Web Scraping
• Data Collection via API
• Data Wrangling
• EDA with SQL
• EDA with data visualization
• Build interactive maps with Folium.
• Build interactive dashboards with Plotly Dash.
• Predictive analysis using classification machine learning model

• Summary of all results


• EDA
• Interactive Analytics 3

• Predictive Analysis
Introduction
• Project background and context
• SpaceX advertises Falcon 9 rocket launches on its website, with a cost of 62 million dollars; other
providers cost upward of 165 million dollars each, much of the savings is because SpaceX can reuse
the first stage.

• Problems you want to find answers


• Which elements determine whether the rocket will successfully land?
• Ways of different elements interact to affect the likelihood of a successful landing.
• Operational requirements which must be met to guarantee a successful landing program.

4
Section 1

5
Methodology
Executive Summary
• Data collection methodology:
• SpaceX Rest API.
• Web Scraping from SpaceX’s Wikipedia page.
• Perform data wrangling:
• Input missing value, encode categorical data, using only relevant columns of data.
• Perform exploratory data analysis (EDA) using visualization and SQL.
• Perform interactive visual analytics using Folium and Plotly Dash.
• Perform predictive analysis using classification models:
• Build several model (SVM, Classification Trees, kNN, and Logistic Regression).
• Find the best hyperparameter for each model. 6

• Find the method performs best using test data.


Data Collection
Describe how data sets were collected.
Utilizing a get request to the SpaceX API, data was gathered.
Next, we used the.json() function call to decode the response's content as JSON and the.json_normalize()
function call to convert it into a pandas dataframe.
The data was then cleansed, missing values were checked for, and filled in as appropriate.
Web scraping from SpaceX’s page on Wikipedia using BeautifulSoup.

7
Data Collection – SpaceX API

• https://github.com/W1lly-Wonka/IBM-Data-Science-Certification/
blob/main/jupyter-labs-spacex-data-collection-api.ipynb 8
Data Collection - Scraping
• https://
github.com/
W1lly-Wonka/
IBM-Data-
Science-
Certification/
blob/main/
jupyter_labs_we
bscraping.ipynb

9
Data Wrangling
https://
github.com/
W1lly-Wonka/
IBM-Data-
Science-
Certification/
blob/main/
jupyter-spacex-
Data
%20wrangling.i
pynb

10
EDA with Data Visualization
Scatter point charts to visualize the:
1. Relationship between Flight Number and Launch Site.
2. Relationship between Payload and Launch Site.
3. Relationship between FlightNumber and Orbit type.
4. Relationship between Payload and Orbit type
Bar chart to visualize the:
5. Relationship between success rate of each orbit type.
Line chart to visualize the:
6. Launch success yearly trend

https://github.com/W1lly-Wonka/IBM-Data-Science-Certification/blob/main/jupyter-labs-eda-dataviz.ipynb
11
EDA with SQL
Display the names of the unique launch sites in the space mission

Display 5 records where launch sites begin with the string 'CCA’

Display the total payload mass carried by boosters launched by NASA (CRS)

Display average payload mass carried by booster version F9 v1.1

List the date when the first succesful landing outcome in ground pad was acheived.

List the names of the boosters which have success in drone ship and have payload mass greater than 4000 but less than 6000

List the total number of successful and failure mission outcomes

List the names of the booster_versions which have carried the maximum payload mass. Use a subquery

List the records which will display the month names, failure landing_outcomes in drone ship ,booster versions, launch_site for the months in
year 2015.

Rank the count of successful landing_outcomes between the date 04-06-2010 and 20-03-2017 in descending order.

https://github.com/W1lly-Wonka/IBM-Data-Science-Certification/blob/main/jupyter-labs-eda-sql-coursera_sqllite.ipynb
12
Build an Interactive Map with Folium
Mark all launch sites on a map, by using map objects such as circles and color-labeled
markers to pinpoint success/failed launches for each site on the map.
Using lines to pinpoint the distance between launch site to the nearest coastline, city,
railway and highway

https://github.com/W1lly-Wonka/IBM-Data-Science-Certification/blob/main/
lab_jupyter_launch_site_location.ipynb

13
Build a Dashboard with Plotly Dash
Built two plots for dashboard:
1. pie graphs displaying the overall number of launches by specific sites and their
respective success rate
2. scatter plot displaying the link between Payload Mass (Kg) and Outcome for several
booster versions.

https://github.com/W1lly-Wonka/IBM-Data-Science-Certification/blob/main/
spacex_dash_app.py

14
Predictive Analysis (Classification)
Load the data, transform it using StandardScaler, and split it
into train and test data.
Using GridSearchCV, construct various machine learning
models and tuned their respective various hyperparameters.
Using accuracy metric and best hyperparameter for each
models, calculate accuracy in test data, also display the score.
The most effective classification model was discovered by
comparing accuracy score in test data for each models.

https://github.com/W1lly-Wonka/IBM-Data-Science-
Certification/blob/main/
SpaceX_Machine_Learning_Prediction_Part_5.ipynb 15
Results
• Exploratory data analysis results
1. CCAFS LC-40, has a success rate of 60 %, while KSC LC-39A and VAFB SLC 4E has
a success rate of 77%.
2. KSC LC 39A had the most successful launches rate from all the sites.
3. For VAFB-SLC launchsite, there are no rockets launched for heavypayload
mass(greater than 10000).
4. For Orbit ES-L1, GEO, HEO, and SSO has the best Success Rate.
5. LEO orbit the Success appears related to the number of flights; on the other hand, there
seems to be no relationship between flight number when in GTO orbit.
6. For the success rate since 2013 kept increasing till 2020.
16
Results
Interactive analytics demo in screenshots

17
Results
• Exploratory data analysis results
• Interactive analytics demo in screenshots
• Predictive analysis results
• Logistic Regression model is the best model based on the accuracy test score which is
0.833333 and the fastest in running time. Also the difference between accuracy scores on
the train and test data is the smallest among other models, suggesting that the model is
stable.

18
Section 2
Flight Number vs. Launch Site

• Launches from the site of CCAFS SLC 40 are significantly higher than launches
form other sites.
20
Payload vs. Launch Site

• For the VAFB-SLC launchsite there are no rockets launched for heavypayload mass(greater than 10000).

21
Success Rate vs. Orbit Type

• For Orbit ES-L1, GEO, HEO, and SSO has the best Success Rate.
22
Flight Number vs. Orbit Type

• LEO orbit the Success appears related to the number of flights; on the other hand, there seems to be no relationship between
flight number when in GTO orbit.

23
Payload vs. Orbit Type

• With heavy payloads the successful landing or positive landing rate are more for
Polar,LEO and ISS.
• However for GTO we cannot distinguish this well as both positive landing rate and
negative landing(unsuccessful mission) are both here and there.
24
Launch Success Yearly Trend

• The success rate since 2013 kept increasing till 2020


25
All Launch Site Names

Use distinct keyword for the column Launch_Site to show only unique values of the
column

26
Launch Site Names Begin with 'CCA'

Use wildcard method LIKE “CCA%” on launch site column to filter the data and
limit the data by 5 top rows.

27
Total Payload Mass

First sum the payload mass carried then filter the condition of payload column by
using wildcard method LIKE '%CRS%’

28
Average Payload Mass by F9 v1.1

First average the payload mass the filter the condition where the booster version is
F9 v1.1

29
First Successful Ground Landing Date

First we need to know the distinct values of landing outcome, then we need to know
if the landing outcome = success on the ground pad then filter the data using
min(date) or we can just limit the data by only the first row since the data is on the
top row. It can be seen that the first was December 22,2015

30
Successful Drone Ship Landing with Payload between 4000 and 6000

First we have to know the distinct values of booster version then filter the condition
where the payload mass is between 4000 and 6000 and the landing outcome =
success on the drone ship

31
Total Number of Successful and Failure Mission Outcomes

First we have to know the distinct values of mission outcome then the column needs
to be counted for each distinct values

32
Boosters Carried Maximum Payload

So we have to select the booster version data where the payloadmass is the
maximum of it. We can filter the data by using the subquery on the max
payload column and also additional condition which is ordering it by
booster version

33
2015 Launch Records

Because SQLLite does not support monthnames, we have to use substr(Date, 4, 2) as


month to get the months and substr(Date,7,4)='2015' for year. Also select booster
version, and launch site where landing outcome = failure on drone ship and use
wildcard method on year column like "2015%"

34
Rank Landing Outcomes Between 2010-06-04 and 2017-03-20

First we have to know the dates between '04-06-2010' AND '20-03-2017’ then count
landing outcome also the condition where the column is success only, by using
wildcard method. Group by landing outcome and order the table by count of landing
outcome in descending order.

35
Section 3
All Launch Sites Location

It can be seen from the map that the launch sites always near the US coasts.
37
Color-labeled launch outcomes on the map
• Florida • California

The green color suggests that the launch is successful, and the red indicates otherwise.
38
Selected launch site to its proximity to landmarks
VAFB SLC-4E to coastline = 1,37 KM (close enough)
VAFB SLC-4E to railways = 1,27 KM (close enough)

VAFB SLC-4E to a highway = 12,42 KM (far enough)


VAFB SLC-4E to a city = 13,99 KM (far enough)

39
Section 4
Launch success count for all sites

It can be seen from the graph that the KSC LC-39A launch site has the best success
rate while the worst is CCAFS SLC-40.
41
Launch site with highest launch success ratio

KSC LC-39A launch site able to achieve a 79% success rate.


42
Payload vs. Launch Outcome scatter plot for all sites

On the left : less than 5000 kg


On the right : more than 5000 kg

It can be concluded that the heavier the weight (in this case more than 5000 kg) the lower
the success rate.
43
Section 5
Classification Accuracy

• Logistic Regression and the SVM models have the highest classification accuracy of all
with the score of 0.833333.
45
Confusion Matrix

• The model can distinguish which ones are successful and which are not. the problem is
that there are 3 data of false positives, which should have been predicted to land, but in
fact, they didn't.
46
Conclusions
KSC LC-39A launch site has the best success rate while the worst is CCAFS SLC-40.
KSC LC-39A launch site able to achieve a 79% success rate.
It can be concluded that the heavier the weight (in this case more than 5000 kg) the lower
the success rate.
Logistic Regression and the SVM models have the highest classification accuracy of all
with the score of 0.833333.
The model can distinguish which ones are successful and which are not. the problem is that
there are 3 data of false positives, which should have been predicted to land, but in fact,
they didn't.

47
Appendix

https://github.com/W1lly-Wonka/IBM-Data-Science-Certification

48

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy