Ds Capstone Template Coursera
Ds Capstone Template Coursera
• Executive Summary
• Introduction
• Methodology
• Results
• Conclusion
• Appendix
2
Executive Summary
• Summary of methodologies
• Data Collection with Web Scraping
• Data Collection via API
• Data Wrangling
• EDA with SQL
• EDA with data visualization
• Build interactive maps with Folium.
• Build interactive dashboards with Plotly Dash.
• Predictive analysis using classification machine learning model
• Predictive Analysis
Introduction
• Project background and context
• SpaceX advertises Falcon 9 rocket launches on its website, with a cost of 62 million dollars; other
providers cost upward of 165 million dollars each, much of the savings is because SpaceX can reuse
the first stage.
4
Section 1
5
Methodology
Executive Summary
• Data collection methodology:
• SpaceX Rest API.
• Web Scraping from SpaceX’s Wikipedia page.
• Perform data wrangling:
• Input missing value, encode categorical data, using only relevant columns of data.
• Perform exploratory data analysis (EDA) using visualization and SQL.
• Perform interactive visual analytics using Folium and Plotly Dash.
• Perform predictive analysis using classification models:
• Build several model (SVM, Classification Trees, kNN, and Logistic Regression).
• Find the best hyperparameter for each model. 6
7
Data Collection – SpaceX API
• https://github.com/W1lly-Wonka/IBM-Data-Science-Certification/
blob/main/jupyter-labs-spacex-data-collection-api.ipynb 8
Data Collection - Scraping
• https://
github.com/
W1lly-Wonka/
IBM-Data-
Science-
Certification/
blob/main/
jupyter_labs_we
bscraping.ipynb
9
Data Wrangling
https://
github.com/
W1lly-Wonka/
IBM-Data-
Science-
Certification/
blob/main/
jupyter-spacex-
Data
%20wrangling.i
pynb
10
EDA with Data Visualization
Scatter point charts to visualize the:
1. Relationship between Flight Number and Launch Site.
2. Relationship between Payload and Launch Site.
3. Relationship between FlightNumber and Orbit type.
4. Relationship between Payload and Orbit type
Bar chart to visualize the:
5. Relationship between success rate of each orbit type.
Line chart to visualize the:
6. Launch success yearly trend
https://github.com/W1lly-Wonka/IBM-Data-Science-Certification/blob/main/jupyter-labs-eda-dataviz.ipynb
11
EDA with SQL
Display the names of the unique launch sites in the space mission
Display 5 records where launch sites begin with the string 'CCA’
Display the total payload mass carried by boosters launched by NASA (CRS)
List the date when the first succesful landing outcome in ground pad was acheived.
List the names of the boosters which have success in drone ship and have payload mass greater than 4000 but less than 6000
List the names of the booster_versions which have carried the maximum payload mass. Use a subquery
List the records which will display the month names, failure landing_outcomes in drone ship ,booster versions, launch_site for the months in
year 2015.
Rank the count of successful landing_outcomes between the date 04-06-2010 and 20-03-2017 in descending order.
https://github.com/W1lly-Wonka/IBM-Data-Science-Certification/blob/main/jupyter-labs-eda-sql-coursera_sqllite.ipynb
12
Build an Interactive Map with Folium
Mark all launch sites on a map, by using map objects such as circles and color-labeled
markers to pinpoint success/failed launches for each site on the map.
Using lines to pinpoint the distance between launch site to the nearest coastline, city,
railway and highway
https://github.com/W1lly-Wonka/IBM-Data-Science-Certification/blob/main/
lab_jupyter_launch_site_location.ipynb
13
Build a Dashboard with Plotly Dash
Built two plots for dashboard:
1. pie graphs displaying the overall number of launches by specific sites and their
respective success rate
2. scatter plot displaying the link between Payload Mass (Kg) and Outcome for several
booster versions.
https://github.com/W1lly-Wonka/IBM-Data-Science-Certification/blob/main/
spacex_dash_app.py
14
Predictive Analysis (Classification)
Load the data, transform it using StandardScaler, and split it
into train and test data.
Using GridSearchCV, construct various machine learning
models and tuned their respective various hyperparameters.
Using accuracy metric and best hyperparameter for each
models, calculate accuracy in test data, also display the score.
The most effective classification model was discovered by
comparing accuracy score in test data for each models.
https://github.com/W1lly-Wonka/IBM-Data-Science-
Certification/blob/main/
SpaceX_Machine_Learning_Prediction_Part_5.ipynb 15
Results
• Exploratory data analysis results
1. CCAFS LC-40, has a success rate of 60 %, while KSC LC-39A and VAFB SLC 4E has
a success rate of 77%.
2. KSC LC 39A had the most successful launches rate from all the sites.
3. For VAFB-SLC launchsite, there are no rockets launched for heavypayload
mass(greater than 10000).
4. For Orbit ES-L1, GEO, HEO, and SSO has the best Success Rate.
5. LEO orbit the Success appears related to the number of flights; on the other hand, there
seems to be no relationship between flight number when in GTO orbit.
6. For the success rate since 2013 kept increasing till 2020.
16
Results
Interactive analytics demo in screenshots
17
Results
• Exploratory data analysis results
• Interactive analytics demo in screenshots
• Predictive analysis results
• Logistic Regression model is the best model based on the accuracy test score which is
0.833333 and the fastest in running time. Also the difference between accuracy scores on
the train and test data is the smallest among other models, suggesting that the model is
stable.
18
Section 2
Flight Number vs. Launch Site
• Launches from the site of CCAFS SLC 40 are significantly higher than launches
form other sites.
20
Payload vs. Launch Site
• For the VAFB-SLC launchsite there are no rockets launched for heavypayload mass(greater than 10000).
21
Success Rate vs. Orbit Type
• For Orbit ES-L1, GEO, HEO, and SSO has the best Success Rate.
22
Flight Number vs. Orbit Type
• LEO orbit the Success appears related to the number of flights; on the other hand, there seems to be no relationship between
flight number when in GTO orbit.
23
Payload vs. Orbit Type
• With heavy payloads the successful landing or positive landing rate are more for
Polar,LEO and ISS.
• However for GTO we cannot distinguish this well as both positive landing rate and
negative landing(unsuccessful mission) are both here and there.
24
Launch Success Yearly Trend
Use distinct keyword for the column Launch_Site to show only unique values of the
column
26
Launch Site Names Begin with 'CCA'
Use wildcard method LIKE “CCA%” on launch site column to filter the data and
limit the data by 5 top rows.
27
Total Payload Mass
First sum the payload mass carried then filter the condition of payload column by
using wildcard method LIKE '%CRS%’
28
Average Payload Mass by F9 v1.1
First average the payload mass the filter the condition where the booster version is
F9 v1.1
29
First Successful Ground Landing Date
First we need to know the distinct values of landing outcome, then we need to know
if the landing outcome = success on the ground pad then filter the data using
min(date) or we can just limit the data by only the first row since the data is on the
top row. It can be seen that the first was December 22,2015
30
Successful Drone Ship Landing with Payload between 4000 and 6000
First we have to know the distinct values of booster version then filter the condition
where the payload mass is between 4000 and 6000 and the landing outcome =
success on the drone ship
31
Total Number of Successful and Failure Mission Outcomes
First we have to know the distinct values of mission outcome then the column needs
to be counted for each distinct values
32
Boosters Carried Maximum Payload
So we have to select the booster version data where the payloadmass is the
maximum of it. We can filter the data by using the subquery on the max
payload column and also additional condition which is ordering it by
booster version
33
2015 Launch Records
34
Rank Landing Outcomes Between 2010-06-04 and 2017-03-20
First we have to know the dates between '04-06-2010' AND '20-03-2017’ then count
landing outcome also the condition where the column is success only, by using
wildcard method. Group by landing outcome and order the table by count of landing
outcome in descending order.
35
Section 3
All Launch Sites Location
It can be seen from the map that the launch sites always near the US coasts.
37
Color-labeled launch outcomes on the map
• Florida • California
The green color suggests that the launch is successful, and the red indicates otherwise.
38
Selected launch site to its proximity to landmarks
VAFB SLC-4E to coastline = 1,37 KM (close enough)
VAFB SLC-4E to railways = 1,27 KM (close enough)
39
Section 4
Launch success count for all sites
It can be seen from the graph that the KSC LC-39A launch site has the best success
rate while the worst is CCAFS SLC-40.
41
Launch site with highest launch success ratio
It can be concluded that the heavier the weight (in this case more than 5000 kg) the lower
the success rate.
43
Section 5
Classification Accuracy
• Logistic Regression and the SVM models have the highest classification accuracy of all
with the score of 0.833333.
45
Confusion Matrix
• The model can distinguish which ones are successful and which are not. the problem is
that there are 3 data of false positives, which should have been predicted to land, but in
fact, they didn't.
46
Conclusions
KSC LC-39A launch site has the best success rate while the worst is CCAFS SLC-40.
KSC LC-39A launch site able to achieve a 79% success rate.
It can be concluded that the heavier the weight (in this case more than 5000 kg) the lower
the success rate.
Logistic Regression and the SVM models have the highest classification accuracy of all
with the score of 0.833333.
The model can distinguish which ones are successful and which are not. the problem is that
there are 3 data of false positives, which should have been predicted to land, but in fact,
they didn't.
47
Appendix
https://github.com/W1lly-Wonka/IBM-Data-Science-Certification
48