0% found this document useful (0 votes)
291 views47 pages

Ds Capstone Presentation

The document outlines a data science capstone project to predict if SpaceX will reuse the first stage of Falcon 9 rockets using machine learning models. It collects data through SpaceX's API and web scraping Wikipedia, then performs exploratory data analysis with visualization and SQL before building classification models to predict first stage landing outcomes. Key steps include data wrangling, interactive visualizations with Folium and Plotly Dash, and evaluating models to determine the best for binary classification. The goal is to help determine the cost of launches by predicting first stage reuse.

Uploaded by

Danish Nazrin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
291 views47 pages

Ds Capstone Presentation

The document outlines a data science capstone project to predict if SpaceX will reuse the first stage of Falcon 9 rockets using machine learning models. It collects data through SpaceX's API and web scraping Wikipedia, then performs exploratory data analysis with visualization and SQL before building classification models to predict first stage landing outcomes. Key steps include data wrangling, interactive visualizations with Folium and Plotly Dash, and evaluating models to determine the best for binary classification. The goal is to help determine the cost of launches by predicting first stage reuse.

Uploaded by

Danish Nazrin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

Data Science Capstone Project

Evgeny Zorin
29.08.2021

Outline

• Executive Summary
• Introduction
• Methodology
• Results
• Conclusion
• Appendix

Executive Summary
Summary of methodologies
-  Data collection
-  Data wrangling
-  Exploratory Data Analysis with Data Visualization
-  Exploratory Data Analysis with SQL
-  Building an interactive map with Folium
-  Building a Dashboard with Plotly Dash
-  Predictive analysis (Classi ication)

Summary of all results


-  Exploratory Data Analysis results
-  Interactive analytics demo in screenshots
-  Predictive analysis results

Introduction
Project background and context
SpaceX is the most successful company of the commercial space
age, making space travel affordable. The company advertises Falcon
9 rocket launches on its website, with a cost of 62 million dollars;
other providers cost upward of 165 million dollars each, much of the
savings is because SpaceX can reuse the irst stage. Therefore, if we
can determine if the irst stage will land, we can determine the cost
of a launch. Based on public information and machine learning
models, we are going to predict if SpaceX will reuse the irst stage.

Questions to be answered
- How do variables such as payload mass, launch site, number of
lights, and orbits affect the success of the irst stage landing?
- Does the rate of successful landings increase over the years?
- What is the best algorithm that can be used for binary classi ication
in this case?
f
f

f
f
f
f

Methodology
Data collection methodology:
- Using SpaceX Rest API
- Using Web Scrapping from Wikipedia

Performed data wrangling


- Filtering the data
- Dealing with missing values
- Using One Hot Encoding to prepare the data to a binary classi ication

Performed exploratory data analysis (EDA) using visualization and SQL

Performed interactive visual analytics using Folium and Plotly Dash

Performed predictive analysis using classi ication models


- Building, tuning and evaluation of classi ication models to ensure the best
results

f
f
f

Methodology
Data collection
Data collection process involved a combination of API requests from SpaceX REST
API and Web Scraping data from a table in SpaceX’s Wikipedia entry.
We had to use both of these data collection methods in order to get complete
information about the launches for a more detailed analysis.

Data Columns are obtained by using SpaceX REST API:


FlightNumber, Date, BoosterVersion, PayloadMass, Orbit, LaunchSite,
Outcome, Flights, GridFins, Reused, Legs, LandingPad, Block, ReusedCount,
Serial, Longitude, Latitude

Data Columns are obtained by using Wikipedia Web Scraping:


Flight No., Launch site, Payload, PayloadMass, Orbit, Customer, Launch
outcome, Version Booster, Booster landing, Date, Time

Data collection – SpaceX API


Decoding the Requesting needed
Requesting response content information about
Constructing data
rocket launch using .json() and the launches from
we have obtained
data from turning it into a SpaceX API
into a dictionary
SpaceX API dataframe using by applying
.json_normalize() custom functions

Replacing missing
Filtering the
values of Payload
Exporting the data dataframe to only Creating a dataframe
Mass column with
to CSV include Falcon 9 from the dictionary
calculated .mean()
launches
for this column

GitHub URL: Data Collection API

Data collection – Web scraping


Requesting Creating a Extracting
Falcon 9 launch BeautifulSoup object all column names
data from from the HTML from the HTML table
Wikipedia response header

Collecting the data


by parsing
HTML tables

Constructing data
Exporting the data Creating a dataframe
we have obtained
to CSV from the dictionary
into a dictionary

GitHub URL: Data Collection with Web Scraping

Data wrangling
In the data set, there are several different cases where the Perform exploratory Data Analysis
booster did not land successfully. Sometimes a landing was and determine Training Labels
attempted but failed due to an accident; for example, True
Ocean means the mission outcome was successfully landed
Calculate the number of launches
to a speci ic region of the ocean while False Ocean means on each site
the mission outcome was unsuccessfully landed to a speci ic
region of the ocean. True RTLS means the mission outcome Calculate the number and occurrence
was successfully landed to a ground pad False RTLS means of each orbit

the mission outcome was unsuccessfully landed to a ground Calculate the number and occurrence
pad.True ASDS means the mission outcome was successfully of mission outcome per orbit type
landed on a drone ship False ASDS means the mission
Create a landing outcome label
outcome was unsuccessfully landed on a drone ship.
from Outcome column
We mainly convert those outcomes into Training Labels with
Exporting the data
“1” means the booster successfully landed, “0” means it was to CSV
unsuccessful.

GitHub URL: Data Wrangling


f

f
EDA with data visualization
Charts were plotted:
Flight Number vs. Payload Mass, Flight Number vs. Launch Site, Payload Mass
vs. Launch Site, Orbit Type vs. Success Rate, Flight Number vs. Orbit Type,
Payload Mass vs Orbit Type and Success Rate Yearly Trend

Scatter plots show the relationship between variables. If a relationship exists,


they could be used in machine learning model.
Bar charts show comparisons among discrete categories. The goal is to show the
relationship between the speci ic categories being compared and a measured
value.
Line charts show trends in data over time (time series).

GitHub URL: EDA with Data Visualization

EDA with SQL


Performed SQL queries:
• Displaying the names of the unique launch sites in the space mission
• Displaying 5 records where launch sites begin with the string ‘CCA'
• Displaying the total payload mass carried by boosters launched by NASA (CRS)
• Displaying average payload mass carried by booster version F9 v1.1
• Listing the date when the irst successful landing outcome in ground pad was achieved
• Listing the names of the boosters which have success in drone ship and have payload mass greater than 4000 but
less than 6000
• Listing the total number of successful and failure mission outcomes
• Listing the names of the booster versions which have carried the maximum payload mass
• Listing the failed landing outcomes in drone ship, their booster versions and launch site names for the months in
year 2015
• Ranking the count of landing outcomes (such as Failure (drone ship) or Success (ground pad)) between the date
2010-06-04 and 2017-03-20 in descending order

GitHub URL: EDA with SQL


Build an interactive map with Folium


Markers of all Launch Sites:
- Added Marker with Circle, Popup Label and Text Label of NASA Johnson Space Center using
its latitude and longitude coordinates as a start location.
- Added Markers with Circle, Popup Label and Text Label of all Launch Sites using their latitude
and longitude coordinates to show their geographical locations and proximity to Equator and
coasts.

Coloured Markers of the launch outcomes for each Launch Site:


- Added coloured Markers of success (Green) and failed (Red) launches using Marker Cluster to
identify which launch sites have relatively high success rates.

Distances between a Launch Site to its proximities:


- Added coloured Lines to show distances between the Launch Site KSC LC-39A (as an
example) and its proximities like Railway, Highway, Coastline and Closest City.

GitHub URL: Interactive Visual Analytics with Folium

Build a Dashboard with Plotly Dash


Launch Sites Dropdown List:
- Added a dropdown list to enable Launch Site selection.

Pie Chart showing Success Launches (All Sites/Certain Site):


- Added a pie chart to show the total successful launches count for all sites and the
Success vs. Failed counts for the site, if a speci ic Launch Site was selected.

Slider of Payload Mass Range:


- Added a slider to select Payload range.

Scatter Chart of Payload Mass vs. Success Rate for the di erent Booster Versions:
- Added a scatter chart to show the correlation between Payload and Launch Success.

GitHub URL: SpaceX Dash App


ff

Predictive analysis (Classi ication)


Standardizing the Splitting the data into
Creating a
Creating a NumPy data with training and testing
GridSearchCV object
array from the column StandardScaler, then sets with
with cv = 10 to ind
“Class” in data itting and train_test_split
the best parameters
transforming it function

Finding the method Calculating the Applying


performs best by Examining the accuracy on the test GridSearchCV
examining the confusion matrix data using the on LogReg, SVM,
Jaccard_score and for all models method .score() Decision Tree, and
F1_score metrics for all models KNN models

GitHub URL: Machine Learning Prediction


f

f
Results

• Exploratory data analysis results


• Interactive analytics demo in

screenshots
• Predictive analysis results

EDA with Visualization


Flight Number vs. Launch Site

Explanation:
• The earliest lights all failed while the latest lights all succeeded.
• The CCAFS SLC 40 launch site has about a half of all launches.
• VAFB SLC 4E and KSC LC 39A have higher success rates.
• It can be assumed that each new launch has a higher rate of success.
f

Payload vs. Launch Site

Explanation:
• For every launch site the higher the payload mass, the higher the success
rate.
• Most of the launches with payload mass over 7000 kg were successful.
• KSC LC 39A has a 100% success rate for payload mass under 5500 kg too.

Success rate vs. Orbit type


Explanation:
• Orbits with 100% success rate:
- ES-L1, GEO, HEO, SSO
• Orbits with 0% success rate:
- SO
• Orbits with success rate
between 50% and 85%:
- GTO, ISS, LEO, MEO, PO

Flight Number vs. Orbit type

Explanation:
• In the LEO orbit the Success appears related to the number of lights;
on the other hand, there seems to be no relationship between light
number when in GTO orbit.

f
f
Payload Mass vs. Orbit type

Explanation:
• Heavy payloads have a negative in luence on GTO orbits and positive
on GTO and Polar LEO (ISS) orbits.

f
Launch success yearly trend

Explanation:
• The success rate
since 2013 kept
increasing till 2020.

EDA with SQL


All launch site names

Explanation:
• Displaying the names of the unique launch sites in the space mission.

Launch site names begin with `CCA`

Explanation:
• Displaying 5 records where launch sites begin with the string 'CCA'.

Total payload mass

Explanation:
• Displaying the total payload mass carried by boosters launched by
NASA (CRS).

Average payload mass by F9 v1.1

Explanation:
• Displaying average payload mass carried by booster version F9 v1.1.

First successful ground landing date

Explanation:
• Listing the date when the irst successful landing outcome in ground
pad was achieved.

f
Successful drone ship landing with payload
between 4000 and 6000

Explanation:
• Listing the names of the boosters which have success in drone ship
and have payload mass greater than 4000 but less than 6000.

Total number of successful and failure


mission outcomes

Explanation:
• Listing the total number of successful and failure mission outcomes.

Boosters carried maximum payload

Explanation:
• Listing the names of the booster versions which have carried the maximum
payload mass.

2015 launch records

Explanation:
• Listing the failed landing outcomes in drone ship, their booster
versions and launch site names for the months in year 2015.

Rank success count between 2010-06-04 and 2017-03-20

Explanation:
• Ranking the count of landing outcomes (such as Failure (drone ship) or Success
(ground pad)) between the date 2010-06-04 and 2017-03-20 in descending order.

Interactive map with Folium


All launch sites’ location markers on a global map
Explanation:
• Most of Launch sites are in proximity to the
Equator line. The land is moving faster at
the equator than any other place on the
surface of the Earth. Anything on the
surface of the Earth at the equator is
already moving at 1670 km/hour. If a ship is
launched from the equator it goes up into
space, and it is also moving around the
Earth at the same speed it was moving
before launching. This is because of inertia.
This speed will help the spacecraft keep up
a good enough speed to stay in orbit.
• All launch sites are in very close proximity
to the coast, while launching rockets
towards the ocean it minimises the risk of
having any debris dropping or exploding
near people.

Colour-labeled launch records on the map


Explanation:
• From the colour-labeled markers
we should be able to easily
identify which launch sites have
relatively high success rates.
- Green Marker = Successful
Launch
- Red Marker = Failed Launch
• Launch Site KSC LC-39A has a
very high Success Rate.

Distance from the launch site


KSC LC-39A to its proximities
Explanation:
• From the visual analysis of the launch
site KSC LC-39A we can clearly see that
it is:
- relative close to railway (15.23 km)
- relative close to highway (20.28 km)
- relative close to coastline (14.99 km)
• Also the launch site KSC LC-39A is
relative close to its closest city
Titusville (16.32 km).
• Failed rocket with its high speed can
cover distances like 15-20 km in few
seconds. It could be potentially
dangerous to populated areas.

Build a Dashboard with Plotly


Dash
Launch success count for all sites

Explanation:
• The chart clearly shows that from all the sites, KSC LC-39A has the most
successful launches.

Launch site with highest launch success ratio

Explanation:
• KSC LC-39A has the highest launch success rate (76.9%) with 10 successful and
only 3 failed landings.

Payload Mass vs. Launch Outcome for all sites

Explanation:
• The charts show
that payloads
between 2000
and 5500 kg have
the highest
success rate.

Predictive analysis
(Classi ication)
f
Classi ication Accuracy
Explanation: Scores and Accuracy of the Test Set
• Based on the scores of the Test Set,
we can not con irm which method
performs best.
• Same Test Set scores may be due
to the small test sample size (18
samples). Therefore, we tested all
methods based on the whole
Dataset.
Scores and Accuracy of the Entire Data Set
• The scores of the whole Dataset
con irm that the best model is the
Decision Tree Model. This model
has not only higher scores, but also
the highest accuracy.
f

f
Confusion Matrix
Explanation:
• Examining the confusion matrix, we see
that logistic regression can distinguish
between the different classes. We see
that the major problem is false positives.

Conclusion
• Decision Tree Model is the best algorithm for this dataset.

• Launches with a low payload mass show better results


than launches with a larger payload mass.

• Most of launch sites are in proximity to the Equator line


and all the sites are in very close proximity to the coast.

• The success rate of launches increases over the years.

• KSC LC-39A has the highest success rate of the launches


from all the sites.

• Orbits ES-L1, GEO, HEO and SSO have 100% success rate.

Appendix

Special Thanks to:


Instructors
Coursera
IBM

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy