MiniProject 1 PDF
MiniProject 1 PDF
A
Mini Project Report
Bachelor of Technology
in
Computer Science and Engineering
Submitted by
SRAVANI.K (16SS1A0524)
HARIKA.D (17SS5A0502)
SAI SUNANDA.P (17SS5A0506)
Certificate
This is to certify that the Mini Project report work entitled "EARTHQUAKE MAGNITUDE
PREDICTION" is a bonafide work carried out by the team consisting of SRAVANI.K
bearing Roll no.16SS1A0524, HARIKA.D bearing Roll no.17SS5A0502, SAI SUNANDA.P
bearing Roll no.17SS5A0506, in partial fullfillment of the requirements of the degree of
BACHELOR OF TECHNOLOGY in COMPUTER SCIENCE AND ENGINEERING
discipline to Jawaharlal Nehru Technological University, Hyderabad during the academic
year 2019-2020.
The results embodied in this report have not been submitted to any other University or
Institution for the award of any degree or diploma.
EXTERNAL EXAMINER
i
Declaration
SRAVANI.K (16SS1A0524)
HARIKA.D (17SS5A0502)
SAI SUNANDA.P (17SS5A0506)
ii
Acknowledgement
We wish to take this opportunity to express our deep gratitude to all those who
helped, encouraged, motivated and have extended their cooperation in various ways
during our Mini project report work. It is our pleasure to acknowledge the help of
all those individuals who were responsible for foreseeing the successful completion of
Mini project report work.
We express our sincere gratitude to Dr. B. BALU NAIK, Principal of JNTUHCES
for his support during the course period.
We sincerely thank Dr. VENKATESHWAR REDDY, Vice Principal of JNTUHCES
for his kind help and cooperation.
We are thankful to Sri JOSHI SHRIPAD, Associate Professor and Head of the Depart-
ment of Computer Science and Engineering of JNTUHCES for his effective suggestions
during the course period.
Finally, we express our gratitude with great admiration and respect to our teaching
and non-teaching staff for their moral support and encouragement throughout the
course.
SRAVANI.K (16SS1A0524)
HARIKA.D (17SS5A0502)
SAI SUNANDA.P (17SS5A0506)
iii
Abstract
Characteristic perils like earthquakes are for the most part the consequence of spreading
seismic waves underneath the surface of the earth. Tremors are dangerous absolutely
in light of the fact that they’re erratic, striking without warning, triggering fires and
tsunamis and leading to deaths of countless individuals. If researchers could caution
people in weeks or months ahead of time about seismic disturbances, clearing and different
arrangements could be made to spare incalculable lives. An early identification and
future earthquake prediction can be achieved using machine learning models. Seismic
stations continuously gather data without the necessity of the occurrence of an event. The
gathered data can be used to distinguish earthquake and non-earthquake prone regions.
Machine learning methods can be used for analyzing continuous time series data in order
to detect earthquakes effectively. The pre-existing linear models applied to earthquake
problems have failed to achieve significant amount of efficiency and generate overheads
with respect to pre-processing. The proposed work exploits the predicting the earthquake
magnitude using the machine learning algorithms(used randomforestregressor).
iv
List of Figures
v
Contents
Certificate i
Declaration ii
Acknowledgement iii
Abstract iv
Page
1 INTRODUCTION 1
1.1 Concept of an Earthquake . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Occurrence of Earthquakes . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Indian Geological Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Classification of Earthquake . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 LITERATURE SURVEY 5
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Literature Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Earthquake Damages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3 ANALYSIS 8
3.1 Environments Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.1.1 Jupyter notebook . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.1.2 The NoteBook Dashboard . . . . . . . . . . . . . . . . . . . . . . 9
3.1.3 Overview of the Notebook UI . . . . . . . . . . . . . . . . . . . . 10
3.1.4 Running code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.1.5 Anaconda . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.1.6 Packages installed with Anaconda . . . . . . . . . . . . . . . . . . 11
3.1.7 Using Python in Anaconda . . . . . . . . . . . . . . . . . . . . . . 11
4 DATASETS 12
4.1 Train and test data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
5 EDA 15
5.1 Basemap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
vi
6.2.2 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
6.3 Types of Supervised Machine Learning Algorithms . . . . . . . . . . . . 22
6.3.1 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
6.3.2 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . 23
6.3.3 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
6.3.4 Naive Bayes Classifiers . . . . . . . . . . . . . . . . . . . . . . . . 23
6.3.5 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
6.3.6 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . 23
6.4 Random forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
6.4.1 Decision tree learning . . . . . . . . . . . . . . . . . . . . . . . . . 24
6.4.2 Bagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
6.4.3 Bagging to Random Forests . . . . . . . . . . . . . . . . . . . . . 25
6.4.4 Extremely Randomized Trees . . . . . . . . . . . . . . . . . . . . 25
6.4.5 Advantages of using Random Forest . . . . . . . . . . . . . . . . . 26
6.4.6 Disadvantages of using Random Forest . . . . . . . . . . . . . . . 26
6.5 Grid Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
6.6 Packages and Libraries Used . . . . . . . . . . . . . . . . . . . . . . . . . 27
6.6.1 NumPy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
6.6.2 PANDAS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
6.6.3 Matplotlib . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
6.6.4 read_xl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
6.6.5 Train_test_split . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
6.6.6 nan_to_num . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
6.6.7 Workbook() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
6.6.8 Mktime() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
6.6.9 Timetuple() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
7 IMPLEMENTATION CODE 32
8 OUTPUT 35
CONCLUSION 44
REFERENCES 45
vii
Chapter 1
INTRODUCTION
Movement of seismic plates under the surface of the earth which supports life in many
form, causes earthquakes which is a natural hazard. Seismometers, which are used to
record motion of these plates, are installed at various locations on the planet. These
instruments detect vertical motion of the plates to record it on the scale. The earth
surface formally called the crust is divided into seven large tectonic plates. These larger
plates are further divided into several small sub-plates which are being observed and are
noticed to move apart continuously.
There are variances of seismic types. These can be stated as divergence, con-
vergence which lead to transformation of plate boundaries. When the plates distance
themselves from each other, new boundaries are introduced. In the phenomenon of con-
vergence, plates of different densities tend to approach nearer giving rise to new geo-
graphical structures. When these plates slide apart from each other, this type of motion
is called transformation. Divergence, convergence and transformations are all together
known as faults.
A fault in any geological region causes stress. When the stress quantity is large,
it is released by earth in the form of earthquakes and sometimes volcanic eruption (stress
along with heat). Apart from faults, some other reasons leading to earthquakes include
volcanic eruptions, nuclear activities, mine blasts. The point of origin of the earthquake
is known as the focus point. Earthquakes are recorded by a modern form of geophones
called seismometers. These geophones are very sensitive to even small energy patterns
that they can record. They work in efficient way when they are installed in groups and
work in a cluster. The cluster of geophones can be deployed to increase the accuracy in
measurement of seismic values.
Geophones are mainly used for two purposes. Firstly, they increase the accuracy
by reducing noise results; and secondly, they record vertical displacements and ignore
any kind of horizontal seismic vibrations. Horizontally moving seismic waves are also
called ground rolls. They are considered as noise which is caused as a side effect of
seismic energy patterns. Vertically propagating waves almost simultaneously strike the
seismometers installed in a group and are recorded. All the vertical waves that hit the
seismometers at the same time are recorded by the cluster and all others which hit with
some delay are ignored. The sum of the propagating waves vertically can be calculated
and in the end it can generate time series data for recording.
1
Figure 1.1: stages involved in earthquake analysis.
2
1.3 Indian Geological Setting
The tectonic framework of northern India is dominated by two main features:
1.The stable continental craton of peninsular India; and
2.The collision zone
where India and Asia converge along the Himalaya plate boundary zone. It is an estab-
lished fact that the Indian plate is moving north relative to Asia at a rate of 20±3 mm/yr.
Asian plate is also moving northward, but at about half the rate of the Indian plate. The
difference in relative plate velocities produces an intercontinental collision that is forming
the Himalayan Mountains. Most of this convergence is accommodated in a zone of de-
formation 50 km wide along the southern edge of the Tibetan Plateau. However, several
millimeters per year of convergence is accommodated by distributed localized zones of
deformation within the Indian plate, as demonstrated by the earthquakes of Bhuj (Mw
7.7) in 2001, Anjar (M 6.0) in 1956, and Kachchh (M 7.5-8) in 1819.
3
below establishes relation between Modified Mercalli Intensity scale and Magnitudes of
Richter scale.
for forecasting future earthquakes and assessment of seismic hazard especially for devel-
oping countries like India where destruction and deaths due to earthquakes are several
orders higher because of shoddy constructions. The risk has increased several times in the
developing countries due to increasing population while it has decreased in the developed
countries by adopting sound engineering practices. For the Himalayas, hazard mitigation
has become essential as several large dams and developmental activities are coming up.
For Peninsular India also, it is necessary as devastating earthquakes have occurred and
even small earthquakes cause panic due to large population.
4
Chapter 2
LITERATURE SURVEY
2.1 Introduction
Occurrences of natural disasters even though not avoidable, but after-effect due to them
on human livelihood or nature can be minimized to considerable amount, if one knows
about it in advance. Forecasted disasters can be mitigated with due preparedness and
planned for disaster mitigation. Various types of natural disasters are forecasted by using
different technologies and methodologies. Drought like disasters can be predicted months
before. Heavy rains, storms, floods, landslides like disasters can be predicted few days be-
fore they occur. Use of remote sensing from satellites, Geographic Information Systems
(GIS)/Global Positioning Systems (GPS) and advances in computer technology made
near to accurate prediction of the most of the natural disasters. But short or medium-
term earthquake prediction has not yet become possible. What is possible is forecasting
or long-term prediction or assessment of earthquake potential based on seismicity pat-
terns. Current chapter discusses elaborately studies and research done on disaster related
topics, and more details on earthquake related concepts such as earthquake, predication
of earthquakes and damage assessment etc. as present work is focused on the earthquake
damage assessment.
5
forecast disasters to set warning system so as to take appropriate action before actually
disaster strikes. Various tools are available to assess damages caused due to disasters
other than earthquake, to help disaster management (DM) authorities for preparedness
and action plans. Some of the research works are discussed below. Mr. Parminder Singh
Bhogal suggested depth-area-duration (DAD) curves for the likely extreme rainfall and
discharges thereof to mitigate the likely flood damages in a region. DAD curves can also
be used for delineating area going to be under flood based on rainfall and catchment’s
data. Pre and post damage assessment due to Landslides, Floods, Cyclones or Droughts
are carry out in past. Earthquake, which is recurring disaster and about 55 percent-
age of the Indian continent is prone to earthquakes, does not give any warning. Even
though many scientists/researchers have put-forth earthquake forecasting tools, which
are described in the later part of this chapter, but fact is none can be proved to be full
proof system to rely-on completely. Three Greek scientists Prof. Varotsos, Alexopoulos
and Nomikos proposed VAN method (initials of the scientists were acronym designating
the method) for predicting the earthquakes. The VAN consists in continuously recording
telluric currents using a network of stations which cover a particular region. Chinese seis-
mologists invented a seismoscope that indicates the relative intensity and the direction
of the tremors. They succeeded in making their admirable prediction of the 1975 Liaon-
ing earthquake and could save hundreds of thousands of people from death or injury.
And the same seismoscope was unable to predict the disaster that occurred next year
in the same region, destroying the city of Tangshan. killing almost a million people and
injuring a number that has never been revealed. Some more earthquake forecasting the-
ories are discussed . Post earthquake damage assessment is being carried out by various
organisations like State Disaster Management Centers of India, Earthquake Engineer-
ing Research Institute (EERI) California, Building Materials and Technology Promotion
Council(BMTPC) New Delhi etc.
These damage assessments are carried out after the earthquake, based on their
reports post disaster management strategies are planned. This takes fairly long duration
to administer any appropriate action to combat the disaster. BMTPC have published
“Vulnerability Atlas of India”, which gives state-wise hazard maps and district-wise risk
tables. But does not quantify the damages likely to occur due to a disaster. There is very
little or no work is done to forecast the quantum of damages due to an earthquake, which
will help the administration to deploy requisite resources for disaster management. Mr.
M. Fischinger and P. Kante and Mr. Agrawal S.K. Chourasia A. and Parashar elaborate
on seismic vulnerability assessments of the buildings. These vulnerability assessment is
restricted to the structure under study, again does not specify damages of the region
6
model base. As seen from above listed literatures, research in the areas of prediction of
disaster and forecasting of damages due to disasters except earthquake are done. This
present work is an effort to develop a model to forecast estimate of damages which may
occur due to an earthquake, same will be of help to the administration for organising
disaster management.
7
Chapter 3
ANALYSIS
Software requirements
• Python 3.6
• Jupyter notebook
• Anaconda
Components
The jupyter notebook combines three components :
- The notebook web application
- Kernels
- Notebook documents
•Kernals :
8
Separate processes started by the notebook webapplication that runs users code in a given
language and returns output back to the notebook web application. The kernel also han-
dles things like computations for interactive widgets , tab completion and introspection.
•Notebook Documents:
Notebook documents contain the inputs and outputs of an interactive session as well as
narrative text that accompanies the code but is not meant for execution. Rich output
generated by running code, including HTML, images, video, and plots is embeddeed in
the notebook, which makes it a complete and self-contained record of a computation.
When you run the notebook web application on your computer, notebook doc-
uments are just files on your local filesystem with a “.ipynb” extension. This allows you
to use familiar workflows for organizing your notebooks into folders and sharing them
with others.
Notebook consists of a linear sequence of cells .There are four basic cell types:
• Code cells: Input and output of live code that is run in the kernel
• Markdown cells: Narrative text with embedded Latex equations
• Heading cells: Six levels of hierarchical organization formatting
• Raw cells: Unformatted text that is included,without modification,when notebooks are
converted to different formats using nbconvert
9
of the notebook list displays clickable breadcrumbs of the current directory.By clicking
on these breadcrumbs or on sub-directories in the notebook list,you can navigate your
file system.
To create a new notebook,click on the “New ” button at the buttom at the top
of the list and select a kernel from the dropdown (as seen below).Which kernels are listed
depend on what ‘s installed on the server .Some of the kernels in the screenshot below
may not exist as an option to you . Notebook and files can be uploaded to the current
directory by dragging a notebook file onto the notebook list or by the “click here ”text
above the list.
10
However, each notebook is associated with a single kernel. This notebook is associated
with the IPython kernel, therefore runs Python code.
3.1.5 Anaconda
Anaconda is a Package manager, an environment manager, a Python distribution, and a
collection of over 1000+ open source packages. It is free and easy to install it offers free
community support. Additionally , Anaconda can create custom environments that mix
and match different Python versions(2.6,2.7,3.3 or 3.4) and other packages into isolated
environments and easily switch between them using conda, our innovative multi-platform
package manager for Python and other languages.
11
Chapter 4
DATASETS
When we are building mathematical model to predict the future, we must split the dataset
into “Training Dataset” and “Testing Dataset”. For example, if we are building a machine
learning model, the model is going to learn the relationship of the data first. The model is
going to “Learn” the mathematical relationship in the data using the “Training Dataset”.
In order to verify whether the model is valid, we have to test the model with
data that are different with the “Training Dataset”. Therefore, we are going to check the
model using the “Testing Dataset”.
12
4.1 Train and test data
Training and test data are common for supervised learning algorithms. Given a dataset,
its split into training set and test set. In Machine Learning, this applies to supervised
learning algorithms. Training and test data In the real world we have all kinds of data
like financial data or customer data. An algorithm should make new predictions based
on new data. You can simulate this by splitting the dataset in training and test data.
usually split the data around 20%-80% between testing and training stages. Under su-
pervised learning, we split a dataset into a training data and test data in Python ML.
13
1. pip install pandas
1. pip install sklearn
We use pandas to import the dataset and sklearn to perform the splitting. You can
import these packages as-
1. »> import pandas as pd
2. »> from sklearn.model_selection import train_test_split
3. »> from sklearn.datasets import load_iris
3. How to Split Train and Test Set in Python Machine Learning? Following are the
process of Train and Test set in Python ML. So, let’s take a dataset first.
14
Chapter 5
EDA
5.1 Basemap
Basemap is a great tool for creating maps using python in a simple way. It’s a matplotlib
extension, so it has got all its features to create data visualizations, and adds the geo-
graphical projections and some datasets to be able to plot coastlines, countries, and so
on directly from the library.
Any map created with the Basemap library must start with the creation of a
Basemap instance
15
mpl_toolkits.basemap.Basemap(llcrnrlon=None, llcrnrlat=None, urcrnrlon=None,
urcrnrlat=None, llcrnrx=None, llcrnry=None, urcrnrx=None, urcrnry=None, width=None,
height=None, projection=’cyl’, resolution=’c’, area_thresh=None, rsphere=6370997.0,
ellps=None, lat_ts=None, lat_1=None, lat_2=None, lat_0=None, lon_0=None, lon_1=None,
lon_2=None, o_lon_p=None, o_lat_p=None, k_0=None, no_rot=False, suppress_ticks=True,
satellite_height=35786000, boundinglat=None, fix_aspect=True, anchor=’C’, celestial=False,
round=False, epsg=None, ax=None)
The class constructor has many possible arguments, and all of them are optional:
• Resolution: The resolution of the included coastlines, lakes, and so on. The options
are c (crude, the default), l (low), i (intermediate), – None option is a good one if a
Shapefile will be used instead of the included files, since no data must be loaded and the
performance rises a lot.
• Area_thresh: The threshold under what no coast line or lake will be drawn. Default
10000,1000,100,10,1 for resolution c, l, i, h, f.
• Rsphere: Radius of the sphere to be used in the projections. Default is 6370997 me-
ters. If a sequence is given, the first two elements are taken as the radius of the ellipsoid.
• Ellips: An ellipsoid name, such as ‘WGS84’. The allowed values are defined at
pyproj.pj_ellps
• Suppress_ticks: Suppress automatic drawing of axis ticks and labels in map projec-
tion coordinates
• Fix_aspect: Fix aspect ratio of plot to match aspect ratio of map projection region
(default True)
• Anchor: The place in the plot where the map is anchored. Default is C, which means
map is centered. Allowed values are C, SW, S, SE, E, NE, N, NW, and W
• Celestial: Use the astronomical conventions for longitude (i.e. negative longitudes to
the east of 0). Default False. Implies resolution=None
• Ax: set default axes instance
An other option is setting the bounding box, but using the projected units: fig
5.3
Finally, the last option is to set the bounding box giving the center point in geographical
coordinates, and the width and height of the domain in the projection units fig 5.4.
16
Figure 5.2: Geographical Coordinates
1. The first two lines include the Basemap library and matplotlib. Both are neces-
sary
2. The map is created using the Basemap class, which has many options. Without
passing any option, the map has the Plate Carrée projection centered at longitude and
latitude = 0
3. After setting the map, we can draw what we want. In this case, the coast lines layer,
which comes already with the library, using the method drawcoastlines()
4. Finally, the map has to be shown or saved. The methods from mathplotlib are used.
In this example, plt.show() opens a window to explore the result. plt.savefig(‘file_name’)
would save the map into an image.
Projection:
The projection argument sets the map projection to be used:(fig 5.6)
17
Figure 5.4: Setting the Bounding Box
map = Basemap(projection=’cyl’)
map.drawmapboundary(fill_color=’aqua’)
map.fillcontinents(color=’coral’,lake_color=’aqua’)
map.drawcoastlines()
plt.show()
The default value is cyl, or Cylindrical Equidistant projection(fig 5.7), also known as
Equirectangular projection or Plate Carrée
Many projections require extra arguments:
18
Figure 5.6: Mercatar Projection
plt.show()
19
Chapter 6
A machine is said to be learning from past Experiences(data feed in) with respect to
some class of Tasks, if it’s Performance in a given Task improves with the Experience.
20
Figure 6.2: Labelled data set
21
6.2 Types of Supervised Learning
6.2.1 Classification
It is a Supervised Learning task where output is having defined labels(discrete value).
For example in above Fig 6.2-Figure A, Output Purchased has defined labels i.e. 0 or 1.
1 means the customer will purchase and
0 means that customer won’t purchase.
The goal here is to predict discrete values belonging to a particular class and evalu-
ate on the basis of accuracy. It can be either binary or multi class classification. In
binary classification, model predicts either 0 or 1 ; yes or no but in case of multi class
classification, model predicts more than one class.
Example: Gmail classifies mails in more than one classes like social, promotions, up-
dates, forum.
6.2.2 Regression
It is a Supervised Learning task where output is having continuous value. Example in
above Figure B, Output – Wind Speed is not having any discrete value but is continuous
in the particular range. The goal here is to predict a value as much closer to actual
output value as our model can and then evaluation is done by calculating error value.
The smaller the error the greater the accuracy of our regression model.
Example: You can use regression to predict the house price from training data. The
input variables will be locality, size of a house, etc.
Strengths: Outputs always have a probabilistic interpretation, and the algo-
rithm can be regularized to avoid overfitting.
Weaknesses: Logistic regression may underperform when there are multiple
or non-linear decision boundaries. This method is not flexible, so it does not capture
more complex relationships.
22
6.3.2 Logistic Regression
Logistic regression method used to estimate discrete values based on given a set of inde-
pendent variables. It helps you to predicts the probability of occurrence of an event by
fitting data to a logit function. Therefore, it is also known as logistic regression. As it
predicts the probability, its output value lies between 0 and 1. Here are a few types of
Regression Algorithms
6.3.3 Classification
Classification means to group the output inside a class. If the algorithm tries to label
input into two distinct classes, it is called binary classification. Selecting between more
than two classes is referred to as multiclass classification.
Example: Determining whether or not someone will be a defaulter of the loan.
Strengths: Classification tree perform very well in practice
Weaknesses: Unconstrained, individual trees are prone to overfitting.
23
• k-nearest neighbor algorithm
• Neural Networks (Multilayer perceptron)
• Similarity learning
Irregular patterns: they overfit their training sets, i.e. have low bias, but very high
variance. Random forests are a way of averaging multiple deep decision trees, trained
on different parts of the same training set, with the goal of reducing the variance. This
comes at the expense of a small increase in the bias and some loss of interpretability, but
generally greatly boosts the performance in the final model.
6.4.2 Bagging
The training algorithm for random forests applies the general technique of bootstrap
aggregating, or bagging, to tree learners. Given a training set X = x1, ..., xn with
responses Y = y1, ..., yn, bagging repeatedly (B times) selects a random sample with
24
replacement of the training set and fits trees to these samples:
For b = 1, ..., B:
1. Sample, with replacement, n training examples from X, Y; call these Xb,
Yb.
2. Train a classification or regression tree fb on Xb, Yb. After training, pre-
dictions for unseen samples x’ can be made by averaging the predictions from all the
individual regression trees on x’
or by taking the majority vote in the case of classification trees. This bootstrapping
procedure leads to better model performance because it decreases the variance of the
model, without increasing the bias. This means that while the predictions of a single tree
are highly sensitive to noise in its training set, the average of many trees is not, as long
as the trees are not correlated. Simply training many trees on a single training set would
give strongly correlated trees (or even the same tree many times, if the training algorithm
is deterministic); bootstrap sampling is a way of de-correlating the trees by showing them
different training sets. Additionally, an estimate of the uncertainty of the prediction can
be made as the standard deviation of the predictions from all the individual regression
trees on x’.
The number of samples / trees, B, is a free parameter. Typically, a few hundred
to several thousand trees are used, depending on the size and nature of the training set.
An optimal number of trees B can be found using cross-validation, or by observing the
out-of-bag error: the mean prediction error on each training sample xi, using only the
trees that did not have xi in their bootstrap sample. The training and test error tend to
level off after some number of trees have been fit.
25
under consideration (based on, e.g., information gain or the Gini impurity), a random
cut-point is selected. This value is selected from a uniform distribution within the fea-
ture’s empirical range (in the tree’s training set). Then, of all the randomly generated
splits, the split that yields the highest score is chosen to split the node. Similar to ordi-
nary random forests, the number of randomly selected features to be considered at each
node can be specified. Default values for this parameter are for classification and for
regression, where is the number of features in the model.
1. This algorithm is very stable. Even if a new data point is introduced in the dataset
the overall algorithm is not affected much since new data may impact one tree, but it is
very hard for it to impact all the trees.
2. The random forest algorithm works well when you have both categorical and numerical
features.
3. The random forest algorithm also works well when data has missing values or it has
not been scaled well (although we have performed feature scaling in this article just for
the purpose of demonstration).
choosing of k:
Some people simply go with recommendations based on past studies of the data type.
Others use grid search. This method will be able to best determine which k is the optimal
to use for your data.
26
working:
First you need to build a grid. This is essentially a set of possible values your hyper-
parameter can take. For our case we can use [1,2,3,...,10] . Then you will train your k-NN
model for each value in the grid. First you would do 1-NN, then 2-NN, and so on. For
each iteration you will get a performance score which will tell you how well your algorithm
performed using that value for the hyper-parameter. After you have gone through the
entire grid you will select the value that gave the best performance. This goes against
the principles of not using test data!! You would be absolutely right. That is the reason
grid search is often mixed with cross-validation. So that we keep the test data completely
separate until we are truly satisfied with our results and are ready to test. n -fold cross-
validation takes a training set and separates it into n parts. It then trains on n−1 folds
and tests on the fold which was left out. For each value in the grid, the algorithm will
be retrained n times, for each fold being left out. Then the performance across each fold
is averaged and that is the achieved performance for that hyper-parameter value. The
selected hyper-parameter value is the one which achieves the highest average performance
across the n-folds. Once you are satisfied with your algorithm, then you can test it on
the testing set. If you go straight to the testing set then you are risking overfitting
6.6.2 PANDAS
Pandas is an opensource library that allows to you perform data manipulation in Python.
Pandas library is built on top of Numpy, meaning Pandas needs Numpy to operate. Pan-
das provide an easy way to create, manipulate and wrangle the data. Pandas is also an
elegant solution for time series data.
27
• It includes a powerful time series tool to work with
In a nutshell, Pandas is a useful library in data analysis. It can be used to perform data
manipulation and analysis. Pandas provide powerful and easy-to-use data structures, as
well as the means to quickly perform operations on these structures.
6.6.3 Matplotlib
Matplotlib is an amazing visualization library in Python for 2D plots of arrays. Matplotlib
is a multi-platform data visualization library built on NumPy arrays and designed to work
with the broader SciPy stack. It was introduced by John Hunter in the year 2002.
One of the greatest benefits of visualization is that it allows us visual access to
huge amounts of data in easily digestible visuals. Matplotlib consists of several plots like
line, bar, scatter, histogram etc.
matplotlib.pyplot is a collection of command style functions that make mat-
plotlib work like MATLAB. Each pyplot function makes some change to a figure: e.g.,
creates a figure, creates a plotting area in a figure, plots some lines in a plotting area,
decorates the plot with labels, etc.
In matplotlib.pyplot various states are preserved across function calls, so that it
keeps track of things like the current figure and plotting area, and the plotting functions
are directed to the current axes (please note that "axes" here and in most places in the
documentation refers to the axes part of a figure and not the strict mathematical term
for more than one axis).
6.6.4 read_xl
The readxl package makes it easy to get data out of Excel and into R. Compared to
many of the existing packages (e.g. gdata, xlsx, xlsReadWrite) readxl has no external
dependencies, so it’s easy to install and use on all operating systems. It is designed to
work with tabular data.
readxl supports both the legacy .xls format and the modern xml-based .xlsx
format. The libxls C library is used to support .xls, which abstracts away many of the
complexities of the underlying binary format. To parse .xlsx, we use the RapidXML C++
library.
6.6.5 Train_test_split
sklearn.model_selection.train_test_split(arrays, *options) Split arrays or matrices into
random train and test subsets Quick utility that wraps input validation and next(ShuffleS
plit(). split(X, y)) and application to input data into a single call for splitting (and op-
tionally subsampling) data in a oneliner.
28
train_size : float, int, or None, (default=None) If float, should be between
0.0 and 1.0 and represent the proportion of the dataset to include in the train split. If
int, represents the absolute number of train samples. If None, the value is automatically
set to the complement of the test size.
random_state : int, RandomState instance or None, optional (default=None)
If int, random_state is the seed used by the random number generator;
If RandomState instance, random_state is the random number generator;
If None, the random number generator is the RandomState instance used by np.random.
shuffle : boolean, optional (default=True)
Whether or not to shuffle the data before splitting. If shuffle=False then stratify must
be None.
stratify : array-like or None (default=None)
If not None, data is split in a stratified fashion, using this as the class labels.
Returns: splitting : list, length=2 * len(arrays)
List containing train-test split of inputs.
New in version 0.16: If the input is sparse, the output will be a scipy.sparse.csr_matrix.
Else, output type is the same as the input type.
6.6.6 nan_to_num
numpy.nan_to_num() function is used when we want to replace nan(Not A Number)
with zero and inf with finite numbers in an array. It returns (positive) infinity with a
very large number and negative infinity with a very small (or negative) number.
6.6.7 Workbook()
The Workbook class is the main class exposed by the XlsxWriter module and it is the
only class that you will need to instantiate directly. The Workbook class represents the
entire spreadsheet as you see it in Excel and internally it represents the Excel file as it is
written on disk.
Constructor
Workbook(filename[, options]) Create a new XlsxWriter Workbook object.
Parameters:
6.6.8 Mktime()
Pythom time method mktime() is the inverse function of localtime(). Its argument is the
struct_time or full 9-tuple and it returns a floating point number, for compatibility with
time(). If the input value cannot be represented as a valid time, either OverflowError or
ValueError will be raised.
29
Syntax: time.mktime(t)
Parameters:
• t − This is the struct_time or full 9-tuple.
Return Value:
• This method returns a floating point number, for compatibility with time().
6.6.9 Timetuple()
The timetuple() method of datetime.date instances returns an object of type time.struct_time.
The struct_time is a named tuple object (A named tuple object has attributes that can
be accessed by an index or by name).
The struct_time object has attributes for representing both date and time
fields along with a flag to indicate whether Daylight Saving Time is active.
The named tuple returned by the timetuple() function will have its year, month
and day fields set as per the date object and fields corresponding to the hour, minutes,
seconds will be set to zero.
Strptime()
Pythom time method strptime() parses a string representing a time according to a format.
The return value is a struct_time as returned by gmtime() or localtime(). The format
parameter uses the same directives as those used by
Strftime()
it defaults to "%a %b %d %H:%M:%S %Y" which matches the formatting returned by
ctime().
If string cannot be parsed according to format, or if it has excess data after
parsing, ValueError is raised.
Syntax Following is the syntax for strptime() method −
time.strptime(string[, format])
Parameters:
• string − This is the time in string format which would be parsed based on
the given format.
• format − This is the directive which would be used to parse the given string.
30
• %H - hour, using a 24-hour clock (00 to 23)
• %I - hour, using a 12-hour clock (01 to 12)
• %j - day of the year (001 to 366)
• %m - month (01 to 12)
• %M - minute
• %n - newline character
• %p - either am or pm according to the given time value
• %r - time in a.m. and p.m. notation
• %R - time in 24 hour notation
• %S - second
• %t - tab character
• %T - current time, equal to %H:%M:%S
• %u - weekday as a number (1 to 7), Monday=1. Warning: In Sun Solaris Sunday=1
• %U - week number of the current year, starting with the first Sunday as the first day
of the first week
• %V - The ISO 8601 week number of the current year (01 to 53), where week 1 is the
first week that has at least 4 days in the current year, and with Monday as the first day
of the week
• %W - week number of the current year, starting with the first Monday as the first day
of the first week
• %w - day of the week as a decimal, Sunday=0
• %x - preferred date representation without the time
• %X - preferred time representation without the date
• %y - year without a century (range 00 to 99)
• %Y - year including the century
• %Z or %z - time zone or name or abbreviation
• %% - a literal % character
Return Value:
This return value is struct_time as returned by gmtime() or localtime().
31
Chapter 7
IMPLEMENTATION CODE
CODE :
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
data = pd.read_excel("Book4.xlsx")
data.tail()
data = data[[’time’,’date’,’timestamp’, ’latitude’, ’longitude’,’place’, ’depth’,
’mag’]]
data.tail()
final_data = data.drop([’date’,’time’], axis=1)
#final_data = final_data[final_data.time != ’ValueError’]
final_data.head()
from mpl_toolkits.basemap import Basemap
m=Basemap(projection=’merc’, llcrnrlat=8.,urcrnrlat=37., llcrnrlon=68.,
urcrnrlon=97.,lat_0=54.5,lon_0=-4.36,resolution=’c’) longitudes = data["longitude"].tolist()
latitudes = data["latitude"].tolist()
x,y = m(longitudes,latitudes)
fig = plt.figure(figsize=(12,10))
plt.title("All affected areas")
m.plot(x, y, "o", markersize = 2, color = ’red’)
m.drawcoastlines()
m.fillcontinents(color=’skyblue’,lake_color=’aqua’)
m.drawmapboundary()
m.drawcountries()
plt.show()
data.loc[0:]
from sklearn.model_selection import train_test_split
X = final_data[[ ’timestamp’,’latitude’, ’longitude’]]
y = final_data[[’mag’, ’depth’]].astype(’float32’)
X[:] = np.nan_to_num(X)
y[:] = np.nan_to_num(y)
32
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, ran-
dom_state=42)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
from sklearn.ensemble import RandomForestRegressor
reg = RandomForestRegressor(random_state=42)
reg.fit(X_train, y_train)
reg.predict(X_test)
y_pred=reg.predict(X_test)
predicted=y_pred.tolist()
import openpyxl
wb = openpyxl.Workbook()
Sheet_name = wb.sheetnames
wb.save(filename=’results.xlsx’)
resdf=pd.read_excel(’results.xlsx’)
resdf[’Timestamp’]=X_test[’timestamp’].tolist()
resdf[’Longitude’]=X_test[’longitude’].tolist()
resdf[’Latitude’]=X_test[’latitude’].tolist()
places=[]
for instance in resdf.itertuples():
for row in data.itertuples():
if (instance.Longitude==row.longitude and instance.Latitude==row.latitude):
places.append(row.place)
del places[-1]
resdf[’Place’]=places
resdf[’y_Predicted(mag,depth)’]=predicted
resdf.head(100)
reg.score(X_test, y_test)
from sklearn.model_selection import GridSearchCV
parameters = ’n_estimators’:[10, 20, 50, 100, 200, 500]
grid_obj = GridSearchCV(reg, parameters)
grid_fit = grid_obj.fit(X_train, y_train)
best_fit = grid_fit.best_estimator_ t=best_fit.predict(X_test)
print(t)
best_fit.score(X_test, y_test)
import time
import datetime
date=input(’please enter date in d m y format’)
ts=time.mktime(datetime.datetime.strptime(date, "%d/%m/%Y").timetuple())
print(ts)
from sklearn.ensemble import RandomForestRegressor
reg = RandomForestRegressor(random_state=42)
reg.fit(X_train, y_train)
latitude=input(’lat’)
longitude=input(’long’)
X_test1=[ts ,latitude,longitude]
predicted=reg.predict([X_test1])
for row in data.itertuples():
if (float(longitude)==row.longitude and float(latitude)==row.latitude):
33
print("Place: "+row.place)
print(predicted)
34
Chapter 8
OUTPUT
[1]
35
[2]
36
[3]
37
[4]
38
[5]
39
[6]
40
[7]
41
[8]
42
[9]
43
CONCLUSION
Earthquakes are hard to understand and are dangerous to live through. Many people
might have never experienced an earthquake or might never experience one. Whatever
your situation everyone should be prepared and know how to deal with one. Predictions
can be told, but is there evidence that proves it all. Forecasting of earthquakes falls
on knowledge of past earthquakes on a specific fault. Thus it can be observed that by
using the following algorithmic model for earthquake prediction, proper methods can
be implemented for deploying warnings and preparing for earthquakes. The proposed
algorithmic model efficiently performs data analysis using Machine learning and can be
used for observing insights related to earthquakes.
44
REFERENCES
45