0% found this document useful (0 votes)
129 views53 pages

MiniProject 1 PDF

The document describes a project that aims to predict earthquake magnitudes using machine learning algorithms. The project involves collecting earthquake data, preprocessing the data, splitting it into training and test sets, and using a random forest regressor model to predict magnitudes on the test set. The results and performance of the model are also analyzed.

Uploaded by

Sravani Kongala
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
129 views53 pages

MiniProject 1 PDF

The document describes a project that aims to predict earthquake magnitudes using machine learning algorithms. The project involves collecting earthquake data, preprocessing the data, splitting it into training and test sets, and using a random forest regressor model to predict magnitudes on the test set. The results and performance of the model are also analyzed.

Uploaded by

Sravani Kongala
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

EARTHQUAKE MAGNITUDE PREDICTION

A
Mini Project Report

Submitted in partial fulfillment of


the requirements for the award of the degree of

Bachelor of Technology
in
Computer Science and Engineering

Submitted by

SRAVANI.K (16SS1A0524)
HARIKA.D (17SS5A0502)
SAI SUNANDA.P (17SS5A0506)

Under the guidance of


Dr.B.V. RamNaresh Yadav

Department of Computer Science and Engineering


JNTUH College of Engineering, Sultanpur
Sultanpur(V), Pulkal(M), Sangareddy district, Telangana –502273
November 2019
JAWAHARLAL NEHRU TECHNOLOGICAL UNIVERSITY
HYDERABAD COLLEGE OF ENGINEERING SULTHANPUR
Sultanpur,Pulkal(M),Sangareddy-502273,Telangana.

Department of Computer Sciences and Engineering

Certificate

This is to certify that the Mini Project report work entitled "EARTHQUAKE MAGNITUDE
PREDICTION" is a bonafide work carried out by the team consisting of SRAVANI.K
bearing Roll no.16SS1A0524, HARIKA.D bearing Roll no.17SS5A0502, SAI SUNANDA.P
bearing Roll no.17SS5A0506, in partial fullfillment of the requirements of the degree of
BACHELOR OF TECHNOLOGY in COMPUTER SCIENCE AND ENGINEERING
discipline to Jawaharlal Nehru Technological University, Hyderabad during the academic
year 2019-2020.

The results embodied in this report have not been submitted to any other University or
Institution for the award of any degree or diploma.

Project Guide Head of the Department


Dr.B.V.RamNaresh Yadav JOSHI SHRIPAD
Associate Professor Associate Professor

EXTERNAL EXAMINER

i
Declaration

We hereby declare that the Mini project entitled "EARTHQUAKE MAGNITUDE


PREDICTION" is the work carried out by the team consisting of SRAVANI.K bear-
ing Roll no.16SS1A0524, HARIKA.D bearing Roll no.17SS5A0502, SAI SUNANDA.P
bearing Roll no.17SS5A506, is submitted in the partial fulfilment of the requirements
for the award of degree of Bachelor of Technology in Computer Science and Engineer-
ing from Jawaharlal Nehru Technological University Hyderabad College of Engineering
Sultanpur. The results embodied in this project have not been submitted to any other
university or Institution for the award of any degree or diploma.

SRAVANI.K (16SS1A0524)
HARIKA.D (17SS5A0502)
SAI SUNANDA.P (17SS5A0506)

ii
Acknowledgement

We wish to take this opportunity to express our deep gratitude to all those who
helped, encouraged, motivated and have extended their cooperation in various ways
during our Mini project report work. It is our pleasure to acknowledge the help of
all those individuals who were responsible for foreseeing the successful completion of
Mini project report work.
We express our sincere gratitude to Dr. B. BALU NAIK, Principal of JNTUHCES
for his support during the course period.
We sincerely thank Dr. VENKATESHWAR REDDY, Vice Principal of JNTUHCES
for his kind help and cooperation.
We are thankful to Sri JOSHI SHRIPAD, Associate Professor and Head of the Depart-
ment of Computer Science and Engineering of JNTUHCES for his effective suggestions
during the course period.
Finally, we express our gratitude with great admiration and respect to our teaching
and non-teaching staff for their moral support and encouragement throughout the
course.

SRAVANI.K (16SS1A0524)
HARIKA.D (17SS5A0502)
SAI SUNANDA.P (17SS5A0506)

iii
Abstract

Characteristic perils like earthquakes are for the most part the consequence of spreading
seismic waves underneath the surface of the earth. Tremors are dangerous absolutely
in light of the fact that they’re erratic, striking without warning, triggering fires and
tsunamis and leading to deaths of countless individuals. If researchers could caution
people in weeks or months ahead of time about seismic disturbances, clearing and different
arrangements could be made to spare incalculable lives. An early identification and
future earthquake prediction can be achieved using machine learning models. Seismic
stations continuously gather data without the necessity of the occurrence of an event. The
gathered data can be used to distinguish earthquake and non-earthquake prone regions.
Machine learning methods can be used for analyzing continuous time series data in order
to detect earthquakes effectively. The pre-existing linear models applied to earthquake
problems have failed to achieve significant amount of efficiency and generate overheads
with respect to pre-processing. The proposed work exploits the predicting the earthquake
magnitude using the machine learning algorithms(used randomforestregressor).

iv
List of Figures

1.1 stages involved in earthquake analysis. . . . . . . . . . . . . . . . . . . . 2

3.1 Jupyter Dashboard showing file tab. . . . . . . . . . . . . . . . . . . . . . 9


3.2 Jupyter new menu. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

4.1 Data Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13


4.2 data spliting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.3 Split Train and Test Set . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.4 data.head() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

5.1 Technology Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 15


5.2 Geographical Coordinates . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5.3 Projection units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5.4 Setting the Bounding Box . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5.5 Simple Projection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5.6 Mercatar Projection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.7 Cylindrical Equidistant projection . . . . . . . . . . . . . . . . . . . . . . 19

6.1 Types of Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20


6.2 Labelled data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
6.3 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

8.1 Read the DataSet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35


8.2 Attribute selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
8.3 Plotting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
8.4 Total DataSet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
8.5 Split_train_test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
8.6 Predicted TestData . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
8.7 Cross Validation of DataSet . . . . . . . . . . . . . . . . . . . . . . . . . 41
8.8 Conversion of TimeStamp . . . . . . . . . . . . . . . . . . . . . . . . . . 42
8.9 Predicted Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

v
Contents

Certificate i

Declaration ii

Acknowledgement iii

Abstract iv
Page

1 INTRODUCTION 1
1.1 Concept of an Earthquake . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Occurrence of Earthquakes . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Indian Geological Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Classification of Earthquake . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 LITERATURE SURVEY 5
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Literature Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Earthquake Damages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3 ANALYSIS 8
3.1 Environments Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.1.1 Jupyter notebook . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.1.2 The NoteBook Dashboard . . . . . . . . . . . . . . . . . . . . . . 9
3.1.3 Overview of the Notebook UI . . . . . . . . . . . . . . . . . . . . 10
3.1.4 Running code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.1.5 Anaconda . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.1.6 Packages installed with Anaconda . . . . . . . . . . . . . . . . . . 11
3.1.7 Using Python in Anaconda . . . . . . . . . . . . . . . . . . . . . . 11

4 DATASETS 12
4.1 Train and test data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

5 EDA 15
5.1 Basemap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

6 ALGORITHMS AND FUNCTIONS USED 20


6.1 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
6.2 Types of Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . 22
6.2.1 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

vi
6.2.2 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
6.3 Types of Supervised Machine Learning Algorithms . . . . . . . . . . . . 22
6.3.1 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
6.3.2 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . 23
6.3.3 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
6.3.4 Naive Bayes Classifiers . . . . . . . . . . . . . . . . . . . . . . . . 23
6.3.5 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
6.3.6 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . 23
6.4 Random forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
6.4.1 Decision tree learning . . . . . . . . . . . . . . . . . . . . . . . . . 24
6.4.2 Bagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
6.4.3 Bagging to Random Forests . . . . . . . . . . . . . . . . . . . . . 25
6.4.4 Extremely Randomized Trees . . . . . . . . . . . . . . . . . . . . 25
6.4.5 Advantages of using Random Forest . . . . . . . . . . . . . . . . . 26
6.4.6 Disadvantages of using Random Forest . . . . . . . . . . . . . . . 26
6.5 Grid Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
6.6 Packages and Libraries Used . . . . . . . . . . . . . . . . . . . . . . . . . 27
6.6.1 NumPy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
6.6.2 PANDAS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
6.6.3 Matplotlib . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
6.6.4 read_xl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
6.6.5 Train_test_split . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
6.6.6 nan_to_num . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
6.6.7 Workbook() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
6.6.8 Mktime() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
6.6.9 Timetuple() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

7 IMPLEMENTATION CODE 32

8 OUTPUT 35

CONCLUSION 44

REFERENCES 45

vii
Chapter 1

INTRODUCTION

Movement of seismic plates under the surface of the earth which supports life in many
form, causes earthquakes which is a natural hazard. Seismometers, which are used to
record motion of these plates, are installed at various locations on the planet. These
instruments detect vertical motion of the plates to record it on the scale. The earth
surface formally called the crust is divided into seven large tectonic plates. These larger
plates are further divided into several small sub-plates which are being observed and are
noticed to move apart continuously.
There are variances of seismic types. These can be stated as divergence, con-
vergence which lead to transformation of plate boundaries. When the plates distance
themselves from each other, new boundaries are introduced. In the phenomenon of con-
vergence, plates of different densities tend to approach nearer giving rise to new geo-
graphical structures. When these plates slide apart from each other, this type of motion
is called transformation. Divergence, convergence and transformations are all together
known as faults.
A fault in any geological region causes stress. When the stress quantity is large,
it is released by earth in the form of earthquakes and sometimes volcanic eruption (stress
along with heat). Apart from faults, some other reasons leading to earthquakes include
volcanic eruptions, nuclear activities, mine blasts. The point of origin of the earthquake
is known as the focus point. Earthquakes are recorded by a modern form of geophones
called seismometers. These geophones are very sensitive to even small energy patterns
that they can record. They work in efficient way when they are installed in groups and
work in a cluster. The cluster of geophones can be deployed to increase the accuracy in
measurement of seismic values.
Geophones are mainly used for two purposes. Firstly, they increase the accuracy
by reducing noise results; and secondly, they record vertical displacements and ignore
any kind of horizontal seismic vibrations. Horizontally moving seismic waves are also
called ground rolls. They are considered as noise which is caused as a side effect of
seismic energy patterns. Vertically propagating waves almost simultaneously strike the
seismometers installed in a group and are recorded. All the vertical waves that hit the
seismometers at the same time are recorded by the cluster and all others which hit with
some delay are ignored. The sum of the propagating waves vertically can be calculated
and in the end it can generate time series data for recording.

1
Figure 1.1: stages involved in earthquake analysis.

1.1 Concept of an Earthquake


An ‘Earthquake’ is the tectonic movements to release built-up energy when two earth
strata relatively move in relation to each other along a fault. In geological structuring
of the earth, various distinct layers are formed and have fractures (faults), relative in-
clinations. In certain portion of the earth crust there appears to be a continuous strain
movement between earth strata. When the stresses developed by such strain exceed the
strength of the earth crust materials, a slip occurs between the two portions of the crust
and energy is released causing earthquake.
The types of waves produced on account of the release of energy depend some-
what on the location of the origin (earthquake focus) but in all cases energy is transmitted
from the point of origin to the other points by waves. From the focus waves radiate in all
directions. At the epicenter (point lying on the earth surface vertically above the focus)
the shaking is the most intense. Seismic waves are of three types; (i) P waves, (ii) S
waves, and (iii) L or Surface waves.
Surface waves are responsible for causing earthquakes and hence damages.

1.2 Occurrence of Earthquakes


History of earthquakes shows they occur in peculiar patterns year after year, principally
in three large zones of the earth. The world’s greatest earthquake belt, the circum-Pacific
seismic belt, is found along the rim of the Pacific Ocean, where about 81 percentage of
the world’s largest earthquakes occur. The belt extends from Chile, northward along the
South American coast through Central America, Mexico, the west coast of the United
States, and the southern part of Alaska, through the Aleutian Island to Japan, the
Philippine Island, New Guinea, the island group of the southwest Pacific and to New
Zealand. This earthquake belt was responsible for 70,000 deaths in Peru in May 1970.
This is a region of young, growing mountains and deep ocean trenches with invariably
parallel mountain chains. The second important belt, the Alpide, extent from java to
Sumatra through the Himalayas, the Mediterranean, and out into the Atlantic. This belt
accounts for about 17 percentage of the world’s largest earthquakes, including some of
the most destructive, such as the Iran shock that took, 11,000 lives in Aug 1968 and
the Turkey tremors in March 1970 and May 1971 that each killed over 1000. All were
near magnitude 7 on the Richter scale. The third prominent belt follows the submerged
mid-Atlantic Ridge.

2
1.3 Indian Geological Setting
The tectonic framework of northern India is dominated by two main features:
1.The stable continental craton of peninsular India; and
2.The collision zone
where India and Asia converge along the Himalaya plate boundary zone. It is an estab-
lished fact that the Indian plate is moving north relative to Asia at a rate of 20±3 mm/yr.
Asian plate is also moving northward, but at about half the rate of the Indian plate. The
difference in relative plate velocities produces an intercontinental collision that is forming
the Himalayan Mountains. Most of this convergence is accommodated in a zone of de-
formation 50 km wide along the southern edge of the Tibetan Plateau. However, several
millimeters per year of convergence is accommodated by distributed localized zones of
deformation within the Indian plate, as demonstrated by the earthquakes of Bhuj (Mw
7.7) in 2001, Anjar (M 6.0) in 1956, and Kachchh (M 7.5-8) in 1819.

1.4 Classification of Earthquake


In order to assess the damage due to earthquake the intensity or magnitude of earthquake
occurred and the corresponding damages are standardised by the BIS and in various lit-
eratures is given in following paras. The severity of shaking of an earthquake as felt
or observed through damage is described as Intensity at a certain place on an arbitrary
scale. It was first devised by Rossi-Forel (1885) had ten divisions and later by Mer-
calli (1904) has twelve divisions (modified by Neumann in 1931). Table below shows
the modified Mercalli (MM) Intensity Scale(33), which presents a qualitative description
of shaking experienced at a place. It naturally decreases with distance from the epicenter.

Modified Mercalli Intensity Scale (Abridged)


As intensity scale does not describe the possible size of an earthquake in abso-
lute terms. For this purpose, Richter suggested that the Magnitude of an earthquake be
standardized as ‘Logarithm with base 10’, of the maximum amplitude ‘A’ of the ground
motion as recorded in microns at a distance of 100 km from the epicenter on a Wood-
Anderson type Torsion seismograph having damping equal to 80 percentage of critical,
natural period of 0.8 second and magnification of 2800’, i.e. M = log 10 A. Different
magnitude scales are developed by seismologists based on use of the amplitude of the
different waves propagated during an earthquake for calculating the magnitude. Vari-
ous magnitude scales suggested are: Richter Scale also called the Local Magnitude scale
(ML), Body wave magnitude scale (mb), surface wave magnitude scale (Ms) and moment
magnitude scale (Mw). Richter has established relationship between strain energy ‘E’ re-
leased by an earthquake and its magnitude ‘M’ as follows: logio E = 11.4 + 1.5M. Energy
released in earthquakes of different magnitudes’33 R is presented in Table below Table

3
below establishes relation between Modified Mercalli Intensity scale and Magnitudes of
Richter scale.

Earthquakes can also be grouped based on Duration of shocks and Radius of


affected region as given in Table below Hence, a good knowledge of seismicity is essential

for forecasting future earthquakes and assessment of seismic hazard especially for devel-
oping countries like India where destruction and deaths due to earthquakes are several
orders higher because of shoddy constructions. The risk has increased several times in the
developing countries due to increasing population while it has decreased in the developed
countries by adopting sound engineering practices. For the Himalayas, hazard mitigation
has become essential as several large dams and developmental activities are coming up.
For Peninsular India also, it is necessary as devastating earthquakes have occurred and
even small earthquakes cause panic due to large population.

4
Chapter 2

LITERATURE SURVEY

2.1 Introduction
Occurrences of natural disasters even though not avoidable, but after-effect due to them
on human livelihood or nature can be minimized to considerable amount, if one knows
about it in advance. Forecasted disasters can be mitigated with due preparedness and
planned for disaster mitigation. Various types of natural disasters are forecasted by using
different technologies and methodologies. Drought like disasters can be predicted months
before. Heavy rains, storms, floods, landslides like disasters can be predicted few days be-
fore they occur. Use of remote sensing from satellites, Geographic Information Systems
(GIS)/Global Positioning Systems (GPS) and advances in computer technology made
near to accurate prediction of the most of the natural disasters. But short or medium-
term earthquake prediction has not yet become possible. What is possible is forecasting
or long-term prediction or assessment of earthquake potential based on seismicity pat-
terns. Current chapter discusses elaborately studies and research done on disaster related
topics, and more details on earthquake related concepts such as earthquake, predication
of earthquakes and damage assessment etc. as present work is focused on the earthquake
damage assessment.

2.2 Literature Analysis


Many scientists and researchers have worked on disasters and related areas to mitigate,
combat and/or minimise the effect of disasters. The research work done in the area of
disasters is critically reviewed in the following paras. Plenty of work have been done in
developing models/methods to predict disasters other than earthquake, so as to set early
warning system to reduce devastating effect of disasters. Ashwagosha Ganju suggested
integrated method of snow cover model with remote automatic weather stations and
observations for avalanche forecast, which provides an objective basis for the assessment
of snow-cover stability and eventually degree of avalanche danger at regional and local
level. Statistical models are also developed to forecast avalanches.
Mr. Brun E., et. al Formulated a numerical model to simulate snow-cover for
forecasting avalanche. Mr. Rajesh Rai et. al. used artificial Neural Network approach to
predict landslides. Satellite remote sensing technology is a power tool in mapping floods
and also provides valuable forecast of drought in the regions. These works are done to

5
forecast disasters to set warning system so as to take appropriate action before actually
disaster strikes. Various tools are available to assess damages caused due to disasters
other than earthquake, to help disaster management (DM) authorities for preparedness
and action plans. Some of the research works are discussed below. Mr. Parminder Singh
Bhogal suggested depth-area-duration (DAD) curves for the likely extreme rainfall and
discharges thereof to mitigate the likely flood damages in a region. DAD curves can also
be used for delineating area going to be under flood based on rainfall and catchment’s
data. Pre and post damage assessment due to Landslides, Floods, Cyclones or Droughts
are carry out in past. Earthquake, which is recurring disaster and about 55 percent-
age of the Indian continent is prone to earthquakes, does not give any warning. Even
though many scientists/researchers have put-forth earthquake forecasting tools, which
are described in the later part of this chapter, but fact is none can be proved to be full
proof system to rely-on completely. Three Greek scientists Prof. Varotsos, Alexopoulos
and Nomikos proposed VAN method (initials of the scientists were acronym designating
the method) for predicting the earthquakes. The VAN consists in continuously recording
telluric currents using a network of stations which cover a particular region. Chinese seis-
mologists invented a seismoscope that indicates the relative intensity and the direction
of the tremors. They succeeded in making their admirable prediction of the 1975 Liaon-
ing earthquake and could save hundreds of thousands of people from death or injury.
And the same seismoscope was unable to predict the disaster that occurred next year
in the same region, destroying the city of Tangshan. killing almost a million people and
injuring a number that has never been revealed. Some more earthquake forecasting the-
ories are discussed . Post earthquake damage assessment is being carried out by various
organisations like State Disaster Management Centers of India, Earthquake Engineer-
ing Research Institute (EERI) California, Building Materials and Technology Promotion
Council(BMTPC) New Delhi etc.
These damage assessments are carried out after the earthquake, based on their
reports post disaster management strategies are planned. This takes fairly long duration
to administer any appropriate action to combat the disaster. BMTPC have published
“Vulnerability Atlas of India”, which gives state-wise hazard maps and district-wise risk
tables. But does not quantify the damages likely to occur due to a disaster. There is very
little or no work is done to forecast the quantum of damages due to an earthquake, which
will help the administration to deploy requisite resources for disaster management. Mr.
M. Fischinger and P. Kante and Mr. Agrawal S.K. Chourasia A. and Parashar elaborate
on seismic vulnerability assessments of the buildings. These vulnerability assessment is
restricted to the structure under study, again does not specify damages of the region

2.3 Earthquake Damages


Among the natural calamities, earthquakes are the most destructive, in terms of loss of life
and destruction of property. Often, they occur without any warning, which make them
the most feared and unpredictable natural phenomena. On an average, two earthquakes of
magnitude eight are reported to occur globally every year. Many destructive earthquakes
have occurred in India in the recent past, causing damages worth crores of rupees and
claiming many thousands of human lives. More than 650 earthquakes of magnitude more
than 5.0 have been reported in India since 1890. In India Gujarat have seen more number
of major earthquakes, hence Gujarat Bhuj Earthquake is chosen as a case-study for the

6
model base. As seen from above listed literatures, research in the areas of prediction of
disaster and forecasting of damages due to disasters except earthquake are done. This
present work is an effort to develop a model to forecast estimate of damages which may
occur due to an earthquake, same will be of help to the administration for organising
disaster management.

7
Chapter 3

ANALYSIS

Software requirements
• Python 3.6
• Jupyter notebook
• Anaconda

3.1 Environments Used


3.1.1 Jupyter notebook
The jupyter notebook is an interactive computing environment that enables users to
author notebook documents that includes :
- Live code
- Interactive widgets
- Plots
- Narrative text
- Equations
- Images
- Videos.
This documents provide a complete and self-contained record of a computation that can
be converted to various formats and shared with others using e-mail, Dropbox, version
control systems(like git/GitHub)or nbviewer.jupyter.org.

Components
The jupyter notebook combines three components :
- The notebook web application
- Kernels
- Notebook documents

•The Notebook Web Applications:


An interactive web application for writing and running code interactively and authoring
notebook documents.

•Kernals :

8
Separate processes started by the notebook webapplication that runs users code in a given
language and returns output back to the notebook web application. The kernel also han-
dles things like computations for interactive widgets , tab completion and introspection.

•Notebook Documents:
Notebook documents contain the inputs and outputs of an interactive session as well as
narrative text that accompanies the code but is not meant for execution. Rich output
generated by running code, including HTML, images, video, and plots is embeddeed in
the notebook, which makes it a complete and self-contained record of a computation.

When you run the notebook web application on your computer, notebook doc-
uments are just files on your local filesystem with a “.ipynb” extension. This allows you
to use familiar workflows for organizing your notebooks into folders and sharing them
with others.
Notebook consists of a linear sequence of cells .There are four basic cell types:
• Code cells: Input and output of live code that is run in the kernel
• Markdown cells: Narrative text with embedded Latex equations
• Heading cells: Six levels of hierarchical organization formatting
• Raw cells: Unformatted text that is included,without modification,when notebooks are
converted to different formats using nbconvert

3.1.2 The NoteBook Dashboard


When you first start the notebook server , your browser will open to the notebook dash-
board.The dashboard serves as the homepage for the notebook.Its main purpose is to
display the notebooks and files in the current directory.For example here is the screen-
shot of the dashboard page for the examples directory in the jupyter repository : The top

Figure 3.1: Jupyter Dashboard showing file tab.

9
of the notebook list displays clickable breadcrumbs of the current directory.By clicking
on these breadcrumbs or on sub-directories in the notebook list,you can navigate your
file system.
To create a new notebook,click on the “New ” button at the buttom at the top
of the list and select a kernel from the dropdown (as seen below).Which kernels are listed
depend on what ‘s installed on the server .Some of the kernels in the screenshot below
may not exist as an option to you . Notebook and files can be uploaded to the current

Figure 3.2: Jupyter new menu.

directory by dragging a notebook file onto the notebook list or by the “click here ”text
above the list.

3.1.3 Overview of the Notebook UI


If you create a new notebook or an existing one , you will be taken to the notebook
user interface (UI). These UI allows you to run code and author notebook documents
interactively. The notebook UI has the following main areas:
o Menu
o Toolbar
o Notebook area and cells
The notebook has an interactive tour of these elements that can be started in the
“Help:User Interface Tour” menu item.

3.1.4 Running code


First and foremost, the Jupyter Notebook is an interactive environment for writing and
running code. The notebook is capable of running code in a wide range of languages.

10
However, each notebook is associated with a single kernel. This notebook is associated
with the IPython kernel, therefore runs Python code.

3.1.5 Anaconda
Anaconda is a Package manager, an environment manager, a Python distribution, and a
collection of over 1000+ open source packages. It is free and easy to install it offers free
community support. Additionally , Anaconda can create custom environments that mix
and match different Python versions(2.6,2.7,3.3 or 3.4) and other packages into isolated
environments and easily switch between them using conda, our innovative multi-platform
package manager for Python and other languages.

3.1.6 Packages installed with Anaconda


You may want to check out what packages are installed with Anaconda. Navigate to the
terminal or command line and type conda list to quickly display a list of all the packages
in your default Anaconda environment. Alternatively, the Continuum Analytics website
has a list of packages available in the latest release of the Anaconda installer.

3.1.7 Using Python in Anaconda


Many people write Python code using a text editor like Emacs or Vim. Others prefer to
use an IDE like Spyder, Wing IDE, PyCharm or Python Tools for Visual Studio. Spyder
is a great free IDE that is included with Anaconda. To start Spyder type the name spyder
in a terminal or at the command prompt.
The Python 2.7 version of Anaconda also includes a graphical launcher appli-
cation that enables you to start IPython Notebook, IPython QTConsole and Spyder with
single click. On Mac, double click the Launcher.app, found in your/anaconda directory(or
wherever you install Anaconda). On Windows, U’ll find Launcher in your Start Menu.
The Start Menu also has an Anaconda Command Prompt that, regardless of system and
install settings, will launch the Python interpreter installed via Anaconda. This is partic-
ularly useful for troubleshooting, if you have multiple Python installations on your system.

11
Chapter 4

DATASETS

When we are building mathematical model to predict the future, we must split the dataset
into “Training Dataset” and “Testing Dataset”. For example, if we are building a machine
learning model, the model is going to learn the relationship of the data first. The model is
going to “Learn” the mathematical relationship in the data using the “Training Dataset”.
In order to verify whether the model is valid, we have to test the model with
data that are different with the “Training Dataset”. Therefore, we are going to check the
model using the “Testing Dataset”.

• The idea is like this-


If we have 1000 observations, then we are going to train our model using 75
percentage or 750 observations.
After the model is built, we are going check the model using the testing set,
which is 25 percentage or 250 observations.
• The results or the accuracies of the training set and the testing set should be similar.
• “Overfitting” might occur when the model learned too much on the training set and
failed to predict the testing set result.
• We are going to use Cross validation from scikit learn.
• Cross_validation can declare 4 variables xtrain xtest ytrain ytest at once.
• It is going to split the data RANDOMLY. If you want your data to be split by Random,
you can set the random state.

Example on Split dataset in to Training Set and Testing Set:


Say your data has 5 columns.
Column 0 to Column 4 are the dependent variables (Y).
The last column on the right is the independent variables (X).
This is how you create the training set and testing set.
# import dataset
import pandas as pd
dataset=pd.read_ csv(‘dataset.csv’).values
# split dependent variable and independent variable
y=dataset[:,4]
x=dataset[:,1:4]
# split training set and testing set
from sklearn.cross_validation import train_test_split
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.25)

12
4.1 Train and test data
Training and test data are common for supervised learning algorithms. Given a dataset,
its split into training set and test set. In Machine Learning, this applies to supervised
learning algorithms. Training and test data In the real world we have all kinds of data
like financial data or customer data. An algorithm should make new predictions based
on new data. You can simulate this by splitting the dataset in training and test data.

As we work with datasets, a machine learning algorithm works in two stages. We

Figure 4.1: Data Evolution

usually split the data around 20%-80% between testing and training stages. Under su-
pervised learning, we split a dataset into a training data and test data in Python ML.

Figure 4.2: data spliting

Prerequisites for Train and Test Data

We will need the following Python libraries- pandas and sklearn.


We can install these with pip-

13
1. pip install pandas
1. pip install sklearn
We use pandas to import the dataset and sklearn to perform the splitting. You can
import these packages as-
1. »> import pandas as pd
2. »> from sklearn.model_selection import train_test_split
3. »> from sklearn.datasets import load_iris
3. How to Split Train and Test Set in Python Machine Learning? Following are the
process of Train and Test set in Python ML. So, let’s take a dataset first.

Figure 4.3: Split Train and Test Set

Loading the Dataset


Let’s load the forestfires dataset using pandas.
1. »> data=pd.read_csv(’forestfires.csv’)
2. »> data.head()

Figure 4.4: data.head()

14
Chapter 5

EDA

In statistics, exploratory data analysis(EDA) is an approach to analyzing datasets to


summarize their main characteristics often with visual methods. A statistical model can
be used or not, but primarily EDA is for seeing what the data can tell us beyond the
formal modeling or hypothesis testing task. Exploratory data analysis was promoted
by John Tukey to encourage statisticians to explore the data, and possibly formulate
hypothesis that could lead to new data collection and experiments.
Technology Architecture:

Figure 5.1: Technology Architecture

5.1 Basemap
Basemap is a great tool for creating maps using python in a simple way. It’s a matplotlib
extension, so it has got all its features to create data visualizations, and adds the geo-
graphical projections and some datasets to be able to plot coastlines, countries, and so
on directly from the library.
Any map created with the Basemap library must start with the creation of a
Basemap instance

15
mpl_toolkits.basemap.Basemap(llcrnrlon=None, llcrnrlat=None, urcrnrlon=None,
urcrnrlat=None, llcrnrx=None, llcrnry=None, urcrnrx=None, urcrnry=None, width=None,
height=None, projection=’cyl’, resolution=’c’, area_thresh=None, rsphere=6370997.0,
ellps=None, lat_ts=None, lat_1=None, lat_2=None, lat_0=None, lon_0=None, lon_1=None,
lon_2=None, o_lon_p=None, o_lat_p=None, k_0=None, no_rot=False, suppress_ticks=True,
satellite_height=35786000, boundinglat=None, fix_aspect=True, anchor=’C’, celestial=False,
round=False, epsg=None, ax=None)

The class constructor has many possible arguments, and all of them are optional:
• Resolution: The resolution of the included coastlines, lakes, and so on. The options
are c (crude, the default), l (low), i (intermediate), – None option is a good one if a
Shapefile will be used instead of the included files, since no data must be loaded and the
performance rises a lot.
• Area_thresh: The threshold under what no coast line or lake will be drawn. Default
10000,1000,100,10,1 for resolution c, l, i, h, f.
• Rsphere: Radius of the sphere to be used in the projections. Default is 6370997 me-
ters. If a sequence is given, the first two elements are taken as the radius of the ellipsoid.
• Ellips: An ellipsoid name, such as ‘WGS84’. The allowed values are defined at
pyproj.pj_ellps
• Suppress_ticks: Suppress automatic drawing of axis ticks and labels in map projec-
tion coordinates
• Fix_aspect: Fix aspect ratio of plot to match aspect ratio of map projection region
(default True)
• Anchor: The place in the plot where the map is anchored. Default is C, which means
map is centered. Allowed values are C, SW, S, SE, E, NE, N, NW, and W
• Celestial: Use the astronomical conventions for longitude (i.e. negative longitudes to
the east of 0). Default False. Implies resolution=None
• Ax: set default axes instance

Passing the bounding box


The following arguments are used to set the extent of the map. To see some examples
and explanations about setting the bounding box, take a look at the Extension section.
The first way to set the extent is by defining the map bounding box in geographical
coordinates: fig 5.2

An other option is setting the bounding box, but using the projected units: fig
5.3

Finally, the last option is to set the bounding box giving the center point in geographical
coordinates, and the width and height of the domain in the projection units fig 5.4.

16
Figure 5.2: Geographical Coordinates

Figure 5.3: Projection units

Drawing the first map


Let’s create a the simplest map(fig 5.5):
from mpl_toolkits.basemap import Basemap
import matplotlib.pyplot as plt
map = Basemap()
map.drawcoastlines()
plt.show()
plt.savefig(’test.png’)

1. The first two lines include the Basemap library and matplotlib. Both are neces-
sary
2. The map is created using the Basemap class, which has many options. Without
passing any option, the map has the Plate Carrée projection centered at longitude and
latitude = 0
3. After setting the map, we can draw what we want. In this case, the coast lines layer,
which comes already with the library, using the method drawcoastlines()
4. Finally, the map has to be shown or saved. The methods from mathplotlib are used.
In this example, plt.show() opens a window to explore the result. plt.savefig(‘file_name’)
would save the map into an image.

Projection:
The projection argument sets the map projection to be used:(fig 5.6)

from mpl_toolkits.basemap import Basemap


import matplotlib.pyplot as plt

17
Figure 5.4: Setting the Bounding Box

Figure 5.5: Simple Projection

map = Basemap(projection=’cyl’)
map.drawmapboundary(fill_color=’aqua’)
map.fillcontinents(color=’coral’,lake_color=’aqua’)
map.drawcoastlines()
plt.show()

The default value is cyl, or Cylindrical Equidistant projection(fig 5.7), also known as
Equirectangular projection or Plate Carrée
Many projections require extra arguments:

from mpl_toolkits.basemap import Basemap


import matplotlib.pyplot as plt
map = Basemap(projection=’aeqd’, lon_0 = 10, lat_0 = 50)
map.drawmapboundary(fill_color=’aqua’)
map.fillcontinents(color=’coral’,lake_color=’aqua’)
map.drawcoastlines()

18
Figure 5.6: Mercatar Projection

plt.show()

Figure 5.7: Cylindrical Equidistant projection

19
Chapter 6

ALGORITHMS AND FUNCTIONS


USED

A machine is said to be learning from past Experiences(data feed in) with respect to
some class of Tasks, if it’s Performance in a given Task improves with the Experience.

Figure 6.1: Types of Learning

6.1 Supervised Learning


Supervised learning is when the model is getting trained on a labelled dataset. Labelled
dataset is one which have both input and output parameters. In this type of learning
both training and validation datasets are labelled as shown in the figures below(fig 6.2).

•Figure A: It is a dataset of a shopping store which is useful in predicting whether a


customer will purchase a particular product under consideration or not based on his/ her
gender, age and salary.
Input : Gender, Age, Salary
Output : Purchased i.e. 0 or 1 ; 1 means yes the customer will purchase and 0
means that customer won’t purchase it.

•Figure B: It is a Meteorological dataset which serves the purpose of predicting wind


speed based on different parameters.
Input : Dew Point, Temperature, Pressure, Relative Humidity, Wind Direction
Output : Wind Speed

20
Figure 6.2: Labelled data set

Training the system


While training the model, data is usually split in the ratio of 80:20 i.e. 80% as training
data and rest as testing data. In training data, we feed input as well as output for 80%
data. The model learns from training data only. We use different machine learning algo-
rithms(which we will discuss in detail in next articles) to build our model. By learning,
it means that the model will build some logic of its own. Once the model is ready then it
is good to be tested. At the time of testing, input is fed from remaining 20% data which
the model has never seen before, the model will predict some value and we will compare
it with actual output and calculate the accuracy.

Figure 6.3: Supervised Learning

21
6.2 Types of Supervised Learning
6.2.1 Classification
It is a Supervised Learning task where output is having defined labels(discrete value).
For example in above Fig 6.2-Figure A, Output Purchased has defined labels i.e. 0 or 1.
1 means the customer will purchase and
0 means that customer won’t purchase.

The goal here is to predict discrete values belonging to a particular class and evalu-
ate on the basis of accuracy. It can be either binary or multi class classification. In
binary classification, model predicts either 0 or 1 ; yes or no but in case of multi class
classification, model predicts more than one class.
Example: Gmail classifies mails in more than one classes like social, promotions, up-
dates, forum.

6.2.2 Regression
It is a Supervised Learning task where output is having continuous value. Example in
above Figure B, Output – Wind Speed is not having any discrete value but is continuous
in the particular range. The goal here is to predict a value as much closer to actual
output value as our model can and then evaluation is done by calculating error value.
The smaller the error the greater the accuracy of our regression model.

Example of Supervised Learning Algorithms:


• Linear Regression
• Nearest Neighbor
• Guassian Naive Bayes
• Decision Trees
• Support Vector Machine (SVM)
• Random Forest

6.3 Types of Supervised Machine Learning Algorithms


6.3.1 Regression
Regression technique predicts a single output value using training data.

Example: You can use regression to predict the house price from training data. The
input variables will be locality, size of a house, etc.
Strengths: Outputs always have a probabilistic interpretation, and the algo-
rithm can be regularized to avoid overfitting.
Weaknesses: Logistic regression may underperform when there are multiple
or non-linear decision boundaries. This method is not flexible, so it does not capture
more complex relationships.

22
6.3.2 Logistic Regression
Logistic regression method used to estimate discrete values based on given a set of inde-
pendent variables. It helps you to predicts the probability of occurrence of an event by
fitting data to a logit function. Therefore, it is also known as logistic regression. As it
predicts the probability, its output value lies between 0 and 1. Here are a few types of
Regression Algorithms

6.3.3 Classification
Classification means to group the output inside a class. If the algorithm tries to label
input into two distinct classes, it is called binary classification. Selecting between more
than two classes is referred to as multiclass classification.
Example: Determining whether or not someone will be a defaulter of the loan.
Strengths: Classification tree perform very well in practice
Weaknesses: Unconstrained, individual trees are prone to overfitting.

Here are a few types of Classification Algorithms

6.3.4 Naive Bayes Classifiers


Naïve Bayesian model (NBN) is easy to build and very useful for large datasets. This
method is composed of direct acyclic graphs with one parent and several children. It
assumes independence among child nodes separated from their parent.

6.3.5 Decision Trees


Decisions trees classify instance by sorting them based on the feature value. In this
method, each mode is the feature of an instance. It should be classified, and every
branch represents a value which the node can assume. It is a widely used technique for
classification. In this method, classification is a tree which is known as a decision tree.
It helps you to estimate real values (cost of purchasing a car, number of calls,
total monthly sales, etc.).

6.3.6 Support Vector Machine


Support vector machine (SVM) is a type of learning algorithm developed in 1990. This
method is based on results from statistical learning theory introduced by Vap Nik.
SVM machines are also closely connected to kernel functions which is a central
concept for most of the learning tasks. The kernel framework and SVM are used in a
variety of fields. It includes multimedia information retrieval, bioinformatics, and pattern
recognition.
The most widely used learning algorithms in supervised learning are:
• Support Vector Machines
• linear regression
• logistic regression
• naive Bayes
• linear discriminant analysis
• decision trees

23
• k-nearest neighbor algorithm
• Neural Networks (Multilayer perceptron)
• Similarity learning

6.4 Random forest


Random forest is a type of supervised machine learning algorithm based on ensemble
learning. Ensemble learning is a type of learning where you join different types of algo-
rithms or same algorithm multiple times to form a more powerful prediction model. The
random forest algorithm combines multiple algorithm of the same type i.e. multiple de-
cision trees, resulting in a forest of trees, hence the name "Random Forest". The random
forest algorithm can be used for both regression and classification tasks.

How the Random Forest Algorithm Works


The following are the basic steps involved in performing the random forest algorithm:

1. Pick N random records from the dataset.


2. Build a decision tree based on these N records.
3. Choose the number of trees you want in your algorithm and repeat steps 1
and 2.
4. In case of a regression problem, for a new record, each tree in the forest
predicts a value for Y (output). The final value can be calculated by taking the average
of all the values predicted by all the trees in forest. Or, in case of a classification problem,
each tree in the forest predicts the category to which the new record belongs. Finally,
the new record is assigned to the category that wins the majority vote.

6.4.1 Decision tree learning


Decision trees are a popular method for various machine learning tasks. Tree learning
"come[s] closest to meeting the requirements for serving as an off-the-shelf procedure for
data mining", say Hastie et al., "because it is invariant under scaling and various other
transformations of feature values, is robust to inclusion of irrelevant features, and pro-
duces inspectable models. However, they are seldom accurate".
In particular, trees that are grown very deep tend to learn highly

Irregular patterns: they overfit their training sets, i.e. have low bias, but very high
variance. Random forests are a way of averaging multiple deep decision trees, trained
on different parts of the same training set, with the goal of reducing the variance. This
comes at the expense of a small increase in the bias and some loss of interpretability, but
generally greatly boosts the performance in the final model.

6.4.2 Bagging
The training algorithm for random forests applies the general technique of bootstrap
aggregating, or bagging, to tree learners. Given a training set X = x1, ..., xn with
responses Y = y1, ..., yn, bagging repeatedly (B times) selects a random sample with

24
replacement of the training set and fits trees to these samples:
For b = 1, ..., B:
1. Sample, with replacement, n training examples from X, Y; call these Xb,
Yb.
2. Train a classification or regression tree fb on Xb, Yb. After training, pre-
dictions for unseen samples x’ can be made by averaging the predictions from all the
individual regression trees on x’

or by taking the majority vote in the case of classification trees. This bootstrapping
procedure leads to better model performance because it decreases the variance of the
model, without increasing the bias. This means that while the predictions of a single tree
are highly sensitive to noise in its training set, the average of many trees is not, as long
as the trees are not correlated. Simply training many trees on a single training set would
give strongly correlated trees (or even the same tree many times, if the training algorithm
is deterministic); bootstrap sampling is a way of de-correlating the trees by showing them
different training sets. Additionally, an estimate of the uncertainty of the prediction can
be made as the standard deviation of the predictions from all the individual regression
trees on x’.
The number of samples / trees, B, is a free parameter. Typically, a few hundred
to several thousand trees are used, depending on the size and nature of the training set.
An optimal number of trees B can be found using cross-validation, or by observing the
out-of-bag error: the mean prediction error on each training sample xi, using only the
trees that did not have xi in their bootstrap sample. The training and test error tend to
level off after some number of trees have been fit.

6.4.3 Bagging to Random Forests


The above procedure describes the original bagging algorithm for trees. Random forests
differ in only one way from this general scheme: they use a modified tree learning algo-
rithm that selects, at each candidate split in the learning process, a random subset of the
features. This process is sometimes called "feature bagging". The reason for doing this
is the correlation of the trees in an ordinary bootstrap sample: if one or a few features
are very strong predictors for the response variable (target output), these features will be
selected in many of the B trees, causing them to become correlated. An analysis of how
bagging and random subspace projection contribute to accuracy gains under different √
conditions is given by Ho. Typically, for a classification problem with p features, p
(rounded down) features are used in each split. For regression problems the inventors
recommend p/3 (rounded down) with a minimum node size of 5 as the default. In prac-
tice the best values for these parameters will depend on the problem, and they should be
treated as tuning parameters.

6.4.4 Extremely Randomized Trees


Adding one further step of randomization yields extremely randomized trees, or Extra-
Trees. While similar to ordinary random forests in that they are an ensemble of individual
trees, there are two main differences: first, each tree is trained using the whole learning
sample (rather than a bootstrap sample), and second, the top-down splitting in the tree
learner is randomized. Instead of computing the locally optimal cut-point for each feature

25
under consideration (based on, e.g., information gain or the Gini impurity), a random
cut-point is selected. This value is selected from a uniform distribution within the fea-
ture’s empirical range (in the tree’s training set). Then, of all the randomly generated
splits, the split that yields the highest score is chosen to split the node. Similar to ordi-
nary random forests, the number of randomly selected features to be considered at each
node can be specified. Default values for this parameter are for classification and for
regression, where is the number of features in the model.

6.4.5 Advantages of using Random Forest


As with any algorithm, there are advantages and disadvantages to using it.
The random forest algorithm is not biased, since, there are multiple trees and each tree
is trained on a subset of data. Basically, the random forest algorithm relies on the power
of "the crowd"; therefore the overall biasedness of the algorithm is reduced.

1. This algorithm is very stable. Even if a new data point is introduced in the dataset
the overall algorithm is not affected much since new data may impact one tree, but it is
very hard for it to impact all the trees.
2. The random forest algorithm works well when you have both categorical and numerical
features.
3. The random forest algorithm also works well when data has missing values or it has
not been scaled well (although we have performed feature scaling in this article just for
the purpose of demonstration).

6.4.6 Disadvantages of using Random Forest


1. A major disadvantage of random forests lies in their complexity. They required much
more computational resources, owing to the large number of decision trees joined to-
gether.
2. Due to their complexity, they require much more time to train than other comparable
algorithms.

6.5 Grid Search


This technique is used to find the optimal parameters to use with an algorithm. This is
NOT the weights or the model, those are learned using the data. This is obviously quite
confusing so I will distinguish between these parameters, by calling one hyper-parameters.
Hyper-parameters are like the k in k-Nearest Neighbors (k-NN). k-NN requires
the user to select which neighbor to consider when calculating the distance. The algo-
rithm then tunes a parameter, a threshold, to see if a novel example falls within the
learned distribution, this is done with the data.

choosing of k:
Some people simply go with recommendations based on past studies of the data type.
Others use grid search. This method will be able to best determine which k is the optimal
to use for your data.

26
working:
First you need to build a grid. This is essentially a set of possible values your hyper-
parameter can take. For our case we can use [1,2,3,...,10] . Then you will train your k-NN
model for each value in the grid. First you would do 1-NN, then 2-NN, and so on. For
each iteration you will get a performance score which will tell you how well your algorithm
performed using that value for the hyper-parameter. After you have gone through the
entire grid you will select the value that gave the best performance. This goes against
the principles of not using test data!! You would be absolutely right. That is the reason
grid search is often mixed with cross-validation. So that we keep the test data completely
separate until we are truly satisfied with our results and are ready to test. n -fold cross-
validation takes a training set and separates it into n parts. It then trains on n−1 folds
and tests on the fold which was left out. For each value in the grid, the algorithm will
be retrained n times, for each fold being left out. Then the performance across each fold
is averaged and that is the achieved performance for that hyper-parameter value. The
selected hyper-parameter value is the one which achieves the highest average performance
across the n-folds. Once you are satisfied with your algorithm, then you can test it on
the testing set. If you go straight to the testing set then you are risking overfitting

6.6 Packages and Libraries Used


6.6.1 NumPy
NumPy is a Python package which stands for ‘Numerical Python’. It is the core library
for scientific computing, which contains a powerful n-dimensional array object, provide
tools for integrating C, C++ etc. It is also useful in linear algebra, random number
capability etc. NumPy array can also be used as an efficient multi-dimensional container
for generic data. Now, let me tell you what exactly is a python numpy array.

NumPy Array: Numpy array is a powerful N-dimensional array object which is in


the form of rows and columns. We can initialize numpy arrays from nested Python lists
and access it elements.

6.6.2 PANDAS
Pandas is an opensource library that allows to you perform data manipulation in Python.
Pandas library is built on top of Numpy, meaning Pandas needs Numpy to operate. Pan-
das provide an easy way to create, manipulate and wrangle the data. Pandas is also an
elegant solution for time series data.

Why use Pandas?


Pandas has following advantages:

• Easily handles missing data


• It uses Series for one-dimensional data structure and DataFrame for multi-
dimensional data structure
• It provides an efficient way to slice the data
• It provides a flexible way to merge, concatenate or reshape the data

27
• It includes a powerful time series tool to work with

In a nutshell, Pandas is a useful library in data analysis. It can be used to perform data
manipulation and analysis. Pandas provide powerful and easy-to-use data structures, as
well as the means to quickly perform operations on these structures.

6.6.3 Matplotlib
Matplotlib is an amazing visualization library in Python for 2D plots of arrays. Matplotlib
is a multi-platform data visualization library built on NumPy arrays and designed to work
with the broader SciPy stack. It was introduced by John Hunter in the year 2002.
One of the greatest benefits of visualization is that it allows us visual access to
huge amounts of data in easily digestible visuals. Matplotlib consists of several plots like
line, bar, scatter, histogram etc.
matplotlib.pyplot is a collection of command style functions that make mat-
plotlib work like MATLAB. Each pyplot function makes some change to a figure: e.g.,
creates a figure, creates a plotting area in a figure, plots some lines in a plotting area,
decorates the plot with labels, etc.
In matplotlib.pyplot various states are preserved across function calls, so that it
keeps track of things like the current figure and plotting area, and the plotting functions
are directed to the current axes (please note that "axes" here and in most places in the
documentation refers to the axes part of a figure and not the strict mathematical term
for more than one axis).

6.6.4 read_xl
The readxl package makes it easy to get data out of Excel and into R. Compared to
many of the existing packages (e.g. gdata, xlsx, xlsReadWrite) readxl has no external
dependencies, so it’s easy to install and use on all operating systems. It is designed to
work with tabular data.
readxl supports both the legacy .xls format and the modern xml-based .xlsx
format. The libxls C library is used to support .xls, which abstracts away many of the
complexities of the underlying binary format. To parse .xlsx, we use the RapidXML C++
library.

6.6.5 Train_test_split
sklearn.model_selection.train_test_split(arrays, *options) Split arrays or matrices into
random train and test subsets Quick utility that wraps input validation and next(ShuffleS
plit(). split(X, y)) and application to input data into a single call for splitting (and op-
tionally subsampling) data in a oneliner.

Parameters: arrays : sequence of indexables with same length / shape[0]


Allowed inputs are lists, numpy arrays, scipy-sparse matrices or pandas dataframes.
test_size : float, int or None, optional (default=None) If float, should be
between 0.0 and 1.0 and represent the proportion of the dataset to include in the test
split. If int, represents the absolute number of test samples. If None, the value is set to
the complement of the train size. If train_size is also None, it will be set to 0.25.

28
train_size : float, int, or None, (default=None) If float, should be between
0.0 and 1.0 and represent the proportion of the dataset to include in the train split. If
int, represents the absolute number of train samples. If None, the value is automatically
set to the complement of the test size.
random_state : int, RandomState instance or None, optional (default=None)
If int, random_state is the seed used by the random number generator;
If RandomState instance, random_state is the random number generator;
If None, the random number generator is the RandomState instance used by np.random.
shuffle : boolean, optional (default=True)
Whether or not to shuffle the data before splitting. If shuffle=False then stratify must
be None.
stratify : array-like or None (default=None)
If not None, data is split in a stratified fashion, using this as the class labels.
Returns: splitting : list, length=2 * len(arrays)
List containing train-test split of inputs.
New in version 0.16: If the input is sparse, the output will be a scipy.sparse.csr_matrix.
Else, output type is the same as the input type.

6.6.6 nan_to_num
numpy.nan_to_num() function is used when we want to replace nan(Not A Number)
with zero and inf with finite numbers in an array. It returns (positive) infinity with a
very large number and negative infinity with a very small (or negative) number.

6.6.7 Workbook()
The Workbook class is the main class exposed by the XlsxWriter module and it is the
only class that you will need to instantiate directly. The Workbook class represents the
entire spreadsheet as you see it in Excel and internally it represents the Excel file as it is
written on disk.
Constructor
Workbook(filename[, options]) Create a new XlsxWriter Workbook object.
Parameters:

• filename (string) – The name of the new Excel file to create.


• options (dict) – Optional workbook parameters. See below.
Return type:
• A Workbook object - The Workbook() constructor is used to create a new
Excel workbook with a filename.

6.6.8 Mktime()
Pythom time method mktime() is the inverse function of localtime(). Its argument is the
struct_time or full 9-tuple and it returns a floating point number, for compatibility with
time(). If the input value cannot be represented as a valid time, either OverflowError or
ValueError will be raised.

29
Syntax: time.mktime(t)
Parameters:
• t − This is the struct_time or full 9-tuple.
Return Value:
• This method returns a floating point number, for compatibility with time().

6.6.9 Timetuple()
The timetuple() method of datetime.date instances returns an object of type time.struct_time.
The struct_time is a named tuple object (A named tuple object has attributes that can
be accessed by an index or by name).
The struct_time object has attributes for representing both date and time
fields along with a flag to indicate whether Daylight Saving Time is active.
The named tuple returned by the timetuple() function will have its year, month
and day fields set as per the date object and fields corresponding to the hour, minutes,
seconds will be set to zero.

Strptime()
Pythom time method strptime() parses a string representing a time according to a format.
The return value is a struct_time as returned by gmtime() or localtime(). The format
parameter uses the same directives as those used by

Strftime()
it defaults to "%a %b %d %H:%M:%S %Y" which matches the formatting returned by
ctime().
If string cannot be parsed according to format, or if it has excess data after
parsing, ValueError is raised.
Syntax Following is the syntax for strptime() method −
time.strptime(string[, format])
Parameters:
• string − This is the time in string format which would be parsed based on
the given format.
• format − This is the directive which would be used to parse the given string.

The following directives can be embedded in the format string − Directive


• %a - abbreviated weekday name
• %A - full weekday name
• %b - abbreviated month name
• %B - full month name
• %c - preferred date and time representation
• %C - century number (the year divided by 100, range 00 to 99)
• %d - day of the month (01 to 31)
• %D - same as %m/%d/%y
• %e - day of the month (1 to 31)
• %g - like %G, but without the century
• %G - 4-digit year corresponding to the ISO week number (see %V).
• %h - same as %b

30
• %H - hour, using a 24-hour clock (00 to 23)
• %I - hour, using a 12-hour clock (01 to 12)
• %j - day of the year (001 to 366)
• %m - month (01 to 12)
• %M - minute
• %n - newline character
• %p - either am or pm according to the given time value
• %r - time in a.m. and p.m. notation
• %R - time in 24 hour notation
• %S - second
• %t - tab character
• %T - current time, equal to %H:%M:%S
• %u - weekday as a number (1 to 7), Monday=1. Warning: In Sun Solaris Sunday=1
• %U - week number of the current year, starting with the first Sunday as the first day
of the first week
• %V - The ISO 8601 week number of the current year (01 to 53), where week 1 is the
first week that has at least 4 days in the current year, and with Monday as the first day
of the week
• %W - week number of the current year, starting with the first Monday as the first day
of the first week
• %w - day of the week as a decimal, Sunday=0
• %x - preferred date representation without the time
• %X - preferred time representation without the date
• %y - year without a century (range 00 to 99)
• %Y - year including the century
• %Z or %z - time zone or name or abbreviation
• %% - a literal % character
Return Value:
This return value is struct_time as returned by gmtime() or localtime().

31
Chapter 7

IMPLEMENTATION CODE

CODE :

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
data = pd.read_excel("Book4.xlsx")
data.tail()
data = data[[’time’,’date’,’timestamp’, ’latitude’, ’longitude’,’place’, ’depth’,
’mag’]]
data.tail()
final_data = data.drop([’date’,’time’], axis=1)
#final_data = final_data[final_data.time != ’ValueError’]
final_data.head()
from mpl_toolkits.basemap import Basemap
m=Basemap(projection=’merc’, llcrnrlat=8.,urcrnrlat=37., llcrnrlon=68.,
urcrnrlon=97.,lat_0=54.5,lon_0=-4.36,resolution=’c’) longitudes = data["longitude"].tolist()
latitudes = data["latitude"].tolist()
x,y = m(longitudes,latitudes)
fig = plt.figure(figsize=(12,10))
plt.title("All affected areas")
m.plot(x, y, "o", markersize = 2, color = ’red’)
m.drawcoastlines()
m.fillcontinents(color=’skyblue’,lake_color=’aqua’)
m.drawmapboundary()
m.drawcountries()
plt.show()
data.loc[0:]
from sklearn.model_selection import train_test_split
X = final_data[[ ’timestamp’,’latitude’, ’longitude’]]
y = final_data[[’mag’, ’depth’]].astype(’float32’)
X[:] = np.nan_to_num(X)
y[:] = np.nan_to_num(y)

32
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, ran-
dom_state=42)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
from sklearn.ensemble import RandomForestRegressor
reg = RandomForestRegressor(random_state=42)
reg.fit(X_train, y_train)
reg.predict(X_test)
y_pred=reg.predict(X_test)
predicted=y_pred.tolist()
import openpyxl
wb = openpyxl.Workbook()
Sheet_name = wb.sheetnames
wb.save(filename=’results.xlsx’)
resdf=pd.read_excel(’results.xlsx’)
resdf[’Timestamp’]=X_test[’timestamp’].tolist()
resdf[’Longitude’]=X_test[’longitude’].tolist()
resdf[’Latitude’]=X_test[’latitude’].tolist()
places=[]
for instance in resdf.itertuples():
for row in data.itertuples():
if (instance.Longitude==row.longitude and instance.Latitude==row.latitude):
places.append(row.place)
del places[-1]
resdf[’Place’]=places
resdf[’y_Predicted(mag,depth)’]=predicted
resdf.head(100)
reg.score(X_test, y_test)
from sklearn.model_selection import GridSearchCV
parameters = ’n_estimators’:[10, 20, 50, 100, 200, 500]
grid_obj = GridSearchCV(reg, parameters)
grid_fit = grid_obj.fit(X_train, y_train)
best_fit = grid_fit.best_estimator_ t=best_fit.predict(X_test)
print(t)
best_fit.score(X_test, y_test)
import time
import datetime
date=input(’please enter date in d m y format’)
ts=time.mktime(datetime.datetime.strptime(date, "%d/%m/%Y").timetuple())
print(ts)
from sklearn.ensemble import RandomForestRegressor
reg = RandomForestRegressor(random_state=42)
reg.fit(X_train, y_train)
latitude=input(’lat’)
longitude=input(’long’)
X_test1=[ts ,latitude,longitude]
predicted=reg.predict([X_test1])
for row in data.itertuples():
if (float(longitude)==row.longitude and float(latitude)==row.latitude):

33
print("Place: "+row.place)
print(predicted)

34
Chapter 8

OUTPUT

[1]

Figure 8.1: Read the DataSet

35
[2]

Figure 8.2: Attribute selection

36
[3]

Figure 8.3: Plotting

37
[4]

Figure 8.4: Total DataSet

38
[5]

Figure 8.5: Split_train_test

39
[6]

Figure 8.6: Predicted TestData

40
[7]

Figure 8.7: Cross Validation of DataSet

41
[8]

Figure 8.8: Conversion of TimeStamp

42
[9]

Figure 8.9: Predicted Output

43
CONCLUSION

Earthquakes are hard to understand and are dangerous to live through. Many people
might have never experienced an earthquake or might never experience one. Whatever
your situation everyone should be prepared and know how to deal with one. Predictions
can be told, but is there evidence that proves it all. Forecasting of earthquakes falls
on knowledge of past earthquakes on a specific fault. Thus it can be observed that by
using the following algorithmic model for earthquake prediction, proper methods can
be implemented for deploying warnings and preparing for earthquakes. The proposed
algorithmic model efficiently performs data analysis using Machine learning and can be
used for observing insights related to earthquakes.

44
REFERENCES

[1] Adeli H, Panakkat A. A probabilistic neural network for earthquake


magnitude prediction. Neural networks. 2009;22(7):1018–24. 10.1016/
j.neunet.2009.05.003 [PubMed] [CrossRef] [Google Scholar]
[2] Panakkat A, Adeli H. Neural network models For earthquake mag-
nitude prediction using multiple seismicity indicators. International Jour-
nal of Neural Systems. 2007;17(01):13–33. 10.1142/S0129065707000890
[PubMed] [CrossRef] [Google Scholar]
[3] https://www.researchgate.net/publication/307951466_Earthquake_
magnitude_prediction_in_Hindukush_region_using_machine_
learning_techniques.
[4] https://towardsdatascience.com/types-of-machine-learning-
algorithms-you-should-know-953a08248861.
[5] https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.
RandomForestRegressor.html

45

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy