Geospatial-Temporal Analysis Andclassification of Criminal Data in Manila
Geospatial-Temporal Analysis Andclassification of Criminal Data in Manila
Maria Jeseca C. Baculo, Charlie S. Marzan, Remedios de Dios Bulos, Conrado Ruiz
De La Salle University
Manila City, Philippines
e-mail: maria_jeseca_baculo@dlsu.edu.ph, charlie_marzan@dlsu.edu.ph, remedios.bulos@ dlsu.edu.ph, cons.ruizjr@
delasalle.ph
V. METHODOLOGY
Manila is one of the cities of Metro Manila, the capital
region of the Philippines. It is composed of sixteen districts:
Binondo, Ermita, Intramuros, Malate, Paco, Pandacan, Port
Area, Quiapo, Sampaloc, San Andres, San Miguel, San
Nicolas, Santa Ana, Santa Cruz, Santa Mesa, and Tondo. It is
the most densely populated with 71,263 persons per square
kilometer in 2015 and ranks second to Quezon City in the
highest number of crimes from 2010 to 2015. With around
60,000 establishment operating in the city, it is one of the
major centers of commerce ranking third in 2016’s Cities
and Municipalities Competitive Index of the Philippines. Fig.
1 displays the map of Manila. Figure 1. Study area
A. Data Preprocessing The resulting coordinates were then divided into two
Data preprocessing is imperative in data mining. This attributes named Longitude and Latitude. These were used to
process cleans and transforms the raw crime data gathered plot the different points of crime incidents to the map. The
from MPD into its appropriate representation for crime coordinates also played an important role in identifying the
analysis. The following data preprocessing techniques were inconsistencies in the Location attribute. The WGS 1984
performed in the data set. Web Mercator coordinate system in ArcMap 10 was used to
1) Data cleaning plot the coordinates of these crimes in Manila.
3) Data reduction
Data cleaning was performed to the instances in the data
set which have missing and inconsistent values for crime Extracting patterns from multiple attributes becomes
location. The missing values were manually filled up and challenging when it is represented in voluminous divisions.
some were ignored since they belong to cities outside Manila. Data reduction helps remove irrelevant attributes by
During geocoding the addresses, some of the locations selecting the attributes and values that best represent the data
produced noise since the coordinates returned by the to be mined.
geocoding function lies outside the study area. Some of the Originally, the incident report handed by the MPD had 9
districts in Manila were split and some were united between number of attributes which included the nature of case, date,
2012 and 2016. For their case, the old names of the streets time, location, location type, identification of suspects and
with districts were changed to the new ones. All key victims/complainant, status and facts of the case, investigator,
7
and station concerned. Among these attributes, five were three temporal perspectives: time of the day, day of the
selected. These were the type of crime, date and time of week, and quarters of the year.
occurrence, location, and location type. The number of
instances were also reduced since some of the crime types C. Predictive Classification
given are irrelevant to the area being studied. Classification models are used to predict categorical
4) Data transformation class labels. In this paper, five algorithms for classification
were implemented to predict the possible crime locations
Transforming data entails converting values to forms
along with other variables that contribute to the occurrence
appropriate for mining. In this paper, the values representing
of crime. The tool used for the implementation of these
the attributes Weather, Location and Time were transformed
algorithms was WEKA (Waikato Environment for
to minimize the distinct values the attributes represent. This
Knowledge Analysis) version 3.9. It is a software that
helped in narrowing the circumstances leading to the
contains the visualization tools and algorithms useful for
prediction of possible crime location.
predictive modeling and data analysis. The attributes
In the data set, the resulting weather attribute had 9 included in the dataset were the District, LocationType,
distinct values after data integration. Through the use of Time, IsHoliday, and Rain.
concept hierarchy, it was then reduced to 2 values (Yes/No)
and the attribute was renamed Rain. The same technique VI. RESULTS AND DISCUSSION
was also used in the location. Instead of having hundreds of
A. Kernel Density Estimation Maps
distinct values from the streets of Manila, these were
reduced to the 16 districts where these areas belong. The Fig. 2 illustrates the distribution of crimes over a day. It
diversity of time was also reduced into a four-hour interval shows that the highest crime rate happens from 8:01 PM to
resulting to 6 distinct values. Table II shows the attributes 12:00 MN followed by the incidents from midnight to 4:00
generated after data preprocessing. AM. On the other hand, the lowest crime rate occurs in the
hours between 4:01 AM to 8:00 AM.
TABLE II. DATASET ATTRIBUTES
Attribute Description Data
To identify the spatial characteristics of the incidents,
Name Type Fig. 3 maps the incidents in a four-hour interval where the
Crime_Category Type of crime Nominal blue areas represent the highest density of incidents and
District The location where the crime happened Nominal yellow green areas signify the lowest.
Location_Type The nature of location where the crime Nominal
happened
Year The year when the crime happened Nominal
DayNominal The day of the week when the crime Nominal
happened
Longitude The vertical measure of the location of the Real
crime
Latitude The horizontal measure of the location of Real
the crime
IsHoliday Specifies whether the date of occurrence Nominal
falls under holidays
Rain Specifies whether it rained during the Nominal Figure 2. Distribution of crimes over a day
execution of the crime
8
Fig. 6 exhibits that majority of these crimes happen in
the months of July to September (3rd quarter) with the
highest clusters found in the districts Tondo, San Nicolas,
Port Area, Ermita, and Malate (Fig. 7). These clusters are
also prevalent in the three remaining KDE and attained the
highest density in the time of the day and day of the week
analysis.
9
curve has values ranging from 0.5 to 1. A perfect test has a Correctly
ROC
ROC Area of 1 and a useless test has 0.5. Classifier Classified Kappa Precision
Area
Instances
The results illustrate the performance of the selected Carnapping
classifiers in the carnapping, gun shooting, physical injury, BayesNet 65.37% .5348 .660 .833
and robbery/theft datasets. These datasets were selected Naïve Bayes 65.47% .5358 .661 .833
because of their high prediction accuracy where patterns can J48 73.03% .6342 .730 .891
be extracted. For murder and sexual assault, the predictions Random Forest 73.03% .6344 .730 .896
Decision Stump 35.85% .0983 .248 .583
that the models produced were at random because of their Gun Shooting
low kappa statistics. BayesNet 77.41% .5077 .772 .775
Naïve Bayes 77.78% .5146 .775 .775
1) Prediction: K-Fold validation J48 73.84% .4316 .734 .755
The k-fold cross validation method was used to run the Random Forest 76.34% .4944 .761 .784
Decision Stump 77.06% .4897 .771 .684
test 10 times and uses the final fold for testing. Out of the Robbery/Theft
six experimental datasets, the crimes concerning carnapping, BayesNet 70.18% .5994 .715 .888
robbery, and physical injury resulted to similar results. Naïve Bayes 70.41% .6026 .718 .888
It can be seen in Table III that in these datasets, J48 and J48 77.51% .6936 .776 .911
Random Forest classifiers gained the highest percentage of Random Forest 77.51% .6958 .775 .919
Decision Stump 51.24% .2595 .357 .654
correct classified instances and performance measures. With
Physical Injury
the computed ROC Area higher than 0.5, both classifiers BayesNet 73.03% .4942 .729 .819
have relatively good performance over all possible Naïve Bayes 73.03% .4942 .729 .821
thresholds. On the other hand, the Naïve Bayes classifier J48 86.31% .7617 .863 .958
provided the highest accurate prediction, precision, and Random Forest 85.68% .7509 .856 .956
Decision Stump 69.09% .385 .544 .650
kappa statistic in the Gun Shooting dataset. However, in
terms of the test performance, the model generated using the TABLE IV. ACCURACY AND PERFORMANCE MEASURES ON
Random Forest classifier got the highest ROC Area. PERCENTAGE SPLIT
10
Train Correctly
ROC providing us strength to finish this worthwhile undertaking.
Classifier Data Classified Kappa Precision Appreciation is also due to our mentor, Ms. Julie Ann
Area
(%) Instances
Physical Injury Salido, for her helpful insights and ideas vital to the conduct
BayesNet 80% 76.04% .5777 .765 .877 of this research. Ms. Baculo and Mr. Marzan acknowledge
70% 77.24% .5897 .784 .878 the Commission on Higher Education, in collaboration with
Naïve Bayes 80% 76.04% .5777 .765 .896 the De La Salle University (DLSU), for funding support
70% 77.24% .5897 .784 .874 through the Commission on Higher Education K-12
J48 80% 89.58% .8321 .909 .977
70% 88.28% .81 .893 .963
Transition (CHED K-12) Program.
Random 80% 89.58% .8313 .907 .974
Forest 70% 88.28% .8089 .890 .962
REFERENCES
Decision 80% 71.88% .4767 .574 .749 [1] Inquirer, P. (2016). PNP’s crime analysis goes hi-tech. [online]
Stump 70% 69.66% .4262 .550 .714 Newsinfo.inquirer.net. Available at:
http://newsinfo.inquirer.net/760990/pnps-crime-analysis-goes-hi-
VII. CONCLUSION tech [Accessed 27 Jun. 2017].
[2] G. Zhou, J. Lin, and W. Zheng, “A web-based geographical
The heat maps generated by ArcGIS 10 using the kernel information system for crime mapping and decision support”,
density estimation showed areas in Manila with the highest International Conference on Computational Problem-Solving, 2012,
number of criminal activities in the time of day, day of the pp. 147-150.
week and quarter of the year analysis. The time when [3] S. Khalid, J. Wang, M. Shakeel, and X. Nan, “Spatio-temporal
criminal activities are at peak is around 8:00 PM to 4:00 analysis of the street crime hotspots in Faisalabad city of Pakistan”,
AM. Also, most of the incidents happen during the 23rd International Conference on Geoinformatics, 2015, pp. 1-4.
[4] E. Clougherty, J. Clougherty, X. Liu, and D. Brown, “Spatial and
weekends and throughout the year, the third quarter (July-
temporal analysis of sex crimes in Charlottesville, Virginia”,
September) has the most active criminal activities based on Systems and Information Engineering Design Symposium, 2015, pp.
historical data. The hotspots identified lies in the districts of 69-74.
Tondo, Port Area, San Nicolas, Ermita, and Malate. The [5] M. Sarayanan, R. Thayyil, and S. Narayanan, “Enabling real time
identification of these hotspots may be used as input to the crime intelligence using mobile GIS and prediction methods”,
PNP in allocating their resources at a given time and European Intelligence and Security Informatics Conference, 2013,
pp. 125-128.
location.
[6] A. Babakura, M. N. Sulaiman, and M. A. Yusuf, “Improved method
We then applied the BayesNet, Naïve Bayes, J48, of classification algorithms for crime prediction”, 2014
Decision Stump, and Random Forest classifiers to six International Symposium on Biometrics and Security Technologies,
datasets categorized by the type of crime. Results showed 2014, pp. 250-255.
that generally, the Random Forest classifier outperformed [7] A. Gupta, A. Syed, A. Mohammad, and M. Halgamuge, “A
the other algorithms in both 10-fold cross validation and comparative study of classification algorithms using data mining:
percentage split methods. However, the computed accuracy Crime and accidents in Denver city USA”, International Journal of
Advanced Computer Science and Applications, 2016, pp. 374-381.
and performance values computed were close to the results [8] T. Almanie, R. Mirza, and E. Lor, “Crime prediction based on crime
generated using J48 classifier. Both algorithms got the types and using spatial and temporal criminal hotspots”,
highest percentage of correctly classified instances of 89.58% International Journal of Data Mining & Knowledge Management
but J48 got better kappa statistic, precision and ROC area Process, 2015, pp. 1-19.
values. Using the models generated by these classification [9] Z. Wang, J. Wu and B. Yu, "Analyzing spatio-temporal distribution
algorithms, law enforcement agencies may be able to predict of crime hot-spots and their related factors in Shanghai, China," 19th
International Conference on Geoinformatics, 2011, pp. 1 – 6.
the location (class label) and possible factors (attributes) [10] E. Johansson, C. Gåhlin and A. Borg, "Crime hotspots: An
that can affect the occurrence of crimes. evaluation of the KDE spatial mapping technique," European
Intelligence and Security Informatics Conference, 2015, pp. 69 – 74.
FUTURE WORK [11] I. Ben-Gal, F. Ruggeri, F. Faltin, and R. Kenett, Bayesian networks
As future extension of this research, we plan to apply encyclopedia of statistics in quality & reliability, Wiley & Sons,
more classification and clustering methods to criminal data 2007.
[12] S. G. Sathyadevan, “Crime analysis and prediction using data
sets and expand the study area to the other cities in Metro
mining”, First International Conference on Networks & Soft
Manila. We also intend to add other attributes necessary for Computing, 2014, pp. 406-412.
profiling the identification of possible suspects based on [13] P. Sharma, “Comparative Analysis of Various Decision Tree
historical data. Classification Algorithms using WEKA”, International Journal on
Recent and Innovation Trends in Computing and Communication,
ACKNOWLEDGMENT 2014, pp. 684-690.
Our deepest, profound, and sincere gratitude to the
Heavenly Father for lending us His spirit of wisdom and for
11