Big Data Analytics For Gold Price Forecasting Based On Decision Tree Algorithm and Support Vector Regression (SVR)
Big Data Analytics For Gold Price Forecasting Based On Decision Tree Algorithm and Support Vector Regression (SVR)
net/publication/274238769
Big Data Analytics for Gold Price Forecasting Based on Decision Tree
Algorithm and Support Vector Regression (SVR)
CITATIONS READS
19 3,404
2 authors:
All content following this page was uploaded by Dr. Vadivu G on 19 May 2015.
Abstract: Develop a forecasting model for predicting and forecasting gold prices based on economic factors such as inflation,
currency price movements and others. For investing the money, investors are putting their money into gold because gold plays an
important role as a stabilizing influence for investment portfolios. Due to the increase in demand for gold in India, it is necessary to
develop a model that reflects the structure and pattern of gold market and forecast movement of gold price. The most appropriate
approach to the understanding of gold prices Support vector Regression and decision tree model. The experimental result will show the
better performance from these two (Decision tree algorithm and support vector regression algorithm) algorithms.
Keywords: R, RHadoop, SVR (Support Vector Regression), Decision tree, Gold price.
1. Introduction preferences and various trends of the data. The main purpose
of the big data analytics is to check the timing for enter into
Essentially there is two type of stock market present one is the market and exit to the market to invest the money. There
equity market and the commodity market. An equity market are various tools are present to do the analytics such as BI
is aggregation of the producer and consumer of stocks and tools, Statistical tools, data visualization tools. But most of
the trade in primary other than manufactured product is the the tools cannot support the large amount of data and if any
commodity market. There are two type of commodity are tool supports the large data then it used to take so much time
present in the commodity market one is soft commodity in to process the data or to analyses of the data. Big data
which wheat, coffee, cocoa and sugar are come and other is analytics is use to perform the data mining process, data
hard commodity in which gold, rubber and oil are come. forecasting, data prediction etc. For the forecasting of the
gold price there are 4 or 5 factors are present such as Open
In Indian gold (hard commodity) play the crucial role in the Price, Close Price, Lowest Price, Highest Price and value of
market. Gold is the most popular as an investment the money the Gold. From these factors they can find the percentage
of all the metals and investor buy the gold as per diversifying change of the price.
risk. Especially through the use of futures derivatives and
contracts the gold market is subject to speculation as are We know that one of the reasons of the gold price change is
other markets. Gold trades predominantly as a function of the external effects such as social problems, economic
sentiment and its price is less affected by the laws of supply policies and environmental conditions political. One general
and demand. Gold is the storable and many people invest assumption is made in such cases is that the historical data
their money in the gold market.to invest the money, gold incorporate all those behavior. As a result, the historical data
prediction is the very important way to predict and forecast is the major input to the prediction process. In this hypothesis
the value (price) of the gold. There are so many method to the external effects are modeled as noise, and the phenomena
predict and forecast the price such as linear regression one considered as accidental.
method, logistic regression method, decision tree method,
support vector regression method etc.in this paper we are
describing the two forecasting method one is decision tree
method and second is support vector regression method. All
the method are predict the vale on the basis of factor of the
gold. Forecasting is basically used for check the trend and to
earn money. The price of the gold is depends upon the supply
and demands just like other goods. Big data analytics is the
process of gathering the data from the different sources,
managing or organizing the data and apply analytics on large
amount of the datasets to find the pattern and to check the
tendency of the datasets. Big data can be any type of the data
such as structured data, semi structured data and unstructured
data or it can be mixed of these three datasets. Big data
analytics is useful to find the correlation, customer Before start the analysis, the analyst perform the data
cleaning process or data pre-processing. analyst remove the
Volume 4 Issue 3, March 2015
www.ijsr.net
Paper ID: SUB152560 2026
Licensed Under Creative Commons Attribution CC BY
International Journal of Science and Research (IJSR)
ISSN (Online): 2319-7064
Index Copernicus Value (2013): 6.14 | Impact Factor (2013): 4.438
NULL values, data duplicity and data ambiguity from the provide the connection between databases to hadoop. Big
datasets can connect to MySQL database to manipulate the data can be analyzed with many software tools commonly
data or clean the data with the package of RMySQL library used as part of advanced analytics disciplines such as data
and access all the database file through R. R can also connect mining, statistical analysis, predictive analytics and text
to the hadoop to access the hdfs (hadoop distributed file analytics. Many BI tools are support the analytics and
system) file and to perform the analysis on that datasets and visualization technique but the relational database cannot
with the Linux terminal, analyst can access all the platform support the unstructured data or traditional data warehouse
such as R, database and hadoop. Hadoop supports the and data warehouses may not be able to handle sets of big
additional software packages (eco-system tools) to analysis data that need to be updated continually.
purpose, here the apache sqoop ecosystem tool is use for
The one of the best framework is present which support the modeling and non-linear modeling and other with the use of
large datasets is hadoop. Apache hadoop is the open source its libraries. R support many languages such as C, C++, and
framework written in java language to support the large PYTHON etc. to directly manipulate the R objects. For any
amount of the data with map-reduce technique and hadoop- specific function, specific area and specific language user
ecosystems tools (additional software packages) such as upload the packages in R and because of this it used to
apache PIG, apache HIVE, apache SQOOP etc. the main part become highly extensible. With the different packages R
of the apache hadoop is HDHS (hadoop distributed file used to provide better connection, better analytics and better
system). Real-time data on the performance of gold price. visualization. In the graphical representation (visualization),
Many organizations used to gather, process and analyze big R uses lot of plots such as histogram, box plot, pie chart, line
data have turned to a newer class of technologies that graph, bubble chart etc. to analyse the gold fluctuation the
includes Hadoop and hadoop ecosystem tools such as most valuable graph is line graph. R uses the visualization in
YARN, MapReduce, Spark, Hive, MongoDB, Hbase and Pig the form of 2D and 3D. in R visualization can express the
as well as NoSQL databases. Those technologies form the results, excavation process and it allowing user to find the
hadoop framework that supports the processing of large data exact problem after deeply understanding of the data and
sets across clustered technology. This is the storage part of after analyse the data value it recognize which algorithm is
the hadoop and to perform the processing of the data hadoop best for the analysis.
uses the map-reduce technique. The main goal of hadoop is
data locality which is to use a whole server in a large cluster,
in which each server has internal disk drive.to provide the
higher performance Map Reduce technique assign the total
workload to these server and proceed to the for the data
analysis.
2. R
R is a statistical software or data analytical software to
analyse the data and to apply the predictive modeling with
data visualization. R is used for many graphical and
statistical methods, such as time series analysis,
classification, cluster, classical statistical test, linear R has some important features and it facilitate to the data
Volume 4 Issue 3, March 2015
www.ijsr.net
Paper ID: SUB152560 2027
Licensed Under Creative Commons Attribution CC BY
International Journal of Science and Research (IJSR)
ISSN (Online): 2319-7064
Index Copernicus Value (2013): 6.14 | Impact Factor (2013): 4.438
manipulation and calculation. It also include learning theory. For the regression and classification task, it
A facility of storage of data and data handling has very powerful and useful tool. Basically support vector
in the particular matrices, it suite or the calculation of array regression is used in the time series problem and regression
A large, integrated, coherent collection of intermediate problem. The best example of the time series data is Gold
tools for data analysis, price data.
It display the result either on softcopy or hardcopy and
graphical facilities for data analysis Now we are represent the basic concept of the support vector
It is simple and very effective language which includes regression is a given dataset,
many function such as (conditionals, loops, user-defined
recursive functions) and input and output facilities. Where , , P is the size of training data. X
1) Training phase
Stage 1: Read the randomly selected training dataset from
local repository.
Stage 2: Apply windowing operator to transform the data
into a generic dataset. This step will convert the last row of
a window within the time series into a label or target
variable. Last variable is treated as label.
Stage 3: apply the cross validation process (CVP) of the
It starts from the root node and step by step it goes down till
produced label from that operator in order to feed them as
terminal node to interpret the result. Decision tree is the best
inputs into support vector regression model.
approach to predict the gold value and it give the best result
Stage 4: Select type of kernel and select special parameters at the time of prediction of the gold price. For each node, we
of support vector regression calculate the EMV (expected monetary value), and place it in
Stage 5: apply that model into the dataset and observe the the node to indicate that it is the expected value calculated
performance or accuracy. over all branches emanating from that node. There are four
Stage 6: If accuracy is good than go to step 6, otherwise go key advantage present in the decision tree
to step 4. It implicitly perform feature selection or variable screening
Stage 7: Exit from the training phase & apply trained It require relatively little effort from users for data
model to the testing dataset. prediction and data preparation
Nonlinear relationships between parameters
2) Testing phase
Easy to interpret and explain to executives
Stage 1: Read the randomly selected testing dataset from
local repository. After performing all the experiments in the gold price data
Stage 2: Apply the training model to test the out of sample sets. We have to compute three error that are MSE, MAD,
dataset for gold price prediction. and MAPE.
Stage 3: Produce the gold price predicted trends
5. Decision Tree
The decision tree is the visualization form that have a root
node and the leaf node. The leaf node contain the results.
There are two type of nodes are present in the decision tree
one is inner node and other in terminal node. Basically there
are two type of the decision tree can be drown in the gold
6. Conclusion
There are five factor are present in the gold data which are
open value, close value, low value, high value and volume.
Gold provide an effective and useful means of diversifying a
portfolio. The way to achieving success with the gold is to
know your goals and risk profile before jumping in. The
volatility of the gold can be harnessed to accumulate wealth,
but left unchecked, it can also lead to ruin. Based on these
attribute we have predicted the result from both method
.decision tree are best for the feature selection and SVR are
best for the large amount of the dataset. But there are some
problem in the SVM. It takes long time to train the dataset.
Decision tree takes less time to process the data. Decision
tree have less mean square error then the SVM.
References
[1] K.Sahu, R.Panwar, “Exchange Forecasting Using
Hadoop Map-Reduce Technique”, S.Tilekar, R.Satpute.
April 2013
[2] Shahriar Shafiee “An overview of global gold market
and gold price forecasting” , Erkan Topal 2010
[3] K.Sahu, R.Panwar, “Exchange Forecasting Using
Hadoop Map-Reduce Technique”, S.Tilekar, R.Satpute.
April 2013
[4] Daniel Keim “Big-Data Visualization” Huamin Qu,
Kwan-Liu Ma 2013.
[5] Lucas, K. C. Lai, James, N. K. Liu, “Stock Forecasting
Using Support Vector Machine,” In: Proceedings of the
Ninth International Conference on Machine Learning
and Cybernetics, pp. 1607-1614, 2010.
[6] Tak-chung Fu, "Adaptive Data Delivery Framework for
Financial Time Series Visualization ", Fu-lai Chung, Fu-
lai Chung, Chun-fai Lam, Robert Luk 2005
[7] Z. Ismail “Forecasting Gold Prices Using Multiple
Linear Regression Method”, A. Yahya, A. Shabri 2009
[8] Big data Decision tree analytics available online
www.treeplan.com/chapters/introduction-to-decision-
trees.pdf
[9] Ashesh Anand “Forecasting Gold Prices using Time
Series Analysis”, Piyush Dharnidharka.
[10] Big data analytics available online
“searchbusinessanalytics.techtarget.com/definition/big-
data-analytics”.
[11] A. Smola and B. Scholkopf, “A Tutorial on Support
Vector Regression,” Technical Report NeuroCOLT NC-
TR-98-030, 1998.
[12] P.Chandarana,”Big Data analytics frameworks”,
M.Vijayalakshmi, 2014.