Analysis Course HW2
Analysis Course HW2
Question 2.1
A situation from everyday life a classification model could be applied to would be in differentiating if an indvidual
should purchase or sell a stock based on a predicted close price. Some predictors that could be used in
determining this response could be the stock’s respective high, low, close, or volume throughout a range of time.
Finance data from a source like Yahoo Finance could be used to train, validate, and test our model.
Question 2.2
Libraries Needed
library(kernlab)
library(kknn)
library(rsample)
library(caret)
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
## 1 1 30.83 0.000 1.25 1 0 1 1 202 0 1
## 2 0 58.67 4.460 3.04 1 0 6 1 43 560 1
## 3 0 24.50 0.500 1.50 1 1 0 1 280 824 1
## 4 1 27.83 1.540 3.75 1 0 5 0 100 3 1
## 5 1 20.17 5.625 1.71 1 1 0 1 120 0 1
## 6 1 32.08 4.000 2.50 1 1 0 0 360 0 1
tail(ccdata)
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
## 649 1 40.58 3.290 3.50 0 1 0 0 400 0 0
## 650 1 21.08 10.085 1.25 0 1 0 1 260 0 0
## 651 0 22.67 0.750 2.00 0 0 2 0 200 394 0
## 652 0 25.25 13.500 2.00 0 0 1 0 200 1 0
## 653 1 17.92 0.205 0.04 0 1 0 1 280 750 0
## 654 1 35.00 3.375 8.29 0 1 0 0 0 0 0
plot(accuracy_vector)
max(accuracy_vector)
## [1] 0.8639144
which.max(accuracy_vector)
## [1] 4
## V1 V2 V3 V4 V5
## -0.0010065348 -0.0011729048 -0.0016261967 0.0030064203 1.0049405641
## V6 V7 V8 V9 V10
## -0.0028259432 0.0002600295 -0.0005349551 -0.0012283758 0.1063633995
a0 <- -model@b
a0
## [1] 0.08158492
## [1] 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [38] 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
## [75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [112] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [149] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [186] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1
## [223] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1
## [260] 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
## [297] 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
## [334] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [371] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [408] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [445] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [482] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [519] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1
## [556] 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
## [593] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
## [630] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## Levels: 0 1
Observing the accuracy vector shows that ~86.39144% is the mode of the vector, so we will use C = 10^2. It
should be noted that the accuracy is not affected until C is significantly small.
accuracy_vector
With a and a0, the equation for our classifier can be expressed as follows:
equation <- paste("0 =", a[1], "* V1 +", a[2], "* V2 +", a[3], "* V3 +", a[4], "* V4 +", a[5],
"* V5 +", a[6], "* V6 +", a[7], "* V7 +", a[8], "* V8 +", a[9], "* V9 +", a[10], "* V10 +", a0)
equation
knn_pred <- rep(0, nrow(ccdata)) #vector of all 0's the size of our dataset that will be filled
with 1's & 0's our prediction based from our model
knn_acc_vector <- vector("numeric") #empty vector to store the accurracy of our model for each i
teration of K
for (K in 1:50) { #number of K values to iterate through
for (i in 1:nrow(ccdata)) { #for each data point where i is the data point
knn_model <- kknn(ccdata[-i,11]~., #Can also be "V11 ~.,"
ccdata[-i,1:10], #Train on all the predictors for all but the ith data poi
nt
ccdata[i,1:10], #Test on all the predictors including i
k = K,
kernel = "optimal",
scale = TRUE)
knn_pred[i] <- round(fitted(knn_model)) #"fitted will return the predicted respones from our
model. Since kknn will read responses as continous, we can use the round function to make make a
ll predictions either 1 or 0 it will then be stored into our previously vector of all 0's"
knn_acc <- sum(knn_pred == ccdata[,11]) / nrow(ccdata) #sums all the data points where our p
rediction matches our data set and then divides it over the number of datapoints we have to dete
rmine accuracy
}
knn_acc_vector <- c(knn_acc_vector,knn_acc) #for each K, store the accuracy in a vector
}
plot(knn_acc_vector)
max(knn_acc_vector) # Accurate 85.32% of the time!
## [1] 0.853211
## [1] 12
Question 3.1a
Using the full ccdata set, we can train our model using the k-fold crossvalidation by function cv.kknn from
library(kknn). Keeping the number of folds constant, we can iterate through which K we want for the nearest
neighbor model. We could do the opposite and keep the K nearest neighbor constant, and iterate to determine the
best number of folds to use too!
k_acc_vec = vector("numeric")
for (K in 1:50) {
kmodel3 <- cv.kknn(V11 ~ .,
ccdata,
kcv = 10, # # of folds
k = K,
kernel = "optimal",
scale = TRUE)
kmodel3 <- data.frame(kmodel3) #cv.kknn function outputs our prediction in a weird way, so we
can use the data.frame function to put into a normal matrix
kmodelpred2 <- kmodel3[,2] #the 2nd column has our model predictions
rpred2 <- round(kmodelpred2) #round them so that they are 1 or 0
k_accuracy3 <- sum(rpred2 == ccdata[,11]) / nrow(ccdata)
k_acc_vec <- c(k_acc_vec, k_accuracy3)
}
plot(k_acc_vec)
## [1] 0.8577982
set.seed(3)
## [1] 0.8776758
kmodel
##
## Call:
## train.kknn(formula = V11 ~ ., data = ccdata, kmax = 100, kernel = "optimal", scale = TRU
E)
##
## Type of response variable: continuous
## minimal mean absolute error: 0.1850153
## Minimal mean squared error: 0.1073792
## Best kernel: optimal
## Best k: 58
Question 3.1b
Splitting the data into training, validation, and test data, we can compare between the KNN and SVM.
set.seed(3)
#Splitting data into 70% training, 15% validation, and 15% testin
ccdatasplit <- sample(1:3, nrow(ccdata), prob = c(.7,.15,.15), replace = TRUE)
cctrain <- ccdata[ccdatasplit == 1,]
ccvalid <- ccdata[ccdatasplit == 2,]
cctest <- ccdata[ccdatasplit == 3,]
#Training KSVM Model using our previous code to find the C value that has the lowest training er
ror on our training set.
Cloop <- 10^(-3:3)
ksvm_acc_vec <- vector("numeric")
for(lambda in Cloop) {
ksvm_model <- ksvm(as.matrix(cctrain[,1:10]),
as.factor(cctrain[,11]),
type = "C-svc",
kernel = "vanilladot",
C = lambda,
scaled=TRUE)
a <- colSums(ksvm_model@xmatrix[[1]] * ksvm_model@coef[[1]])
a0 <- -ksvm_model@b
prediction <- predict(ksvm_model,cctrain[,1:10])
ksvm_acc <- sum(prediction == cctrain[,11]) / nrow(cctrain)
ksvm_acc_vec <- c(ksvm_acc_vec,ksvm_acc)
}
ksvm_acc_vec
Our ksvm_model appears to have a pretty consistent accuracy at 87.11% as long as our C value is not
significantly small, so we will use C=100 for our KSVM model for validation.
set.seed(3)
ksvm_model2 <- ksvm(as.matrix(cctrain[,1:10]), #train on training set
as.factor(cctrain[,11]),
type = "C-svc",
kernel = "vanilladot",
C = 100,
scaled=TRUE)
## Setting default kernel parameters
## [1] 0.877551
I was skeptical about the predict function so i wanted to see if i could reproduce this accuracy by training the
model on the validation set itself
ksvm_model_valid
1-.122449
## [1] 0.877551
Got the same answer! From the validation data, our ksvm_model has a training error of about 0.122449. So, its
accurate about 87.75% of the time. Slightly higher than our training set. Lets see how our KNN model performs
against the training set.
set.seed(3)
#Finding best K on the training set by iterating through different values of K. The K with the h
ighest accuracy will be used for our kknn model that uses validation dataset.
}
knn_acc_vector2 <- c(knn_acc_vector2,knn_acc2)
}
plot(knn_acc_vector2)
max(knn_acc_vector2) # 84.667% Accurate!
## [1] 0.8466667
## [1] 10
K = 10 had the highest accuracy on the training set with an accuracy of 84.667%. Lets see how this performs on
the validation set.
set.seed(3)
## [1] 0.8571429
KKNN model performed slightly better on the validation set. Since the SVM model performed best on the validation
set, we will use our SVM model on the test data set to see how well our model can predict.
set.seed(3)
ksvm_prediction_test <- predict(ksvm_model2, cctest[,1:10])
ksvm.acc2 <- sum(ksvm_prediction_test == cctest[,11]) / nrow(cctest)
ksvm.acc2 # 82.07% accurate on the test set!
## [1] 0.8207547
##Conclusion
The KSVM model is a better model for our data due to its higher performance on the validation set compared to
KNN. Using our KSVM model on the test set, our model is accurate about 82.07% of the time, down from the
87.11% accuracy we observed on the training set.