0% found this document useful (0 votes)
16 views13 pages

Analysis Course HW2

Uploaded by

12joe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views13 pages

Analysis Course HW2

Uploaded by

12joe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

HW1

Question 2.1
A situation from everyday life a classification model could be applied to would be in differentiating if an indvidual
should purchase or sell a stock based on a predicted close price. Some predictors that could be used in
determining this response could be the stock’s respective high, low, close, or volume throughout a range of time.
Finance data from a source like Yahoo Finance could be used to train, validate, and test our model.

Question 2.2
Libraries Needed

library(kernlab)
library(kknn)
library(rsample)
library(caret)

Importing the files “credit_card_data.txt” and renaming it to “ccdata”

ccdata <- read.table("credit_card_data.txt", header = FALSE, stringsAsFactors = FALSE)


set.seed(3)
head(ccdata)

## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
## 1 1 30.83 0.000 1.25 1 0 1 1 202 0 1
## 2 0 58.67 4.460 3.04 1 0 6 1 43 560 1
## 3 0 24.50 0.500 1.50 1 1 0 1 280 824 1
## 4 1 27.83 1.540 3.75 1 0 5 0 100 3 1
## 5 1 20.17 5.625 1.71 1 1 0 1 120 0 1
## 6 1 32.08 4.000 2.50 1 1 0 0 360 0 1

tail(ccdata)
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
## 649 1 40.58 3.290 3.50 0 1 0 0 400 0 0
## 650 1 21.08 10.085 1.25 0 1 0 1 260 0 0
## 651 0 22.67 0.750 2.00 0 0 2 0 200 394 0
## 652 0 25.25 13.500 2.00 0 0 1 0 200 1 0
## 653 1 17.92 0.205 0.04 0 1 0 1 280 750 0
## 654 1 35.00 3.375 8.29 0 1 0 0 0 0 0

0.1 KSVM Model


Iterating through different magnitudes of C putting the accuracy of the model’s prediction in a vector called
accuracy_vector.

Cloop <- 10^(-5:2)


accuracy_vector <- vector("numeric")
for(lambda in Cloop) {
model <- ksvm(as.matrix(ccdata[,1:10]),
as.factor(ccdata[,11]),
type = "C-svc",
kernel = "vanilladot",
C = lambda,
scaled=TRUE)
a <- colSums(model@xmatrix[[1]] * model@coef[[1]])
a0 <- -model@b
pred <- predict(model,ccdata[,1:10])
accuracy <- sum(pred == ccdata[,11]) / nrow(ccdata)
accuracy_vector <- c(accuracy_vector,accuracy)
}

## Setting default kernel parameters


## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters

plot(accuracy_vector)
max(accuracy_vector)

## [1] 0.8639144

which.max(accuracy_vector)

## [1] 4

Calculating a1…am with C = 10^2 :

a <- colSums(model@xmatrix[[1]] * model@coef[[1]])


a

## V1 V2 V3 V4 V5
## -0.0010065348 -0.0011729048 -0.0016261967 0.0030064203 1.0049405641
## V6 V7 V8 V9 V10
## -0.0028259432 0.0002600295 -0.0005349551 -0.0012283758 0.1063633995

Calculating a0 with C = 10^2:

a0 <- -model@b
a0
## [1] 0.08158492

Model Prediction with C = 10^2:

pred <- predict(model,ccdata[,1:10])


pred

## [1] 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [38] 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
## [75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [112] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [149] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [186] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1
## [223] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1
## [260] 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
## [297] 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
## [334] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [371] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [408] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [445] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [482] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [519] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1
## [556] 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
## [593] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
## [630] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## Levels: 0 1

Observing the accuracy vector shows that ~86.39144% is the mode of the vector, so we will use C = 10^2. It
should be noted that the accuracy is not affected until C is significantly small.

accuracy_vector

## [1] 0.5474006 0.5474006 0.8379205 0.8639144 0.8639144 0.8639144 0.8639144


## [8] 0.8639144

With a and a0, the equation for our classifier can be expressed as follows:

equation <- paste("0 =", a[1], "* V1 +", a[2], "* V2 +", a[3], "* V3 +", a[4], "* V4 +", a[5],
"* V5 +", a[6], "* V6 +", a[7], "* V7 +", a[8], "* V8 +", a[9], "* V9 +", a[10], "* V10 +", a0)
equation

## [1] "0 = -0.00100653481057611 * V1 + -0.00117290480611665 * V2 + -0.00162619672236963 * V3 +


0.0030064202649194 * V4 + 1.00494056410556 * V5 + -0.00282594323043472 * V6 + 0.0002600295070163
13 * V7 + -0.000534955143494997 * V8 + -0.00122837582291523 * V9 + 0.106363399527188 * V10 + 0.0
81584921659538"
0.2 KKNN Model
#KKNN function that iterates through different values of K and stores the accuracy of that respective model into a
vector so that we can determine which K value gives use the highest accuracy on the dataset.

knn_pred <- rep(0, nrow(ccdata)) #vector of all 0's the size of our dataset that will be filled
with 1's & 0's our prediction based from our model
knn_acc_vector <- vector("numeric") #empty vector to store the accurracy of our model for each i
teration of K
for (K in 1:50) { #number of K values to iterate through
for (i in 1:nrow(ccdata)) { #for each data point where i is the data point
knn_model <- kknn(ccdata[-i,11]~., #Can also be "V11 ~.,"
ccdata[-i,1:10], #Train on all the predictors for all but the ith data poi
nt
ccdata[i,1:10], #Test on all the predictors including i
k = K,
kernel = "optimal",
scale = TRUE)
knn_pred[i] <- round(fitted(knn_model)) #"fitted will return the predicted respones from our
model. Since kknn will read responses as continous, we can use the round function to make make a
ll predictions either 1 or 0 it will then be stored into our previously vector of all 0's"
knn_acc <- sum(knn_pred == ccdata[,11]) / nrow(ccdata) #sums all the data points where our p
rediction matches our data set and then divides it over the number of datapoints we have to dete
rmine accuracy

}
knn_acc_vector <- c(knn_acc_vector,knn_acc) #for each K, store the accuracy in a vector
}

plot(knn_acc_vector)
max(knn_acc_vector) # Accurate 85.32% of the time!

## [1] 0.853211

which.max(knn_acc_vector) #Max accuracy @ K = 12

## [1] 12

Question 3.1a
Using the full ccdata set, we can train our model using the k-fold crossvalidation by function cv.kknn from
library(kknn). Keeping the number of folds constant, we can iterate through which K we want for the nearest
neighbor model. We could do the opposite and keep the K nearest neighbor constant, and iterate to determine the
best number of folds to use too!
k_acc_vec = vector("numeric")
for (K in 1:50) {
kmodel3 <- cv.kknn(V11 ~ .,
ccdata,
kcv = 10, # # of folds
k = K,
kernel = "optimal",
scale = TRUE)
kmodel3 <- data.frame(kmodel3) #cv.kknn function outputs our prediction in a weird way, so we
can use the data.frame function to put into a normal matrix
kmodelpred2 <- kmodel3[,2] #the 2nd column has our model predictions
rpred2 <- round(kmodelpred2) #round them so that they are 1 or 0
k_accuracy3 <- sum(rpred2 == ccdata[,11]) / nrow(ccdata)
k_acc_vec <- c(k_acc_vec, k_accuracy3)
}
plot(k_acc_vec)

max(k_acc_vec) # 85.53% accurate

## [1] 0.8577982

which.max(k_acc_vec) # Most accurate with a K value of 20


## [1] 5

Training of kknn via leave-one-out cross validation method

set.seed(3)

kmodel <- train.kknn(V11 ~.,


ccdata,
kmax = 100,
kernel = "optimal",
scale = TRUE)

kpred <- predict(kmodel, ccdata)


roundedpred <- round(kpred)
k_accuracy <- sum(roundedpred == ccdata[,11])/ nrow(ccdata)
k_accuracy

## [1] 0.8776758

kmodel

##
## Call:
## train.kknn(formula = V11 ~ ., data = ccdata, kmax = 100, kernel = "optimal", scale = TRU
E)
##
## Type of response variable: continuous
## minimal mean absolute error: 0.1850153
## Minimal mean squared error: 0.1073792
## Best kernel: optimal
## Best k: 58

Question 3.1b
Splitting the data into training, validation, and test data, we can compare between the KNN and SVM.
set.seed(3)
#Splitting data into 70% training, 15% validation, and 15% testin
ccdatasplit <- sample(1:3, nrow(ccdata), prob = c(.7,.15,.15), replace = TRUE)
cctrain <- ccdata[ccdatasplit == 1,]
ccvalid <- ccdata[ccdatasplit == 2,]
cctest <- ccdata[ccdatasplit == 3,]

#Training KSVM Model using our previous code to find the C value that has the lowest training er
ror on our training set.
Cloop <- 10^(-3:3)
ksvm_acc_vec <- vector("numeric")
for(lambda in Cloop) {
ksvm_model <- ksvm(as.matrix(cctrain[,1:10]),
as.factor(cctrain[,11]),
type = "C-svc",
kernel = "vanilladot",
C = lambda,
scaled=TRUE)
a <- colSums(ksvm_model@xmatrix[[1]] * ksvm_model@coef[[1]])
a0 <- -ksvm_model@b
prediction <- predict(ksvm_model,cctrain[,1:10])
ksvm_acc <- sum(prediction == cctrain[,11]) / nrow(cctrain)
ksvm_acc_vec <- c(ksvm_acc_vec,ksvm_acc)
}

## Setting default kernel parameters


## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters

ksvm_acc_vec

## [1] 0.8200000 0.8711111 0.8711111 0.8711111 0.8711111 0.8711111 0.8711111

Our ksvm_model appears to have a pretty consistent accuracy at 87.11% as long as our C value is not
significantly small, so we will use C=100 for our KSVM model for validation.

set.seed(3)
ksvm_model2 <- ksvm(as.matrix(cctrain[,1:10]), #train on training set
as.factor(cctrain[,11]),
type = "C-svc",
kernel = "vanilladot",
C = 100,
scaled=TRUE)
## Setting default kernel parameters

ksvm_prediction_valid <- predict(ksvm_model2, ccvalid[,1:10]) #predicting how the model will do


on our validation set's predictors.
ksvm.acc <- sum(ksvm_prediction_valid == ccvalid[,11]) / nrow(ccvalid)
ksvm.acc # 87.75% accurate!

## [1] 0.877551

I was skeptical about the predict function so i wanted to see if i could reproduce this accuracy by training the
model on the validation set itself

#Validating KSVM Model


set.seed(3)
ksvm_model_valid <- ksvm(as.matrix(ccvalid[,1:10]),
as.factor(ccvalid[,11]),
type = "C-svc",
kernel = "vanilladot",
C = 100,
scaled=TRUE)

## Setting default kernel parameters

ksvm_model_valid

## Support Vector Machine object of class "ksvm"


##
## SV type: C-svc (classification)
## parameter : cost C = 100
##
## Linear (vanilla) kernel function.
##
## Number of Support Vectors : 42
##
## Objective Function Value : -2400.483
## Training error : 0.122449

1-.122449

## [1] 0.877551

Got the same answer! From the validation data, our ksvm_model has a training error of about 0.122449. So, its
accurate about 87.75% of the time. Slightly higher than our training set. Lets see how our KNN model performs
against the training set.
set.seed(3)
#Finding best K on the training set by iterating through different values of K. The K with the h
ighest accuracy will be used for our kknn model that uses validation dataset.

knn_pred2 <- rep(0, nrow(cctrain))


knn_acc_vector2 <- vector("numeric")
for (K in 1:50) {
for (i in 1:nrow(cctrain)) {
knn_model2 <- kknn(cctrain[-i,11]~.,
cctrain[-i,1:10],
cctrain[i,1:10],
k = K,
kernel = "optimal",
scale = TRUE)
knn_pred2[i] <- round(fitted(knn_model2))
knn_acc2 <- sum(knn_pred2 == cctrain[,11]) / nrow(cctrain)

}
knn_acc_vector2 <- c(knn_acc_vector2,knn_acc2)
}

plot(knn_acc_vector2)
max(knn_acc_vector2) # 84.667% Accurate!

## [1] 0.8466667

which.max(knn_acc_vector2) #K value of 10 is the most accurate

## [1] 10

K = 10 had the highest accuracy on the training set with an accuracy of 84.667%. Lets see how this performs on
the validation set.

set.seed(3)

knn_pred3 <- rep(0, nrow(ccvalid))


knn_acc_vector3 <- vector("numeric")
for (i in 1:nrow(ccvalid)) {
knn_model3 <- kknn(ccvalid[-i,11]~.,
ccvalid[-i,1:10],
ccvalid[i,1:10],
k = 10,
kernel = "optimal",
scale = TRUE)
knn_pred3[i] <- round(predict(knn_model3))
knn.acc = sum(knn_pred3 == ccvalid[,11])/ nrow(ccvalid)

knn.acc # 85.7% accurate witht he validation set

## [1] 0.8571429

KKNN model performed slightly better on the validation set. Since the SVM model performed best on the validation
set, we will use our SVM model on the test data set to see how well our model can predict.

set.seed(3)
ksvm_prediction_test <- predict(ksvm_model2, cctest[,1:10])
ksvm.acc2 <- sum(ksvm_prediction_test == cctest[,11]) / nrow(cctest)
ksvm.acc2 # 82.07% accurate on the test set!

## [1] 0.8207547

##Conclusion
The KSVM model is a better model for our data due to its higher performance on the validation set compared to
KNN. Using our KSVM model on the test set, our model is accurate about 82.07% of the time, down from the
87.11% accuracy we observed on the training set.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy