The goal of this work is to create a model that is able to identify how well a participant of the experiment performs a particular exercise. They were asked to perform it 5 times according to the experiment especification as follows:
More about the experiment can be read here.
After download the data, it can be loaded into R:
training <- read.csv(file = 'pml-training.csv', header = TRUE, sep = ',', na.strings=c("NA","","#DIV/0!"))
testing <- read.csv(file = 'pml-testing.csv', header = TRUE, sep = ',', na.strings=c("NA","","#DIV/0!"))
Both training
and testing
datasets have 160 variables/features each one.
length(names(training))
## [1] 160
length(names(testing))
## [1] 160
The variables in dataset are related to measurements acquired by the arm, forearm, belt and dumbbell sensors. For each sensor was monitored the x, y and z positions as well as its accelerations and other additional measures.
As the goal is to predict the correctness of the exercise, we don’t need all the variables in dataset, just those related with x, y and z axis and their accelerations. As the dataset contains the total acceleration for each sensor, we will use it because it represents the resultant acceleration. This way we can discharge the individual axis accelerations.
The features selected (17 at all) for our model were:
features <- which(sapply(X = names(training), FUN = grepl, pattern = '^total_accel|gyros|classe'))
features
## total_accel_belt gyros_belt_x gyros_belt_y
## 11 37 38
## gyros_belt_z total_accel_arm gyros_arm_x
## 39 49 60
## gyros_arm_y gyros_arm_z total_accel_dumbbell
## 61 62 102
## gyros_dumbbell_x gyros_dumbbell_y gyros_dumbbell_z
## 113 114 115
## total_accel_forearm gyros_forearm_x gyros_forearm_y
## 140 151 152
## gyros_forearm_z classe
## 153 160
training <- training[, features]
testing <- testing[, features]
The study desing considered 70% of the observations in training
dataset to train the model and 30% of observations to cross validation
.
library(caret)
index <- createDataPartition(y = training$classe, p = 0.7, list = FALSE)
validation <- training[-index,]
training <- training[index,]
To buil the model the doParallel
library was used to processing to be parallelized. The algorithm choosen to build it was the random forest
.
library(doParallel)
cl <- makeCluster(detectCores())
registerDoParallel(cl)
set.seed(32343)
modelFit <- train(classe ~ ., data = training, method = 'rf')
stopCluster(cl)
modelFit$finalModel
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 4.88%
## Confusion matrix:
## A B C D E class.error
## A 3808 24 33 35 6 0.02508961
## B 93 2496 48 11 10 0.06094808
## C 51 35 2273 32 5 0.05133556
## D 71 21 88 2051 21 0.08925400
## E 13 29 35 9 2439 0.03405941
As described above, the best approach for the algorithm was split the trees using two predictors at each node.
Its error rate for each classe decreased near 500 trees, but was still above 0.05%:
plot(modelFit$finalModel, main= "Error Rates")
The Accuracy between the models used to create the optimal one was:
plot(modelFit, main = "Model Accuracy", xlab = "Predictors",
ylab = "Accuracy")
After build the model, the predictions to cross validate it were created:
validation$prediction <- predict(modelFit, validation)
confusionMatrix(validation$prediction, validation$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1645 32 26 36 6
## B 5 1071 16 5 15
## C 10 32 962 41 10
## D 13 3 18 875 2
## E 1 1 4 7 1049
##
## Overall Statistics
##
## Accuracy : 0.9519
## 95% CI : (0.9461, 0.9572)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9391
## Mcnemar's Test P-Value : 8.985e-12
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9827 0.9403 0.9376 0.9077 0.9695
## Specificity 0.9763 0.9914 0.9809 0.9927 0.9973
## Pos Pred Value 0.9427 0.9631 0.9118 0.9605 0.9878
## Neg Pred Value 0.9930 0.9858 0.9867 0.9821 0.9932
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2795 0.1820 0.1635 0.1487 0.1782
## Detection Prevalence 0.2965 0.1890 0.1793 0.1548 0.1805
## Balanced Accuracy 0.9795 0.9658 0.9592 0.9502 0.9834
The confusion matrix
above showed that the model predicted the values with an Accuracy of 0.9256 and its Kappa value, the agreement between true observations and predicted ones, was about 0.9057.
As asked on the second part of the project, the model was used to predict 20 samples classes. The results was:
testing$prediction <- predict(modelFit, testing)
as.character(testing$prediction)
## [1] "B" "A" "C" "A" "A" "E" "D" "B" "A" "A" "A" "C" "B" "A" "E" "E" "A"
## [18] "B" "B" "B"