The goal of this work is to create a model that is able to identify how well a participant of the experiment performs a particular exercise. They were asked to perform it 5 times according to the experiment especification as follows:

More about the experiment can be read here.

Acquiring Data

After download the data, it can be loaded into R:

training <- read.csv(file = 'pml-training.csv', header = TRUE, sep = ',', na.strings=c("NA","","#DIV/0!"))
testing <- read.csv(file = 'pml-testing.csv', header = TRUE, sep = ',', na.strings=c("NA","","#DIV/0!"))

Exploratory Data Analysis

Both training and testing datasets have 160 variables/features each one.

length(names(training))
## [1] 160
length(names(testing))
## [1] 160

The variables in dataset are related to measurements acquired by the arm, forearm, belt and dumbbell sensors. For each sensor was monitored the x, y and z positions as well as its accelerations and other additional measures.

As the goal is to predict the correctness of the exercise, we don’t need all the variables in dataset, just those related with x, y and z axis and their accelerations. As the dataset contains the total acceleration for each sensor, we will use it because it represents the resultant acceleration. This way we can discharge the individual axis accelerations.

The features selected (17 at all) for our model were:

features <- which(sapply(X = names(training), FUN = grepl, pattern = '^total_accel|gyros|classe'))
features
##     total_accel_belt         gyros_belt_x         gyros_belt_y 
##                   11                   37                   38 
##         gyros_belt_z      total_accel_arm          gyros_arm_x 
##                   39                   49                   60 
##          gyros_arm_y          gyros_arm_z total_accel_dumbbell 
##                   61                   62                  102 
##     gyros_dumbbell_x     gyros_dumbbell_y     gyros_dumbbell_z 
##                  113                  114                  115 
##  total_accel_forearm      gyros_forearm_x      gyros_forearm_y 
##                  140                  151                  152 
##      gyros_forearm_z               classe 
##                  153                  160
training <- training[, features]
testing <- testing[, features]

Building The Model

The study desing considered 70% of the observations in training dataset to train the model and 30% of observations to cross validation.

library(caret)
index <- createDataPartition(y = training$classe, p = 0.7, list = FALSE)
validation <- training[-index,]
training <- training[index,]

To buil the model the doParallel library was used to processing to be parallelized. The algorithm choosen to build it was the random forest.

library(doParallel)
cl <- makeCluster(detectCores())
registerDoParallel(cl)
set.seed(32343)
modelFit <- train(classe ~ ., data = training, method = 'rf')
stopCluster(cl)

modelFit$finalModel
## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 4.88%
## Confusion matrix:
##      A    B    C    D    E class.error
## A 3808   24   33   35    6  0.02508961
## B   93 2496   48   11   10  0.06094808
## C   51   35 2273   32    5  0.05133556
## D   71   21   88 2051   21  0.08925400
## E   13   29   35    9 2439  0.03405941

As described above, the best approach for the algorithm was split the trees using two predictors at each node.

Its error rate for each classe decreased near 500 trees, but was still above 0.05%:

plot(modelFit$finalModel, main= "Error Rates")

The Accuracy between the models used to create the optimal one was:

plot(modelFit, main = "Model Accuracy", xlab = "Predictors", 
     ylab = "Accuracy")

After build the model, the predictions to cross validate it were created:

validation$prediction <- predict(modelFit, validation)
confusionMatrix(validation$prediction, validation$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1645   32   26   36    6
##          B    5 1071   16    5   15
##          C   10   32  962   41   10
##          D   13    3   18  875    2
##          E    1    1    4    7 1049
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9519          
##                  95% CI : (0.9461, 0.9572)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9391          
##  Mcnemar's Test P-Value : 8.985e-12       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9827   0.9403   0.9376   0.9077   0.9695
## Specificity            0.9763   0.9914   0.9809   0.9927   0.9973
## Pos Pred Value         0.9427   0.9631   0.9118   0.9605   0.9878
## Neg Pred Value         0.9930   0.9858   0.9867   0.9821   0.9932
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2795   0.1820   0.1635   0.1487   0.1782
## Detection Prevalence   0.2965   0.1890   0.1793   0.1548   0.1805
## Balanced Accuracy      0.9795   0.9658   0.9592   0.9502   0.9834

The confusion matrix above showed that the model predicted the values with an Accuracy of 0.9256 and its Kappa value, the agreement between true observations and predicted ones, was about 0.9057.

Predict the 20 Samples

As asked on the second part of the project, the model was used to predict 20 samples classes. The results was:

testing$prediction <- predict(modelFit, testing)
as.character(testing$prediction)
##  [1] "B" "A" "C" "A" "A" "E" "D" "B" "A" "A" "A" "C" "B" "A" "E" "E" "A"
## [18] "B" "B" "B"