Using Random Forest and Logistic Regression
In this project, I will use two machine learning models, Random Forest and Logistic Regression, to predict heart disease. This is a simple model where I use all available variables to make the prediction. I won’t delve too deeply into the problem beyond the estimation part. In reality, establishing causal inference requires more subject expertise and looking beyond the data for analytics. However, today's goal is to demonstrate and practice running these two machine-learning models.
Importing library in R
library(ggplot2)
library(cowplot)
library(randomForest)
library(tidymodels)
library(ROCR)
Using the UCI heart disease dataset
I will be using the UCI heart disease database. It’s tidy, well-structured, and ready to use.
rm(list=ls())
url <- "http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"
data <- read.csv(url, header=FALSE)
Cleaning the data
I used this approach from Josh Starmer’s StatQuest YouTube video. Check out his machine learning videos on YouTube; they are the best source for learning it.
colnames(data) <- c(
"age",
"sex",# 0 = female, 1 = male
"cp", # chest pain
# 1 = typical angina,
# 2 = atypical angina,
# 3 = non-anginal pain,
# 4 = asymptomatic
"trestbps", # resting blood pressure (in mm Hg)
"chol", # serum cholestoral in mg/dl
"fbs", # fasting blood sugar if less than 120 mg/dl, 1 = TRUE, 0 = FALSE
"restecg", # resting electrocardiographic results
# 1 = normal
# 2 = having ST-T wave abnormality
# 3 = showing probable or definite left ventricular hypertrophy
"thalach", # maximum heart rate achieved
"exang", # exercise induced angina, 1 = yes, 0 = no
"oldpeak", # ST depression induced by exercise relative to rest
"slope", # the slope of the peak exercise ST segment
# 1 = upsloping
# 2 = flat
# 3 = downsloping
"ca", # number of major vessels (0-3) colored by fluoroscopy
"thal", # this is short of thalium heart scan
# 3 = normal (no cold spots)
# 6 = fixed defect (cold spots during rest and exercise)
# 7 = reversible defect (when cold spots only appear during exercise)
"hd" # (the predicted attribute) - diagnosis of heart disease
# 0 if less than or equal to 50% diameter narrowing
# 1 if greater than 50% diameter narrowing
)
str(data)
data[data == "?"] <- NA #some values are ? which we will make NA
data[data$sex == 0,]$sex <- "F"
data[data$sex == 1,]$sex <- "M"
data$sex <- as.factor(data$sex)
#changing int variables as factor
data$cp <- as.factor(data$cp)
data$fbs <- as.factor(data$fbs)
data$restecg <- as.factor(data$restecg)
data$exang <- as.factor(data$exang)
data$slope <- as.factor(data$slope)
data$ca <- as.integer(data$ca) # since this column had "?"s in it (which
data$ca <- as.factor(data$ca) # ...then convert the integers to factor levels
data$thal <- as.integer(data$thal) # "thal" also had "?"s in it.
data$thal <- as.factor(data$thal)
## This next line replaces 0 and 1 with "Healthy" and "Unhealthy"
data$hd <- ifelse(test=data$hd == 0, yes="Healthy", no="Unhealthy")
data$hd <- as.factor(data$hd) # Now convert to a factor
Training and Testing set
Dividing the data into training and test sets, I used initial_split
the tidymodels
library in R. It’s a handy library with built-in modeling functions for machine learning. However, for this project, I will use the base glm
function for the logistic model and the randomForest
package for the Random Forest model. I still need to learn more about tidymodels
myself.
set.seed(2059)
df <- data %>% drop_na()
split_data <- initial_split(df,strata = hd)
train <- training(split_data)
test <- testing(split_data)
Baseline Accuracy
table(df$hd)
print(160/297)
print(137/297)
Out of 297 people, 160 are healthy and 137 are unhealthy. By just guessing “healthy,” we can be correct 53% of the time, and by guessing “unhealthy,” we can be accurate 46% of the time. Therefore, 53% is our baseline accuracy, which we will try to beat with our model.
Using Random Forest for Prediction
Using Random Forest: Note that I have used a loop to find the best tree and iteration values to fine-tune my model. Tuning is important to achieve better models. However, since Random Forest uses bagging and bootstrap techniques, the values on your device, mine, and any other may differ without the same seed. The larger the dataset, the more consistent the results will likely be.
set.seed(2059)
ntree_values <- seq(100,1000,by=100)
tree <- NA
iter <- NA
optimal <-Inf
for(i in 1:10){
for(j in seq_along(ntree_values)){
temp <- randomForest(hd~.,data=train, mtry=i,ntree=ntree_values[j])
optimal_temp <- temp$err.rate[nrow(temp$err.rate),1]
if (optimal_temp < optimal){
tree <- ntree_values[j]
iter <- i
optimal = optimal_temp
}
}
}
paste('Best Tree is',tree)
#Output is 400
paste('Best ntry is',iter)
Output is 4
The output of random forest using the best parameters is as follows.
rf_model <- randomForest(hd~.,data=train,iter=iter,ntree=tree)
rf_model
While the tuning is already done, I am still going to plot the results to see the performance of the random forest after each tree is made.
error_data <- as.data.frame(rf_model$err.rate)
error_data <- error_data %>% pivot_longer(
cols = c(OOB, Healthy, Unhealthy),
names_to = 'Type',
values_to = 'Value'
)
error_data$Tree <- rep(seq(1,tree),3)
head(error_data)
rf <- randomForest(hd~.,data=train,iter=iter,ntree=1000)
error_df <- as.data.frame(rf$err.rate)
error_df <- error_df %>% pivot_longer(
cols = c(OOB, Healthy, Unhealthy),
names_to = 'Type',
values_to = 'Value'
)
error_df$Tree <- rep(seq(1,1000),3)
ggplot(data = error_df, aes(x = Tree, y = Value)) +
geom_line(aes(color = Type)) + theme_classic()
Apart from that, let’s examine which variable is the most important in predicting heart disease. Again, this analysis is based entirely on data and the model and should not be considered an expert opinion.
varImpPlot(rf_model)
‘ca’, which is a number of major vessels (0–3) colored by fluoroscopy, seems to be the most important variable, followed by ‘cp’ and ‘thalach’ .
rf_predictions <- predict(rf_model,test,type="response")
table(rf_predictions,test$hd)
confusionMatrix <- table(rf_predictions,test$hd)
sum(diag(confusionMatrix)) / sum(confusionMatrix)
#Output is 0.773
The accuracy is 0.773.
Logistic Regression
Now I will use logistic regression to build the model and make predictions. The logistic regression model is available in base R.
lr <- glm(data=train, hd~., family = binomial)
summary(lr)
Now let me make predictions with the logistic model. This model will predict the probability of heart disease. The value of any probability lies between 0 and 1. I need to carefully choose a threshold that provides the best accuracy for the model.
lr_predictions <-predict(lr,test,type = 'response')
ROCRpred <- prediction(lr_predictions,test$hd)
ROCRperf <- performance(ROCRpred, "tpr",'fpr')
plot(ROCRperf,colorize=T,print.cutoffs.at=seq(0,1,0.1), text.adj=c(-0.5,2))
threshold <- seq(0.1,1,0.1)
a<-c()
t=c()
for(i in threshold){
lr_pred <- ifelse(lr_predictions <=i,"Pred_Healthy","Pred_Unhealthy")
confusionmatrix <- table(lr_pred, test$hd)
acc <- sum(diag(confusionmatrix)) / sum(confusionmatrix)
a <- c(a,acc)
t<- c(t,i)
}
The best accuracy is when the threshold is 0.7, given the threshold increase at the interval of 0.1.
Prediction with a threshold of 0.7
lr_pred <- ifelse(lr_predictions <=0.7,"Pred_Healthy","Pred_Unhealthy")
confusionmatrix <- table(lr_pred, test$hd)
confusionmatrix
However, I can see that many cases of heart disease are being wrongly identified as healthy. In this kind of scenario, maximizing accuracy may not be the best approach. Instead, we can check specificity and sensitivity to get the true positive rate or false negative rate.
Here, I will try a threshold of 0.5. The ROC curve above showed a nice trade-off, so this should be a good starting point. My goal is to create a model that does not predict unhealthy cases as healthy. I am more lenient with false positives, as predicting a healthy person as unhealthy is not as severe an issue.
lr_pred <- ifelse(lr_predictions <=0.5,"Pred_Healthy","Pred_Unhealthy")
confusionmatrix <- table(lr_pred, test$hd)
confusionmatrix
The accuracy using a threshold of 0.5 goes down slightly, but we make significantly fewer errors in identifying unhealthy patients. Sometimes, accuracy may not be the best metric, and it depends on the context of what we want to achieve. Diagnosing a healthy person as unhealthy is less severe compared to not diagnosing an unhealthy person and potentially letting them suffer or die. Therefore, it is better to reduce the latter error even if it comes with a trade-off in overall accuracy.
So here, I used two models: Random Forest and Logistic Regression. Both demonstrated strong accuracy. I hope you can replicate these methods in your future work. Thanks for going through my post.