AbstractsMathematics

Evaluation of logistic regression and random forest classification based on prediction accuracy and metadata analysis

by Andreas Wålinder




Institution: Linnæus University
Department:
Year: 2014
Keywords: classification; logistic regression; random forest; metadata; Natural Sciences; Mathematics; Naturvetenskap; Matematik; Matematikerprogrammet, 180 hp; Applied Mahtematics Programme, 180 credits; Matematisk statistik; Matematisk statistik
Record ID: 1371437
Full text PDF: http://urn.kb.se/resolve?urn=urn:nbn:se:lnu:diva-35126


Abstract

Model selection is an important part of classification. In this thesis we study the two classification models logistic regression and random forest. They are compared and evaluated based on prediction accuracy and metadata analysis. The models were trained on 25 diverse datasets. We calculated the prediction accuracy of both models using RapidMiner. We also collected metadata for the datasets concerning number of observations, number of predictor variables and number of classes in the response variable.     There is a correlation between performance of logistic regression and random forest with significant correlation of 0.60 and confidence interval [0.29 0.79]. The models appear to perform similarly across the datasets with performance more influenced by choice of dataset rather than model selection.     Random forest with an average prediction accuracy of 81.66% performed better on these datasets than logistic regression with an average prediction accuracy of 73.07%. The difference is however not statistically significant with a p-value of 0.088 for Student's t-test.     Multiple linear regression analysis reveals none of the analysed metadata have a significant linear relationship with logistic regression performance. The regression of logistic regression performance on metadata has a p-value of 0.66. We get similar results with random forest performance. The regression of random forest performance on metadata has a p-value of 0.89. None of the analysed metadata have a significant linear relationship with random forest performance.     We conclude that the prediction accuracies of logistic regression and random forest are correlated. Random forest performed slightly better on the studied datasets but the difference is not statistically significant. The studied metadata does not appear to have a significant effect on prediction accuracy of either model.