Universiti Teknologi Malaysia Institutional Repository

Variable selection in high dimensional data with interactions

Jaafar, Zuharah and Ismail, Norazlina (2022) Variable selection in high dimensional data with interactions. International Journal of Advances in Soft Computing and its Applications, 14 (2). pp. 152-166. ISSN 2074-8523

[img]
Preview
PDF
737kB

Official URL: http://dx.doi.org/10.15849/IJASCA.220720.11

Abstract

A common research area in statistical machine learning has been variable selection in high dimensional settings. In recent years, numerous effective approaches have been created to deal with these challenges. In order to improve the prediction accuracy of the model for the given dataset, this study sought to present a double approach variable selection method when pairwise interactions between the explanatory variables exist and to choose the smallest explanatory variable set (considering interactions among them). In this study, a double step method consolidating Random Forest and Adaptive Elastic Net was further examined to mimic potential health effects of environmental contamination. When there were existing interactions in the data or none at all, the double step approach was compared to the single-step adaptive elastic net method and two-step CART paired with the adaptive elastic net method. Using significant statistical tests like RMSE, R2, and the quantity of the variable chosen for the final model, the success of the strategies was measured. The double step RF+AENET approach produces a simple, constrained model. Despite the complex association between exposure variables, it has the lowest false detection rate for null interactions. A set of variables that have correlation with the result are effectively retained by the screening and variable reduction processes in the RF step of the RF+AENET approach. The double step RF+AENET performs prediction better than a single technique and chooses a sparse model that is close to the true model. Thus, it can be said that when there are pairwise interactions between variables in the simulated biological dataset, the double step technique is a better method for model prediction and parameter estimation.

Item Type:Article
Uncontrolled Keywords:adaptive elastic net, CART, random forest
Subjects:Q Science > QA Mathematics
Divisions:Science
ID Code:98760
Deposited By: Narimah Nawil
Deposited On:02 Feb 2023 08:29
Last Modified:02 Feb 2023 08:29

Repository Staff Only: item control page