Roman Hornung
Dr.
* Former Member
Background: Identifying relevant biomarkers is critical in clinical research and precision medicine, particularly when analysing high-dimensional data. Random forests (RFs) are promising for such settings due to their flexibility, ease of use, and their ability to handle data sets with more variables than samples. RFs assess the importance of each variable in predicting the outcome using variable importance (VIMP) scores. However, since the distribution of VIMP scores is intricate, standard statistical testing and multiple testing adjustments for variable selection are challenging.<br>Methods: We propose shadowVIMP, a novel method for multiple testing-controlled variable selection, based on an approach similar to permutation testing. It generates permuted counterparts for each variable and compares their VIMPs with those of the original variables over multiple iterations to calculate p-values. Unlike conventional permutation testing, shadowVIMP preserves the correlation structure between variables, mitigating biases caused by the over-selection of correlated variables in RFs. We evaluated shadowVIMP against three competing RF variable selection approaches using simulation designs previously employed in studies considering VIMPs and variable selection for RFs. These designs included high- and low-dimensional data, as well as correlated and categorical variables. For illustration, we also applied the method to a real-world example on Alzheimer's disease.<br>Conclusions: Our results showed that, compared to competing approaches, shadowVIMP offers advantages in high-dimensional settings, improving sensitivity while enabling multiple testing-adjusted results. Additionally, it demonstrated robustness against VIMP biases induced by correlated and categorical variables when using permutation-based VIMP. The method can be used to annotate standard VIMP plots, visually presenting selected variable sets based on different types of multiple testing adjustments and significance levels. Overall, shadowVIMP is a promising approach for providing multiple testing-adjusted variable selection while explicitly addressing known biases of RF's permutation-based VIMP measure. The shadowVIMP method is implemented in an R package shadowVIMP, which is available on CRAN.
article MHS+26
BibTeXKey: MHS+26