Home | Publications | HWG+26

ALPHA: Active Learning With PAC-Bayesian Theory for Android Malware Detection

MCML Authors

Yunru Wang

→ Group Johannes Kinder
Programming Languages and Artificial Intelligence

Debarghya Ghoshdastidar

Prof. Dr.

Core PI

Theoretical Foundations of Artificial Intelligence

Johannes Kinder

Prof. Dr.

Collaborating PI

Programming Languages and Artificial Intelligence

Abstract

Learning-based malware detection for Android is sensitive to multiple forms of distribution drift. Temporal drift includes (i) the emergence of new families and (ii) variant-level evolution within existing families, while spatial drift manifests as (iii) population-level shifts in the overall app distribution. Although recent work applies active learning to mitigate the resulting performance degradation, it commonly relies on margin-based sampling, which prioritizes samples near the decision boundary and lacks theoretical grounding for improving adaptation to test distribution. We propose ALPHA, a drift-aware active learning framework guided by PAC-Bayes theory, to mitigate this limitation. The PACBayes bound decomposition expresses test-domain error as the combination of the empirical training error and three discrepancy terms capturing, respectively, out-of-support mass, boundary instability, and distributional density mismatch. These components align closely with the three drift types of Android malware, allowing us to derive active learning strategies that are theoretically motivated by the decomposition. Using this perspective, we first examine two commonly used Android malware benchmarks and show that they exhibit substantially different degrees of distribution drift. Evaluating ALPHA we show that it improves the classification F1-score by 15.4 − 22.8% over uncertainty-based sampling strategies. Further, ALPHA achieves greater gains on the high-drift benchmark, and we validate this relationship through statistical analysis. Finally, through targeted case studies, we provide empirical evidence that connects the PAC-Bayes decomposition to the three forms of drift observed in the evaluated Android malware datasets.

misc HWG+26

Preprint

May. 2026

Authors

Y. Han • Y. Wang • D. Ghoshdastidar • J. Kinder

Links

URL GitHub

Research Areas

A1 | Statistical Foundations & Explainability

A3 | Computational Models

BibTeXKey: HWG+26

#p-ghoshdastidar #p-kinder