Home | Publications | BWY+25

PRISM: Self-Pruning Intrinsic Selection Method for Training-Free Multimodal Data Selection

MCML Authors

Jinhe Bi

→ Group Volker Tresp
Database Systems, Data Mining and AI

Sikuan Yan

→ Group Volker Tresp
Database Systems, Data Mining and AI

Hinrich Schütze

Prof. Dr.

Principal Investigator

Computational Linguistics

Volker Tresp

Prof. Dr.

Principal Investigator

Database Systems, Data Mining and AI

Yunpu Ma

Dr.

→ Group Volker Tresp
Database Systems, Data Mining and AI
→ Co-Group Hinrich Schütze

Abstract

Visual instruction tuning adapts pre-trained Multimodal Large Language Models (MLLMs) to follow human instructions for real-world applications. However, the rapid growth of these datasets introduces significant redundancy, leading to increased computational costs. Existing methods for selecting instruction data aim to prune this redundancy, but predominantly rely on computationally demanding techniques such as proxy-based inference or training-based metrics. Consequently, the substantial computational costs incurred by these selection processes often exacerbate the very efficiency bottlenecks they are intended to resolve, posing a significant challenge to the scalable and effective tuning of MLLMs. To address this challenge, we first identify a critical, yet previously overlooked, factor: the anisotropy inherent in visual feature distributions. We find that this anisotropy induces a textit{Global Semantic Drift}, and overlooking this phenomenon is a key factor limiting the efficiency of current data selection methods. Motivated by this insight, we devise textbf{PRISM}, the first training-free framework for efficient visual instruction selection. PRISM surgically removes the corrupting influence of global background features by modeling the intrinsic visual semantics via implicit re-centering. Empirically, PRISM reduces the end-to-end time for data selection and model tuning to just 30% of conventional pipelines. More remarkably, it achieves this efficiency while simultaneously enhancing performance, surpassing models fine-tuned on the full dataset across eight multimodal and three language understanding benchmarks, culminating in a 101.7% relative improvement over the baseline.

misc BWY+25

Preprint

Feb. 2025

Authors

J. Bi • Aniri • Y. Wang • D. Yan • W. Huang • Z. Jin • X. Ma • S. Yan • A. Hecker • M. Ye • X. Xiao • H. Schütze • V. Tresp • Y. Ma

Links

arXiv GitHub

In Collaboration

Huawei

Research Areas

A3 | Computational Models

B2 | Natural Language Processing

BibTeXKey: BWY+25

#p-schuetze #p-tresp