Home | Publications | HKB+24

GRAtt-VIS: Gated Residual Attention for Video Instance Segmentation

MCML Authors

Tanveer Hannan

→ Group Thomas Seidl
Database Systems, Data Mining and AI

Rajat Koner

→ Group Volker Tresp
Database Systems, Data Mining and AI

Maximilian Bernhard

Dr.

* Former Member

→ Group Matthias Schubert
Spatial Artificial Intelligence

Volker Tresp

Prof. Dr.

Principal Investigator

Database Systems, Data Mining and AI

Matthias Schubert

Prof. Dr.

Principal Investigator

Spatial Artificial Intelligence

Thomas Seidl

Prof. Dr.

Director

Database Systems, Data Mining and AI

Abstract

Recent trends in Video Instance Segmentation (VIS) have seen a growing reliance on online methods to model complex and lengthy video sequences. However, the degradation of representation and noise accumulation of the online methods, especially during occlusion and abrupt changes, pose substantial challenges. Transformer-based query propagation provides promising directions at the cost of quadratic memory attention. However, they are susceptible to the degradation of instance features due to the above-mentioned challenges and suffer from cascading effects. The detection and rectification of such errors remain largely underexplored. To this end, we introduce textbf{GRAtt-VIS}, textbf{G}ated textbf{R}esidual textbf{Att}ention for textbf{V}ideo textbf{I}nstance textbf{S}egmentation. Firstly, we leverage a Gumbel-Softmax-based gate to detect possible errors in the current frame. Next, based on the gate activation, we rectify degraded features from its past representation. Such a residual configuration alleviates the need for dedicated memory and provides a continuous stream of relevant instance features. Secondly, we propose a novel inter-instance interaction using gate activation as a mask for self-attention. This masking strategy dynamically restricts the unrepresentative instance queries in the self-attention and preserves vital information for long-term tracking. We refer to this novel combination of Gated Residual Connection and Masked Self-Attention as textbf{GRAtt} block, which can easily be integrated into the existing propagation-based framework. Further, GRAtt blocks significantly reduce the attention overhead and simplify dynamic temporal modeling. GRAtt-VIS achieves state-of-the-art performance on YouTube-VIS and the highly challenging OVIS dataset, significantly improving over previous methods.

inproceedings HKB+24

ICPR 2020

27th International Conference on Pattern Recognition. Kolkata, India, Dec 01-05, 2024.

Authors

T. Hannan • R. Koner • M. Bernhard • S. Shit • B. Menze • V. Tresp • M. Schubert • T. Seidl

Links

DOI GitHub

Research Area

A3 | Computational Models

BibTeXKey: HKB+24

#p-schubert #p-seidl #p-tresp