FOCUS: Towards Universal Foreground Segmentation

AAAI 2025

1Shanghai Key Lab of Intell. Info. Processing, School of CS, Fudan University, 2Shanghai Collaborative Innovation Center of Intelligent Visual Computing

†: Corresponding author, *: These authors contributed equally.

Abstract

Foreground segmentation is a fundamental task in computer vision, including various subdivision tasks. Previous research has typically designed task-specific architectures for foreground segmentation tasks, leading to a lack of unified frameworks. Moreover, they primarily focus on recognizing foreground objects without effectively distinguishing the foreground from the background. In this paper, we argue that the background and its relationship with the foreground matter. We introduce FOCUS, the Foreground ObjeCts Universal Segmentation framework that can handle multiple foreground tasks. We develop a multi-scale semantic network using the edge information of objects to enhance image features. To achieve boundary-aware segmentation, we propose a novel distillation method, integrating the contrastive learning strategy to refine the prediction mask in multi-modal feature space. We conduct extensive experiments on a total of 13 datasets across 5 tasks, and the results demonstrate that FOCUS consistently outperforms the state-of-the-art task-specific models on most metrics.

Pipeline

Overview of the FOCUS framework

To universally represent the foreground and background, we borrow the object queries concept from DETR by introducing ground queries. We apply the multi-scale strategy to extract image features to feed the transformer decoder, using masked attention to enable the ground queries to focus on relevant features corresponding to foreground and background. We utilize the feature map obtained from the backbone to initialize the masked attention, which can serve as a localization prior. During this process, the ground queries adapt to learn the features relevant to the context of different tasks, making them universal features.

To fully leverage the background information in images, we employ contrastive learning strategies. We propose the CLIP refiner, using the powerful multi-modal learning ability from CLIP to correct the masks generated by previous modules. We fuse the mask and image and align the fused image and its corresponding text in multi-modal feature space to refine the masks. This not only refines the edges of the mask but also accentuates the distinction between foreground and background. We treat foreground segmentation and background segmentation as two independent tasks, and in the inference stage, the probability map of both foreground and background will jointly determine the boundary of MoI.

Results

BibTeX

@inproceedings{you2025focus,
  title     = {{FOCUS}: Towards Universal Foreground Segmentation},
  author    = {You, Zuyao and Kong, Lingyu and Meng, Lingchen and Wu, Zuxuan},
  booktitle = {AAAI},
  year      = {2025},
}