Seg-R1: Segmentation Can Be Surprisingly Simple with Reinforcement Learning

We present Seg-R1, a preliminary exploration of using reinforcement learning (RL) to enhance the pixel-level understanding and reasoning capabilities of large multimodal models (LMMs). Starting with foreground segmentation tasks, specifically camouflaged object detection (COD) and salient object detection (SOD), our approach enables the LMM to generate point and bounding box prompts in a next-token fashion, which are then used to guide SAM2 in producing segmentation masks. We introduce Group Relative Policy Optimization (GRPO) into the segmentation domain, equipping the LMM with pixel-level comprehension through a carefully designed training strategy. Notably, Seg-R1 achieves remarkable performance with purely RL-based training, achieving .873 S-measure on COD10K without complex model modification. Moreover, we found that pure RL training demonstrates strong open-world generalization. Despite being trained solely on foreground segmentation data without text annotations, Seg-R1 achieves impressive zero-shot performance on referring segmentation and reasoning segmentation tasks, with 71.4 cIoU on RefCOCOg val and 71.0 cIoU on ReasonSeg val, comparable with many fully supervised baselines.

We propose a new paradigm that leverages RL to equip LMMs with segmentation capabilities. We introduce Seg-R1, a simple yet effective framework for pixel-level learning. Our approach is built upon Qwen-2.5-VL-3B and SAM2, where Qwen-2.5-VL-3B is trained to generate bounding box and point prompts to guide SAM2 in producing segmentation masks. We incorporate GRPO into the segmentation task, requiring the model to output the reasoning process and mask prompts explicitly. To guide learning, we design a reward function that combines a format reward with a segmentation reward based on IoU and S-Measure, striking a balance between global accuracy and fine-grained structural fidelity.

To explore how far pure RL can drive segmentation in LMMs, we adopt a two-stage RL training strategy. Seg-R1 is first pre-trained with GRPO on the high-resolution DIS5K dataset to acquire fundamental knowledge of segmentation structure and formatting. It is then further fine-tuned on COD10K to enhance both its segmentation precision and reasoning ability. Notably, our method requires no architectural modifications to Qwen-2.5-VL and introduces no special tokens. Seg-R1 autonomously learns to construct annotation trajectories and generate high-quality prompts for SAM2. As a result, it achieves state-of-the-art performance on weakly supervised camouflaged object detection tasks demonstrates remarkable open-world segmentation capabilities.

BibTeX

@article{you2025segr1,
  title     = {{Seg-R1}: Segmentation Can Be Surprisingly Simple with Reinforcement Learning},
  author    = {You, Zuyao and Wu, Zuxuan},
  journal   = {arXiv preprint arXiv:2506.22624},
  year      = {2025}
}

Seg-R1: Segmentation Can Be Surprisingly Simple with Reinforcement Learning

Abstract

Pipeline

Results

BibTeX