GaMMA: Towards Joint Global-Temporal Music Understanding in Large Multimodal Models

Project Page

1Fudan University 2ByteDance
Project Lead

Abstract

Overview of GaMMA qualitative capabilities.
Overview of GaMMA across multi-turn music conversation, temporal music understanding, and detailed music captioning.

Introducing GaMMA, a large multimodal model designed to jointly handle both global music understanding and temporal music reasoning within a unified parameter space. Built on a streamlined encoder-decoder paradigm, GaMMA combines language modeling with dual audio experts and a gated fusion mechanism to model both non-temporal musical semantics and time-dependent musical structure, while a progressive training pipeline based on pretraining, supervised fine-tuning, and reinforcement learning further strengthens instruction-following, full-song understanding, and temporal reasoning.

Pipeline

Overview of the model architecture and dataset construction pipeline of GaMMA.
Overview of the model architecture and dataset construction pipeline of GaMMA.

GaMMA jointly models temporal and non-temporal music understanding with dual Whisper-based audio experts, where a temporal expert captures time-dependent structure and a global expert models holistic musical semantics, and their representations are fused through a gated extractor-injector design for full-song reasoning. On the data side, the pipeline first constructs SFT data by segmenting songs into fine-grained musical structure, combining aligned music and lyrics to produce expert-verified music reports and multi-turn QA pairs across diverse musical dimensions, and then builds RL data by selecting moderately difficult questions through rollout-based pass-rate filtering and upgrading them into harder but answer-consistent candidates.

Qualitative Results

We provide a demos with audible music clips here to provide a more intuitive presentation of qualitative results related to the main paper, further demonstrating GaMMA's strong capabilities in music understanding. Note that for each music clip, only the audio is used as input; the model has no access to the song title or the artist information. Demos are organized as follows:

  • Multi-turn Music Conversation (Sec. A)
  • Multilingual Capabilities (Sec. B)
  • Text-Only Conversation (Sec. C)
  • Detailed Music Captioning (Sec. D)

A. Multi-turn Music Conversation

GaMMA supports multi-turn conversations grounded in full-song audio, answering follow-up questions about musical highlights, emotional arcs, structure, and creative interpretation while keeping context across turns.

Demo 1
What's the biggest highlight of this song?
Black Myth: Wukong
00:00 / 00:00
GaMMA
GaMMA
If you were to summarize the "feel" of this music with three keywords, what would you choose? Please explain your reasoning for each.
GaMMA
GaMMA
Is there an "emotional turning point" or "climax"? How is this part foreshadowed?
GaMMA
GaMMA
If you were to use this song as the soundtrack for a game themed around a character from one of the Four Great Classical Novels, which character would you choose?
GaMMA
GaMMA
Demo 2
How does this song control the emotional tension in the verse and chorus sections through the arrangement of instrumentation? Please illustrate with specific timestamps.
Golden
HUNTR/X
00:00 / 00:00
GaMMA
GaMMA
Demo 3
I'm shooting a cute pet video. From which second to which second should this piece of music be used as the background music for the video?
Bad Guy
Billie Eilish
00:00 / 00:00
GaMMA
GaMMA
Demo 4
What is unusual about this song?
让我们荡起双桨(重金属版)
suno.ai
00:00 / 00:00
GaMMA
GaMMA
What instruments are used in this song? Please list them by category: lead instruments, harmony instruments, bass instruments, and percussion instruments.
GaMMA
GaMMA
In what scenarios is this song suitable for?
GaMMA
GaMMA
Demo 5
Where can you hear the whistle?
City of Stars
Ryan Gosling
00:00 / 00:00
GaMMA
GaMMA
Which lyric is repeated the most?
GaMMA
GaMMA
Where is the guitar?
GaMMA
GaMMA
Demo 6
Create a video story script for this music.
Espresso
Sabrina Carpenter
00:00 / 00:00
GaMMA
GaMMA
Demo 7
Can you please analyze this song?
Gaussian Noise
00:00 / 00:00
GaMMA
GaMMA

B. Multilingual Capabilities

GaMMA handles music understanding and discussion in multiple languages, supporting questions about meaning, lyrics, vocal attributes, and time-specific musical events across Chinese, Japanese, Spanish, and more.

Demo 8
这首歌想表达什么?
南方
达达乐队
00:00 / 00:00
GaMMA
GaMMA
歌曲里面有一段呐喊将情感推向高潮,具体的时间是在哪?
GaMMA
GaMMA
描述一下这首歌的outro。
GaMMA
GaMMA
Demo 9
この歌は何語で歌われていますか?
GO!
CORTIS
00:00 / 00:00
GaMMA
GaMMA
サビが始まる時間を教えてください。
GaMMA
GaMMA
この歌、どうしてこんなにクセになるの?
GaMMA
GaMMA
Demo 10
¿En qué idioma está cantada esta canción?
Tout a changé (Rien n'a changé)
Helena
00:00 / 00:00
GaMMA
GaMMA
¿Esta canción la canta un hombre o una mujer? Describe su timbre.
GaMMA
GaMMA
Dime el tiempo de inicio y final de la segunda frase de la letra de esta canción, y tradúcela al inglés.
GaMMA
GaMMA

C. Text-Only Conversation

GaMMA provides general text-only interaction for everyday dialogue, knowledge questions, and lightweight reasoning tasks even when no music input is provided.

Demo 11
Hi, how are you today?
GaMMA
GaMMA
Introduce the Beatles to me.
GaMMA
GaMMA
How many letter "r" in the word "strawberry"
GaMMA
GaMMA
Help me make a 3-day travel itinerary for Singapore.
GaMMA
GaMMA

D. Detailed Music Captioning

GaMMA generates detailed music reports with fine-grained descriptions of structure, lyrics, instrumentation, harmony, and progression, providing long-form analysis grounded in the full audio content.

Demo 12
Please provide a detailed analysis of this song's structure, including the lyrics, the instruments, the chord progressions then analyze them.
DASIES
Justin Bieber
00:00 / 00:00
GaMMA
GaMMA
Demo 13
Please provide a detailed analysis of this song's structure, including the lyrics, the instruments, the chord progressions then analyze them.
Let It Be
The Beatles
00:00 / 00:00
GaMMA
GaMMA
Demo 14
Please provide a detailed analysis of this song's structure, including the lyrics, the instruments, the chord progressions then analyze them.
Love Is Gone (with JOSHUA of SEVENTEEN)
SLANDER
00:00 / 00:00
GaMMA
GaMMA
Demo 15
Please provide a detailed analysis of this song's structure, including the lyrics, the instruments, the chord progressions then analyze them.
Haunt Me
Anson Seabra
00:00 / 00:00
GaMMA
GaMMA
Demo 16
Please provide a detailed analysis of this song's structure, including the lyrics, the instruments, the chord progressions then analyze them.
Brutus
Em Beihold
00:00 / 00:00
GaMMA
GaMMA
Demo 17
Analyze this piece of music, and output the analysis results in JSON format.
Bad Guy
Billie Eilish
00:00 / 00:00
GaMMA
GaMMA

MusicBench

Hierarchical evaluation dimensions in MusicBench.
Hierarchical evaluation dimensions in MusicBench, comprising 3,739 manually labeled multiple-choice questions.

MusicBench is a comprehensive benchmark for music-language models with 3,739 human-curated multiple-choice questions covering both temporal and non-temporal understanding. It is organized into two primary subsets, MusicBench-Global with 2,741 questions on global and fine-grained musical attributes such as genre, mood, instrumentation, key, BPM, melody, rhythm, and association, and MusicBench-Temporal with 998 questions focused on reasoning over time, including vocals, instruments, structure, chords, and lyrics, providing a structured evaluation of broad music understanding and temporal reasoning.

Quantitative Results

BibTeX

@misc{you2026gamma,
  title     = {GaMMA: Towards Joint Global-Temporal Music Understanding in Large Multimodal Models},
  author    = {You, Zuyao and Yu, Zhesong and Liu, Mingyu and Zhu, Bilei and Wan, Yuan and Wu, Zuxuan},
  journal   = {arXiv},
  year      = {2026}
}