GaMMA: Towards Joint Global-Temporal Music Understanding in Large Multimodal Models

Project Page

Zuyao You^1,2, Zhesong Yu¹, Mingyu Liu¹, Bilei Zhu¹^†, Yuan Wan¹, Zuxuan Wu²

¹Fudan University ²ByteDance

^†Project Lead

Abstract

Overview of GaMMA qualitative capabilities.

Overview of GaMMA across multi-turn music conversation, temporal music understanding, and detailed music captioning.

Introducing GaMMA, a large multimodal model designed to jointly handle both global music understanding and temporal music reasoning within a unified parameter space. Built on a streamlined encoder-decoder paradigm, GaMMA combines language modeling with dual audio experts and a gated fusion mechanism to model both non-temporal musical semantics and time-dependent musical structure, while a progressive training pipeline based on pretraining, supervised fine-tuning, and reinforcement learning further strengthens instruction-following, full-song understanding, and temporal reasoning.

Pipeline

Overview of the model architecture and dataset construction pipeline of GaMMA.

GaMMA jointly models temporal and non-temporal music understanding with dual Whisper-based audio experts, where a temporal expert captures time-dependent structure and a global expert models holistic musical semantics, and their representations are fused through a gated extractor-injector design for full-song reasoning. On the data side, the pipeline first constructs SFT data by segmenting songs into fine-grained musical structure, combining aligned music and lyrics to produce expert-verified music reports and multi-turn QA pairs across diverse musical dimensions, and then builds RL data by selecting moderately difficult questions through rollout-based pass-rate filtering and upgrading them into harder but answer-consistent candidates.

Qualitative Results

We provide a demos with audible music clips here to provide a more intuitive presentation of qualitative results related to the main paper, further demonstrating GaMMA's strong capabilities in music understanding. Note that for each music clip, only the audio is used as input; the model has no access to the song title or the artist information. Demos are organized as follows:

Multi-turn Music Conversation (Sec. A)
Multilingual Capabilities (Sec. B)
Text-Only Conversation (Sec. C)
Detailed Music Captioning (Sec. D)

A. Multi-turn Music Conversation

GaMMA supports multi-turn conversations grounded in full-song audio, answering follow-up questions about musical highlights, emotional arcs, structure, and creative interpretation while keeping context across turns.

Demo 1

What's the biggest highlight of this song?

Black Myth: Wukong

00:00 / 00:00

GaMMA

If you were to summarize the "feel" of this music with three keywords, what would you choose? Please explain your reasoning for each.

GaMMA

Is there an "emotional turning point" or "climax"? How is this part foreshadowed?

GaMMA

If you were to use this song as the soundtrack for a game themed around a character from one of the Four Great Classical Novels, which character would you choose?

GaMMA

Demo 2

How does this song control the emotional tension in the verse and chorus sections through the arrangement of instrumentation? Please illustrate with specific timestamps.

Golden

HUNTR/X

00:00 / 00:00

GaMMA

Demo 3

I'm shooting a cute pet video. From which second to which second should this piece of music be used as the background music for the video?

Bad Guy

Billie Eilish

00:00 / 00:00

GaMMA

Demo 4

What is unusual about this song?

让我们荡起双桨（重金属版）

suno.ai

00:00 / 00:00

GaMMA

What instruments are used in this song? Please list them by category: lead instruments, harmony instruments, bass instruments, and percussion instruments.

GaMMA

In what scenarios is this song suitable for?

GaMMA

Demo 5

Where can you hear the whistle?

City of Stars

Ryan Gosling

00:00 / 00:00

GaMMA

Which lyric is repeated the most?

GaMMA

Where is the guitar?

GaMMA

Demo 6

Create a video story script for this music.

Espresso

Sabrina Carpenter

00:00 / 00:00

GaMMA

Demo 7

Can you please analyze this song?

Gaussian Noise

00:00 / 00:00

GaMMA

B. Multilingual Capabilities

GaMMA handles music understanding and discussion in multiple languages, supporting questions about meaning, lyrics, vocal attributes, and time-specific musical events across Chinese, Japanese, Spanish, and more.

Demo 8

这首歌想表达什么？

南方

达达乐队

00:00 / 00:00

GaMMA

歌曲里面有一段呐喊将情感推向高潮，具体的时间是在哪？

GaMMA

描述一下这首歌的outro。

GaMMA

Demo 9

この歌は何語で歌われていますか？

GO!

CORTIS

00:00 / 00:00

GaMMA

サビが始まる時間を教えてください。

GaMMA

この歌、どうしてこんなにクセになるの？

GaMMA

Demo 10

¿En qué idioma está cantada esta canción?

Tout a changé (Rien n'a changé)

Helena

00:00 / 00:00

GaMMA

¿Esta canción la canta un hombre o una mujer? Describe su timbre.

GaMMA

Dime el tiempo de inicio y final de la segunda frase de la letra de esta canción, y tradúcela al inglés.

GaMMA

C. Text-Only Conversation

GaMMA provides general text-only interaction for everyday dialogue, knowledge questions, and lightweight reasoning tasks even when no music input is provided.

Demo 11

Hi, how are you today?

GaMMA

Introduce the Beatles to me.

GaMMA

How many letter "r" in the word "strawberry"

GaMMA

Help me make a 3-day travel itinerary for Singapore.

GaMMA

D. Detailed Music Captioning

GaMMA generates detailed music reports with fine-grained descriptions of structure, lyrics, instrumentation, harmony, and progression, providing long-form analysis grounded in the full audio content.

Demo 12

Please provide a detailed analysis of this song's structure, including the lyrics, the instruments, the chord progressions then analyze them.

DASIES

Justin Bieber

00:00 / 00:00

GaMMA

Demo 13

Please provide a detailed analysis of this song's structure, including the lyrics, the instruments, the chord progressions then analyze them.

Let It Be

The Beatles

00:00 / 00:00

GaMMA

Demo 14

Please provide a detailed analysis of this song's structure, including the lyrics, the instruments, the chord progressions then analyze them.

Love Is Gone (with JOSHUA of SEVENTEEN)

SLANDER

00:00 / 00:00

GaMMA

Demo 15

Please provide a detailed analysis of this song's structure, including the lyrics, the instruments, the chord progressions then analyze them.

Haunt Me

Anson Seabra

00:00 / 00:00

GaMMA

Demo 16

Please provide a detailed analysis of this song's structure, including the lyrics, the instruments, the chord progressions then analyze them.

Brutus

Em Beihold

00:00 / 00:00

GaMMA

Demo 17

Analyze this piece of music, and output the analysis results in JSON format.

Bad Guy

Billie Eilish

00:00 / 00:00

GaMMA

MusicBench

Hierarchical evaluation dimensions in MusicBench, comprising 3,739 manually labeled multiple-choice questions.

MusicBench is a comprehensive benchmark for music-language models with 3,739 human-curated multiple-choice questions covering both temporal and non-temporal understanding. It is organized into two primary subsets, MusicBench-Global with 2,741 questions on global and fine-grained musical attributes such as genre, mood, instrumentation, key, BPM, melody, rhythm, and association, and MusicBench-Temporal with 998 questions focused on reasoning over time, including vocals, instruments, structure, chords, and lyrics, providing a structured evaluation of broad music understanding and temporal reasoning.

Quantitative Results

BibTeX

@misc{you2026gamma,
  title     = {GaMMA: Towards Joint Global-Temporal Music Understanding in Large Multimodal Models},
  author    = {You, Zuyao and Yu, Zhesong and Liu, Mingyu and Zhu, Bilei and Wan, Yuan and Wu, Zuxuan},
  journal   = {arXiv},
  year      = {2026}
}