MusicGen-Large (3.3B)

LUMINA Gradient Extraction

A comprehensive guide to understanding how LUMINA extracts signatures from audio using gradient-based techniques to determine training data influence.

๐Ÿ“„ 15 min read ๐ŸŽฏ Advanced ๐Ÿ“… March 2026

The Big Picture

What Problem Are We Solving?

When MusicGen generates a piece of audio, we want to answer a critical question: "Which songs from the training data influenced this output, and by how much?"

This is called attribution or influence estimation. It's essential for royalty distribution, copyright compliance, and building trust with rightsholders.

Why Is This Hard?

MusicGen is a 3.3 billion parameter neural network. When it generates audio, the output is influenced by all the training data in complex, non-linear ways. We can't simply "look inside" the model and see which songs it's thinking of.

๐Ÿ’ก Key Insight

The solution is gradient signatureing: If two pieces of audio cause similar changes (gradients) in the model's weights when processed, they are "similar" in a musically meaningful way.

Think of it like this:

  • Every song creates a unique "pattern of activation" when passed through the model
  • We capture this pattern as a fingerprint (a 216-dimensional vector per channel, 432D total)
  • To find which training songs influenced an output, we compare signatures

Understanding the Two Channels

Why Two Channels?

Music copyright has two distinct types of rights. LUMINA separates them so we can attribute them independently:

Channel Rights Type What It Captures Legal Implication
Channel P Composition Influence Melody, harmony, structure Songwriting royalties
Channel M Recording Influence Sound, timbre, production Recording royalties

A song could sample someone's production style (Master) without copying their melody (Publishing), or vice versa.

Technical Separation

We hook into different parts of the MusicGen architecture:

MusicGen Architecture
Transformer Language Model (48 layers)
Attention Layers โ†’ Channel P (Composition)
  • Layers 42โ€“47 (upper transformer)
  • 6 tensors: self_attn.in_proj, self_attn.out_proj, cross_attn.in_proj, cross_attn.out_proj, linear1, linear2
  • Capture: Structure, Harmony, Melody โ†’ 216D
FFN Layers โ†’ Channel M (Production)
  • Layers 12โ€“17 (lower transformer)
  • 6 tensors: linear1, linear2, self_attn.in_proj, self_attn.out_proj, cross_attn.in_proj, cross_attn.out_proj
  • Capture: Timbre, Texture, Sound Design โ†’ 216D
EnCodec (Compression Model)
Encoder โ†’ Channel P
Decoder โ†’ Channel M

Script 1: The Hooking System

hooks/lumina_hooker.py โ€” Low-level machinery to intercept gradients

What is a "Hook"?

In PyTorch, a hook is a callback function that gets triggered during the forward or backward pass. It lets us "spy on" what's happening inside the model without modifying its behavior.

# Simplified example: How hooks work
def my_hook(gradient):
    """This function is called whenever a gradient flows through."""
    print(f"Caught gradient with shape: {gradient.shape}")
    save_for_later(gradient)
    return gradient  # Pass it through unchanged

# Attach the hook to a parameter
parameter.register_hook(my_hook)

Selective Layer Hooking

MusicGen has 48 transformer layers. Hooking all of them would use ~40GB of VRAM.

# Production layer ranges for channel separation
CHANNEL_P_LAYERS = list(range(42, 48))  # Upper layers: composition
CHANNEL_M_LAYERS = list(range(12, 18))  # Lower layers: production
โœ“ Memory Efficiency

By hooking only layers 42โ€“47 for P and 12โ€“17 for M (12 total layers, 6 tensors each), we reduce VRAM usage to ~11GB while capturing the most informative gradient signals. 6 summary statistics per tensor ร— 6 tensors ร— 6 layers = 216D per channel.

Channel Classification (Layer-Range Based)

# v2 Architecture: Both channels use ALL 6 tensors
# Differentiated by layer range, not tensor type
ALL_TENSORS = [
    "self_attn.in_proj_weight",  # Self-attention input
    "self_attn.out_proj",        # Self-attention output
    "cross_attention.in_proj",   # Cross-attention input
    "cross_attention.out_proj",  # Cross-attention output
    "linear1",                   # FFN first layer
    "linear2",                   # FFN second layer
]

# Channel separation by layer range
CHANNEL_P_LAYERS = range(42, 48)  # Upper: composition
CHANNEL_M_LAYERS = range(12, 18)  # Mid: production

The intuition:

  • Upper layers (42โ€“47) encode abstract compositional decisions: melody, harmony, structure. All tensor types in these layers contribute to Publishing.
  • Mid layers (12โ€“17) encode concrete production characteristics: timbre, texture, sound design. All tensor types in these layers contribute to Master.

Script 2: Signature Extraction

extract_signatures.py โ€” Process your entire catalog into a database

The Overall Flow

Input: Training Songs (WAV) Output: signatures.h5 (Database)
1
Processing Loop
Load Audio โ†’ MusicGen Forward Pass โ†’ Capture Gradients.
2
Gradient Stats Collection
Compute 6 summary statistics (mean, std, L2 norm, max, min, skew) per tensor โ†’ 216D per channel.
3
Storage
Write to HDF5: signatures_p, signatures_m, and song_ids.

Step-by-Step Walkthrough

1

Load the Audio

def load_audio(audio_path, target_sr=32000):
    """Load and prepare audio for MusicGen."""
    waveform, sr = torchaudio.load(audio_path)
    
    # Resample to 32kHz (MusicGen's sample rate)
    if sr != target_sr:
        waveform = resample(waveform, sr, target_sr)
    
    # Convert stereo to mono
    if waveform.shape[0] > 1:
        waveform = waveform.mean(dim=0, keepdim=True)
    
    return waveform

Why 32kHz mono? MusicGen was trained on 32kHz mono audio. Mismatched formats would corrupt signatures.

2

Capture Gradients from Transformer LM Layers

# Teacher forcing: LM predicts codes from codes
lm_output = lm.compute_predictions(codes=codes, conditions=attrs)
logits, mask = lm_output.logits, lm_output.mask
loss = F.cross_entropy(logits.flatten(), codes.flatten())
loss.backward()

# Channel P: gradients from layers 42-47 (upper transformer)
# Channel M: gradients from layers 12-17 (mid transformer)

Upper layers capture composition (melody, harmony), mid layers capture production (timbre, texture).

3

Forward Pass

with torch.no_grad():
    # Encode: audio โ†’ codes
    codes, scale = compression_model.encode(audio)
    
    # Decode: codes โ†’ audio (triggers decoder hooks)
    _ = compression_model.decode(codes, scale)
4

Compute Gradient Statistics โ†’ 216D

# For each tensor in each layer, compute 6 stats
stats = [
    grad.mean(),
    grad.std(),
    torch.norm(grad, p=2),   # L2 norm
    grad.max(),
    grad.min(),
    skewness(grad),
]

# Channel P: 6 layers ร— 6 tensors ร— 6 stats = 216D
signature_p = torch.tensor(all_p_stats)
5

Normalize

# L2 normalize: vector length = 1
signature_p = F.normalize(signature_p, p=2)

# Now similarity = dot product!
similarity = signature_a @ signature_b  # Value: -1 to 1

Script 3: Attribution Module

core/gradient_attribution.py โ€” Runtime attribution interface

How It Differs

Aspect extract_signatures.py gradient_attribution.py
Uses EnCodec activations LM gradients
Speed ~100ms per song ~2-5s per song
When Building database Runtime analysis
๐Ÿ’ก Gradients vs Activations

Gradients are more informative than activations because they encode sensitivity โ€” how much each parameter contributed to the output.

LUMINA-WTA Teacher Forcing (Engine Approach)

The modern LUMINA engine uses cross-entropy teacher forcing โ€” a more robust method that extracts gradients by computing the loss between predicted and actual codebook tokens.

The Core Algorithm

# 1. Encode audio to codebook tokens
with torch.no_grad():
    codes, scale = compression_model.encode(audio_chunk)

# 2. Teacher forcing: LM predicts codes from codes
lm_output = lm.compute_predictions(codes=codes, conditions=attributes)
logits = lm_output.logits  # [B, K, T, vocab]
mask = lm_output.mask      # [B, K, T]

# 3. Cross-entropy loss
loss = F.cross_entropy(
    logits.reshape(-1, logits.shape[-1]),
    codes.reshape(-1),
    reduction='none'
)
mask_flat = mask.reshape(-1).float()
loss = (loss * mask_flat).sum() / (mask_flat.sum() + 1e-8)

# 4. Backpropagate to extract gradients
loss.backward()

Chunked Processing (10s Sweet Spot)

To handle songs of arbitrary length and manage VRAM, audio is processed in 10-second chunks. Gradients are accumulated and averaged across chunks:

CHUNK_DURATION = 10.0  # The "Sweet Spot"
chunk_samples = int(CHUNK_DURATION * sample_rate)

for start_idx in range(0, total_samples, chunk_samples):
    # Process chunk and accumulate gradients
    grads_p_sum[i] += chunk_grads_p[i]
    grads_m_sum[i] += chunk_grads_m[i]
    num_chunks += 1

# Average across all chunks
grads_p = [g / num_chunks for g in grads_p_sum]
grads_m = [g / num_chunks for g in grads_m_sum]

Channel Separation in Practice

Channel Gradient Source Collection Method
Channel P Self-Attention + Cross-Attention (layers 42โ€“47) layer.self_attn + cross_attn (6 tensors)
Channel M FFN layers (layers 12โ€“17) layer.linear1 + linear2 (6 tensors)
โœ“ Why Teacher Forcing Works

By computing how well the model "would have predicted" the actual audio tokens, we get gradients that encode the model's internal representation of that song. Songs that produce similar gradient patterns share similar musical DNA.

The Math: Gradient Summary Statistics

The Problem

Raw gradients can be millions of dimensions. We reduce this to 216D per channel using 6 summary statistics computed per tensor per layer.

The Solution: Statistical Fingerprinting

def compute_gradient_stats(grad_tensor):
    """Compute 6 summary statistics from a gradient tensor."""
    stats = [
        grad_tensor.mean().item(),
        grad_tensor.std().item(),
        grad_tensor.norm(2).item(),           # L2 norm
        grad_tensor.max().item(),
        grad_tensor.min().item(),
        skewness(grad_tensor).item(),   # Skew
    ]
    return stats

Why It Works (Intuition)

Distances are preserved! The 6 summary statistics capture the essential distributional properties of each gradient tensor. Combined across 6 layers ร— 6 tensors, this produces a 216D fingerprint that is both compact and highly discriminative.

โœ“ Empirical Noise Calibration

The noise floor is calibrated empirically using 50 GTZAN control tracks (5 per genre). Cosine similarities between control and training fingerprints determine the true ฮผ and ฯƒ of the noise distribution, replacing the theoretical 1/โˆšd approximation.

First Codebook Dominance Fix

The Problem

CB1 dominates because fundamental pitch has highest energy. But for attribution, we care about all aspects equally.

The Solution: Weighted Gradients

codebook_boosts = [0.4, 1.2, 1.3, 1.4]  # CB1, CB2, CB3, CB4

def _apply_codebook_weighting(self, grad, param_name):
    # Attenuate CB1 to 40%
    # Boost CB2-CB4 to 120-140%
    for cb_idx in range(4):
        weighted_grad[section] *= codebook_boosts[cb_idx]
    return weighted_grad
โœ“ Effect

Timbral and textural similarities are now properly weighted, preventing fundamental pitch from drowning out production style influences.

End-to-End Pipeline

Training Time: Building the Database

Input: 99 Training Songs (WAV) Output: signatures.h5 (~400KB)
1
Load Audio
Load WAV files, resample to 32kHz, and convert to mono.
2
EnCodec Forward Pass
Run audio through encode/decode with gradient hooks attached.
3
Collect Activations
Capture raw signals from Attention (Channel P) and FFN (Channel M).
Channel P Channel M
4
Compute Gradient Statistics
Compute 6 summary stats per tensor per layer โ†’ 216D per channel.
5
Normalize & Store
L2-normalize vectors and save to HDF5 database with Song IDs.

Inference Time: Attribution

Input: Generated Audio Output: Attribution Report (~0.8ms)
1
Extract Signature
Run the same extraction pipeline on the new audio to get 216D vectors per channel.
2
Similarity Search
Compute Cosine Similarity (Dot Product) against the entire 99-song database.
Matrix Multiplication
3
Thresholding
Apply cosine threshold (empirically calibrated via GTZAN). Compute excess above threshold.
4
Share Calculation
Potency-weighted attribution shares: Share = excess / ฮฃ(excess).

Reproducibility Guarantees

LUMINA is designed for legal and forensic contexts. Every step must be reproducible.

Component Fixed Value Why It Matters
JL Projection Seed 422024 Same projection matrix always
Signature Dimension 216 per channel (432 total) Consistent database schema
Sample Rate 32kHz Matches MusicGen training

Summary

  1. Gradients reveal influence: The model "learned" from audio can be traced back.
  2. Two channels for two rights: Attention (layers 42โ€“47) โ†’ Publishing. FFN (layers 12โ€“17) โ†’ Master.
  3. 6 gradient stats preserve similarity: High-dim to 216D per channel.
  4. Fixed seed = reproducibility: Essential for legal audit.