LUMINA for MusicGen | Gradient Extraction Deep Dive

The Big Picture

What Problem Are We Solving?

When MusicGen generates a piece of audio, we want to answer a critical question: "Which songs from the training data influenced this output, and by how much?"

This is called attribution or influence estimation. It's essential for royalty distribution, copyright compliance, and building trust with rightsholders.

Why Is This Hard?

MusicGen is a 3.3 billion parameter neural network. When it generates audio, the output is influenced by all the training data in complex, non-linear ways. We can't simply "look inside" the model and see which songs it's thinking of.

💡 Key Insight

The solution is gradient signatureing: If two pieces of audio cause similar changes (gradients) in the model's weights when processed, they are "similar" in a musically meaningful way.

Think of it like this:

Every song creates a unique "pattern of activation" when passed through the model
We capture this pattern as a signature (a 512-dimensional vector)
To find which training songs influenced an output, we compare signatures

Understanding the Two Channels

Why Two Channels?

Music copyright has two distinct types of rights. LUMINA separates them so we can attribute them independently:

Channel	Rights Type	What It Captures	Legal Implication
Channel P	Composition Influence	Melody, harmony, structure	Songwriting royalties
Channel M	Recording Influence	Sound, timbre, production	Recording royalties

A song could sample someone's production style (Master) without copying their melody (Publishing), or vice versa.

Technical Separation

We hook into different parts of the MusicGen architecture:

MusicGen Architecture

Transformer Language Model (48 layers)

Attention Layers → Channel P (Composition)

Query, Key, Value projections
Learns relationships between musical elements
Capture: Structure, Harmony, Note Relations

FFN & Embedding Layers → Channel M (Production)

Feed-forward networks
RVQ Codebooks & LayerNorm
Capture: Timbre, Texture, Sound Design

EnCodec (Compression Model)

Encoder → Channel P

Decoder → Channel M

Script 1: The Hooking System

hooks/lumina_hooker.py — Low-level machinery to intercept gradients

What is a "Hook"?

In PyTorch, a hook is a callback function that gets triggered during the forward or backward pass. It lets us "spy on" what's happening inside the model without modifying its behavior.

# Simplified example: How hooks work
def my_hook(gradient):
    """This function is called whenever a gradient flows through."""
    print(f"Caught gradient with shape: {gradient.shape}")
    save_for_later(gradient)
    return gradient  # Pass it through unchanged

# Attach the hook to a parameter
parameter.register_hook(my_hook)

Selective Layer Hooking

MusicGen has 48 transformer layers. Hooking all of them would use ~40GB of VRAM.

# Which layers to hook: every 4th from 48 total = 12 layers
DEFAULT_HOOKED_LAYERS = [0, 4, 8, 12, 16, 20, 24, 28, 32, 36, 40, 44]

✓ Memory Efficiency

By hooking every 4th layer, we reduce VRAM usage to ~11GB while preserving enough information for accurate attribution. The signal is redundant across adjacent layers anyway.

Channel Classification (SpinTrak-Aligned)

# Parameter names indicating Self-Attention → Channel P
CHANNEL_P_PATTERNS = [
    "self_attn.in_proj_weight",  # Combined QKV projection
    "self_attn.out_proj",        # Output projection
    "cross_attention.*",         # Cross-attention (text → audio)
]

# Parameter names indicating Output Linears → Channel M
CHANNEL_M_PATTERNS = [
    "linear1", "linear2",      # FFN layers
    "lm.linears",               # Output codebook projections
    "norm1", "norm2",          # Layer normalization
]

The intuition:

Attention layers learn what elements relate to each other. This is compositional: "This chord should follow that chord." That's Publishing.
FFN layers learn what patterns to generate. This is about sounds and textures. That's Master.

Script 2: Signature Extraction

extract_signatures.py — Process your entire catalog into a database

The Overall Flow

Input: Training Songs (WAV) Output: signatures.h5 (Database)

Processing Loop

Load Audio → MusicGen Forward Pass → Capture Gradients.

Projection & Normalization

Project high-dim activations to 512D and L2-normalize.

Storage

Write to HDF5: signatures_p, signatures_m, and song_ids.

Step-by-Step Walkthrough

Load the Audio

def load_audio(audio_path, target_sr=32000):
    """Load and prepare audio for MusicGen."""
    waveform, sr = torchaudio.load(audio_path)
    
    # Resample to 32kHz (MusicGen's sample rate)
    if sr != target_sr:
        waveform = resample(waveform, sr, target_sr)
    
    # Convert stereo to mono
    if waveform.shape[0] > 1:
        waveform = waveform.mean(dim=0, keepdim=True)
    
    return waveform

Why 32kHz mono? MusicGen was trained on 32kHz mono audio. Mismatched formats would corrupt signatures.

Hook into EnCodec

# Attach hooks to encoder (Channel P)
for layer in compression_model.encoder.model:
    layer.register_forward_hook(make_hook(activations_p))

# Attach hooks to decoder (Channel M)
for layer in compression_model.decoder.model:
    layer.register_forward_hook(make_hook(activations_m))

Encoder captures structure, decoder captures sound.

Forward Pass

with torch.no_grad():
    # Encode: audio → codes
    codes, scale = compression_model.encode(audio)
    
    # Decode: codes → audio (triggers decoder hooks)
    _ = compression_model.decode(codes, scale)

Project to 512D

# Concatenate all activations
grad_p = torch.cat(activations_p)  # Millions of numbers

# Project using fixed-seed matrix
signature_p = grad_p @ projection_matrix  # → 512D

Normalize

# L2 normalize: vector length = 1
signature_p = F.normalize(signature_p, p=2)

# Now similarity = dot product!
similarity = signature_a @ signature_b  # Value: -1 to 1

Script 3: Attribution Module

core/gradient_attribution.py — Runtime attribution interface

How It Differs

Aspect	extract_signatures.py	gradient_attribution.py
Uses	EnCodec activations	LM gradients
Speed	~100ms per song	~2-5s per song
When	Building database	Runtime analysis

💡 Gradients vs Activations

Gradients are more informative than activations because they encode sensitivity — how much each parameter contributed to the output.

SpinTrak Teacher Forcing (Engine Approach)

The modern LUMINA engine uses cross-entropy teacher forcing — a more robust method that extracts gradients by computing the loss between predicted and actual codebook tokens.

The Core Algorithm

# 1. Encode audio to codebook tokens
with torch.no_grad():
    codes, scale = compression_model.encode(audio_chunk)

# 2. Teacher forcing: LM predicts codes from codes
lm_output = lm.compute_predictions(codes=codes, conditions=attributes)
logits = lm_output.logits  # [B, K, T, vocab]
mask = lm_output.mask      # [B, K, T]

# 3. Cross-entropy loss
loss = F.cross_entropy(
    logits.reshape(-1, logits.shape[-1]),
    codes.reshape(-1),
    reduction='none'
)
mask_flat = mask.reshape(-1).float()
loss = (loss * mask_flat).sum() / (mask_flat.sum() + 1e-8)

# 4. Backpropagate to extract gradients
loss.backward()

Chunked Processing (10s Sweet Spot)

To handle songs of arbitrary length and manage VRAM, audio is processed in 10-second chunks. Gradients are accumulated and averaged across chunks:

CHUNK_DURATION = 10.0  # The "Sweet Spot"
chunk_samples = int(CHUNK_DURATION * sample_rate)

for start_idx in range(0, total_samples, chunk_samples):
    # Process chunk and accumulate gradients
    grads_p_sum[i] += chunk_grads_p[i]
    grads_m_sum[i] += chunk_grads_m[i]
    num_chunks += 1

# Average across all chunks
grads_p = [g / num_chunks for g in grads_p_sum]
grads_m = [g / num_chunks for g in grads_m_sum]

Channel Separation in Practice

Channel	Gradient Source	Collection Method
Channel P	Self-Attention (Q, K, V, Out)	`layer.self_attn.parameters()`
Channel M	Output Linear Projections	`lm.linears.parameters()`

✓ Why Teacher Forcing Works

By computing how well the model "would have predicted" the actual audio tokens, we get gradients that encode the model's internal representation of that song. Songs that produce similar gradient patterns share similar musical DNA.

The Math: Johnson-Lindenstrauss Projection

The Problem

Raw activations can be millions of dimensions. We need to reduce this to something manageable (512D) without losing similarity information.

The Solution: Random Projection

def create_projection_matrix(input_dim, output_dim=512, seed=422024):
    """Random Gaussian projection with fixed seed."""
    torch.manual_seed(seed)  # CRITICAL: Same seed = reproducible
    
    # Random Gaussian entries, scaled appropriately
    matrix = torch.randn(input_dim, output_dim) / np.sqrt(output_dim)
    
    return matrix

Why It Works (Intuition)

Distances are preserved! If A and B were similar in the original space, A' and B' will be similar in the projected space.

⚠️ The Fixed Seed is Critical

jl_seed = 422024 appears throughout the codebase. If we used different random projections for database vs. query, signatures wouldn't be comparable!

First Codebook Dominance Fix

The Problem

CB1 dominates because fundamental pitch has highest energy. But for attribution, we care about all aspects equally.

The Solution: Weighted Gradients

codebook_boosts = [0.4, 1.2, 1.3, 1.4]  # CB1, CB2, CB3, CB4

def _apply_codebook_weighting(self, grad, param_name):
    # Attenuate CB1 to 40%
    # Boost CB2-CB4 to 120-140%
    for cb_idx in range(4):
        weighted_grad[section] *= codebook_boosts[cb_idx]
    return weighted_grad

✓ Effect

Timbral and textural similarities are now properly weighted, preventing fundamental pitch from drowning out production style influences.

End-to-End Pipeline

Training Time: Building the Database

Input: 99 Training Songs (WAV) Output: signatures.h5 (~400KB)

Load Audio

Load WAV files, resample to 32kHz, and convert to mono.

EnCodec Forward Pass

Run audio through encode/decode with gradient hooks attached.

Collect Activations

Capture raw signals from Attention (Channel P) and FFN (Channel M).

Channel P Channel M

Johnson-Lindenstrauss Projection

Project millions of dimensions down to 512 using fixed seed 422024.

Normalize & Store

L2-normalize vectors and save to HDF5 database with Song IDs.

Inference Time: Attribution

Input: Generated Audio Output: Attribution Report (~0.8ms)

Extract Signature

Run the same extraction pipeline on the new audio to get 512D vectors.

Similarity Search

Compute Cosine Similarity (Dot Product) against the entire 99-song database.

Matrix Multiplication

Thresholding

Apply 1σ (4.4%) threshold. Calculate LIP (LUMINA Influence Potency).

Share Calculation

Potency-weighted attribution shares: Share = LIP / Σ(LIP).

Reproducibility Guarantees

LUMINA is designed for legal and forensic contexts. Every step must be reproducible.

Component	Fixed Value	Why It Matters
JL Projection Seed	`422024`	Same projection matrix always
Signature Dimension	`512`	Consistent database schema
Sample Rate	`32kHz`	Matches MusicGen training

Summary

Gradients reveal influence: The model "learned" from audio can be traced back.
Two channels for two rights: Attention → Publishing. FFN → Master.
JL projection preserves similarity: High-dim to 512D.
Fixed seed = reproducibility: Essential for legal audit.