MusicGen-Large (3.3B)

LUMINA Gradient Extraction

A comprehensive guide to understanding how LUMINA extracts signatures from audio using gradient-based techniques to determine training data influence.

๐Ÿ“„ 15 min read ๐ŸŽฏ Advanced ๐Ÿ“… January 2026

The Big Picture

What Problem Are We Solving?

When MusicGen generates a piece of audio, we want to answer a critical question: "Which songs from the training data influenced this output, and by how much?"

This is called attribution or influence estimation. It's essential for royalty distribution, copyright compliance, and building trust with rightsholders.

Why Is This Hard?

MusicGen is a 3.3 billion parameter neural network. When it generates audio, the output is influenced by all the training data in complex, non-linear ways. We can't simply "look inside" the model and see which songs it's thinking of.

๐Ÿ’ก Key Insight

The solution is gradient signatureing: If two pieces of audio cause similar changes (gradients) in the model's weights when processed, they are "similar" in a musically meaningful way.

Think of it like this:

  • Every song creates a unique "pattern of activation" when passed through the model
  • We capture this pattern as a signature (a 512-dimensional vector)
  • To find which training songs influenced an output, we compare signatures

Understanding the Two Channels

Why Two Channels?

Music copyright has two distinct types of rights. LUMINA separates them so we can attribute them independently:

Channel Rights Type What It Captures Legal Implication
Channel P Composition Influence Melody, harmony, structure Songwriting royalties
Channel M Recording Influence Sound, timbre, production Recording royalties

A song could sample someone's production style (Master) without copying their melody (Publishing), or vice versa.

Technical Separation

We hook into different parts of the MusicGen architecture:

MusicGen Architecture
Transformer Language Model (48 layers)
Attention Layers โ†’ Channel P (Composition)
  • Query, Key, Value projections
  • Learns relationships between musical elements
  • Capture: Structure, Harmony, Note Relations
FFN & Embedding Layers โ†’ Channel M (Production)
  • Feed-forward networks
  • RVQ Codebooks & LayerNorm
  • Capture: Timbre, Texture, Sound Design
EnCodec (Compression Model)
Encoder โ†’ Channel P
Decoder โ†’ Channel M

Script 1: The Hooking System

hooks/lumina_hooker.py โ€” Low-level machinery to intercept gradients

What is a "Hook"?

In PyTorch, a hook is a callback function that gets triggered during the forward or backward pass. It lets us "spy on" what's happening inside the model without modifying its behavior.

# Simplified example: How hooks work
def my_hook(gradient):
    """This function is called whenever a gradient flows through."""
    print(f"Caught gradient with shape: {gradient.shape}")
    save_for_later(gradient)
    return gradient  # Pass it through unchanged

# Attach the hook to a parameter
parameter.register_hook(my_hook)

Selective Layer Hooking

MusicGen has 48 transformer layers. Hooking all of them would use ~40GB of VRAM.

# Which layers to hook: every 4th from 48 total = 12 layers
DEFAULT_HOOKED_LAYERS = [0, 4, 8, 12, 16, 20, 24, 28, 32, 36, 40, 44]
โœ“ Memory Efficiency

By hooking every 4th layer, we reduce VRAM usage to ~11GB while preserving enough information for accurate attribution. The signal is redundant across adjacent layers anyway.

Channel Classification (SpinTrak-Aligned)

# Parameter names indicating Self-Attention โ†’ Channel P
CHANNEL_P_PATTERNS = [
    "self_attn.in_proj_weight",  # Combined QKV projection
    "self_attn.out_proj",        # Output projection
    "cross_attention.*",         # Cross-attention (text โ†’ audio)
]

# Parameter names indicating Output Linears โ†’ Channel M
CHANNEL_M_PATTERNS = [
    "linear1", "linear2",      # FFN layers
    "lm.linears",               # Output codebook projections
    "norm1", "norm2",          # Layer normalization
]

The intuition:

  • Attention layers learn what elements relate to each other. This is compositional: "This chord should follow that chord." That's Publishing.
  • FFN layers learn what patterns to generate. This is about sounds and textures. That's Master.

Script 2: Signature Extraction

extract_signatures.py โ€” Process your entire catalog into a database

The Overall Flow

Input: Training Songs (WAV) Output: signatures.h5 (Database)
1
Processing Loop
Load Audio โ†’ MusicGen Forward Pass โ†’ Capture Gradients.
2
Projection & Normalization
Project high-dim activations to 512D and L2-normalize.
3
Storage
Write to HDF5: signatures_p, signatures_m, and song_ids.

Step-by-Step Walkthrough

1

Load the Audio

def load_audio(audio_path, target_sr=32000):
    """Load and prepare audio for MusicGen."""
    waveform, sr = torchaudio.load(audio_path)
    
    # Resample to 32kHz (MusicGen's sample rate)
    if sr != target_sr:
        waveform = resample(waveform, sr, target_sr)
    
    # Convert stereo to mono
    if waveform.shape[0] > 1:
        waveform = waveform.mean(dim=0, keepdim=True)
    
    return waveform

Why 32kHz mono? MusicGen was trained on 32kHz mono audio. Mismatched formats would corrupt signatures.

2

Hook into EnCodec

# Attach hooks to encoder (Channel P)
for layer in compression_model.encoder.model:
    layer.register_forward_hook(make_hook(activations_p))

# Attach hooks to decoder (Channel M)
for layer in compression_model.decoder.model:
    layer.register_forward_hook(make_hook(activations_m))

Encoder captures structure, decoder captures sound.

3

Forward Pass

with torch.no_grad():
    # Encode: audio โ†’ codes
    codes, scale = compression_model.encode(audio)
    
    # Decode: codes โ†’ audio (triggers decoder hooks)
    _ = compression_model.decode(codes, scale)
4

Project to 512D

# Concatenate all activations
grad_p = torch.cat(activations_p)  # Millions of numbers

# Project using fixed-seed matrix
signature_p = grad_p @ projection_matrix  # โ†’ 512D
5

Normalize

# L2 normalize: vector length = 1
signature_p = F.normalize(signature_p, p=2)

# Now similarity = dot product!
similarity = signature_a @ signature_b  # Value: -1 to 1

Script 3: Attribution Module

core/gradient_attribution.py โ€” Runtime attribution interface

How It Differs

Aspect extract_signatures.py gradient_attribution.py
Uses EnCodec activations LM gradients
Speed ~100ms per song ~2-5s per song
When Building database Runtime analysis
๐Ÿ’ก Gradients vs Activations

Gradients are more informative than activations because they encode sensitivity โ€” how much each parameter contributed to the output.

The Math: Johnson-Lindenstrauss Projection

The Problem

Raw activations can be millions of dimensions. We need to reduce this to something manageable (512D) without losing similarity information.

The Solution: Random Projection

def create_projection_matrix(input_dim, output_dim=512, seed=422024):
    """Random Gaussian projection with fixed seed."""
    torch.manual_seed(seed)  # CRITICAL: Same seed = reproducible
    
    # Random Gaussian entries, scaled appropriately
    matrix = torch.randn(input_dim, output_dim) / np.sqrt(output_dim)
    
    return matrix

Why It Works (Intuition)

Distances are preserved! If A and B were similar in the original space, A' and B' will be similar in the projected space.

โš ๏ธ The Fixed Seed is Critical

jl_seed = 422024 appears throughout the codebase. If we used different random projections for database vs. query, signatures wouldn't be comparable!

First Codebook Dominance Fix

The Problem

CB1 dominates because fundamental pitch has highest energy. But for attribution, we care about all aspects equally.

The Solution: Weighted Gradients

codebook_boosts = [0.4, 1.2, 1.3, 1.4]  # CB1, CB2, CB3, CB4

def _apply_codebook_weighting(self, grad, param_name):
    # Attenuate CB1 to 40%
    # Boost CB2-CB4 to 120-140%
    for cb_idx in range(4):
        weighted_grad[section] *= codebook_boosts[cb_idx]
    return weighted_grad
โœ“ Effect

Timbral and textural similarities are now properly weighted, preventing fundamental pitch from drowning out production style influences.

End-to-End Pipeline

Training Time: Building the Database

Input: 99 Training Songs (WAV) Output: signatures.h5 (~400KB)
1
Load Audio
Load WAV files, resample to 32kHz, and convert to mono.
2
EnCodec Forward Pass
Run audio through encode/decode with gradient hooks attached.
3
Collect Activations
Capture raw signals from Attention (Channel P) and FFN (Channel M).
Channel P Channel M
4
Johnson-Lindenstrauss Projection
Project millions of dimensions down to 512 using fixed seed 422024.
5
Normalize & Store
L2-normalize vectors and save to HDF5 database with Song IDs.

Inference Time: Attribution

Input: Generated Audio Output: Attribution Report (~0.8ms)
1
Extract Signature
Run the same extraction pipeline on the new audio to get 512D vectors.
2
Similarity Search
Compute Cosine Similarity (Dot Product) against the entire 99-song database.
Matrix Multiplication
3
Thresholding
Apply 1ฯƒ (4.4%) threshold. Calculate LIP (LUMINA Influence Potency).
4
Share Calculation
Potency-weighted attribution shares: Share = LIP / ฮฃ(LIP).

Reproducibility Guarantees

LUMINA is designed for legal and forensic contexts. Every step must be reproducible.

Component Fixed Value Why It Matters
JL Projection Seed 422024 Same projection matrix always
Signature Dimension 512 Consistent database schema
Sample Rate 32kHz Matches MusicGen training

Summary

  1. Gradients reveal influence: The model "learned" from audio can be traced back.
  2. Two channels for two rights: Attention โ†’ Publishing. FFN โ†’ Master.
  3. JL projection preserves similarity: High-dim to 512D.
  4. Fixed seed = reproducibility: Essential for legal audit.