The Big Picture
What Problem Are We Solving?
When MusicGen generates a piece of audio, we want to answer a critical question: "Which songs from the training data influenced this output, and by how much?"
This is called attribution or influence estimation. It's essential for royalty distribution, copyright compliance, and building trust with rightsholders.
Why Is This Hard?
MusicGen is a 3.3 billion parameter neural network. When it generates audio, the output is influenced by all the training data in complex, non-linear ways. We can't simply "look inside" the model and see which songs it's thinking of.
The solution is gradient signatureing: If two pieces of audio cause similar changes (gradients) in the model's weights when processed, they are "similar" in a musically meaningful way.
Think of it like this:
- Every song creates a unique "pattern of activation" when passed through the model
- We capture this pattern as a fingerprint (a 216-dimensional vector per channel, 432D total)
- To find which training songs influenced an output, we compare signatures
Understanding the Two Channels
Why Two Channels?
Music copyright has two distinct types of rights. LUMINA separates them so we can attribute them independently:
| Channel | Rights Type | What It Captures | Legal Implication |
|---|---|---|---|
| Channel P | Composition Influence | Melody, harmony, structure | Songwriting royalties |
| Channel M | Recording Influence | Sound, timbre, production | Recording royalties |
A song could sample someone's production style (Master) without copying their melody (Publishing), or vice versa.
Technical Separation
We hook into different parts of the MusicGen architecture:
- Layers 42โ47 (upper transformer)
- 6 tensors:
self_attn.in_proj,self_attn.out_proj,cross_attn.in_proj,cross_attn.out_proj,linear1,linear2 - Capture: Structure, Harmony, Melody โ 216D
- Layers 12โ17 (lower transformer)
- 6 tensors:
linear1,linear2,self_attn.in_proj,self_attn.out_proj,cross_attn.in_proj,cross_attn.out_proj - Capture: Timbre, Texture, Sound Design โ 216D
Script 1: The Hooking System
hooks/lumina_hooker.py โ Low-level machinery to intercept gradients
What is a "Hook"?
In PyTorch, a hook is a callback function that gets triggered during the forward or backward pass. It lets us "spy on" what's happening inside the model without modifying its behavior.
# Simplified example: How hooks work
def my_hook(gradient):
"""This function is called whenever a gradient flows through."""
print(f"Caught gradient with shape: {gradient.shape}")
save_for_later(gradient)
return gradient # Pass it through unchanged
# Attach the hook to a parameter
parameter.register_hook(my_hook)
Selective Layer Hooking
MusicGen has 48 transformer layers. Hooking all of them would use ~40GB of VRAM.
# Production layer ranges for channel separation
CHANNEL_P_LAYERS = list(range(42, 48)) # Upper layers: composition
CHANNEL_M_LAYERS = list(range(12, 18)) # Lower layers: production
By hooking only layers 42โ47 for P and 12โ17 for M (12 total layers, 6 tensors each), we reduce VRAM usage to ~11GB while capturing the most informative gradient signals. 6 summary statistics per tensor ร 6 tensors ร 6 layers = 216D per channel.
Channel Classification (Layer-Range Based)
# v2 Architecture: Both channels use ALL 6 tensors
# Differentiated by layer range, not tensor type
ALL_TENSORS = [
"self_attn.in_proj_weight", # Self-attention input
"self_attn.out_proj", # Self-attention output
"cross_attention.in_proj", # Cross-attention input
"cross_attention.out_proj", # Cross-attention output
"linear1", # FFN first layer
"linear2", # FFN second layer
]
# Channel separation by layer range
CHANNEL_P_LAYERS = range(42, 48) # Upper: composition
CHANNEL_M_LAYERS = range(12, 18) # Mid: production
The intuition:
- Upper layers (42โ47) encode abstract compositional decisions: melody, harmony, structure. All tensor types in these layers contribute to Publishing.
- Mid layers (12โ17) encode concrete production characteristics: timbre, texture, sound design. All tensor types in these layers contribute to Master.
Script 2: Signature Extraction
extract_signatures.py โ Process your entire catalog into a database
The Overall Flow
signatures_p,
signatures_m, and song_ids.
Step-by-Step Walkthrough
Load the Audio
def load_audio(audio_path, target_sr=32000):
"""Load and prepare audio for MusicGen."""
waveform, sr = torchaudio.load(audio_path)
# Resample to 32kHz (MusicGen's sample rate)
if sr != target_sr:
waveform = resample(waveform, sr, target_sr)
# Convert stereo to mono
if waveform.shape[0] > 1:
waveform = waveform.mean(dim=0, keepdim=True)
return waveform
Why 32kHz mono? MusicGen was trained on 32kHz mono audio. Mismatched formats would corrupt signatures.
Capture Gradients from Transformer LM Layers
# Teacher forcing: LM predicts codes from codes
lm_output = lm.compute_predictions(codes=codes, conditions=attrs)
logits, mask = lm_output.logits, lm_output.mask
loss = F.cross_entropy(logits.flatten(), codes.flatten())
loss.backward()
# Channel P: gradients from layers 42-47 (upper transformer)
# Channel M: gradients from layers 12-17 (mid transformer)
Upper layers capture composition (melody, harmony), mid layers capture production (timbre, texture).
Forward Pass
with torch.no_grad():
# Encode: audio โ codes
codes, scale = compression_model.encode(audio)
# Decode: codes โ audio (triggers decoder hooks)
_ = compression_model.decode(codes, scale)
Compute Gradient Statistics โ 216D
# For each tensor in each layer, compute 6 stats
stats = [
grad.mean(),
grad.std(),
torch.norm(grad, p=2), # L2 norm
grad.max(),
grad.min(),
skewness(grad),
]
# Channel P: 6 layers ร 6 tensors ร 6 stats = 216D
signature_p = torch.tensor(all_p_stats)
Normalize
# L2 normalize: vector length = 1
signature_p = F.normalize(signature_p, p=2)
# Now similarity = dot product!
similarity = signature_a @ signature_b # Value: -1 to 1
Script 3: Attribution Module
core/gradient_attribution.py โ Runtime attribution interface
How It Differs
| Aspect | extract_signatures.py | gradient_attribution.py |
|---|---|---|
| Uses | EnCodec activations | LM gradients |
| Speed | ~100ms per song | ~2-5s per song |
| When | Building database | Runtime analysis |
Gradients are more informative than activations because they encode sensitivity โ how much each parameter contributed to the output.
LUMINA-WTA Teacher Forcing (Engine Approach)
The modern LUMINA engine uses cross-entropy teacher forcing โ a more robust method that extracts gradients by computing the loss between predicted and actual codebook tokens.
The Core Algorithm
# 1. Encode audio to codebook tokens
with torch.no_grad():
codes, scale = compression_model.encode(audio_chunk)
# 2. Teacher forcing: LM predicts codes from codes
lm_output = lm.compute_predictions(codes=codes, conditions=attributes)
logits = lm_output.logits # [B, K, T, vocab]
mask = lm_output.mask # [B, K, T]
# 3. Cross-entropy loss
loss = F.cross_entropy(
logits.reshape(-1, logits.shape[-1]),
codes.reshape(-1),
reduction='none'
)
mask_flat = mask.reshape(-1).float()
loss = (loss * mask_flat).sum() / (mask_flat.sum() + 1e-8)
# 4. Backpropagate to extract gradients
loss.backward()
Chunked Processing (10s Sweet Spot)
To handle songs of arbitrary length and manage VRAM, audio is processed in 10-second chunks. Gradients are accumulated and averaged across chunks:
CHUNK_DURATION = 10.0 # The "Sweet Spot"
chunk_samples = int(CHUNK_DURATION * sample_rate)
for start_idx in range(0, total_samples, chunk_samples):
# Process chunk and accumulate gradients
grads_p_sum[i] += chunk_grads_p[i]
grads_m_sum[i] += chunk_grads_m[i]
num_chunks += 1
# Average across all chunks
grads_p = [g / num_chunks for g in grads_p_sum]
grads_m = [g / num_chunks for g in grads_m_sum]
Channel Separation in Practice
| Channel | Gradient Source | Collection Method |
|---|---|---|
| Channel P | Self-Attention + Cross-Attention (layers 42โ47) | layer.self_attn + cross_attn (6 tensors) |
| Channel M | FFN layers (layers 12โ17) | layer.linear1 + linear2 (6 tensors) |
By computing how well the model "would have predicted" the actual audio tokens, we get gradients that encode the model's internal representation of that song. Songs that produce similar gradient patterns share similar musical DNA.
The Math: Gradient Summary Statistics
The Problem
Raw gradients can be millions of dimensions. We reduce this to 216D per channel using 6 summary statistics computed per tensor per layer.
The Solution: Statistical Fingerprinting
def compute_gradient_stats(grad_tensor):
"""Compute 6 summary statistics from a gradient tensor."""
stats = [
grad_tensor.mean().item(),
grad_tensor.std().item(),
grad_tensor.norm(2).item(), # L2 norm
grad_tensor.max().item(),
grad_tensor.min().item(),
skewness(grad_tensor).item(), # Skew
]
return stats
Why It Works (Intuition)
Distances are preserved! The 6 summary statistics capture the essential distributional properties of each gradient tensor. Combined across 6 layers ร 6 tensors, this produces a 216D fingerprint that is both compact and highly discriminative.
The noise floor is calibrated empirically using 50 GTZAN control tracks (5 per genre). Cosine similarities between control and training fingerprints determine the true ฮผ and ฯ of the noise distribution, replacing the theoretical 1/โd approximation.
First Codebook Dominance Fix
The Problem
CB1 dominates because fundamental pitch has highest energy. But for attribution, we care about all aspects equally.
The Solution: Weighted Gradients
codebook_boosts = [0.4, 1.2, 1.3, 1.4] # CB1, CB2, CB3, CB4
def _apply_codebook_weighting(self, grad, param_name):
# Attenuate CB1 to 40%
# Boost CB2-CB4 to 120-140%
for cb_idx in range(4):
weighted_grad[section] *= codebook_boosts[cb_idx]
return weighted_grad
Timbral and textural similarities are now properly weighted, preventing fundamental pitch from drowning out production style influences.
End-to-End Pipeline
Training Time: Building the Database
Inference Time: Attribution
Share = excess / ฮฃ(excess).
Reproducibility Guarantees
LUMINA is designed for legal and forensic contexts. Every step must be reproducible.
| Component | Fixed Value | Why It Matters |
|---|---|---|
| JL Projection Seed | 422024 |
Same projection matrix always |
| Signature Dimension | 216 per channel (432 total) |
Consistent database schema |
| Sample Rate | 32kHz |
Matches MusicGen training |
Summary
- Gradients reveal influence: The model "learned" from audio can be traced back.
- Two channels for two rights: Attention (layers 42โ47) โ Publishing. FFN (layers 12โ17) โ Master.
- 6 gradient stats preserve similarity: High-dim to 216D per channel.
- Fixed seed = reproducibility: Essential for legal audit.