The Big Picture
What Problem Are We Solving?
When MusicGen generates a piece of audio, we want to answer a critical question: "Which songs from the training data influenced this output, and by how much?"
This is called attribution or influence estimation. It's essential for royalty distribution, copyright compliance, and building trust with rightsholders.
Why Is This Hard?
MusicGen is a 3.3 billion parameter neural network. When it generates audio, the output is influenced by all the training data in complex, non-linear ways. We can't simply "look inside" the model and see which songs it's thinking of.
The solution is gradient signatureing: If two pieces of audio cause similar changes (gradients) in the model's weights when processed, they are "similar" in a musically meaningful way.
Think of it like this:
- Every song creates a unique "pattern of activation" when passed through the model
- We capture this pattern as a signature (a 512-dimensional vector)
- To find which training songs influenced an output, we compare signatures
Understanding the Two Channels
Why Two Channels?
Music copyright has two distinct types of rights. LUMINA separates them so we can attribute them independently:
| Channel | Rights Type | What It Captures | Legal Implication |
|---|---|---|---|
| Channel P | Composition Influence | Melody, harmony, structure | Songwriting royalties |
| Channel M | Recording Influence | Sound, timbre, production | Recording royalties |
A song could sample someone's production style (Master) without copying their melody (Publishing), or vice versa.
Technical Separation
We hook into different parts of the MusicGen architecture:
- Query, Key, Value projections
- Learns relationships between musical elements
- Capture: Structure, Harmony, Note Relations
- Feed-forward networks
- RVQ Codebooks & LayerNorm
- Capture: Timbre, Texture, Sound Design
Script 1: The Hooking System
hooks/lumina_hooker.py โ Low-level machinery to intercept gradients
What is a "Hook"?
In PyTorch, a hook is a callback function that gets triggered during the forward or backward pass. It lets us "spy on" what's happening inside the model without modifying its behavior.
# Simplified example: How hooks work
def my_hook(gradient):
"""This function is called whenever a gradient flows through."""
print(f"Caught gradient with shape: {gradient.shape}")
save_for_later(gradient)
return gradient # Pass it through unchanged
# Attach the hook to a parameter
parameter.register_hook(my_hook)
Selective Layer Hooking
MusicGen has 48 transformer layers. Hooking all of them would use ~40GB of VRAM.
# Which layers to hook: every 4th from 48 total = 12 layers
DEFAULT_HOOKED_LAYERS = [0, 4, 8, 12, 16, 20, 24, 28, 32, 36, 40, 44]
By hooking every 4th layer, we reduce VRAM usage to ~11GB while preserving enough information for accurate attribution. The signal is redundant across adjacent layers anyway.
Channel Classification (SpinTrak-Aligned)
# Parameter names indicating Self-Attention โ Channel P
CHANNEL_P_PATTERNS = [
"self_attn.in_proj_weight", # Combined QKV projection
"self_attn.out_proj", # Output projection
"cross_attention.*", # Cross-attention (text โ audio)
]
# Parameter names indicating Output Linears โ Channel M
CHANNEL_M_PATTERNS = [
"linear1", "linear2", # FFN layers
"lm.linears", # Output codebook projections
"norm1", "norm2", # Layer normalization
]
The intuition:
- Attention layers learn what elements relate to each other. This is compositional: "This chord should follow that chord." That's Publishing.
- FFN layers learn what patterns to generate. This is about sounds and textures. That's Master.
Script 2: Signature Extraction
extract_signatures.py โ Process your entire catalog into a database
The Overall Flow
signatures_p,
signatures_m, and song_ids.
Step-by-Step Walkthrough
Load the Audio
def load_audio(audio_path, target_sr=32000):
"""Load and prepare audio for MusicGen."""
waveform, sr = torchaudio.load(audio_path)
# Resample to 32kHz (MusicGen's sample rate)
if sr != target_sr:
waveform = resample(waveform, sr, target_sr)
# Convert stereo to mono
if waveform.shape[0] > 1:
waveform = waveform.mean(dim=0, keepdim=True)
return waveform
Why 32kHz mono? MusicGen was trained on 32kHz mono audio. Mismatched formats would corrupt signatures.
Hook into EnCodec
# Attach hooks to encoder (Channel P)
for layer in compression_model.encoder.model:
layer.register_forward_hook(make_hook(activations_p))
# Attach hooks to decoder (Channel M)
for layer in compression_model.decoder.model:
layer.register_forward_hook(make_hook(activations_m))
Encoder captures structure, decoder captures sound.
Forward Pass
with torch.no_grad():
# Encode: audio โ codes
codes, scale = compression_model.encode(audio)
# Decode: codes โ audio (triggers decoder hooks)
_ = compression_model.decode(codes, scale)
Project to 512D
# Concatenate all activations
grad_p = torch.cat(activations_p) # Millions of numbers
# Project using fixed-seed matrix
signature_p = grad_p @ projection_matrix # โ 512D
Normalize
# L2 normalize: vector length = 1
signature_p = F.normalize(signature_p, p=2)
# Now similarity = dot product!
similarity = signature_a @ signature_b # Value: -1 to 1
Script 3: Attribution Module
core/gradient_attribution.py โ Runtime attribution interface
How It Differs
| Aspect | extract_signatures.py | gradient_attribution.py |
|---|---|---|
| Uses | EnCodec activations | LM gradients |
| Speed | ~100ms per song | ~2-5s per song |
| When | Building database | Runtime analysis |
Gradients are more informative than activations because they encode sensitivity โ how much each parameter contributed to the output.
The Math: Johnson-Lindenstrauss Projection
The Problem
Raw activations can be millions of dimensions. We need to reduce this to something manageable (512D) without losing similarity information.
The Solution: Random Projection
def create_projection_matrix(input_dim, output_dim=512, seed=422024):
"""Random Gaussian projection with fixed seed."""
torch.manual_seed(seed) # CRITICAL: Same seed = reproducible
# Random Gaussian entries, scaled appropriately
matrix = torch.randn(input_dim, output_dim) / np.sqrt(output_dim)
return matrix
Why It Works (Intuition)
Distances are preserved! If A and B were similar in the original space, A' and B' will be similar in the projected space.
jl_seed = 422024 appears throughout the codebase. If we used different
random projections for database vs. query, signatures wouldn't be comparable!
First Codebook Dominance Fix
The Problem
CB1 dominates because fundamental pitch has highest energy. But for attribution, we care about all aspects equally.
The Solution: Weighted Gradients
codebook_boosts = [0.4, 1.2, 1.3, 1.4] # CB1, CB2, CB3, CB4
def _apply_codebook_weighting(self, grad, param_name):
# Attenuate CB1 to 40%
# Boost CB2-CB4 to 120-140%
for cb_idx in range(4):
weighted_grad[section] *= codebook_boosts[cb_idx]
return weighted_grad
Timbral and textural similarities are now properly weighted, preventing fundamental pitch from drowning out production style influences.
End-to-End Pipeline
Training Time: Building the Database
422024.
Inference Time: Attribution
Share = LIP / ฮฃ(LIP).
Reproducibility Guarantees
LUMINA is designed for legal and forensic contexts. Every step must be reproducible.
| Component | Fixed Value | Why It Matters |
|---|---|---|
| JL Projection Seed | 422024 |
Same projection matrix always |
| Signature Dimension | 512 |
Consistent database schema |
| Sample Rate | 32kHz |
Matches MusicGen training |
Summary
- Gradients reveal influence: The model "learned" from audio can be traced back.
- Two channels for two rights: Attention โ Publishing. FFN โ Master.
- JL projection preserves similarity: High-dim to 512D.
- Fixed seed = reproducibility: Essential for legal audit.