LUMINA Technical Due Diligence
Comprehensive documentation of statistical methods, architectural decisions, and evaluation metrics for AI music attribution.
LUMINA Architecture
End-to-end pipeline from audio generation to rightsholder attribution, leveraging gradient-based signatureing and dual-channel analysis.
Channel P & Channel M
Attribution is computed through two complementary signal pathways, each capturing different aspects of musical influence.
Channel P (Composition)
Source: All gradients from layers 42–47 (upper transformer).
Tensors: self_attn.in/out, cross_attn.in/out, linear1, linear2.
Captures: Melody, harmony, structure → 216D (6 layers × 6 tensors × 6 stats).
Channel M (Production)
Source: All gradients from layers 12–17 (mid transformer).
Tensors: self_attn.in/out, cross_attn.in/out, linear1, linear2.
Captures: Timbre, texture, sound design → 216D (6 layers × 6 tensors × 6 stats).
LUMINA-WTA Gradient Extraction
The engine uses cross-entropy teacher forcing — computing how well MusicGen would predict existing audio tokens rather than generating new audio. This provides stable, reproducible influence signatures.
# LUMINA-WTA Core Algorithm (lumina-engine)
with torch.no_grad():
codes, _ = compression_model.encode(audio_chunk)
# Teacher forcing: LM predicts codes from codes
lm_output = lm.compute_predictions(codes=codes, conditions=attrs)
logits, mask = lm_output.logits, lm_output.mask
# Cross-entropy loss with masking
loss = F.cross_entropy(logits.flatten(), codes.flatten())
loss = (loss * mask.flatten()).sum() / mask.sum()
# Backpropagate to extract gradients
loss.backward()
# Collect 6 stats per tensor per layer for each channel
# Channel P: all 6 tensors from layers 42-47
# Channel M: all 6 tensors from layers 12-17
stats = [mean, std, L2_norm, max, min, skew]
10s Chunked Processing
Audio is split into 10-second windows. Gradients are accumulated and averaged across chunks. This provides temporal stability while fitting in ~11GB VRAM.
Why Teacher Forcing Works
Gradients encode how the model would change to better predict each sample. Songs with similar gradients share "influence DNA" — the model represents them internally the same way.
Significance Thresholds
Thresholds are derived from the expected cosine similarity distribution The noise floor σ is calibrated empirically using 50 GTZAN control tracks (5 per genre, outside training data).
| Threshold | Sigma Level | Confidence | Meaning |
|---|---|---|---|
| < 1σ | < 1σ | < 68% | Indistinguishable from noise |
| ≥ 1σ | ≥ 1σ | ≥ 68% | Qualified Influence |
| ≥ 2σ | ≥ 2σ | ≥ 95% | High Confidence |
| ≥ 3σ | ≥ 3σ | ≥ 99.7% | Definitive Proof |
Why The σ Rules Apply
In 216D space, random unit vectors are nearly orthogonal. Their dot products follow a tight Gaussian distribution around 0 with σ calibrated empirically from GTZAN control tracks. This makes outlier detection robust.
Attribution Share System
Songs with cos_sim ≥ cos_threshold (empirically calibrated) qualify. Shares are proportional to excess. Sharei = excessi / Σ(excess).
System Performance
On NVIDIA H100 SXM5 (80GB).
Validation Measures
Rigorous safeguards implemented to ensure attribution accuracy, prevent false positives, and handle edge cases.
Low-Energy Filter
Problem: Silent or low-volume segments (e.g., a capella breaks) can produce
random high-variance gradients.
Solution: Audio segments with RMS energy below -50dB are strictly excluded
from attribution.
Causal Verification
Method: "Ablation Testing". We remove the top attributed song from training
and regenerate.
Pass Condition: Output similarity to the removed song must drop by at least
2σ.
Reproducibility
Guarantee: 100% Deterministic.
Fixed seed 422024 ensures that the same audio always produces
the exact same 216D fingerprint per channel, essential for legal audits.
Positive-Only Policy
Rule: Negative cosine similarity is ignored.
Rationale: "Anti-influence" (doing the opposite of a song) does not
constitute copyright infringement or influence.