MusicGen Fine-Tuning Pipeline

Overview

What Is This Pipeline?

This pipeline fine-tunes Meta's MusicGen — a state-of-the-art text-to-music generation model with 3.3 billion parameters — using LoRA (Low-Rank Adaptation) to specialize the model for specific music genres while keeping the base model frozen.

Instead of training all 3.3B parameters (which would require massive compute and risk catastrophic forgetting), LoRA injects small trainable adapters into the transformer's attention layers. This reduces trainable parameters by ~99.8% while still achieving strong genre specialization.

💡 Why LoRA?

A full fine-tune of MusicGen-Large requires ~26 GB of VRAM just for parameters + gradients. With LoRA (rank 64), you train only ~2M parameters, the adapter checkpoint is ~8 MB, and training fits comfortably on a single GPU with 24 GB+ VRAM.

Key Capabilities

Multi-genre training — Train on multiple genres simultaneously or focus on a single genre
Stereo output — Full stereo generation at 32 kHz sample rate
Data augmentation — Built-in pitch shift, time stretch, gaussian noise, and gain augmentation
Automatic metadata — AI-generated text descriptions via Gemini or OpenAI for conditioning
Vocal removal — Automatic instrumental extraction using HT-Demucs
W&B integration — Real-time experiment tracking with Weights & Biases
Docker deployment — Containerized for serverless GPU training (RunPod, Lambda)
Checkpoint management — Automatic best-model saving, early stopping, and run isolation

Architecture

Pipeline Flow

The complete training pipeline follows this flow:

🎵

Raw Audio

WAV / MP3 / FLAC files organized by genre

→

🎤

Vocal Removal

HT-Demucs extracts instrumentals

→

📝

Metadata Gen

Gemini/OpenAI generates text descriptions

→

🧠

LoRA Training

Fine-tune attention layers with adapters

→

🎶

Generation

Genre-specialized music output

Model Architecture

Under the hood, MusicGen consists of two main components:

Component	Role	Details
EnCodec	Audio tokenizer	Compresses raw audio into discrete tokens using 8 codebooks with 2048 codes each. Stereo interleaved.
Transformer LM	Token predictor	A 3.3B-parameter decoder-only transformer that generates audio tokens conditioned on text prompts.

LoRA adapters are injected into the Transformer LM's attention layers: specifically the q_proj, k_proj, v_proj, and out_proj linear layers in each transformer block.

Prerequisites

GPU Requirements

Tier	GPU	VRAM	Batch Size	Notes
Recommended	H100 / A100 80GB	80 GB	`4–8`	Fastest training, supports large batches
Good	A100 40GB / A6000	40–48 GB	`2–4`	Comfortable for most experiments
Minimum	RTX 4090 / 3090	24 GB	`1–2`	Works with gradient accumulation, slower

Software Requirements

Python 3.10+
CUDA 12.1+ with cuDNN 8
PyTorch 2.1.0+ (cu121)
Docker (optional, for containerized training)
NVIDIA Container Toolkit (for Docker GPU passthrough)

API Keys (Optional)

Service	Purpose	Required?
Weights & Biases	Experiment tracking and visualization	Recommended
Google Gemini	Auto-generate text metadata for audio conditioning	Optional

Installation

Option A: Bare Metal (Direct GPU Access)

Clone the repository

# Clone lumina-musicgen
git clone https://github.com/FoldArtists/lumina-musicgen.git
cd lumina-musicgen

Create a virtual environment

python3 -m venv .venv
source .venv/bin/activate

# Install PyTorch with CUDA 12.1
pip install torch==2.1.0 torchaudio==2.1.0 \
  --index-url https://download.pytorch.org/whl/cu121

# Install project dependencies
pip install -e .

Verify the installation

python scripts/verify_deps.py
# Should output:
# ✓ PyTorch 2.1.0+cu121
# ✓ CUDA available
# ✓ AudioCraft 1.3.0
# ✓ PEFT 0.18.1

Option B: Docker (Recommended for Serverless)

Build the Docker image

cd lumina-musicgen
docker build -f docker/Dockerfile.trainer \
  -t lumina-musicgen-trainer:1.0 .

Image size: ~10 GB. Includes all frozen dependencies from the proven H100 training environment.

Verify the build

# Dry-run test (validates GPU, imports, config)
docker run --gpus all \
  -v /path/to/your/audio:/data \
  -v /tmp/test-output:/output \
  -e DRY_RUN=true \
  lumina-musicgen-trainer:1.0

Data Preparation

Directory Structure

Organize your audio files by genre in subdirectories:

/your/audio/data/
├── blues/
│   ├── track001.wav
│   ├── track002.wav
│   └── ...
├── jazz/
│   ├── track001.wav
│   └── ...
├── rock/
│   └── ...
└── classical/
    └── ...

Audio Requirements

Property	Requirement	Notes
Format	`.wav`, `.mp3`, or `.flac`	WAV preferred for quality
Duration	≥ 10 seconds	Shorter files are auto-skipped
Content	Instrumental preferred	Vocals are auto-removed if present
Quantity	10+ tracks per genre	More data = better generalization

💡 Using GTZAN Dataset

For experimentation, we provide a script to download the GTZAN dataset (1000 tracks, 10 genres):

python scripts/prepare_gtzan.py --output-dir /data/gtzan

Automatic Data Processing

The pipeline automatically handles these preprocessing steps during training:

Manifest creation — Scans audio files and creates train.jsonl / val.jsonl splits
Vocal removal — Uses HT-Demucs to extract instrumental stems (configurable)
Metadata generation — Creates text descriptions using Gemini AI for text-conditioning
Segmentation — Splits audio into 30-second segments at 32 kHz for training

Configuration Reference

Configuration uses OmegaConf YAML with a base + experiment override pattern. The base config (configs/base.yaml) defines all defaults. Experiment configs in configs/experiments/ override specific values.

🧠 Model Settings

Key	Default	Description
`model.base`	`facebook/musicgen-stereo-large`	HuggingFace model ID. Also supports `medium` and `small` variants.
`model.sample_rate`	`32000`	Audio sample rate in Hz
`model.segment_duration`	`30`	Training segment length in seconds
`model.channels`	`2`	Stereo (2) or mono (1)

🔧 LoRA Settings

Key	Default	Description
`lora.rank`	`16`	LoRA rank. Higher = more capacity but slower. Try 32–64 for genre specialization.
`lora.alpha`	`32`	LoRA scaling factor. Rule of thumb: `alpha = 2× rank` to `3× rank`.
`lora.target_modules`	`[q_proj, v_proj, k_proj, out_proj]`	Attention layers to apply LoRA to
`lora.dropout`	`0.05`	LoRA dropout for regularization

⚡ Training Settings

Key	Default	Description
`training.epochs`	`7`	Number of training epochs
`training.batch_size`	`4`	Batch size. Reduce to 1–2 on 24 GB GPUs.
`training.optimizer.lr`	`1e-5`	Learning rate. Use 1e-4 for aggressive fine-tuning.
`training.scheduler.name`	`cosine`	LR schedule type with warmup
`training.early_stopping.patience`	`3`	Stop if val_loss doesn't improve for N epochs
`training.gradient_accumulation_steps`	`1`	Simulate larger batch sizes on small GPUs

📊 Data Settings

Key	Default	Description
`data.source_dir`	`/data/gtzan/instrumental`	Path to raw audio files
`data.dataset_dir`	`/data/gtzan/processed`	Path for processed manifests
`data.splits.train`	`0.80`	Train/val/test split ratio
`data.min_duration`	`10.0`	Skip audio shorter than this (seconds)

🎛️ Augmentation

Key	Default	Description
`augmentation.pitch_shift`	±2 semitones, p=0.4	Random pitch shifting
`augmentation.time_stretch`	0.9×–1.1×, p=0.3	Random tempo changes
`augmentation.gaussian_noise`	0.001–0.01 amp, p=0.2	Noise injection for robustness
`augmentation.gain`	±3 dB, p=0.3	Volume variation

Creating an Experiment Config

Create a YAML file in configs/experiments/ that overrides only the settings you want to change:

# configs/experiments/my_experiment.yaml
data:
  source_dir: "/path/to/my/audio"
  dataset_dir: "/path/to/processed"

lora:
  rank: 64
  alpha: 192        # 3× rank

training:
  epochs: 50
  batch_size: 2      # For 24 GB GPUs
  optimizer:
    lr: 1.0e-4

logging:
  wandb:
    name: "my-experiment"

Training Guide

Running a Training Job

Set environment variables

export WANDB_API_KEY="your_wandb_key"
export GEMINI_API_KEY="your_gemini_key"  # optional

Launch training

# Dry run first (prints config, no training)
python scripts/train_adapter.py \
  --config configs/experiments/my_experiment.yaml \
  --dry-run

# Actual training
python scripts/train_adapter.py \
  --config configs/experiments/my_experiment.yaml

Monitor on Weights & Biases

Training metrics are logged to W&B in real-time: train_loss, val_loss, learning rate, gradient norms, and audio samples at configurable intervals.

Training Tips

✅ Recommended Hyperparameters

Based on validated H100 training runs:

LoRA rank 64, alpha 192 — Best balance for genre adaptation
Learning rate 1e-4 — Aggressive enough for small datasets
Batch size 2 — Safe on all GPUs ≥ 24 GB
Early stopping patience 15 — Generous for long runs
Cosine schedule with 200 warmup steps

⚠️ Common Pitfalls

CUDA OOM — Reduce batch_size to 1 and increase gradient_accumulation_steps
Stale manifests — Delete train.jsonl / val.jsonl if you change your dataset
Windows line endings — If running on Linux from Windows-edited files, run sed -i 's/\r$//' config.yaml

Docker Deployment

The Docker image packages the entire training environment with frozen, proven dependencies. This is the recommended approach for serverless GPU platforms.

Volume Mounts

Host Path	Container	Purpose
`/path/to/audio`	`/data`	Audio files (WAV/MP3/FLAC)
`/path/to/configs`	`/config`	Experiment YAML overrides
`/path/to/output`	`/output`	Checkpoints, samples, logs

Environment Variables

Variable	Required	Description
`EXPERIMENT_CONFIG`	No	Config file name or path. Defaults to `base.yaml`
`WANDB_API_KEY`	No	W&B logging. Disabled if missing.
`GEMINI_API_KEY`	No	Gemini metadata generation
`DRY_RUN`	No	Set `true` for config validation only

Full Docker Run Command

docker run --gpus all \
  -v /home/user/audio:/data \
  -v /home/user/configs:/config \
  -v /home/user/output:/output \
  -e WANDB_API_KEY=your_key \
  -e EXPERIMENT_CONFIG=docker_preset_c.yaml \
  lumina-musicgen-trainer:1.0

Serverless Platforms

Tested on these serverless GPU providers:

RunPod — Use GPU Pod with Docker image. Mount network volumes for data persistence.
Lambda Labs — Push image to ECR/DockerHub, launch via API.
Vast.ai — Upload image, select GPU tier, configure volume mounts.

Outputs & Evaluation

Output Structure

/output/runs/experiment_20260305-151233/
├── config.yaml          # Frozen config snapshot
├── training.log         # Full training log
├── metadata.json        # Data metadata copy
├── checkpoints/
│   ├── epoch_1/         # Per-epoch adapter checkpoints
│   ├── epoch_2/
│   └── best/            # Best model (lowest val_loss)
│       ├── lora_A.pt
│       └── lora_B.pt
├── final/
│   └── adapter_final.pt # Final merged adapter
└── samples/
    ├── epoch_5_sample_0.wav
    └── epoch_10_sample_0.wav

Generated Audio Samples

The pipeline generates audio samples at configurable intervals during training. These allow you to aurally monitor how the model adapts to your target genre. Each sample is a 30-second stereo WAV at 32 kHz.

Evaluation Metrics

Metric	What It Measures	Good Values
val_loss	How well the model predicts held-out audio tokens	Should decrease and stabilize. Our best: `3.73`
FAD (CLAP)	Fréchet Audio Distance — distribution similarity to reference audio	Lower is better. < 5.0 is good.
CLAP Score	Text-audio alignment score	Higher is better

Troubleshooting

❌ CUDA Out of Memory

Cause: Batch size too large for available VRAM.
Fix: Reduce training.batch_size to 1 or 2. Increase training.gradient_accumulation_steps to compensate.

# Effective batch size = batch_size × gradient_accumulation_steps
training:
  batch_size: 1
  gradient_accumulation_steps: 4  # Effective batch size = 4

❌ num_samples = 0 (No audio found)

Cause: data.source_dir path is wrong, or stale manifests exist.
Fix:

Verify audio files exist at the configured source_dir path
Delete stale manifests: rm /path/to/dataset_dir/train.jsonl /path/to/dataset_dir/val.jsonl
Check for Windows line endings in YAML: sed -i 's/\r$//' config.yaml

❌ Docker: Permission denied on /var/run/docker.sock

Fix: Use sudo docker or add your user to the docker group:

sudo usermod -aG docker $USER
# Log out and back in for changes to take effect

❌ Docker: Audio loading returns empty tensors

Cause: Missing audio codec libraries in container.
Fix: The Dockerfile already includes FFmpeg dev headers. If you rebuild, ensure libavformat-dev libavcodec-dev are installed.

❌ Model loads but generates noise

Cause: Likely training too aggressively (high LR, too many epochs).
Fix: Reduce learning rate, add early stopping, check that text descriptions in metadata are reasonable.