Overview
What Is This Pipeline?
This pipeline fine-tunes Meta's MusicGen β a state-of-the-art text-to-music generation model with 3.3 billion parameters β using LoRA (Low-Rank Adaptation) to specialize the model for specific music genres while keeping the base model frozen.
Instead of training all 3.3B parameters (which would require massive compute and risk catastrophic forgetting), LoRA injects small trainable adapters into the transformer's attention layers. This reduces trainable parameters by ~99.8% while still achieving strong genre specialization.
A full fine-tune of MusicGen-Large requires ~26 GB of VRAM just for parameters + gradients. With LoRA (rank 64), you train only ~2M parameters, the adapter checkpoint is ~8 MB, and training fits comfortably on a single GPU with 24 GB+ VRAM.
Key Capabilities
- Multi-genre training β Train on multiple genres simultaneously or focus on a single genre
- Stereo output β Full stereo generation at 32 kHz sample rate
- Data augmentation β Built-in pitch shift, time stretch, gaussian noise, and gain augmentation
- Automatic metadata β AI-generated text descriptions via Gemini or OpenAI for conditioning
- Vocal removal β Automatic instrumental extraction using HT-Demucs
- W&B integration β Real-time experiment tracking with Weights & Biases
- Docker deployment β Containerized for serverless GPU training (RunPod, Lambda)
- Checkpoint management β Automatic best-model saving, early stopping, and run isolation
Architecture
Pipeline Flow
The complete training pipeline follows this flow:
Model Architecture
Under the hood, MusicGen consists of two main components:
| Component | Role | Details |
|---|---|---|
| EnCodec | Audio tokenizer | Compresses raw audio into discrete tokens using 8 codebooks with 2048 codes each. Stereo interleaved. |
| Transformer LM | Token predictor | A 3.3B-parameter decoder-only transformer that generates audio tokens conditioned on text prompts. |
LoRA adapters are injected into the Transformer LM's attention layers: specifically the
q_proj, k_proj, v_proj, and out_proj
linear layers in each transformer block.
Prerequisites
GPU Requirements
| Tier | GPU | VRAM | Batch Size | Notes |
|---|---|---|---|---|
| Recommended | H100 / A100 80GB | 80 GB | 4β8 |
Fastest training, supports large batches |
| Good | A100 40GB / A6000 | 40β48 GB | 2β4 |
Comfortable for most experiments |
| Minimum | RTX 4090 / 3090 | 24 GB | 1β2 |
Works with gradient accumulation, slower |
Software Requirements
- Python 3.10+
- CUDA 12.1+ with cuDNN 8
- PyTorch 2.1.0+ (cu121)
- Docker (optional, for containerized training)
- NVIDIA Container Toolkit (for Docker GPU passthrough)
API Keys (Optional)
| Service | Purpose | Required? |
|---|---|---|
| Weights & Biases | Experiment tracking and visualization | Recommended |
| Google Gemini | Auto-generate text metadata for audio conditioning | Optional |
Installation
Option A: Bare Metal (Direct GPU Access)
Clone the repository
# Clone lumina-musicgen
git clone https://github.com/FoldArtists/lumina-musicgen.git
cd lumina-musicgen
Create a virtual environment
python3 -m venv .venv source .venv/bin/activate # Install PyTorch with CUDA 12.1 pip install torch==2.1.0 torchaudio==2.1.0 \ --index-url https://download.pytorch.org/whl/cu121 # Install project dependencies pip install -e .
Verify the installation
python scripts/verify_deps.py
# Should output:
# β PyTorch 2.1.0+cu121
# β CUDA available
# β AudioCraft 1.3.0
# β PEFT 0.18.1
Option B: Docker (Recommended for Serverless)
Build the Docker image
cd lumina-musicgen docker build -f docker/Dockerfile.trainer \ -t lumina-musicgen-trainer:1.0 .
Image size: ~10 GB. Includes all frozen dependencies from the proven H100 training environment.
Verify the build
# Dry-run test (validates GPU, imports, config)
docker run --gpus all \
-v /path/to/your/audio:/data \
-v /tmp/test-output:/output \
-e DRY_RUN=true \
lumina-musicgen-trainer:1.0
Data Preparation
Directory Structure
Organize your audio files by genre in subdirectories:
/your/audio/data/
βββ blues/
β βββ track001.wav
β βββ track002.wav
β βββ ...
βββ jazz/
β βββ track001.wav
β βββ ...
βββ rock/
β βββ ...
βββ classical/
βββ ...
Audio Requirements
| Property | Requirement | Notes |
|---|---|---|
| Format | .wav, .mp3, or .flac |
WAV preferred for quality |
| Duration | β₯ 10 seconds | Shorter files are auto-skipped |
| Content | Instrumental preferred | Vocals are auto-removed if present |
| Quantity | 10+ tracks per genre | More data = better generalization |
For experimentation, we provide a script to download the GTZAN dataset (1000 tracks, 10 genres):
python scripts/prepare_gtzan.py --output-dir /data/gtzan
Automatic Data Processing
The pipeline automatically handles these preprocessing steps during training:
- Manifest creation β Scans audio files and creates
train.jsonl/val.jsonlsplits - Vocal removal β Uses HT-Demucs to extract instrumental stems (configurable)
- Metadata generation β Creates text descriptions using Gemini AI for text-conditioning
- Segmentation β Splits audio into 30-second segments at 32 kHz for training
Configuration Reference
Configuration uses OmegaConf YAML with a base + experiment override pattern.
The base config (configs/base.yaml) defines all defaults. Experiment configs in
configs/experiments/ override specific values.
| Key | Default | Description |
|---|---|---|
model.base | facebook/musicgen-stereo-large | HuggingFace model ID. Also supports medium and small variants. |
model.sample_rate | 32000 | Audio sample rate in Hz |
model.segment_duration | 30 | Training segment length in seconds |
model.channels | 2 | Stereo (2) or mono (1) |
| Key | Default | Description |
|---|---|---|
lora.rank | 16 | LoRA rank. Higher = more capacity but slower. Try 32β64 for genre specialization. |
lora.alpha | 32 | LoRA scaling factor. Rule of thumb: alpha = 2Γ rank to 3Γ rank. |
lora.target_modules | [q_proj, v_proj, k_proj, out_proj] | Attention layers to apply LoRA to |
lora.dropout | 0.05 | LoRA dropout for regularization |
| Key | Default | Description |
|---|---|---|
training.epochs | 7 | Number of training epochs |
training.batch_size | 4 | Batch size. Reduce to 1β2 on 24 GB GPUs. |
training.optimizer.lr | 1e-5 | Learning rate. Use 1e-4 for aggressive fine-tuning. |
training.scheduler.name | cosine | LR schedule type with warmup |
training.early_stopping.patience | 3 | Stop if val_loss doesn't improve for N epochs |
training.gradient_accumulation_steps | 1 | Simulate larger batch sizes on small GPUs |
| Key | Default | Description |
|---|---|---|
data.source_dir | /data/gtzan/instrumental | Path to raw audio files |
data.dataset_dir | /data/gtzan/processed | Path for processed manifests |
data.splits.train | 0.80 | Train/val/test split ratio |
data.min_duration | 10.0 | Skip audio shorter than this (seconds) |
| Key | Default | Description |
|---|---|---|
augmentation.pitch_shift | Β±2 semitones, p=0.4 | Random pitch shifting |
augmentation.time_stretch | 0.9Γβ1.1Γ, p=0.3 | Random tempo changes |
augmentation.gaussian_noise | 0.001β0.01 amp, p=0.2 | Noise injection for robustness |
augmentation.gain | Β±3 dB, p=0.3 | Volume variation |
Creating an Experiment Config
Create a YAML file in configs/experiments/ that overrides only the settings you want to change:
# configs/experiments/my_experiment.yaml data: source_dir: "/path/to/my/audio" dataset_dir: "/path/to/processed" lora: rank: 64 alpha: 192 # 3Γ rank training: epochs: 50 batch_size: 2 # For 24 GB GPUs optimizer: lr: 1.0e-4 logging: wandb: name: "my-experiment"
Training Guide
Running a Training Job
Set environment variables
export WANDB_API_KEY="your_wandb_key"
export GEMINI_API_KEY="your_gemini_key" # optional
Launch training
# Dry run first (prints config, no training) python scripts/train_adapter.py \ --config configs/experiments/my_experiment.yaml \ --dry-run # Actual training python scripts/train_adapter.py \ --config configs/experiments/my_experiment.yaml
Monitor on Weights & Biases
Training metrics are logged to W&B in real-time:
train_loss, val_loss, learning rate, gradient norms,
and audio samples at configurable intervals.
Training Tips
Based on validated H100 training runs:
- LoRA rank 64, alpha 192 β Best balance for genre adaptation
- Learning rate 1e-4 β Aggressive enough for small datasets
- Batch size 2 β Safe on all GPUs β₯ 24 GB
- Early stopping patience 15 β Generous for long runs
- Cosine schedule with 200 warmup steps
- CUDA OOM β Reduce
batch_sizeto 1 and increasegradient_accumulation_steps - Stale manifests β Delete
train.jsonl/val.jsonlif you change your dataset - Windows line endings β If running on Linux from Windows-edited files, run
sed -i 's/\r$//' config.yaml
Docker Deployment
The Docker image packages the entire training environment with frozen, proven dependencies. This is the recommended approach for serverless GPU platforms.
Volume Mounts
| Host Path | Container | Purpose |
|---|---|---|
/path/to/audio |
/data |
Audio files (WAV/MP3/FLAC) |
/path/to/configs |
/config |
Experiment YAML overrides |
/path/to/output |
/output |
Checkpoints, samples, logs |
Environment Variables
| Variable | Required | Description |
|---|---|---|
EXPERIMENT_CONFIG |
No | Config file name or path. Defaults to base.yaml |
WANDB_API_KEY |
No | W&B logging. Disabled if missing. |
GEMINI_API_KEY |
No | Gemini metadata generation |
DRY_RUN |
No | Set true for config validation only |
Full Docker Run Command
docker run --gpus all \ -v /home/user/audio:/data \ -v /home/user/configs:/config \ -v /home/user/output:/output \ -e WANDB_API_KEY=your_key \ -e EXPERIMENT_CONFIG=docker_preset_c.yaml \ lumina-musicgen-trainer:1.0
Serverless Platforms
Tested on these serverless GPU providers:
- RunPod β Use GPU Pod with Docker image. Mount network volumes for data persistence.
- Lambda Labs β Push image to ECR/DockerHub, launch via API.
- Vast.ai β Upload image, select GPU tier, configure volume mounts.
Outputs & Evaluation
Output Structure
/output/runs/experiment_20260305-151233/ βββ config.yaml # Frozen config snapshot βββ training.log # Full training log βββ metadata.json # Data metadata copy βββ checkpoints/ β βββ epoch_1/ # Per-epoch adapter checkpoints β βββ epoch_2/ β βββ best/ # Best model (lowest val_loss) β βββ lora_A.pt β βββ lora_B.pt βββ final/ β βββ adapter_final.pt # Final merged adapter βββ samples/ βββ epoch_5_sample_0.wav βββ epoch_10_sample_0.wav
Generated Audio Samples
The pipeline generates audio samples at configurable intervals during training. These allow you to aurally monitor how the model adapts to your target genre. Each sample is a 30-second stereo WAV at 32 kHz.
Evaluation Metrics
| Metric | What It Measures | Good Values |
|---|---|---|
| val_loss | How well the model predicts held-out audio tokens | Should decrease and stabilize. Our best: 3.73 |
| FAD (CLAP) | FrΓ©chet Audio Distance β distribution similarity to reference audio | Lower is better. < 5.0 is good. |
| CLAP Score | Text-audio alignment score | Higher is better |
Troubleshooting
Cause: Batch size too large for available VRAM.
Fix: Reduce training.batch_size to 1 or 2.
Increase training.gradient_accumulation_steps to compensate.
# Effective batch size = batch_size Γ gradient_accumulation_steps training: batch_size: 1 gradient_accumulation_steps: 4 # Effective batch size = 4
Cause: data.source_dir path is wrong, or stale manifests exist.
Fix:
- Verify audio files exist at the configured
source_dirpath - Delete stale manifests:
rm /path/to/dataset_dir/train.jsonl /path/to/dataset_dir/val.jsonl - Check for Windows line endings in YAML:
sed -i 's/\r$//' config.yaml
Fix: Use sudo docker or add your user to the docker group:
sudo usermod -aG docker $USER
# Log out and back in for changes to take effect
Cause: Missing audio codec libraries in container.
Fix: The Dockerfile already includes FFmpeg dev headers. If you rebuild,
ensure libavformat-dev libavcodec-dev are installed.
Cause: Likely training too aggressively (high LR, too many epochs).
Fix: Reduce learning rate, add early stopping, check that
text descriptions in metadata are reasonable.