ACE-Step 1.5 · 815M DiT LoRA Fine-Tuning Docker · Serverless

ACE-Step 1.5 Training Pipeline

Complete guide to fine-tuning an 815M-parameter Diffusion Transformer for music generation using LoRA adapters. From dataset ingestion to checkpoint deployment — everything for serverless GPU training.

📄 20 min read 🎯 Graduate Level 📅 March 2026

Overview

What Is This Pipeline?

This pipeline fine-tunes ACE-Step 1.5 — a state-of-the-art Diffusion Transformer (DiT) for text-to-music generation with 815 million parameters — using LoRA (Low-Rank Adaptation). Unlike autoregressive models (MusicGen, AudioLM), ACE-Step generates audio through flow matching: iteratively denoising a latent representation to produce high-fidelity music.

💡 Why ACE-Step + LoRA?

ACE-Step's DiT architecture has 48 transformer layers processing music latents with cross-attention for text/lyrics conditioning. LoRA adapters (rank 64) inject ~12M trainable parameters into attention layers — just 1.5% of total — while keeping the full model frozen. This makes training fast, memory-efficient, and supports hot-swapping adapters at inference time.

Key Capabilities

  • Multi-genre training — Train on 50-200 tracks across any genres
  • Text + lyrics conditioning — Separate cross-attention for style tags and lyrics
  • Serverless execution — Fully containerized with Docker, runs on Lambda/RunPod/Modal
  • Hot-swap adapters — Switch between adapters without reloading the base model
  • WTA attribution — Integrated Wasserstein Trajectory Attribution for IP tracking
  • S3 storage — Centralized checkpoint and dataset management on AWS S3
  • Validated configs — Tested presets with 11/11 validation tests passing

Architecture

Pipeline Flow

The complete training pipeline follows this flow:

🎵
Audio + Tags
MP3/WAV/FLAC files with style metadata
📦
HF Dataset
Arrow format for fast loading
🐳
Docker Train
GPU container with auto-patching
🧠
LoRA Adapter
~50 MB adapter checkpoint

Model Architecture

Component Role Details
Music DCAE Audio encoder/decoder Compresses raw audio into continuous latent representations. Replaces discrete tokenization.
DiT Decoder 48-layer Diffusion Transformer Generates music by iteratively denoising latent states using flow matching.
T5 Encoder Text conditioning Encodes style tags and lyrics for cross-attention guidance.

LoRA adapters target self-attention and cross-attention layers: linear_q, linear_k, linear_v, to_q, to_k, to_v, to_out.0 across all 48 DiT layers.

Quick Start

1

Clone and build

# Clone the training-pipelines repo
git clone https://github.com/aenfr/training-pipelines.git
cd training-pipelines/ace-step-1-5

# Build the Docker image
cd docker
docker build -f Dockerfile.trainer -t lumina-trainer:v1 .
2

Prepare your dataset

# Build HuggingFace dataset from your audio files
python scripts/build_multi_style_dataset.py \
    --data-dir ~/my_audio \
    --output-dir ~/my_hf_dataset \
    --validate-audio

See Dataset Preparation for the full tutorial.

3

Smoke test (5 gradient steps)

MODE=smoke ./docker/train.sh

Validates GPU, data loading, and training loop in ~2 minutes.

4

Full training (Preset C)

# Runs with validated Preset C defaults
./docker/train.sh

Expected time: ~8 hours on A100 for 100 tracks × 100 epochs.

Training Presets

Preset C — Multi-Style (Production)

✅ Validated & Production-Ready

11/11 validation tests passed. Used for the production multi-style adapter.

🧠 Preset C Configuration
Parameter Value Notes
LoRA Rank 64 Capacity/overfitting balance
LoRA Alpha 192 3× rank for scaled learning
LoRA Dropout 0.05 Mild regularization
Learning Rate 5e-5 Halved for multi-genre stability
Epochs 100 For ~100 tracks
Grad Accumulation 4 Effective batch = 4
Grad Clip 0.5 Prevents explosion
Precision bf16-mixed Half-precision for efficiency

LOO Baseline — Research Validation

Used for Leave-One-Out causal validation experiments on GTZAN (90 tracks, 10 genres).

🔬 LOO Baseline Configuration
Parameter Value Notes
LoRA Rank 64 Same architecture
LoRA Alpha 128 Standard 2× rank
Learning Rate 1e-4 Higher for shorter runs
Epochs 500 Longer for smaller dataset

Docker Setup

Container Volumes

Container Path Purpose Mode
/model ACE-Step base model weights Read-only
/data HuggingFace dataset (.arrow) Read-only
/output Training output & checkpoints Read-write
/lora-cfg LoRA config JSON Read-only

Entrypoint Auto-Patching

The container entrypoint automatically handles these known issues:

  1. TorchCodec incompatibility — Patches torchaudio.load()librosa.load()
  2. Audio save failure — Patches torchaudio.save()soundfile.write()
  3. Step-0 plot crash — Skips inference at step 0 to prevent gradient corruption
💡 No manual patches needed

All audio compatibility patches are applied automatically by entrypoint.sh at container startup. You don't need to modify any source files.

Custom Presets

Override any parameter via environment variables:

# Example: High-rank, low-LR for single artist
EPOCHS=200 \
LEARNING_RATE=2e-5 \
GRAD_ACCUM=8 \
EXP_NAME="my_custom_preset" \
./docker/train.sh

For LoRA architecture changes, create a JSON config:

{
    "r": 128,
    "lora_alpha": 256,
    "lora_dropout": 0.1,
    "target_modules": [
        "linear_q", "linear_k", "linear_v",
        "to_q", "to_k", "to_v", "to_out.0"
    ],
    "use_rslora": false
}
LORA_CONFIG=/path/to/my_config.json ./docker/train.sh

Dataset Preparation

Required Files Per Track

File Description Example
Audio (.mp3/.wav/.flac) Full mix, 30-60s recommended song_01.mp3
Tags Comma-separated style descriptors jazz, piano, smooth, 90bpm
Lyrics Song lyrics or [Instrumental] [verse]\nMidnight...

Build the Dataset

python scripts/build_multi_style_dataset.py \
    --data-dir ~/my_audio \
    --output-dir ~/my_hf_dataset \
    --validate-audio
⚠️ Tag Quality Matters

The model learns style associations through cross-attention with tags. Include 5-10 descriptive tags per track: vocal type, instruments, genre, mood, tempo (BPM), key. See the full Dataset Ingestion Guide in the repository docs.

S3 Storage

Bucket Layout

s3://lumina-data-foldartists/
├── models/ace-step-1.5/         # Base model weights
├── lora/                        # Fine-tuned adapters
│   ├── multi-style-gen-c/       # ✅ Production
│   └── loo-subsets/             # Validation models
├── datasets/                    # HF datasets
├── trajectories/                # WTA trajectory data
└── results/                     # Experiment outputs

Common Operations

# Download production adapter
aws s3 sync s3://lumina-data-foldartists/lora/multi-style-gen-c/ \
    ~/my_adapter/

# Upload new checkpoint after training
aws s3 sync /output/my_experiment/ \
    s3://lumina-data-foldartists/lora/my-new-adapter/

See the full S3 Setup Guide for bucket creation, IAM roles, and cost estimates.

Monitoring Training

Key Metrics

Metric Healthy Range Action if Abnormal
Training loss 0.25 – 0.50 If > 1.0: reduce LR. If NaN: reduce grad_clip
Learning rate Follows schedule Should warm up, then hold at target
GPU memory < 70% VRAM If OOM: reduce batch or grad_accum
Steps/second > 1.0 on A100 If slow: check num_workers

Duration Estimates

Dataset GPU Preset C (100 ep)
50 tracks A100 40GB ~4 hours
100 tracks A100 40GB ~8 hours
100 tracks H100 80GB ~5 hours

Troubleshooting

Issue Symptom Fix
Audio decode failure "Empty examples" or TorchCodec error Auto-patched by entrypoint. Check audio file paths if still failing.
CUDA OOM "CUDA out of memory" Reduce GRAD_ACCUM, or use A100 80GB
No checkpoints Empty output dir after training every_n_train_steps defaults high. Override with --every_n_train_steps 50
NaN loss Loss becomes NaN Lower GRAD_CLIP to 0.1. Check for corrupt audio files