PIXART-α is a Transformer-based text-to-image diffusion model that delivers photorealistic image generation competitive with leading systems such as Imagen, SDXL, and Midjourney, while requiring dramatically less compute and cost to train. Built on a Diffusion Transformer (DiT) architecture with cross-attention modules for text conditioning, the model supports high-resolution image synthesis up to 1024px and was accepted as a Spotlight paper at ICLR 2024.
Three core design principles power the system. First, a training strategy decomposition separates the optimization of pixel dependency, text-image alignment, and aesthetic quality into distinct training phases. Second, an efficient T2I Transformer integrates cross-attention into the DiT backbone, removing computation-intensive class-conditioning branches. Third, a high-informative data pipeline leverages a large Vision-Language model to auto-label dense pseudo-captions, sharpening the model's text-image alignment from only 25 million training images.
The result is a training cost of approximately $26,000 and roughly 675 A100 GPU days — about 10.8% of Stable Diffusion v1.5's training requirements — while cutting CO2 emissions by 90% compared to that baseline. Against the larger RAPHAEL model, PIXART-α achieves comparable quality at just 1% of the training cost.
PIXART-δ, the follow-up variant, integrates Latent Consistency Models (LCM) and ControlNet support, enabling 1024×1024 image generation in under 0.5 seconds on an A100 GPU with under 8GB of VRAM. The repository provides training scripts, inference pipelines, Dreambooth fine-tuning, LoRA training, ControlNet customization, Gradio demos, and full Hugging Face Diffusers integration.
- Generating photorealistic images from detailed text prompts at up to 1024px resolution
- Fine-tuning the model on custom subject images using Dreambooth for personalized image generation
- Applying ControlNet conditioning with HED edge maps to guide image structure and composition
- Training text-to-image models from scratch using the decomposed three-stage training strategy
- Running fast inference at 1024px in under 0.5 seconds using the PixArt-δ LCM variant
- Producing images in under 8GB of GPU VRAM using optimized diffusers integration
- Auto-captioning large image datasets with LLaVA to generate dense pseudo-captions for training
- Extracting T5 text features and VAE image features to speed up training pipelines
- Fine-tuning with LoRA for lightweight model adaptation on custom datasets
- Experimenting with multiple samplers including DPM-Solver, SA-Solver, and IDDPM

