https://onfjbfzboswbvycybxaj.supabase.co/storage/v1/object/public/Icons/pix_art_alpha.jpg

PixArt-α

Fast Diffusion Transformer for photorealistic text-to-image synthesis
Creative
https://onfjbfzboswbvycybxaj.supabase.co/storage/v1/object/public/Icons/pix_art_alpha.jpg

PixArt-α

DEVELOPER
PixArt-alpha
WEBSITE
SOCIAL
NETWORKS
SUPPORTED
PLATFORMS
STARTING PRICE
Free
FREE TRIAL
PRICING TYPE
Free
CARD REQUIRED
BEST FOR
Personal/Business
SUPPORTED
LANGUAGES
EN
+ N more
See all
AI TEHNOLOGIES
Description

PIXART-α is a Transformer-based text-to-image diffusion model that delivers photorealistic image generation competitive with leading systems such as Imagen, SDXL, and Midjourney, while requiring dramatically less compute and cost to train. Built on a Diffusion Transformer (DiT) architecture with cross-attention modules for text conditioning, the model supports high-resolution image synthesis up to 1024px and was accepted as a Spotlight paper at ICLR 2024.

Three core design principles power the system. First, a training strategy decomposition separates the optimization of pixel dependency, text-image alignment, and aesthetic quality into distinct training phases. Second, an efficient T2I Transformer integrates cross-attention into the DiT backbone, removing computation-intensive class-conditioning branches. Third, a high-informative data pipeline leverages a large Vision-Language model to auto-label dense pseudo-captions, sharpening the model's text-image alignment from only 25 million training images.

The result is a training cost of approximately $26,000 and roughly 675 A100 GPU days — about 10.8% of Stable Diffusion v1.5's training requirements — while cutting CO2 emissions by 90% compared to that baseline. Against the larger RAPHAEL model, PIXART-α achieves comparable quality at just 1% of the training cost.

PIXART-δ, the follow-up variant, integrates Latent Consistency Models (LCM) and ControlNet support, enabling 1024×1024 image generation in under 0.5 seconds on an A100 GPU with under 8GB of VRAM. The repository provides training scripts, inference pipelines, Dreambooth fine-tuning, LoRA training, ControlNet customization, Gradio demos, and full Hugging Face Diffusers integration.

Use cases
  • Generating photorealistic images from detailed text prompts at up to 1024px resolution
  • Fine-tuning the model on custom subject images using Dreambooth for personalized image generation
  • Applying ControlNet conditioning with HED edge maps to guide image structure and composition
  • Training text-to-image models from scratch using the decomposed three-stage training strategy
  • Running fast inference at 1024px in under 0.5 seconds using the PixArt-δ LCM variant
  • Producing images in under 8GB of GPU VRAM using optimized diffusers integration
  • Auto-captioning large image datasets with LLaVA to generate dense pseudo-captions for training
  • Extracting T5 text features and VAE image features to speed up training pipelines
  • Fine-tuning with LoRA for lightweight model adaptation on custom datasets
  • Experimenting with multiple samplers including DPM-Solver, SA-Solver, and IDDPM
Features
Diffusion Transformer (DiT) architecture,Text-to-image synthesis up to 1024px,ControlNet image conditioning,Dreambooth personalization,LCM fast inference support,LoRA fine-tuning scripts,Hugging Face Diffusers integration,Multi-scale VAE feature extraction,LLaVA auto-captioning pipeline,Gradio local demo app

Similar apps

No items found.