Prodigy is a downloadable, self-hosted data annotation and machine teaching tool built for developers, researchers, and data science teams who need to create high-quality training and evaluation data for machine learning models. It is developed by Explosion, the team behind the spaCy NLP library, and is designed as a Python package with a built-in web application that can be extended, scripted, and deployed entirely on your own infrastructure.
The tool supports a wide range of annotation tasks, including named entity recognition, text classification, span categorization, part-of-speech tagging, dependency parsing, coreference resolution, computer vision tasks such as image classification and segmentation, and audio and video annotation including transcription and speaker diarization. Prodigy integrates natively with spaCy and also supports Hugging Face models and major large language model API providers out of the box.
A core design principle of Prodigy is its scriptable recipe system. Recipes are Python functions that define full annotation workflows, allowing teams to implement custom data processing, model-assisted pre-annotation, active learning, and training logic without requiring front-end expertise. The web application exposes 23 annotation interfaces covering text, images, audio, video, relations, multiple-choice, HTML, and conflict resolution tasks.
Prodigy operates entirely offline and never connects to external servers, making it compatible with air-gapped environments and the strictest data privacy requirements. All data, models, and outputs remain on the user's own machines. Licenses are issued as one-time lifetime purchases with 12 months of free upgrades included, and cover unlimited personal and professional use.
- Annotating named entities in text corpora for custom NLP model training
- Training and fine-tuning spaCy pipelines using human-labeled annotation data
- Building text classification datasets with active learning model assistance
- Classifying and segmenting images for computer vision model development
- Annotating audio and video files including transcription and speaker diarization
- Developing and evaluating LLM prompts with structured annotation workflows
- Extracting structured information from unstructured documents and articles
- Creating evaluation benchmarks for custom machine learning models
- Running annotation pipelines in air-gapped or high-security environments
- Building custom annotation interfaces with HTML, CSS, and JavaScript extensions
- Processing and labeling financial, healthcare, legal, and media domain data
- Enabling non-technical annotators to label data through a configured web interface

