Azure Speech in Foundry Tools is a comprehensive suite of AI-powered speech APIs that enables developers to build voice-enabled, multilingual applications and intelligent agents. Part of Microsoft Foundry (formerly Azure AI Services), it provides production-ready capabilities for speech-to-text transcription, text-to-speech synthesis, speech translation, speaker recognition, and real-time voice interactions.
The service supports real-time and batch speech-to-text transcription with high accuracy, including the latest OpenAI Whisper model integration. Developers can transcribe call center and meeting conversations, generate captions and subtitles in more than 100 languages, and apply custom speech models trained on domain-specific acoustic and language data to improve recognition quality.
Text-to-speech capabilities include a broad library of prebuilt neural voices available out of the box in 100 or more languages and locales, alongside custom neural voice options that allow organizations to create branded, realistic voices for their products. High-definition neural voices and Azure OpenAI neural voices are also available. The text-to-speech avatar feature converts text into digital video of a photorealistic human speaking with a natural-sounding voice, supporting both real-time and batch synthesis.
Speech translation enables real-time speech-to-speech or speech-to-text translation across a wide range of languages. The Voice Live API provides a unified real-time speech-to-speech interface for building scalable, production-ready voice AI agents that integrate transcription, generative AI models, synthesis, and conversational enhancements in a single low-latency pipeline. Embedded speech supports on-device scenarios where cloud connectivity is intermittent or unavailable. The service is accessible via Speech CLI, Speech SDK, and REST APIs and can be deployed in the cloud or at the edge with containers.
- Transcribing call center conversations in real time to assist agents and automate post-call analytics
- Generating captions and subtitles for audio and video content in more than 100 languages
- Building conversational voice bots and assistants with natural-sounding prebuilt or custom neural voices
- Enabling real-time speech-to-speech translation for multilingual communication applications
- Creating custom neural voices that reflect a brand's identity for differentiated user experiences
- Powering voice-enabled AI agents with end-to-end speech including customized transcription and avatars
- Transcribing and summarizing meeting conversations for productivity and documentation workflows
- Providing pronunciation assessment feedback to language learners in real-time
- Building text-to-speech avatar videos for customer service, education, and marketing content
- Deploying on-device speech recognition and synthesis in environments with intermittent connectivity
- Analyzing audio and video call recordings to extract business insights using foundation models
- Integrating fast batch transcription for voicemail processing, media captioning, and archiving workflows

