Trivia Cafe
18

What three new foundational AI models for image generation, voice generation, and speech-to-text transcription did Microsoft launch in April 2026?

Learn More

MAI-Transcribe-1, MAI-Voice-1, MAI-Image-2 - current events illustration
MAI-Transcribe-1, MAI-Voice-1, MAI-Image-2 — current events

In April 2026, Microsoft significantly expanded its in-house artificial intelligence capabilities by launching three new foundational models: MAI-Transcribe-1 for speech-to-text, MAI-Voice-1 for voice generation, and MAI-Image-2 for image creation. These models represent a strategic move by Microsoft to diversify its AI portfolio and reduce its reliance on external partners, such as OpenAI, thereby strengthening its competitive stance against other tech giants like Google and Amazon. The introduction of these models under the Microsoft AI (MAI) division underscores a push towards multimodal, proprietary AI systems.

MAI-Transcribe-1 is Microsoft's first dedicated transcription model, engineered to convert audio into text across 25 languages with enterprise-grade accuracy. It boasts superior performance compared to existing solutions, including OpenAI's Whisper-large-v3 and Google's Gemini 3.1 Flash, and operates at speeds up to 2.5 times faster than Microsoft's previous Azure Fast transcription model. This model is designed for applications such as video captioning, meeting transcriptions, and voice-enabled agents, and is available through Microsoft Foundry and the MAI Playground.

Complementing the transcription model, MAI-Voice-1 is a high-fidelity speech generation model capable of producing up to a minute of natural, emotionally rich audio in just one second. This model emphasizes preserving speaker identity and emotional tone, and it now supports custom voice creation from only a few seconds of audio. For visual content, MAI-Image-2, the second generation of Microsoft's in-house image model, offers at least twice the generation speed of its predecessor while delivering more realistic details in areas like skin tone, lighting, and textures. MAI-Image-2 has quickly ranked among the top image model families on the Arena.ai leaderboard and is being integrated into Microsoft products like Bing and PowerPoint. These foundational models are available for commercial use through Microsoft Foundry, empowering developers to build and scale their AI solutions with advanced, in-house Microsoft technology.