Microsoft's self-developed AI "trinity" is being implemented, with the bold goal of independently developing large-scale cutting-edge models by 2027.

robot
Abstract generation in progress

American technology company Microsoft announced on Thursday that three internally developed AI models have officially been launched for broad commercial use, highlighting the company’s efforts to move away from relying on longtime partner OpenAI.

Specifically, the MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2 models developed by Microsoft’s AI Super Intelligent Team cover the three most commercially valuable capabilities in enterprise AI—speech-to-text transcription, voice generation, and image creation.

(These updates were announced by Microsoft CEO Nadella; Source: X)

Microsoft said that MAI-Transcribe-1 has the highest accuracy among the most commonly used transcription models on the market. In tests covering all languages, its average error rate is 3.9%. OpenAI’s GPT-Transcribe has an error rate of 4.2%, while Gemini 3.1 Flash is 4.9%.

The MAI-Voice-1 voice generation model is said to be able to generate 60 seconds of audio in less than one second on a “single GPU,” and to maintain voice consistency during the generation of long-form content.

MAI-Image-2 was first released on March 19. On Thursday, it was also launched for broad commercial use along with the other two models. At present, the model ranks third in the text-to-image generation rankings on the “Large Model Arena,” behind Google’s hit product Nano Banana 2 and OpenAI’s GPT-Image 1.5.

In a side-by-side comparison of pricing, MAI-Image-2’s text input starts at $5 per 1 million tokens, and image output starts at $33 per 1 million tokens. Google’s Gemini 3 Pro image generation model is $120 per 1 million tokens, and Gemini 3.1 Flash image is $60 per 1 million tokens.

Goal: Independently develop world-leading large models

Microsoft’s latest move traces back to last October, when the company restructured its relationship with OpenAI, allowing Microsoft to pursue the right to general artificial intelligence either on its own or together with third-party partners. The previous agreement allowed Microsoft to use OpenAI’s intellectual property, but it also barred Microsoft from developing competing AI systems.

Microsoft’s AI CEO Mustafa Suleyman has said publicly that the team’s goal by 2027 is “to be able to truly reach the state of the art,” encompassing models that can respond to or generate text, images, and audio.

Suleyman said the company is building the compute power required to train models, and has begun deploying GB200 chips since last October.

He said, “Since then, over the next roughly 12 to 18 months, we will gradually scale up to reach computing capabilities at the cutting edge.”

As a co-founder of Google DeepMind, Suleyman joined Microsoft in 2024 and is responsible for integrating artificial intelligence into its consumer products. After Microsoft and OpenAI finalized their deal in October last year, Suleyman took over full-time leadership of Microsoft’s AI Super Intelligent Team in November. In an internal reorganization last month, Suleyman’s responsibilities were narrowed to model development, while former Snap executive Jacob Andrio took over Microsoft’s Copilot assistant products for enterprise and individual users.

Suleyman told the media: “What we want to emphasize is that the importance of advancing our own most advanced AI capabilities and achieving long-term autonomy as a strategic mission over the next three to five years.” He also added that the company will continue to host models developed by other companies.

From a long-term perspective, Microsoft’s deep access to OpenAI’s intellectual property will expire between 2032 and 2033, so developing self-developed large models is also an important risk hedge.

Even the newly launched Microsoft self-developed model business has quite a number of shortcomings, indicating that Suleyman’s team will have a lot of work to complete in the coming year.

For example, MAI-Image-2 currently supports only a 1:1 aspect ratio and does not provide landscape or portrait options. Features commonly found in other AI applications—such as image-to-image editing and reference image support—are also not available. MAI-Transcribe-1 cannot distinguish different speakers in a conversation, and it also does not support context biasing and streaming transmission; Microsoft says all three of these functions are under development.

(Source: Caixin)

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pin