Microsoft's self-developed AI "trinity" is being implemented, with the bold goal of independently developing large-scale cutting-edge models by 2027.

robot
Abstract generation in progress

American technology company Microsoft announced on Thursday that three internally developed AI models are officially being rolled out for broad commercial use, showing the company’s efforts to break away from its long-term reliance on OpenAI.

Specifically, the MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2 models developed by Microsoft’s AI superintelligent team cover the three most commercially valuable capabilities in enterprise AI—speech transcription, speech generation, and image creation.

(Microsoft CEO Satya Nadella announced this update; source: X)

Microsoft said that MAI-Transcribe-1 has the highest accuracy among the most commonly used transcription models on the market. In tests covering all languages, its average error rate is 3.9%. OpenAI’s GPT-Transcribe has an error rate of 4.2%, and Gemini 3.1 Flash is 4.9%.

MAI-Voice-1, a speech generation model, is said to generate 60 seconds of audio in less than a second on a “single GPU,” and to maintain consistency of the voice during the generation of longer content.

MAI-Image-2 was first released on March 19, and on Thursday it was rolled out for broad commercial use alongside the other two models. Currently, in the “Large Model Arena” text-to-image ranking, this model ranks third, behind Google’s blockbuster Nano Banana 2 and OpenAI’s GPT-Image 1.5.

In a side-by-side comparison of pricing, MAI-Image-2’s text input starts at $5 per 1 million tokens, while image output starts at $33 per 1 million tokens. Google’s Gemini 3 Pro image generation model is $120 per 1 million tokens, and Gemini 3.1 Flash image generation is $60 per 1 million tokens.

Goal: Independently develop world-leading large models

Microsoft’s latest move traces back to last October, when the company restructured its partnership with OpenAI, allowing Microsoft to pursue rights to general artificial intelligence on its own or together with third-party partners. Although the earlier agreement allowed Microsoft to use OpenAI intellectual property, it also barred it from developing competing AI systems.

Microsoft’s CEO of AI, Mustafa Suleyman, has said publicly that the team’s goal by 2027 is to “truly reach the state of the art,” covering models that can respond to or generate text, images, and audio.

Suleyman said the company is building the compute capacity needed to train its models, and has been deploying Nvidia GB200 chips since last October.

He said, “Since then, over the next roughly 12 to 18 months, we will be gradually scaling up to reach compute power at a state-of-the-art level.”

As a co-founder of Google DeepMind, Suleyman joined Microsoft in 2024 to integrate AI into its consumer products. After Microsoft and OpenAI finalized their agreement in October last year, Suleyman took over full-time in November to lead Microsoft’s AI superintelligent team. In an internal reshuffle last month, Suleyman’s responsibilities were narrowed to model development, with former Snap executive Jacob Andrio taking over Microsoft’s Copilot assistant products for enterprise and individual users.

Suleyman told the media, “We want to emphasize the importance of driving our own most advanced AI capabilities over the next three to five years and achieving this strategic mission of long-term independence.” He added that the company will continue to host models developed by other companies as well.

From a long-term perspective, Microsoft’s deep access to OpenAI’s intellectual property will expire in 2032, so developing its own large models is also an important risk hedge.

The just-started Microsoft self-developed model business also has plenty of shortcomings, which indicates that Suleyman’s team will have a lot of work to complete in the coming year.

For example, MAI-Image-2 currently supports only a 1:1 aspect ratio and does not offer horizontal or vertical options. Common capabilities in other AI applications—such as image-to-image editing and reference image support—are also not available. MAI-Transcribe-1 cannot distinguish between different speakers in a conversation, and it does not support context biasing and streaming; Microsoft says all three of these features are under development.

(Source: Caixin Global)

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pin