Testing the 'smartest in the world' Grok3: Is it really the ultimate in model marginal effects?

Question

On February 18, Beijing time, Musk and the xAI team officially released the latest version of Grok, Grok3, during a live broadcast.

Even before this press conference, relying on various related information and Musk's 24/7 uninterrupted preheating hype, the global expectation for Grok3 has been raised to an unprecedented level. A week ago, Musk confidently stated during a live broadcast commenting on DeepSeek R1, "XAI is about to launch a better AI model."

From the data displayed on site, Grok3 has surpassed all mainstream models in benchmark tests for mathematics, science, and programming. Musk even claimed that Grok 3 will be used for SpaceX's Mars mission calculations in the future and predicted a "Nobel Prize-level breakthrough within three years".

But these are currently just Musk's words. After the release, the author tested the latest Beta version of Grok3 and raised the classic question that challenges large models: 'Which is bigger, 9.11 or 9.9?'

Unfortunately, without any modifiers or annotations, the self-proclaimed smartest Grok3 still cannot answer this question correctly.

Grok3 did not accurately identify the meaning of this issue | Image Source: Geek Park

After this test was issued, it quickly attracted the attention of many friends in a short period of time. Similarly, there are many similar tests overseas, such as the basic physics/mathematics problem of 'which ball falls first on the Leaning Tower of Pisa,' Grok3 was also found to be unable to cope with it. Therefore, it is jokingly referred to as 'geniuses unwilling to answer simple questions'.

Grok3 appeared 'rollover' on many common sense issues in actual testing | Image Source: X

In addition to the basic knowledge tested by netizens, there were accidents in the Grok3 at the xAI launch event. During the live broadcast, Musk demonstrated the use of Grok3 to analyze the professions and sublimation effects corresponding to the Path of Exile 2 ( that he claims to play frequently. However, in reality, most of the answers provided by Grok3 are incorrect. Musk in the live broadcast did not notice this obvious issue.

Grok3 also appears to give a large number of incorrect data in the live broadcast | Image source: X

Therefore, this mistake not only became evidence for overseas netizens to mock Musk again for playing games and "finding someone to play for", but also raised a big question mark on the reliability of Grok3 in practical applications once again.

For such a "genius", regardless of actual ability, the reliability of being used in extremely complex application scenarios such as Mars exploration missions must be seriously questioned.

Currently, many testers who obtained the Grok3 test qualification a few weeks ago, as well as those who just used the model's capabilities for a few hours yesterday, all point to the same conclusion about the current performance of Grok3:

"Grok3 is good, but it is not better than R1 or o1-Pro"

"Grok3 is good, but it's not better than R1 or o1-Pro" | Image Source: X

In the official PPT of Grok3's release, in the Chatbot Arena of the large model competition, it achieved a 'leading by far' position. However, this also applied some small plotting techniques: the vertical axis of the leaderboard only listed rankings in the 1400-1300 range, making the 1% difference in test results exceptionally obvious in this PPT presentation.

Official release of the 'far ahead' effect in the PPT | Image source: X

In fact, the actual model running scores, Grok3 achieved only a difference of less than 1-2% compared to DeepSeek R1 and GPT4.0: this corresponds to a 'no significant difference' in user's actual testing experience.

The actual Grok3, only slightly higher by 1%-2% than the latter | Image source: X

In addition, although Grok3 has surpassed all models currently publicly tested in terms of scores, this is not accepted by many people: after all, xAI has been "brushing points" on this list since the Grok2 era, and the situation of greatly reducing scores as the list downgrades the length and style of answers, so it is often criticized by industry insiders as "high score and low ability".

Whether it's the leaderboard 'brushing points' or the 'small tricks' in picture design, it demonstrates Elon Musk's obsession with xAI and the fact that the model's capabilities are 'far ahead'.

And for these differences, the cost Musk paid can be described as high: at the press conference, Musk almost boastfully stated that he used nearly 200,000 H100 (Musk stated in the live broadcast that he used 'more than 100,000' pieces ) to train Grok3, with a total training hours of two billion hours. This has led some to believe that this is another significant positive for the GPU industry and that the shock brought by DeepSeek to the industry is 'foolish'.

Many people believe that stacking computing power will be the future of model training | Image source: X

But in fact, some netizens compared the DeepSeek V3 obtained by using 2000 H800 for two months of training, and calculated that the actual training computing power consumption of Grok3 is 263 times that of V3. And the gap between DeeSeek V3 and Grok3, which scored 1402 points, is not even 100 points in the big model arena list.

After these data came out, many people quickly realized that behind the 'world's strongest' of Grok3, there is actually a logic that the larger the model, the stronger the performance, and there is a clear marginal effect.

Even the 'high score low ability' Grok2, behind it, is supported by a large amount of high-quality first-party data within the X (Twitter) platform. And in the training of Grok3, xAI will naturally encounter the same 'ceiling' that OpenAI is currently facing - the shortage of high-quality training data, which quickly exposes the marginal effect of model ability.

For these facts, the first to realize and also the deepest understanding must be the development team of GROK3 and Musk, so Musk has been continuously expressing on social media that the version currently experienced by users is 'just a test version' and the 'full version will be launched in the coming months.' Musk himself has transformed into the product manager of GROK3, suggesting users to directly provide feedback on various issues encountered during use in the comments section.

He is probably the product manager with the most fans on earth | Image Source: X

But within less than a day, Grok3's performance undoubtedly sounded the alarm for those who hoped to rely on the "brute force flying brick" training to develop more powerful large models: According to publicly available information from Microsoft, it is speculated that the parameter size of OpenAI GPT4 is 18 trillion parameters, more than 10 times larger than GPT3, and the rumored GPT4.5 may have an even larger parameter size.

The model parameter volume soars while training costs are also skyrocketing | Image Source: X

With Grok3 in front, GPT4.5 and more who want to continue 'burning money' to achieve better model performance through parameter volume, all have to consider the imminent ceiling and how to break through.

At this moment, OpenAI's former Chief Scientist Ilya Sutskever said in December last year, "The pre-training we are familiar with will come to an end," which has been brought up again by people, trying to find the true way out of large-scale model training.

Ilya's views have already sounded the alarm for the industry | Image source: X

At that time, Ilya accurately foresaw that the available new data was approaching exhaustion, and the model was finding it difficult to continue improving performance by acquiring data. This situation was described as the consumption of fossil fuels, indicating that 'just as oil is a finite resource, the content generated by humans on the Internet is also limited'.

In Sutskever's prediction, the next generation of models after pre-training models will have 'true autonomy' and at the same time will possess 'human-like' reasoning capabilities.

Unlike the content matching that current pre-trained models mainly rely on (based on the content the model has previously learned), future AI systems will be able to gradually learn and establish methodologies for problem-solving in a way similar to human 'thinking'.

Humans can achieve basic proficiency in a certain field with just fundamental professional books, but AI large models need to learn millions of data to achieve the most basic entry-level effect. Even when you change the way you ask a question, these basic questions cannot be understood correctly. The model has not been improved in real intelligence: those basic questions mentioned at the beginning of the article that Grok3 still cannot answer correctly are the intuitive embodiment of this phenomenon.

But beyond the 'Force Fei Brick', if Grok3 can really reveal to the industry the fact that 'pre-trained models are reaching the end', then it still has important inspirational significance for the industry.

Perhaps, after the frenzy of Grok3 gradually subsides, we may also see more cases similar to Fei-Fei Li's "fine-tuning high-performance models on specific datasets for $50". And in these explorations, ultimately find the true path to AGI.