Muon's confidence is very accurate during training, but it tends to overfit when switching to new samples.

robot
Abstract generation in progress

CoinWorld News reports that the Muon optimizer exhibits high confidence during training but tends to be overconfident on new samples. The latest paper, “Too Sharp, Too Sure: When Calibration Follows Curvature,” points out that the model can accurately assess its confidence on the training set, but on the test set, the confidence levels do not match the actual accuracy, leading to overconfidence. Experiments show that Muon’s test ECE on the CIFAR-10 image classification task is 0.065, AdamW is 0.061, SGD is 0.081, and SAM is 0.020. Muon’s training ECE is nearly zero, indicating a more significant gap between training and testing performance. The proposed Calmo method can reduce Muon’s test ECE to 0.019 but has not yet been validated on large language models. The DeepSeek V4 technical report indicates that some modules still use AdamW, highlighting the need to monitor Muon’s performance during generalization.

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pin