I came across an interesting development. It seems that Moondream has launched a new service called "Lens," which is specialized in improving the accuracy of visual language models. This happened last week on April 21.



Until now, VLMs performed excellently in laboratory settings, but their accuracy dropped significantly when applied in real-world scenarios. Lens is a fine-tuning service designed to fix that issue, supporting both reinforcement learning and supervised fine-tuning. It operates on a pay-as-you-go API, so you can use only what you need.

What’s remarkable is that it achieves significant improvements with a small amount of data. For example, when used for analyzing live NBA broadcast footage, the F1 score jumped from 28% to 79%. False detections were also greatly reduced.

It’s said to outperform existing models in tasks like identifying countries from street view images and medical image processing. It feels like a step forward in making visual language models practical.

Moondream’s early partner, PTZOptics, plans to incorporate Lens to enhance target tracking and anomaly detection accuracy. Moondream previously released the Photon inference engine, but Lens complements it by balancing speed and accuracy in VLM deployment.

Solving real-world application challenges with technology—these steady improvements will likely lead to widespread adoption of VLMs.
View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pin