ME News report: On April 17 (UTC+8), according to Beating Monitoring, the observability tool LangSmith under the AI agent development platform LangChain released two updates: an evaluator template library and reusable evaluators. Assessing whether an AI agent is “useful” is currently one of the most time-consuming parts of development. Agents may call the correct tools but produce the wrong response format; a single-turn conversation may work normally, but it crashes in multi-turn dialogues. Ultimately, the final answer may seem reasonable, but during the intermediate steps it retrieves the wrong documents. Developers need to set checkpoints across multiple levels—single steps, complete trajectories, multi-turn conversations, specific tool calls, and more—and each evaluator must go through a process of writing prompts, calibrating with real data, and repeatedly tuning. Starting from scratch often takes several weeks. LangSmith now provides 30+ ready-made templates, covering five classes

MeNews

2026-05-21 02:41:58

ME News message: On April 17 (UTC+8), according to Dongcha Beating monitoring, LangSmith, the observability tool under the AI agent development platform LangChain, released two updates: an evaluator template library and reusable evaluators.

Evaluating whether an AI agent is “useful” is currently one of the most time-consuming parts of development. An agent may call the correct tools but return answers in an incorrect format; single-turn conversations may work normally, but multi-turn conversations may fail; the final answer may look reasonable, yet the intermediate steps retrieve the wrong documents. Developers need to set checkpoints at multiple levels—single steps, complete trajectories, multi-turn conversations, specific tool calls, etc.—and each evaluator must go through the process of writing prompts, calibrating against real data, and repeatedly tuning; starting from scratch often takes weeks.

LangSmith now provides more than 30 ready-made templates covering five categories: Safety and protection (prompt injection detection, personal information leakage checks, bias and toxicity), Answer quality (correctness, usefulness, tone), Execution trajectory (whether the agent followed the correct steps), User behavior analysis (language distribution, satisfaction signals), and Multimodal (review of voice and image outputs). The templates include fine-tuned LLM judging prompts and rule-based code evaluators, which can be used directly or customized as needed, and are suitable for online monitoring and offline experiments.

Reusable evaluators address organizational-level management issues: the newly added Evaluators tab centrally displays all evaluators within the workspace, enables one-click mounting to new projects, and applies globally after updating prompts—without needing to maintain duplicate copies in each project. The above templates are released as open source together with openevals v0.2.0, with added support for multimodal evaluation.

(Source: BlockBeats)

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

12 Likes

Reward
12
7
12
Share

Comment

Add a comment

DegenWithNotebook

· 11h ago

Evaluator template library + reusable evaluators, the combination focuses on improving development efficiency

View OriginalReply0

OutsiderOfZhiyuandao

· 12h ago

Monitoring of Beating is quite fast, and the LangChain ecosystem is becoming increasingly active.

View OriginalReply0

StargazerInTheWoods

· 12h ago

The reusable evaluator design idea is good, avoiding reinventing the wheel.

View OriginalReply0

QuietValidator

· 12h ago

Counting weeks from zero vs ready-made templates, this comparison is a bit of a blow to the heart

View OriginalReply0

AirdropDreamsInAGlassBottle

· 12h ago

Multi-turn conversation crashes—this is so realistic. Finally, someone is seriously solving it.

View OriginalReply0

Don’tRushToDoubleItYet.

· 12h ago

Can more than 30 templates save a few weeks? I'll wait and see the actual results.

View OriginalReply0

MirrorBallPeeking

· 12h ago

LangSmith's recent update indeed hits the pain points; evaluating AI agents was too frustrating.

View OriginalReply0

Trending Topics
View More
#
TradfiTradingChallenge
227.78K Popularity
#
GrayscaleBuysAndStakesOver510KHYPE
8.91M Popularity
#
DailyPolymarketHotspot
1.01M Popularity
#
SpaceXOfficiallyFilesforIPO
748.48K Popularity
#
GateSquarePizzaDay
1.71M Popularity

Pinned

Sitemap

LangSmith has launched over 30 evaluation templates, so quality checks for AI agents no longer need to be written from scratch.

Trending Topics

TradfiTradingChallenge

GrayscaleBuysAndStakesOver510KHYPE

DailyPolymarketHotspot

SpaceXOfficiallyFilesforIPO

GateSquarePizzaDay

Pinned