Who is the best at using Claude Code? The answer might not be programmers.

> Original Title: Agentic Coding and Persistent Returns to Expertise
> Author: Anthropoic
> Translation: Peggy
>

Editor's Note: This report is based on approximately 400k Claude Code sessions, discussing how AI programming tools are changing the relationship between humans and code.

The core finding of the article is: In agentic programming, humans mainly decide "what to do," while Claude is responsible for "how to do it." Users bear most of the planning decisions, while Claude handles most of the execution work. In other words, AI is taking over the implementation steps such as writing code, modifying files, running commands, and debugging, but goal setting and outcome judgment still depend on humans.

More importantly, the effectiveness of using Claude Code does not solely depend on whether the user is a programmer. The report shows that in tasks involving code generation, users from non-technical professions such as law, finance, management, and scientific research have success rates nearly on par with software engineers. The real factor influencing results is whether the user understands the problem they want to solve.

This means that AI programming lowers the barrier to implementation, not to judgment. In the future, those who understand the business, the scenario, and can clearly articulate requirements and evaluate results may be better at leveraging AI than those who only know how to write code. AI will not automatically replace domain knowledge; instead, it will amplify the value of domain expertise.

Below is the original text:

Key Findings

Building on existing research, we propose a framework for studying interactive agent programming. This framework is based on privacy-protected analysis of about 400k Claude Code sessions from October 2025 to April 2026, evaluating task composition, human-AI collaboration modes, and success rates.

In a typical session, humans are responsible for most planning decisions—deciding "what to do"; Claude handles most execution decisions—deciding "how to do it." The stronger a user’s domain expertise, the more work each instruction triggers Claude to perform. In coding tasks, the average success rate across major professional groups—whether they completed the original intent, verified via testing, code submission, etc.—is nearly on par with software engineers.

The more domain expertise a user has, the more likely the session ends successfully. However, the gap between intermediate and expert users is not large. Over the seven months we observed, sessions used for debugging decreased by nearly half, shifting toward more end-to-end agent usage: deploying and running code, data analysis, and writing non-code documentation.

During these seven months, the value of typical tasks increased across nearly all job types. We estimate task value by comparing with freelance job postings, showing an average increase of about 25%.

Introduction

Agentic programming is rapidly emerging. Since late 2025, the proportion of coding activities involving agents in GitHub projects has more than doubled, and Claude Code users now average 20 hours per week on the tool. Can people without formal programming experience successfully direct an agent to perform complex technical work? How will the rapid adoption and capability improvements of these tools impact broader knowledge work? We cannot yet provide a complete answer, but early signals can be seen in Claude Code usage data.

This report is based on privacy-protected analysis of about 235k users and 400k interactive sessions from October 2025 to April 2026, providing evidence of actual usage patterns. It continues our previous research on autonomy metrics in Claude Code sessions and how these tools are transforming work within Anthropic. We propose a framework to describe the usage of interactive AI programming assistants: what people are doing, who is doing it, and whether the work is successful. Our focus is on how users employ Claude Code via command-line interface (CLI), Claude.ai, or desktop applications. By tracking how agentic programming usage evolves with model capabilities, we aim to better understand the impact on professional programmers and the knowledge workforce.

What happens on Claude Code may foreshadow the future of knowledge work: agents will gradually embed into non-coding tasks. We find that Claude is handling more complex and valuable tasks. Meanwhile, a clear division of labor persists: humans decide what to build, agents decide how to build it.

We also see evidence that domain expertise, rather than coding proficiency, truly amplifies tool effectiveness. Especially, domain experts are more likely to succeed and recover from errors or misunderstandings. However, the gap between expert and intermediate users is small. This suggests that as long as someone is sufficiently skilled in a domain, they can use these tools almost as effectively as deep specialists.

These findings allow us to preliminarily observe potential shifts in the labor market. Our data show success depends on whether the user understands the problem they want to solve, not whether they have programming training. If this pattern holds across the economy, it implies that while agentic programming may absorb some implementation-focused tasks, it also rewards those who truly understand the problems they are addressing. Coding agents do not replace domain knowledge; rather, the more understanding a worker brings to the task, the more high-quality work the agent can produce.

Division of Labor

What People Do with Claude Code

To understand how people use Claude Code, we categorize each session into one of nine work modes, each best describing the session’s primary goal. Four modes directly involve coding or maintenance: building new things, fixing broken ones, testing code, and orchestrating other agents or automation pipelines. Another category involves operating software, including deployment, configuration, running pipelines, and monitoring systems. Two modes focus more on understanding "what to do": understanding how an existing system works and planning changes before acting. The last two are unrelated to code or involve code only as an auxiliary: data analysis and communication via presentations or other text-based documents.

About 56% of sessions involve coding (25%), fixing code (26%), or testing and orchestrating code (5%). Operating software accounts for 17%, planning or exploration 14%, and data analysis or writing documents 13% (see Figure 1).

> Figure 1: Nine work modes. Each interactive session is categorized into the single mode that best describes its goal.

We first have the model read session logs and classify each session accordingly; then, using our privacy-preserving analysis tools, we cross-validate these classifications with telemetry data automatically recorded for each session, including code additions or deletions. The two sources show high consistency—for example, over 90% of sessions labeled as creating or modifying code also show code changes in telemetry data. Details are in the appendix.

Who Makes the Decisions

How autonomous is Claude Code? Capability assessments show its upper limit is already high and still rising. For example, in benchmarks like METR’s time span evaluation, state-of-the-art models can now autonomously complete software tasks that previously took hours, overcoming obstacles along the way. But how does this look in real use? Here, we focus on how much guidance humans and Claude each provide during actual sessions.

We examine this from two angles. First, how much decision-making is delegated to Claude; second, how many actions are assigned to Claude. To understand decision division, we built a privacy-preserving classifier that attributes each decision in a session to either the human or Claude, based on session content. The classifier lists all meaningful decisions, dividing them into planning decisions (what to do, which approach, what counts as completion) and execution decisions (which files to modify, what code to write, in which language, what commands to run). It then attributes each decision accordingly and generates two percentages per session: the share of planning decisions made by the user, and the share of execution decisions made by the user.

On average, humans make about 70% of planning decisions but only 20% of execution decisions (see Figure 2). In practice, agentic programming results in a clear division of labor: humans decide what to build, agents decide how to build it.

To understand how much delegation of actions occurs within a session, we look at session structure rather than content. Claude sessions consist of exchanges: the user prompts, Claude acts; then the user prompts again, and so forth. In typical sessions, this cycle lasts about four rounds. In our data from October to April, each user prompt triggers about 10 actions from Claude on average, sometimes over 100. Each round, Claude reads files, edits code, runs commands, and outputs roughly 2,400 words.

How much work Claude completes between user checks largely depends on who makes decisions. When users retain control over execution—making over 80% of execution decisions—Claude performs fewer actions per round, around 8. Conversely, when Claude controls planning—making over 80% of planning decisions—it takes on the most actions, about 16 per round.

> Figure 2: Claude’s share in planning and execution decisions. This chart shows the distribution of attribution to Claude versus the user across different sessions, for planning ("what to do") and execution ("how to do it"). In typical sessions, users make about 70% of planning decisions, while Claude makes about 80% of execution decisions.

Professional Level

Based on each session, Claude assesses the user’s apparent professional level on a five-point scale—from novice to expert—focused on the specific task. The classifier considers three signals: how precise the instructions are, what the user asks Claude to verify, and whether the user more often corrects Claude or vice versa. It’s important to note that this professional level is entirely different from job titles or general ability; it is task-specific. For example, a senior engineer asking about Rust for the first time may still be a beginner on that task. Conversely, an accountant who has never used Python but can accurately tell Claude the reconciliation rules for a Python script and catch edge cases during month-end closing is an expert for that task.

The table below shows how we define each professional level in the classifier, with examples from the publicly available SWE-chat dataset of agentic programming conversations. Conversations labeled as "novice" contain generic instructions without domain-specific knowledge; those labeled as "expert" demonstrate deep understanding of codebases and technical environments.

> Table 1: Professional Level Classifier. Examples are paraphrased, anonymized, and compressed versions of real sessions from the SWE-chat dataset, labeled by our classifier. Many examples come from publicly available agentic programming conversation datasets.

We quantify the relationship between professional level and the output and activity triggered per prompt from Claude. In typical novice sessions, each prompt triggers about five actions and outputs roughly 600 words; in expert sessions, the chain length exceeds twice that, about 12 actions, with output reaching around 3,200 words—five times more (see Figure 3). The gap between novice and expert appears across all work types and task value ranges.

These metrics supplement our previous research on Claude Code autonomy. Earlier work tracked session duration and how frequently users automatically approve actions. In contrast, our decision attribution measures who makes substantive decisions during the session, while the output and number of actions per prompt gauge how autonomous Claude is in response to human instructions.

> Figure 3: With more professional users, Claude performs more work per prompt. Higher professional levels correlate with more actions (left) and more text output (right) per prompt. The boxes show interquartile ranges, with medians marked; whiskers extend to the 5th and 95th percentiles. White dots are geometric means. Both upward trends are statistically significant (p < 0.001), with stepwise differences between adjacent levels also significant. Controlling for work mode, task value, month, profession, and model series, and clustering by user, these trends remain significant: each level increase in professional expertise increases actions by 9%, output by 13%.

Who Uses Claude Code and What They Do

Users

To understand who is doing this work, we infer each user’s profession from session logs and map it to one of 23 major categories in the U.S. Bureau of Labor Statistics’ Standard Occupational Classification (SOC) system. The classifier considers only signals such as the project context loaded at session start, file names and structure, references to materials or outputs like legal documents, clinical data, financial reports, course materials, etc., and vocabulary used. It is explicitly instructed not to interpret "writing code" itself as evidence of a programming profession. Only when clear signals indicate a software or data-related occupation does the session get classified into the "Computer and Mathematical" SOC category. For example, if a lawyer builds a script to automatically check for missing clauses in contracts, it remains classified as a legal profession session, even if the main activity is coding. If no signals about the user’s profession are present, the session remains unclassified.

We can infer the user’s profession in about 70% of sessions. Among these, "Computer and Mathematical" is the largest group, unsurprisingly, as it covers most software-related work. Next are business and financial operations, arts and media, management, and life, physical, and social sciences. The fastest-growing non-software groups in our sample are management, sales, and legal professions.

Work

From October 2025 to April 2026, the composition of work done with Claude Code changed significantly. The most notable shift is that sessions involving fixing broken code dropped from 33% to 19% (see Figure 4). Replaced by more code-related work. Operating software increased from 14% to 21%. Writing and data analysis roughly doubled, from about 10% to 20%.

The value of tasks also increased. We estimate the economic value of each session by approximating the cost of similar work in the freelance market, calibrated with real job posting datasets. Using this metric, the average session value rose by 27% from October to April. This increase appeared across many work types: build, operate, and fix tasks grew approximately 43%, 34%, and 32%, respectively. These price estimates are rough, mainly used to observe trends over time rather than as direct dollar valuations. Details of the task value estimation method are in the appendix.

> Figure 4: Changes in Claude Code work composition and value from October 2025 to April 2026. The chart shows the proportion of different work modes over the seven-month period. Fix-broken-code sessions declined from 33% to 19%, while operating software, data analysis, and documentation writing increased.

Success Depends on What the User Brings

Estimating task value is one way to understand how Claude Code helps people get work done. Another is to observe how many sessions succeed and what features are associated with success. Across all success metrics, a clear pattern emerges: higher professional level correlates with higher success likelihood. Most gains are concentrated at the lower end—from novice to intermediate—indicating that the jump from beginner to mid-level is larger than from mid-level to expert.

Before analyzing successful sessions, we need to define success precisely. We cannot observe real-world outcomes or directly ask users if they achieved their goals with Claude. Instead, we rely on two complementary, session-record-based measures. The first is "determined success," where classifiers read the full session and judge whether the user completed their original goal—options include success, partial success, failure, or no clear goal. Two auxiliary classifiers evaluate the strength of this judgment, producing an "verified success" measure. The success classifier searches for verifiable evidence—such as git commits, pull requests, passing test suites, or explicit user approval. It scores sessions from "no signal" (1) to "weak signal" (3) to "multiple strong signals" (5). A parallel failure classifier assesses evidence of errors, failed tests, repeated attempts, or user objections. Verified success requires both a positive success judgment and at least one strong, verifiable success signal. Our analysis focuses on success or failure levels, excluding sessions classified as "no clear goal," which account for about 7.7% of the full sample.

Returns on Professional Level

Which sessions are most likely to succeed? Results show that the professional level score described above has a significant impact on success.

Some might worry that professional level isn’t the real driver—perhaps experts simply choose different tasks or differ in other ways. To address this, we compare sessions within the same work type, similar estimated value, same month, same topic, and from the same broad occupational group, to see how user expertise influences outcomes.

> Table 2: Definitions of success and failure derived from classifiers. Examples are paraphrased, anonymized, and summarized real sessions from the SWE-chat dataset, labeled by our classifier.

Across all success metrics, higher professional level correlates with higher success rates. Sessions rated as "novice" have a 15% verified success rate and 77% partial success. Those rated as "intermediate" or above have verified success rates of 28–33% and partial success rates of 91–92% (see Figure 5).

Most of the gains occur from novice to intermediate; the increase from intermediate to expert is more gradual. Details of the regression analysis behind Figure 5 are in the appendix.

> Figure 5: Relationship between professional level and session outcome. The chart shows success proportions across five levels from novice to expert, based on user-rated expertise. The left panel includes all sessions; the middle and right panels focus on sessions with issues (failure signals > 3), showing the proportion reaching different success/failure definitions. Each point is an adjusted ratio. We compare only sessions with the same work mode, task value, month, topic, and user type (software-related or not). Regression details are in the appendix. Error bars show 95% confidence intervals, most too small to see. Sessions classified as "no clear goal" are excluded.

Even in challenging sessions, a similar gradient appears. When failure signals record verifiable issues—errors, failed tests, repeated attempts, or user frustration—the verified failure rate drops from 4% for novices to 15% for experts (see Figure 5). Using looser success criteria, partial success is achieved in 60% of novice sessions and 80–81% of intermediate and expert sessions.

We also track the inverse relationship: how professional level relates to various failure indicators. Note that sessions classified as failures are those with no partial success. If a problematic session is classified as a failure and no code lines are written, we call it abandoned. Among sessions where users are perceived as novices, 19% are ultimately abandoned; for other groups, the figure is 5–7%. In other words, less experienced users are more likely to give up when facing difficulties. Part of the value of expertise appears to be in guiding the agent back on track.

Occupation May Be Less Important Than Expertise

Software-related users have about a 30% verified success rate across all sessions, compared to 26% for others. In sessions involving code creation—adding or modifying at least one line—these numbers are 34% and 29%, respectively (see Figure 6). Using looser success definitions, the gap narrows further. In code-generation sessions, the proportion reaching partial success is 89% for software-related users and 88% for others. The five-percentage-point difference is small, stable over seven months, and not increasing or decreasing. Among the top ten occupational groups in our dataset, each differs from software engineers by no more than seven percentage points in success rate. Management professions have the highest verified success rate, slightly above software roles. This may reflect that management skills transfer to commanding agents, but it could also partly result from our measurement approach: verification depends on explicit user confirmation, which managers may be more accustomed to providing when they see desired results.

> Figure 6: Success and verified success rates in coding sessions, inferred by occupation. The chart shows the proportion of sessions with at least one code addition/modification, classified as success or verified success, across the ten largest occupational groups. All groups are within seven percentage points of the "Computer and Mathematical" category (SOC). Error bars indicate 95% confidence intervals based on different accounts.

Looking Ahead

The results in this report sketch an emerging picture: agentic programming amplifies certain knowledge and skills while replacing others. In code-producing sessions, success rates across major professions are close to those of software-related roles. It appears that coding agents are making whether one has a programming background less relevant for success.

At the same time, successful sessions are more likely to involve domain expertise. Expert-level sessions have verified success rates more than twice those of novices. When sessions encounter problems, novices are far more likely to give up. The collaboration mode itself clarifies this: domain experts can guide Claude to do more with each instruction. Therefore, the ability to steer Claude toward success depends more on mastery of the domain than on coding ability. Anyone with operational knowledge in a field can now accomplish tasks previously out of reach. Conversely, those lacking such understanding will benefit less, even with the same tools. Most gains come from competence, not mastery. Having actionable domain understanding already yields most benefits; deep specialization adds only marginal advantages.

These findings are still preliminary. Like most research, we cannot measure real-world outcomes—whether code written in a session is ultimately used or discarded, or whether it produces economically valuable results. Additionally, this report excludes non-interactive use, which accounts for a significant portion of activity. Developing a framework to measure such use is a key future goal. All session classifications rely on the model’s reading of session logs. In the appendix, we show that classifier outputs align with independent telemetry data and with strong reference models in most cases. However, large-scale validation remains challenging; Claude Code sessions are often lengthy and complex, making manual annotation difficult as a ground truth.

As models, users, and their division of labor evolve, the landscape depicted here will continue to change. We hope these metrics help track major shifts: if future returns to expertise decline, it may indicate models are providing key judgments that users previously supplied. If success rates outside software professions continue rising, it could mean software creation is becoming a routine part of many jobs, not confined to a single profession. These shifts will influence who benefits from agentic programming and how much, impacting the most valued skills in the labor market.

[Original Link]

Click to learn about Rhythm BlockBeats job openings

Join the Rhythm BlockBeats official community:
Telegram Subscription Group: https://t.me/theblockbeats
Telegram Discussion Group: https://t.me/BlockBeats_App
Twitter Official Account: https://twitter.com/BlockBeatsAsia

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pinned