Who is the best at using Claude Code? The answer might not be programmers.

Question

> Original Title: Agentic Coding and Persistent Returns to Expertise
> Original Author: Anthropoic
> Translation: Peggy
>

Editor's Note: This report is based on approximately 400k Claude Code sessions, discussing how AI programming tools are changing the relationship between humans and code.

The core finding of the article is: In agentic programming, humans mainly decide "what to do," while Claude is responsible for "how to do it." Users handle most planning decisions, and Claude handles most execution tasks. In other words, AI is taking over steps like writing code, modifying files, running commands, and debugging, but goal setting and outcome judgment still depend on humans.

More importantly, the effectiveness of using Claude Code does not depend solely on whether the user is a programmer. The report shows that in tasks involving code generation, users from non-technical professions such as law, finance, management, and scientific research have success rates nearly equal to software engineers. What truly influences the results is whether the user understands the problem they want to solve.

This means that AI programming lowers the barrier to implementation, not to judgment. In the future, people who understand the business, the scenario, and can clearly articulate requirements and evaluate results may be better at using AI than those who only know how to write code. AI will not automatically replace domain knowledge; instead, it will amplify the value of domain expertise.

Below is the original text:

Key Findings

Building on existing research, we propose a framework for studying interactive agent programming. This framework is based on privacy-protected analysis of about 400k Claude Code sessions from October 2025 to April 2026, evaluating task composition, human-AI collaboration methods, and success rates.

In a typical session, humans are responsible for most planning decisions—deciding "what to do"; Claude handles most execution decisions—deciding "how to do it." The more domain expertise a user has, the more work each instruction triggers Claude to perform. In coding tasks, the average success rate across major professional groups—meaning whether they completed what they originally intended, with verifiable evidence like testing or submitting code—is nearly on par with software engineers.

The stronger the user's domain expertise, the more likely the session ends successfully. However, the gap between intermediate and expert users is not large. Over the seven months we observed, sessions used for debugging decreased by nearly half, shifting toward more end-to-end agent use: deploying and running code, data analysis, and writing non-code documents.

During these seven months, the value of typical tasks increased across nearly all job types. We estimated task value by comparing it to freelance market rates, calibrated with real job posting data, showing an average increase of about 25%.

Introduction

Agentic programming is rapidly emerging. Since late 2025, the proportion of coding agent activities in GitHub projects has more than doubled, and Claude Code users now average 20 hours per week using the tool. Can people without formal programming experience successfully direct an agent to perform complex technical work? How will the rapid adoption and capability improvements of these tools affect broader knowledge work? We cannot yet give a complete answer, but early signals can be seen in Claude Code usage data.

This report is based on privacy-protected analysis of about 235k users and 400k interactive sessions from October 2025 to April 2026, providing evidence of actual usage patterns. It continues our previous research on autonomy metrics in Claude Code sessions and how these tools are changing work within Anthropic. We propose a framework to describe how interactive AI assistants are used: what work people do, who does it, and whether it succeeds. Our focus is on how users employ Claude Code via command-line interfaces (CLI), Claude.ai, or desktop applications. By tracking how agentic programming usage evolves with model capabilities, we aim to better understand the impact on professional programmers and the knowledge workforce.

What happens on Claude Code may foreshadow the future of knowledge work: agents will gradually embed into non-coding tasks. We find that Claude is handling more complex and valuable tasks. Meanwhile, clear division of labor persists: humans decide what to build, agents decide how to build it.

We also see evidence that domain expertise, rather than programming skill, truly amplifies tool effectiveness. Experts are more likely to succeed and recover from errors or misunderstandings. However, the gap between expert and intermediate users is not large. This suggests that as long as someone is sufficiently skilled in a domain, they can use these tools almost as effectively as deep specialists.

These findings allow us to preliminarily observe potential shifts in the labor market. Our data show success depends on whether the user understands the problem, not whether they have programming training. If this pattern holds across the economy, it implies that while agentic programming may absorb some implementation-focused tasks, it also rewards those who truly understand the problems they are solving. Coding agents do not replace domain knowledge; rather, the more understanding a worker brings to the agent, the more high-quality work it can produce.

Division of Labor

What People Do with Claude Code

To understand how people use Claude Code, we categorize each session into one of nine work modes, each best describing the session’s primary goal. Four modes directly involve coding or maintenance: building new things, fixing broken ones, testing code, and orchestrating other agents or automation pipelines. Another category involves operating software, including deployment, configuration, running pipelines, and monitoring systems. Two modes focus more on understanding "what to do": understanding how an existing system works and planning changes before acting. The last two are unrelated to code or only auxiliary: data analysis and communication via presentations or text documents.

About 56% of sessions involve coding (25%), fixing code (26%), or testing and orchestrating code (5%). Operating software accounts for 17%, planning or exploration 14%, and data analysis or writing text 13% (see Figure 1).

> Figure 1: Nine work modes. Each interactive session is categorized into the single mode that best describes its goal.

We first let the model review session logs and classify each session accordingly; then, using our privacy-protected analysis tools, we cross-validate these classifications with telemetry data automatically recorded for each session, including code additions or deletions. The two sources show high consistency—for example, over 90% of sessions labeled as creating or modifying code also show code changes in telemetry data. Details are in the appendix.

Who Makes the Decisions

How autonomous is Claude Code? Capability assessments show its upper limit is already high and still rising. For example, in benchmarks like METR, state-of-the-art models can now autonomously complete software tasks that previously took hours for humans, overcoming obstacles during the process. But how does this look in real use? Here, we focus on how much guidance humans and Claude each provide during actual sessions.

We examine this from two angles. First, how much decision-making do users delegate to Claude? Second, how many actions do they assign to Claude? To understand decision division, we built a privacy-preserving classifier that attributes each meaningful decision in a session to either planning or execution. Planning includes deciding "what to do," "how to do it," and "what counts as completion"; execution includes which files to modify, what code to write, in which language, and which commands to run. The classifier then attributes each decision to Claude or the user, producing two percentages per session: the share of planning decisions made by the user, and the share of execution decisions made by the user.

On average, humans make about 70% of planning decisions but only 20% of execution decisions (see Figure 2). In practice, agentic programming forms a clear division of labor: humans decide what to build, agents decide how.

To understand how much delegation of actions occurs within a session, we look at session structure rather than content. Claude sessions consist of exchanges: the user prompts, Claude acts, then the user prompts again, and so forth. In typical sessions, this cycle lasts about four rounds. In our data from October to April, each user prompt triggers about ten actions from Claude on average, sometimes over 100. Each round, Claude reads files, edits code, runs commands, and outputs roughly 2,400 words.

How much work Claude completes between user checks largely depends on who makes decisions. When users retain control over execution—making over 80% of execution decisions—Claude performs fewer actions per round, around 8. When Claude controls planning—over 80% of planning decisions—it takes on the most actions, about 16.

> Figure 2: Claude’s share of planning and execution decisions. The chart shows the distribution of the proportion of decisions attributed to Claude versus the user across different sessions. In typical sessions, users make about 70% of planning decisions, while Claude makes about 80% of execution decisions.

Professional Level

Based on each session, Claude assesses the user’s apparent professional level on a five-point scale—from novice to expert—focused on the specific task. The classifier considers three signals: how precise the user’s instructions are, what the user asks Claude to verify, and whether the user more often corrects Claude or vice versa. It’s important to note that this professional level is entirely different from job titles or general ability; it is task-specific. For example, a senior engineer asking about Rust for the first time may still be a beginner in that task. Conversely, an accountant who has never used Python but can accurately tell Claude the reconciliation rules for a Python script and catch edge cases during month-end closing is an expert for that task.

The table below shows how we define each professional level in the classifier, with example prompts from the publicly available SWE-chat dataset. Conversations labeled as "novice" contain generic instructions without domain-specific knowledge; those labeled as "expert" demonstrate deep understanding of codebases and technical environments.

> Table 1: Professional Level Classifier. Examples are paraphrased, anonymized, and compressed from real sessions in the SWE-chat dataset, then labeled by our classifier. Many examples come from the public agent programming conversation dataset SWE-chat.

We quantified the relationship between professional level and the output and activity volume per prompt from Claude. In typical novice sessions, each prompt triggers about five actions and outputs roughly 600 words; in expert sessions, the chain length exceeds twice that, about 12 actions, with output reaching around 3,200 words—five times more (see Figure 3). The gap between novice and expert appears across all work types and task value ranges.

These metrics supplement our previous autonomy research on Claude Code. Earlier work tracked session duration and how frequently users automatically approve actions. In contrast, our decision attribution measures who makes substantive decisions during the session, while the output and number of actions per prompt gauge how autonomous Claude is in response to human instructions.

> Figure 3: With more professional users, Claude performs more work per prompt. Higher professional levels correlate with more actions (left) and more text output (right) per prompt. The boxes show interquartile ranges, with median lines; whiskers extend to the 5th and 95th percentiles. The white dots are geometric means. Both upward trends are statistically significant (p < 0.001), with stepwise differences between adjacent levels also significant. Controlling for work mode, task value, month, profession, and model series, and clustering by user, these trends remain significant: each level increase in professionalism increases actions by 9% and output by 13%.

Who Uses Claude Code and What They Do

Users

To understand who is doing this work, we infer each user’s profession from session logs and map it to one of 23 major categories in the U.S. Bureau of Labor Statistics’ Standard Occupational Classification (SOC) system. The classifier considers only signals such as the context loaded at session start, file names and structure, references to materials or outputs like legal documents, clinical data, financial reports, course materials, and vocabulary used. It is explicitly instructed not to interpret "writing code" itself as evidence of a programming occupation. Only when clear signals indicate software or data work does the session get classified as related to coding, i.e., "Computer and Mathematical Occupations." For example, if a lawyer builds a script to automatically check for missing clauses in contracts, it is classified as a legal profession, even if the main activity is coding. If no signals about the user’s profession are present, the session remains unclassified.

We can infer the profession in about 70% of sessions. Among these, "Computer and Mathematical" is the largest group, unsurprisingly, as it covers most software-related work. Next are business and financial operations, arts and media, management, and life sciences, physical sciences, and social sciences. The fastest-growing non-software groups in our sample are management, sales, and legal professions.

Work

From October 2025 to April 2026, the composition of work done with Claude Code changed significantly. The most notable shift is that sessions involving fixing broken code dropped from 33% to 19% (see Figure 4). Replacing it are more code-related tasks. Operating software increased from 14% to 21%. Writing and data analysis roughly doubled, from about 10% to 20%.

The value of tasks also increased. We estimated the economic value of each session by comparing it to freelance market rates, calibrated with real job posting data. This metric shows an average increase of 27% from October to April. The increase appears across various work types: build, operate, and fix tasks grew by approximately 43%, 34%, and 32%, respectively. These price estimates are rough, mainly used to observe trends over time rather than as precise dollar values. Details on how the task value estimator is built are in the appendix.

> Figure 4: Changes in Claude Code work composition and value from October 2025 to April 2026. The chart shows the proportion of different work modes over the seven-month period. Fixing broken code dropped from 33% to 19%, while operating software, data analysis, and documentation increased.

Success Depends on What the User Brings

Estimating task value is one way to understand how Claude Code helps people get work done. Another is to observe how many sessions succeed and what features are associated with success. Across all success metrics, a clear pattern emerges: higher professional level correlates with higher success probability. Most gains are seen at the lower end—from novice to intermediate—indicating that the jump from beginner to intermediate is larger than from intermediate to expert.

Before analyzing successful sessions, we need to define success precisely. We cannot observe real-world outcomes or directly ask users if they achieved their goals. Instead, we rely on two complementary, session-record-based measures. The first is "determined success," where classifiers read the full session and judge whether the user completed their original goal—options include success, partial success, failure, or no clear goal. Two auxiliary classifiers evaluate the strength of this judgment, producing an "verified success" measure. The success classifier looks for verifiable evidence, especially matching git activities like commits, pull requests, passing tests, and explicit user approval. It scores sessions from "no signal" (1) to "weak" (3) to "multiple strong signals" (5). A parallel failure classifier scores evidence of errors, failed tests, repeated attempts, or user objections. Verified success requires both conditions: the session is judged successful, and at least one strong verifiable success signal exists. Our analysis focuses on success or failure levels, excluding sessions labeled as "no clear goal," which account for about 7.7% of the full sample.

Returns on Professional Level

Which sessions are most likely to succeed? Results show that the professional level score described above has a significant impact on success.

Some might worry that professional level isn’t the real driver. Perhaps experts simply choose different tasks or differ in other ways. In this section, we compare sessions of the same work type, same estimated value, same month, same topic, and from the same broad occupational group, to partially address this concern and see how different levels affect outcomes.

> Table 2: Definitions of success and failure derived from classifiers. Examples are paraphrased, anonymized, and summarized from real sessions in the SWE-chat dataset, then labeled by our classifiers.

Across all success metrics, higher professional level correlates with higher success rates. Sessions rated as "novice" have a verified success rate of 15% and a partial success rate of 77%. Those rated as "intermediate" or above have verified success rates of 28% to 33%, and partial success rates of 91% to 92% (see Figure 5).

Most gains occur from novice to intermediate; the slope flattens from intermediate to expert. Details of the regression analysis behind Figure 5 are in the appendix.

> Figure 5: Relationship between professional level and session outcome. The chart shows session success rates across five levels—from novice to expert—based on user’s professional score. The left panel includes all sessions; the middle and right panels focus on sessions with problems, i.e., those with failure signals greater than 3, showing the proportion reaching different success or failure definitions. Each point is an adjusted ratio. We compare only sessions with the same work mode, task value, month, topic, and user type (software-related or not). Regression details are in the appendix. Error bars show 95% confidence intervals, most too small to see. Sessions labeled as "no clear goal" are excluded.

Even in sessions with challenges, a similar gradient appears. When verified failure signals are recorded, we consider the session "encountered problems." This includes errors, failed tests, repeated attempts, or user expressions of frustration. Among such sessions, the verified success rate rises from 4% for novices to 15% for experts (see Figure 5). Using a looser success definition, at least partial success occurs in 60% of novice sessions and 80-81% of intermediate and expert sessions.

We also track a reverse relationship: between professional level and various failure indicators. Note that sessions judged as failures are those with no partial success. If a problematic session is judged as failed and no code lines are written, we call it abandoned. Among sessions where users appear as novices, 19% are ultimately abandoned; among other groups, the figure is 5-7%. In other words, less experienced users are more likely to give up when facing difficulties. Part of the value of expertise seems to lie in the ability to steer the agent back on track.

Occupation May Be Less Important Than Professional Level

Users in software-related professions have an approximate 30% verified success rate across all sessions, compared to 26% for others. In sessions involving code production—at least one line added or modified—these numbers are 34% and 29%, respectively (see Figure 6). Using a looser success criterion, the gap narrows further: 89% and 88% partial success in code-producing sessions. The five-percentage-point difference remains stable over seven months, with success rates rising for both groups. Among the top ten occupational groups in our dataset, none differ from software engineers by more than seven percentage points in success rate. Management professions have the highest verified success rate, slightly above software roles. This may reflect that management skills transfer to commanding agents, or it could partly result from our measurement approach—verification depends on explicit user confirmation, which managers may be more accustomed to providing when they see desired results.

> Figure 6: Success and verified success rates in coding sessions, inferred by occupation. The chart shows the proportion of sessions with at least one line added or modified, classified as success or verified success, across the ten largest occupational groups. All groups are within seven percentage points of the "Computer and Mathematical" category (SOC). Error bars indicate 95% confidence intervals based on different accounts.

Outlook

The results in this report sketch an emerging picture: agentic programming amplifies certain knowledge and skills while replacing others. In code-producing sessions, success rates across major professions are close to those of software-related roles. It appears that coding agents are making whether one has a programming background less relevant for success.

At the same time, successful sessions are more likely to involve domain expertise. Expert-level sessions have verified success rates more than twice those of novices. When sessions encounter problems, novices are much more likely to give up. The collaboration pattern itself clarifies this: domain experts can guide Claude to do more with each instruction. Therefore, the ability to steer Claude toward success depends more on domain mastery than on coding ability. Anyone with operational understanding of a domain can now accomplish technical tasks previously out of reach. Conversely, those lacking this understanding, even with the same tools, will benefit less. Most gains come from competence, not mastery. Having actionable domain knowledge already yields most benefits; deep specialization adds only marginal advantages.

These findings are still preliminary. Like most research, we cannot measure real-world outcomes—whether code written in a session is ultimately used or discarded, or whether it produces economic value. Additionally, this report excludes non-interactive use, which accounts for a significant portion of activity. Developing a framework to measure such use is a future priority. All session classifications rely on the model’s reading of session logs. In the appendix, we show that the classifier aligns with independent telemetry data and with strong reference models in most cases. However, large-scale validation remains challenging; Claude Code sessions are often lengthy and complex, making manual annotation difficult as a ground truth.

As models, users, and their division of labor evolve, the landscape depicted here will continue to change. We hope these metrics help track major shifts: for example, if future returns to professional level decline, it may indicate the model is providing key judgments that users previously made themselves, expanding benefits from domain experts to broader populations. If success rates for non-software users continue to rise, it could mean software production is becoming a routine part of general work, not confined to a single profession. These shifts will influence who benefits from agentic programming and how much, impacting the most valued skills in the labor market.

[Original Link]

Click to learn about Rhythm BlockBeats job openings

Join the Rhythm BlockBeats official community:
Telegram Subscription Group: https://t.me/theblockbeats
Telegram Chat Group: https://t.me/BlockBeats_App
Twitter Official Account: https://twitter.com/BlockBeatsAsia

Who is the best at using Claude Code? The answer might not be programmers.

Key Findings

Introduction

Division of Labor

What People Do with Claude Code

Who Makes the Decisions

Professional Level

Who Uses Claude Code and What They Do

Users

Work

Success Depends on What the User Brings

Returns on Professional Level

Occupation May Be Less Important Than Professional Level

Outlook

Trending Topics

MyGateTradeStory

USIranTalksPostponed

PredictWorldCup🇪🇸vs🇸🇦

TradFiCFDGoldMasters

HoldUSD1EarnYield

Pinned