Who is the best at using Claude Code? The answer might not be programmers.

Title: Agentic Coding and Persistent Returns to Expertise
Author: Anthropoic
Translation: Peggy

Original Author: Rhythm BlockBeats

Original Source:

Reprint: Mars Finance

Editor's Note: This report is based on approximately 400k Claude Code sessions, discussing how AI programming tools are changing the relationship between humans and code.

The core finding of the article is: In agentic programming, humans mainly decide "what to do," while Claude is responsible for "how to do it." Users bear most of the planning decisions, while Claude handles most of the execution work. In other words, AI is taking over the implementation steps such as writing code, modifying files, running commands, and debugging, but goal setting and outcome judgment still depend on humans.

More importantly, the effectiveness of using Claude Code does not solely depend on whether the user is a programmer. The report shows that in tasks involving code generation, users from non-technical professions such as law, finance, management, and scientific research have success rates nearly on par with software engineers. The real factor influencing results is whether the user understands the problem they want to solve.

This means that AI programming lowers the barrier to implementation, not to judgment. In the future, those who understand the business, the scenario, and can clearly articulate requirements and evaluate results may be better at leveraging AI than those who only know how to write code. AI will not automatically replace domain knowledge; instead, it will amplify the value of domain expertise.

Below is the original text:

Key Findings

Building on existing research, we propose a framework for studying interactive agent programming. This framework is based on privacy-protected analysis of about 400k Claude Code sessions from October 2025 to April 2026, evaluating task composition, human-AI collaboration modes, and task success rates.

In a typical session, humans are responsible for most planning decisions—deciding "what to do"; Claude handles most execution decisions—deciding "how to do it." The stronger the user’s domain expertise, the more work each instruction triggers Claude to perform. In coding tasks, the average success rate across major professional groups—whether they completed the original intended task and had verifiable evidence such as passing tests or submitting code—is nearly on par with software engineers.

The more domain expertise a user has, the more likely the session ends successfully. However, the gap between intermediate and expert users is not large. Over the seven months we observed, the proportion of debugging sessions nearly halved, shifting toward more end-to-end agent usage: deploying and running code, data analysis, and writing non-code documentation.

During these seven months, the value of typical tasks increased across nearly all job types. We estimated task value by comparing it to freelance market costs, calibrated with real job posting datasets, and found an average increase of about 25%.

Introduction

Agentic programming is rapidly emerging. Since late 2025, the proportion of coding activities involving intelligent agents in GitHub projects has more than doubled, and Claude Code users now average 20 hours per week using the tool. Can people without formal programming experience successfully direct an agent to perform complex technical work? How will the rapid adoption and capability improvements of these tools impact broader knowledge work? We cannot yet give a complete answer, but early signals can be seen in Claude Code usage data.

This report is based on privacy-protected analysis of approximately 235k users and 400k interactive sessions from October 2025 to April 2026, providing evidence of actual Claude Code usage. It continues our previous research on autonomy metrics within Claude Code sessions and how Claude Code is transforming work within Anthropic. We propose a framework to describe the usage of interactive AI programming assistants: what people are doing, who is doing it, and whether the work is successful. We focus on how users employ Claude Code via command-line interfaces (CLI), Claude.ai, or desktop applications. By tracking how agentic programming usage evolves with model capabilities, we aim to better understand the impact on professional programmers and the knowledge workforce.

What happens on Claude Code may foreshadow the future of knowledge work: agents will gradually embed into non-coding tasks. We find that Claude is handling more complex and valuable tasks. Meanwhile, clear division of labor persists: humans decide what to build, agents decide how to build it.

We also see evidence that the real amplifier of tool effectiveness is domain expertise, not coding proficiency. Especially, domain experts are more likely to succeed and recover from errors or misunderstandings. However, the gap between experts and intermediate users is not large. This suggests that as long as someone is sufficiently skilled in a domain, they can use these tools almost as effectively as deep specialists.

These findings allow us to preliminarily observe potential shifts in the labor market. Our data shows success depends on whether the user understands the problem they want to solve, not whether they have programming training. If this pattern holds across the economy, it implies that while agentic programming may absorb some implementation-focused tasks, it also rewards those who truly understand the problems they are addressing. Coding agents do not replace domain knowledge; rather, the more understanding a worker brings to the task, the more high-quality work the agent can produce.

Division of Labor

What do people do with Claude Code?

To understand how people use Claude Code, we categorize each session into one of nine work modes, representing the activity best describing the session’s goal. Four modes directly involve coding or maintenance: building new things, fixing broken ones, testing code, and orchestrating other agents or automation pipelines. Another category involves operating software: deploying, configuring, running pipelines, and monitoring systems. Two modes focus on understanding "what to do": understanding how an existing system works and planning changes before acting. The last two are non-coding or auxiliary to code: data analysis and communication via presentations or other text-based documents.

Approximately 56% of sessions involve coding (25%), fixing code (26%), or testing and orchestrating code (5%). Operating software accounts for 17%, planning or exploration 14%, and data analysis or writing text 13% (see Figure 1).

We first have the model review session logs and classify each session accordingly; then, using our privacy-preserving analysis tools, we cross-validate these classifications with telemetry data automatically recorded per session, including whether code lines were added or removed. The two sources show high consistency—for example, over 90% of sessions classified as creating or modifying code also show code changes in telemetry data. Details are in the appendix.

Who makes decisions?

How autonomous is Claude Code? Capability assessments show its upper limit is already high and still rising. For example, in benchmarks like METR’s time span evaluation, state-of-the-art models can now autonomously complete software tasks that previously took humans hours, overcoming obstacles during the process. But how does this look in real use? Here, we focus on how much guidance humans and Claude each provide during actual sessions.

We examine this from two angles. First, how much do users delegate decision-making to Claude? Second, how many actions do they assign to Claude? To understand decision division, we built a privacy-preserving classifier that attributes each meaningful decision in a session to either the user or Claude. It categorizes decisions into planning (what to do, how to do it, what counts as completion) and execution (which files to modify, what code to write, in which language, and what commands to run). The classifier then attributes each decision accordingly and produces two metrics per session: the proportion of planning decisions made by the user, and the proportion of execution decisions made by the user.

On average, humans make about 70% of planning decisions but only 20% of execution decisions (see Figure 2). In practice, agentic programming forms a clear division of labor: humans decide what to build, agents decide how to build it.

To understand how much delegation occurs in each session, we look at session structure rather than content. Claude sessions consist of exchanges: the user sends prompts, Claude acts; then the user sends the next prompt, and so on. Typically, there are about four rounds per session. In our data from October to April, each user prompt triggers about 10 actions from Claude, sometimes over 100. Each round, Claude reads files, edits code, runs commands, and outputs roughly 2,400 words on average.

How much work Claude completes between user checks largely depends on who makes decisions. When users retain control over execution—making over 80% of execution decisions—Claude performs fewer actions per round, around 8. When Claude controls planning—making over 80% of planning decisions—it takes on the highest number of actions, about 16.

Professional Level

Based on each session, Claude assesses the user’s apparent professional level for the task on a five-point scale from novice to expert. The classifier focuses on three signals: how precise the user’s instructions are, what the user asks Claude to verify, and whether the user corrects Claude more often or vice versa. It’s important to note that this professional level is entirely different from job titles or general ability; it is task-specific. For example, a senior engineer asking about Rust for the first time may still be a beginner on that task. An accountant who has never used Python but can accurately tell Claude the reconciliation rules for a Python script and catch edge cases during month-end closing is an expert for that task.

The table below shows how we define each professional level in the classifier, with example prompts from the publicly available SWE-chat dataset. Conversations classified as "novice" contain generic instructions without domain-specific knowledge; those labeled "expert" demonstrate deep understanding of codebases and technical environments.

We quantify the relationship between professional level and the output and activity volume per prompt from Claude. In typical novice sessions, each prompt triggers about five actions and outputs around 600 words; in expert sessions, the chain of actions is more than double that, about 12 actions, with output reaching roughly 3,200 words—five times more (see Figure 3). The gap between novice and expert appears across all work types and task value ranges.

These metrics complement our prior research on Claude Code autonomy, which tracked runtime and how often users automatically approve its actions. In contrast, our decision attribution captures who makes substantive decisions during the entire session, while output and action counts per prompt measure how much autonomous activity each human instruction can trigger from Claude.

Who Uses Claude Code and What Do They Do?

Users

To understand who is doing this work, we infer each user’s profession from session logs and map it to one of 23 major categories in the U.S. Bureau of Labor Statistics’ SOC system. The classifier relies solely on signals such as the project context loaded at session start, file names and structure, references to materials or outputs like legal documents, clinical data, financial reports, course materials, and vocabulary used. It is explicitly instructed not to interpret "writing code" itself as evidence of a programming profession. Only when clear signals indicate that software or data work is part of the user’s profession will the session be classified into coding-related SOC categories, such as "Computer and Mathematical Occupations." For example, if a lawyer builds a script to automatically check for missing clauses in contracts, even if the session mainly involves coding, it will be classified as a legal profession. If no signals about the user’s profession are present, the session remains unclassified.

We can infer the profession in about 70% of sessions. Among these, "Computer and Mathematical Occupations" is the largest group, unsurprisingly, as it covers most software-related work. Next are business and financial operations, arts, design and media, management, and life, physical, and social sciences. Among non-software professions, the fastest-growing groups are management, sales, and legal.

Work

From October 2025 to April 2026, the composition of work done with Claude Code changed significantly. The most notable shift is the decline in debugging sessions from 33% to 19% (see Figure 4). Replacing it are more code-centric tasks. Operating software increased from 14% to 21%. Writing and data analysis roughly doubled, from about 10% to 20%.

The value of tasks also increased. We approximate the economic value of each session by estimating the cost of similar work in the freelance market, calibrated with real job posting datasets. According to this metric, the average session value increased by 27% from October to April. This rise appears across various work types: build, operate, and fix tasks increased by approximately 43%, 34%, and 32%, respectively. These price estimates are rough, mainly used to observe trends over time rather than as precise dollar values. Details on how the task value estimator is built are in the appendix.

Success Depends on What the User Brings

Estimating task value is one way to understand how Claude Code helps people do their work. Another is to observe how many sessions succeed and what session features correlate with success. Across all success metrics, a clear pattern emerges: higher professional level correlates with higher success probability. Most gains are seen at the lower end—improving from novice to intermediate yields larger benefits than from intermediate to expert.

Before analyzing features of successful sessions, we need to define success accurately. We cannot observe real-world outcomes or directly ask users if they achieved their goals. Instead, we rely on two complementary, session-record-based measures. The first is "determined success," where classifiers read the full session and judge whether the user achieved their original goal—options include success, partial success, failure, or no clear goal. Two additional classifiers evaluate the strength of evidence for this judgment, including verifiable success signals like git commits, pull requests, passing tests, and explicit user approval. They score sessions from "no signal" (1) to "multiple strong signals" (5). A parallel failure classifier assesses evidence of errors, failed tests, repeated attempts, or user objections. Verified success requires both a positive success judgment and at least one strong verifiable success signal. Our analysis focuses on success or failure levels, excluding sessions judged as "no clear goal," which account for about 7.7% of the full sample.

Returns on Professional Level

Which sessions are most likely to succeed? Results show that the professional level score described above has a significant impact on success.

Some might worry that professional level isn’t the real driver—perhaps experts choose different tasks or differ in other ways. To address this, we compare sessions within the same work type, same estimated value, same month, same topic, and from the same broad occupational group, to see how user expertise affects outcomes.

Across all success metrics, higher professional level correlates with higher success rates. Novice sessions have a verified success rate of 15% and a partial success rate of 77%. Intermediate and above sessions have verified success rates of 28%–33% and partial success rates of 91%–92% (see Figure 5).

Most of the gains occur from novice to intermediate; the slope flattens from intermediate to expert. Details of the regression analysis behind Figure 5 are in the appendix.

In sessions with challenges, a similar gradient appears. When failure signals include verified failure evidence—errors, failed tests, repeated attempts, or user frustration—the success rate increases from 4% in novice sessions to 15% in expert sessions (see Figure 5). Using a more lenient success measure, at least partial success, the rate is 60% for novices and 80%–81% for intermediate to expert users.

We also track the inverse relationship: how professional level relates to various failure indicators. Note that in this analysis, sessions classified as failures are those with no partial success. If a problematic session is judged as failed and no code lines are written, we call it abandoned. Among sessions where users appear to be novices, 19% are ultimately abandoned; for other groups, the rate is 5%–7%. In other words, less experienced users are more likely to give up when facing difficulties. Part of the value of expertise seems to be in guiding the agent back on track.

Occupation May Be Less Important Than Professional Level

Software-related users have about a 30% verified success rate across all sessions, while others are at about 26%. In sessions involving code creation—adding or modifying at least one line—these figures are 34% and 29%, respectively (see Figure 6). Using a more lenient success definition, the gap narrows further: 89% and 88% partial success in code-generation sessions. The five-percentage-point difference is small and has neither widened nor narrowed over seven months, despite success rates rising for both groups. Among the top ten occupational groups in our dataset, each is within seven percentage points of software engineers’ success rate. Management professions have the highest verified success rate, slightly above software occupations. This may reflect that management skills transfer to commanding agents, but it could also partly result from our measurement approach, which relies on explicit user confirmation—managers may be more accustomed to expressing satisfaction when they get the results they want.

Outlook

The results sketch a developing picture: agentic programming amplifies certain knowledge and skills while replacing others. In code-generation sessions, success rates across major professions are close to those of software-related roles. It appears that coding agents are making whether one has a programming background less relevant for success.

At the same time, successful sessions are more likely to involve domain expertise. Expert-level sessions have verified success rates more than twice those of novices. When sessions encounter problems, novices are several times more likely to give up. The collaboration mode itself clarifies this: domain experts can guide Claude to do more with each instruction. Therefore, the ability to steer Claude toward success depends more on domain mastery than on coding ability. Anyone with operational understanding in a field can now accomplish technical tasks previously out of reach. Conversely, those lacking this understanding, even with the same tools, will benefit far less. The main gains come from competence, not mastery. Having actionable domain knowledge already yields most benefits; deep specialization adds only marginal extra advantage.

These findings are still preliminary. Like most research, we cannot measure real-world outcomes—whether code written in a session is ultimately used or discarded, or whether it produces economically valuable results. Additionally, this report excludes non-interactive use, which accounts for a significant portion of activity. Developing a framework to measure such use is a key future goal. All session classifications depend on the model’s reading of session logs. In the appendix, we show that the classifier aligns with independent telemetry data and with strong reference models in most cases. However, validating classifiers at scale remains challenging; Claude Code sessions are often lengthy and complex, making manual annotation as a ground truth difficult.

As models, users, and their division of labor evolve, the landscape presented here will continue to change. We hope these metrics help track major shifts: for example, if future returns from professional expertise decline, it would suggest models are providing key judgments that users previously supplied, expanding benefits from domain experts to broader populations. If success rates for non-software users continue rising, it may indicate that software production is becoming a routine part of general work, not confined to a single profession. These shifts will influence who benefits from agentic programming and how much, impacting the most valued skills in the labor market.

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pinned