Who is the best at using Claude Code? The answer might not be programmers.

Author: Anthropoic; Translation: Peggy, Blockchain Rhythm

This report is based on approximately 400k Claude Code sessions, discussing how AI programming tools are changing the relationship between humans and code.

The core finding of the article is: in intelligent agent programming, humans mainly decide "what to do," while Claude is responsible for "how to do it." Users bear most of the planning decisions, and Claude handles most of the execution work. In other words, AI is taking over the implementation steps such as writing code, modifying files, running commands, and debugging, but goal setting and outcome judgment still depend on humans.

More importantly, the effectiveness of using Claude Code does not depend solely on whether the user is a programmer. The report shows that in tasks involving code generation, users from non-technical professions such as law, finance, management, and scientific research have success rates nearly on par with software engineers. The real factor influencing results is whether the user understands the problem they want to solve.

This means that AI programming lowers the barrier to implementation, not to judgment. In the future, people who understand the business, the scenario, and can clearly articulate requirements and evaluate results may be better at utilizing AI than those who only know how to write code. AI will not automatically replace domain knowledge; instead, it will amplify the value of domain expertise.

Below is the original text:

Key Findings

Building on existing research, we propose a framework for studying interactive intelligent agent programming. This framework is based on privacy-protected analysis of about 400k Claude Code sessions from October 2025 to April 2026, evaluating task composition, human-AI collaboration methods, and success rates.

In a typical session, humans are responsible for most planning decisions—deciding "what to do"; Claude handles most execution decisions—deciding "how to do it." The stronger a user’s expertise in a domain, the more work each instruction triggers Claude to perform. In coding tasks, the average success rate across major professional groups—whether they complete the original intent, verified through tests, code submissions, etc.—is nearly on par with software engineers.

The more domain expertise a user has, the more likely the session is to end successfully. However, the gap between intermediate and expert users is not large. Over the seven months we observed, the proportion of debugging sessions nearly halved, and usage shifted toward more end-to-end agent workflows: deploying and running code, analyzing data, and writing non-code documentation.

During these seven months, the value of typical tasks increased across nearly all job types. We estimate task value by comparing it to freelance market rates, calibrated with real public job data. The average estimated value of sessions increased by about 25% from October to April.

Introduction

Interactive agent programming is rapidly emerging. Since late 2025, the proportion of coding activities involving agents in GitHub projects has more than doubled, and Claude Code users now average 20 hours per week on the tool. Can people without formal programming experience successfully direct an agent to perform complex technical tasks? How will the rapid adoption and capability improvements of these tools affect broader knowledge work? We cannot yet give a complete answer, but early signals can be seen in Claude Code usage data.

This report is based on privacy-protected analysis of approximately 235k users and 400k interactive sessions from October 2025 to April 2026, providing evidence of actual usage patterns. It extends previous research on autonomy metrics in Claude Code sessions and how Claude Code is transforming work within Anthropic. We introduce a framework to describe how interactive AI programming assistants are used: what work people are doing, who is doing it, and whether the work is successful. Our focus is on users interacting via command-line interfaces (CLI), Claude.ai, or the Claude Code desktop app. By tracking how agent programming usage evolves with model capabilities, we can better understand the impact of these tools on professional programmers and the knowledge worker labor market.

What happens on Claude Code may foreshadow the future of knowledge work: agents will gradually embed into non-coding tasks. We find that Claude is handling more complex and valuable tasks. Meanwhile, clear division of labor persists in agent programming: humans decide what to build, agents decide how to build it.

We also see evidence that the real amplifier of tool effectiveness is domain expertise, not coding proficiency. Especially, domain experts are more likely to succeed and recover from errors and misunderstandings. However, the gap between experts and intermediate users is small. This suggests that as long as someone is sufficiently skilled in a domain, they can use these tools almost as effectively as deep specialists.

These findings allow us to preliminarily observe potential shifts in the labor market. Our data show success depends on whether a person understands the problem they want to solve, not whether they have formal coding training. If these patterns hold across the economy, it implies that while agent programming may absorb some implementation-focused work, it also rewards those who truly understand the problems they are addressing. Coding agents does not replace domain knowledge; rather, the more understanding a worker brings to the task, the more high-quality work the agent can produce.

Division of Labor

What do people do with Claude Code

To understand how people use Claude Code, we categorize each session into one of nine work modes—each best describing the session’s primary goal. Four of these modes directly involve coding or maintenance: building new things, fixing broken ones, testing code, and orchestrating other agents or automation pipelines. Another category involves operating software: deploying, configuring, running pipelines, and monitoring systems. Two modes focus more on understanding "what to do": understanding how an existing system works and planning changes before acting. The last two are non-code tasks or involve code only as an auxiliary: data analysis and communication via presentations and other text-based documents.

About 56% of sessions involve coding (25%), fixing code (26%), or testing and orchestrating code (5%). Operating software accounts for 17%, planning or exploration 14%, and data analysis or writing text 13% (see Figure 1).

> Figure 1: Nine work modes. Each interactive session is categorized into the single mode that best describes its goal.

We first have the model read session logs and classify each session accordingly; then, using our privacy-protected analysis tools, we cross-validate these classifications with telemetry data automatically recorded for each session, including whether code lines were added or removed. The two sources show high consistency—for example, in sessions labeled as creating or modifying code, over 90% also show code changes in telemetry data. Details are in the appendix.

Who makes decisions

How autonomous is Claude Code? Capability assessments show its upper bound is already high and still rising. For example, in benchmarks like METR’s time span evaluation, state-of-the-art models can now autonomously complete software tasks that previously took hours, overcoming obstacles on their own. But how does this look in real use? Here, we focus on how much guidance humans and Claude each provide during actual sessions.

We examine this from two angles. First, how much do people delegate decision-making to Claude? Second, how many actions do they assign to Claude? To understand decision division, we build a privacy-protected classifier that attributes each meaningful decision in a session to either planning (what to do, how to do it, what counts as done) or execution (which files to modify, what code to write, in what language, what commands to run). The classifier then attributes each decision to Claude or the user, producing two numbers per session: the proportion of planning decisions made by the user, and the proportion of execution decisions made by the user.

On average, humans make about 70% of planning decisions but only 20% of execution decisions (see Figure 2). In practice, agent programming forms a clear division of labor: humans decide what to build, agents decide how.

To understand how much delegation of actions occurs within a session, we look at the session structure rather than content. Claude Code sessions consist of exchanges: the user prompts, Claude acts; then the user prompts again, and so forth. In typical sessions, this cycle occurs about four times. In our data from October to April, each user prompt triggers about 10 actions from Claude on average, sometimes over 100. Each round, Claude reads files, edits code, runs commands, and outputs roughly 2,400 words.

How much work Claude completes between user checks depends heavily on who makes decisions. When users retain control over execution—making over 80% of execution decisions—Claude performs fewer actions per round, around 8. When Claude controls planning—over 80% of planning decisions—it takes on the most actions, about 16.

> Figure 2: Claude’s share of planning and execution decisions. The figure shows the distribution of the proportion of planning (what to do) and execution (how to do it) decisions attributed to Claude versus the user across different sessions. In typical sessions, users make about 70% of planning decisions, while Claude makes about 80% of execution decisions.

Professional Level

Based on each session, Claude assesses the user’s apparent professional level on a five-point scale—from novice to expert—focused on the specific task. The classifier considers three signals: how precise the user’s instructions are, what the user asks Claude to verify, and whether the user corrects Claude more often or vice versa. Note that this professional level is entirely different from job titles or general ability; it is task-specific. An experienced engineer asking about Rust for the first time may still be a beginner on that task. An accountant who has never used Python but can accurately tell Claude the reconciliation rules for a Python script and catch edge cases during month-end closing is an expert for that task.

The table below shows how we define each professional level in the classifier, with example prompts from the public dataset SWE-chat. Conversations labeled as "novice" contain generic instructions without domain-specific knowledge; those labeled as "expert" demonstrate deep understanding of codebases and technical environments.

> Table 1: Professional level classifier. Examples are paraphrased, anonymized, and compressed real sessions from our annotated dataset SWE-chat. Many examples are from publicly available agent programming sessions.

We quantify the relationship between professional level and the output and activity volume per prompt from Claude. In typical novice sessions, each prompt triggers about 5 actions and outputs roughly 600 words; in expert sessions, the chain length exceeds twice that, about 12 actions, with output reaching around 3,200 words—five times more (see Figure 3). The gap between novice and expert appears across all work types and task value ranges.

These metrics complement our prior research on Claude Code autonomy, which tracked runtime and how often users automatically approve actions. In contrast, our decision attribution captures who makes substantive decisions during the entire session, while the output and action counts per prompt measure how much autonomous activity each human instruction can trigger from Claude.

> Figure 3: More professional users get more work done per prompt. Higher professional levels correlate with more actions (left) and more text output (right) per prompt. The boxes show interquartile ranges, with medians at the center; whiskers extend to the 5th and 95th percentiles. Dots are geometric means. Both upward trends are statistically significant (p < 0.001), with each step up in professional level associated with a 9% increase in actions and a 13% increase in output volume, after controlling for work mode, task value, month, profession, and model series, and clustering by user.

Who Uses Claude Code and What Do They Do With It

Users

To understand who is doing these tasks, we infer each user’s profession from session logs and map it to one of 23 major categories in the U.S. Bureau of Labor Statistics’ Standard Occupational Classification (SOC) system. The classifier is instructed to judge solely based on signals such as the context loaded at session start, file names and structure, references to materials or products like legal documents, clinical data, financial reports, course materials, etc., and vocabulary used. It is explicitly instructed not to interpret "writing code" itself as evidence of a programming occupation. Only when clear signals indicate that software or data work is part of the user’s profession will the session be classified as related to coding, e.g., "Computer and Mathematical Occupations." For example, if a lawyer builds a script to automatically check for missing clauses in a set of contracts, it will be classified as a legal profession even if the main activity is coding this time. If no signals about the user’s profession are present, the session remains unclassified.

We can infer the profession in about 70% of sessions. Among these, "Computer and Mathematical Occupations" is the largest group, unsurprising given it covers most software-related work. Next are business and financial operations, arts, design and media, management, and life, physical, and social sciences. Among non-software professions, the fastest-growing groups are management, sales, and legal.

Work

From October 2025 to April 2026, the composition of work done with Claude Code changed significantly. The most notable shift is the decline in sessions involving fixing broken code—from 33% down to 19% (see Figure 4). Replacing it are more code-related tasks. Operating software increased from 14% to 21%. Writing and data analysis roughly doubled, from about 10% to 20%.

The value of tasks also increased. We estimate the economic value of each session by comparing it to freelance market rates, calibrated with real public job data. According to this measure, the average session value increased by about 27% from October to April. This rise appears across many work types: build, operate, and fix tasks increased approximately 43%, 34%, and 32%, respectively. These price estimates are rough, mainly used to compare trends over time rather than as precise dollar values. Details on how the task value estimator is built are in the appendix.

> Figure 4: Changes in Claude Code work composition and value from October 2025 to April 2026. The figure shows the proportion of different work modes over a seven-month window. Fixing broken code declined from 33% to 19%, while operating software, analyzing data, and writing documents increased.

Success Depends on What the User Brings

Estimating task value is one way to understand how Claude Code helps people get work done. Another is to observe how many sessions succeed and what features are associated with success. Across all success metrics, a clear pattern emerges: higher professional level correlates with higher likelihood of success. Most of the gains are in the lower end—from novice to intermediate—indicating that the jump from beginner to intermediate is larger than from intermediate to expert.

Before analyzing features of successful sessions, we need to define success precisely. We cannot observe real-world outcomes or directly ask users if they achieved their goals. Instead, we rely on two complementary, session-record-based measures. The first is "determined success," where a classifier reads the full session and judges whether the user completed their original goal—options include success, partial success, failure, or no clear goal. Two auxiliary classifiers evaluate the strength of this judgment, determining "verified success." The success classifier looks for verifiable evidence, especially matching git activity like commits, pull requests, passing tests, and explicit user approval. It scores sessions from "no signal" (1) to "weak signal" and "multiple strong signals" (5). A parallel failure classifier scores evidence of errors, failed tests, repeated attempts, or user objections. Verified success requires both the session to be judged successful and at least one strong verifiable success signal. Our analysis focuses on success or failure levels, excluding sessions classified as "no clear goal," which account for about 7.7% of the full sample.

Return on Professional Level

Which sessions are most likely to succeed? Results show that the professional level score described above has a strong impact on success.

Some might worry that professional level isn’t the real driver—perhaps experts choose different tasks or differ in other ways. In this section, we partially address this concern by comparing sessions of the same work type, same estimated value, same month, same topic, from the same broad occupational group, and controlling for other variables. This isolates how different professional levels influence success.

> Table 2: Definitions of success and failure derived from classifiers. Examples are paraphrased, anonymized, and summarized real sessions from the SWE-chat dataset, labeled by our classifiers.

Across all success metrics, higher professional level correlates with higher success rates. Sessions rated as "novice" have a verified success rate of 15% and a partial success rate of 77%. Those rated as "intermediate" or above have verified success rates of 28–33% and partial success rates of 91–92% (see Figure 5).

Most gains occur from novice to intermediate; the slope flattens from intermediate to expert. Details of the regression analysis behind Figure 5 are in the appendix.

> Figure 5: Relationship between professional level and session outcome. The figure shows session results across five levels from novice to expert. The left panel includes all sessions; the middle and right panels focus on sessions with problems (failure signals > 3), showing the proportion reaching different success/failure definitions. Each point is an adjusted ratio. We compare sessions with the same work mode, task value, month, topic, and user type (whether they are in software-related professions) to estimate differences by professional level. Regression details are in the appendix. Error bars show 95% confidence intervals, most too small to see. Sessions classified as "no clear goal" are excluded.

A similar gradient appears in sessions with challenges. When verified failure signals are recorded, we consider the session "encountered a problem." This includes errors, failed tests, repeated attempts, or user frustration. Among sessions with problems, verified success rises from 4% in novices to 15% in experts (see Figure 5). Using a broader success definition, partial success is 60% in novices and 80–81% in intermediates and experts.

We also track the opposite relationship: how failure signals vary with professional level. Note that sessions classified as failures are those with no partial success. If a problematic session is judged as failed and no code lines are written, we call it abandoned. Among sessions where users appear to be novices, 19% are ultimately abandoned; for other groups, the rate is 5–7%. Less experienced users are more likely to give up when facing difficulties. Part of the value of professionalism seems to be in guiding the agent back on track.

Profession May Be Less Important Than Professional Level

Users in software-related professions have about a 30% verified success rate, compared to 26% for others. In sessions involving code creation (at least one line added or modified), these numbers are 34% and 29%, respectively (see Figure 6). Using a broader success definition, the gap narrows further: 89% and 88% partial success. The five-percentage-point difference remains stable over seven months, with success rates rising for both groups. In the top ten largest occupational groups in our dataset, the success rate gap with software engineers is within seven percentage points. Management professions have the highest verified success rate, slightly above software occupations. This may reflect that management skills transfer to commanding agents, but it could also partly result from our measurement approach—verification depends on explicit user confirmation, which managers may be more accustomed to providing when they get the results they want.

> Figure 6: Success and verified success rates in coding sessions by inferred occupation. The figure shows the proportion of sessions with at least one line added or modified, classified as success or verified success, across the ten largest occupational groups. All groups are within seven percentage points of the "Computer and Mathematical" category (SOC). Error bars show 95% confidence intervals based on different accounts.

Looking Ahead

The results in this report sketch a developing picture: agent programming amplifies certain knowledge and skills while replacing others. In code-producing sessions, success rates across major professions are close to those of software-related roles. It appears that coding agents are making whether someone has a programming background less relevant for success.

At the same time, successful sessions tend to involve domain expertise. Expert-rated sessions have verified success rates more than twice those of novices. When sessions encounter problems, novices are much more likely to give up. The collaboration pattern makes this clearer: domain experts can guide Claude to do more with each instruction. Therefore, the ability to steer Claude toward success depends more on domain mastery than on coding skill. Anyone with operational understanding of a domain can now accomplish technical work that was previously out of reach. Conversely, those lacking this understanding, even with the same tools, will benefit less. The main gains come from competence, not mastery. Having actionable domain knowledge already yields most benefits; deep specialization adds only marginal advantages.

These findings are still preliminary. Like most research, we cannot measure real-world outcomes—whether code written in a session is ultimately used or discarded, or whether it produces economic value. Additionally, non-interactive use, which accounts for a significant portion of activity, is excluded here. Developing a framework to measure such use is a key future goal. All session classifications rely on the model’s reading of session logs. In the appendix, we show that classifier outputs align with independent telemetry data and with strong reference models in most cases. But in large-scale scenarios, verifying classifiers remains challenging; Claude Code sessions are often lengthy and complex, making manual annotation as a ground truth difficult.

As models, users, and their division of labor evolve, the landscape depicted here will continue to change. We hope these metrics help track major shifts: for example, if the return on professional level begins to decline, it may indicate models are providing key judgments that users currently bring, expanding benefits from domain experts to broader populations. If success rates outside software professions continue to rise, it could mean software creation is becoming a routine part of general work, not confined to a single profession. These shifts will influence who benefits from agent programming, how much, and will impact the most valued skills in the labor market.

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pinned