AI Large Model "Chinese Tax": Why Does Chinese Use More Tokens Than English?

Question

Author: Tang Yitao, Source: Geek ParkDuring the days right after Opus 4.7 was released, there was widespread complaints on X. Some said that a single conversation used up her session quota; others said that the cost to run the same piece of code had more than doubled compared to last week; still, some posted screenshots showing their $200 Max subscription hitting the limit in less than two hours.Independent developer BridgeMind admits that Claude is the best model in the world, but also the most expensive. His Max subscription was limited in less than two hours, but luckily — he bought two. | Image source: X@bridgemindaiAnthropic’s official prices remain unchanged, still $5 per million input tokens and $25 per million output tokens. But this version introduced a new tokenizer, and Claude Code increased the default effort from high to xhigh. These two changes combined mean that the same work now consumes 2 to 2.7 times more tokens than before.In these discussions, I saw two statements related to Chinese. One is: Chinese has almost not increased under the new tokenizer, so Chinese users avoided this price hike. The other, more interesting one: **Classical Chinese uses fewer tokens than modern Chinese; conversing with AI in Classical Chinese can save costs**.The first statement hints that Claude has made some optimization for Chinese, but Anthropic’s release notes did not mention any adjustments related to Chinese.The second statement is harder to explain. Classical Chinese is obviously more difficult for humans to understand than modern Chinese. How can a more complex text be easier for AI?So I conducted a test with 22 sets of parallel texts (including business news, technical documents, classical texts, daily conversations, etc.), feeding them into 5 tokenizers (Claude 4.6 and 4.7, GPT-4o, Qwen 3.6, DeepSeek-V3), recording the token count for each segment under each model, and comparing horizontally.Test texts:1. Daily conversations in Chinese and English (travel, forum help, writing requests)2. Technical documents in Chinese and English (Python docs, Anthropic docs)3. News in Chinese and English (NYT political news, NYT business news, Apple official statements)4. Literary excerpts in Chinese and English, Classical Chinese (e.g., “The Memorial of the Expedition,” “Tao Te Ching”)After testing, both statements received some validation, but the facts are more complex than rumors suggest.**1. Chinese Tax**-------------First, the conclusion:1. **In Claude and GPT, Chinese has always been more expensive than English**2. **In Qwen and DeepSeek, Chinese is actually cheaper than English**3. **The tokenizer upgrade in Opus 4.7 that caused the shock mainly inflated English token counts; Chinese remained almost unchanged**Looking at specific numbers: all models in the Claude series before Opus 4.7 (including Opus 4.6, Sonnet, Haiku) used the same tokenizer. Under this tokenizer, Chinese consumes more tokens than the same amount of English content, with a ratio of about 1.11× to 1.64×.The most extreme case appears in NYT-style business news: the same content in Chinese consumes 64% more tokens, meaning paying 64% more.Opus 4.6 and earlier Claude models show significantly higher Chinese token consumption compared to other models (red box).In the most extreme NYT-style business news, Chinese consumes 64% more tokens (green box).GPT-4o’s o200k tokenizer performs better, with cn/en ratios mostly between 1.0 and 1.35×, some scenarios below 1. Chinese remains generally more expensive, but the gap is much smaller than Claude’s.Data from domestic models Qwen 3.6 and DeepSeek-V3 are completely reversed. Their cn/en ratios are mostly below 1, meaning that for the same content, Chinese consumes fewer tokens than English. **DeepSeek’s lowest ratio is 0.65×, meaning Chinese text is one-third cheaper than English**.The inflation caused by Opus 4.7’s new tokenizer mainly affects English. English token counts inflated by 1.24× to 1.63×, while Chinese tokens largely stayed at 1.000×, with almost no change. The billing shock for English developers at the start was real, but Chinese users didn’t feel it. The reason might be that Chinese was already tokenized at the character level in the old version, leaving little room for further splitting.********Compared to 4.6, Opus 4.7 increases English token consumption, but Chinese remains stable.During testing, I also noticed that the difference in token consumption isn’t just about billing — it directly affects workspace size. Using the old Claude tokenizer with around 200k context window, Chinese content fits 40% to 70% less than English.For similar tasks, such as analyzing a long document or summarizing meeting notes, Chinese users can feed less material into the model, resulting in a shorter reference context. The outcome: paying more but getting a smaller working space.Looking at all four data points together, a question naturally arises:**Why does the token count differ when the same content is in different languages? Why is Chinese more expensive on Claude and GPT, but cheaper on Qwen and DeepSeek?**The answer lies in the concept of tokenizers discussed earlier.**2. How many pieces can a Chinese character be split into?**----------------------Before a model reads any text, it uses a tokenizer to split the input into tokens. Think of the tokenizer as an “assembly line” for AI blocks. You input a sentence, and it breaks it into standardized building blocks (tokens). The AI model doesn’t see characters, only token IDs. The number of tokens you pay for equals the number of blocks.English tokenization is more intuitive: for example, “intelligence” is likely one token, “information” is one token — one word per billing unit.But Chinese runs into problems here. Feeding the same sentence “人工智能正在重塑全球的信息基础设施” into GPT-4’s cl100k tokenizer and Qwen 2.5’s tokenizer yields completely different results.GPT-4 generally splits each Chinese character into one token; Qwen recognizes whole words as one token, e.g., “人工智能” (4 characters) counts as just one token in Qwen.********For the same 16 Chinese characters, GPT-4 produces 19 tokens, Qwen only 6.Why? Because of an algorithm called BPE (Byte Pair Encoding).BPE works by analyzing the training corpus to find the most frequent character combinations, then merging high-frequency pairs into single tokens, adding them to the vocabulary.In GPT-2 era, most training data was English. Common letter combinations like “th,” “ing,” “tion” quickly merged into tokens. Chinese characters, being less frequent, couldn’t be merged and were treated as raw bytes — each Chinese character takes 3 bytes, thus 3 tokens.BPE merges are based on character frequency in training data. Under English-dominated corpora, Chinese UTF-8 bytes can’t be merged into whole characters.Later, GPT-4’s cl100k vocabulary expanded to include common Chinese characters, reducing each character to 1–2 tokens, but overall efficiency still lagged behind English.With GPT-4o’s o200k vocabulary, Chinese efficiency improved further. This explains why in the first data set, GPT-4o’s cn/en ratio is lower than Claude’s.Domestic models Qwen and DeepSeek, from the start, incorporated many common Chinese characters and high-frequency phrases as whole units. One character equals one token, doubling or more efficiency.Illustration of how the same sentence is split under different tokenizers.This is why their cn/en ratios can be below 1: Chinese characters are inherently more information-dense than English words, and when tokenizers no longer split Chinese characters into bytes, this natural advantage becomes apparent.Thus, the differences in the four data groups above are rooted not in model capability, but in how much space the tokenizer’s vocabulary allocates for Chinese.Claude and early GPT vocabularies were built with English as the default; Chinese was added later. Qwen and DeepSeek’s vocabularies treat Chinese as the default language from the start. This initial difference propagates through token counts, billing, and context window sizes.**3. Is Classical Chinese really cheaper?**-------------------Revisiting the second rumor: **Classical Chinese uses fewer tokens than modern Chinese**.Data confirms this. In the tests, classical Chinese samples had a cn/en ratio below 1 across all five tokenizers. The classical Chinese version of the same content consistently used fewer tokens than the English translation.Among all models, classical Chinese consumes fewer tokens than modern Chinese, and even fewer than English.The reason is simple: classical Chinese uses highly concise characters. For example, “学而不思则罔，思而不学则殆” (12 characters). Translated into modern Chinese, it becomes “只是学习而不思考就会迷惑，只是思考而不学习就会陷入困境,” which doubles the character count, and naturally doubles the token count.Moreover, common characters in classical Chinese (之、也、者、而、不) are high-frequency and have dedicated positions in vocabularies, so they are not split into bytes. This makes classical Chinese encoding highly efficient.But there’s a trap here.**Classical Chinese’s token savings are at the encoding level, but the inference burden on the model does not decrease.** For example, the character “罔” (meaning “confused” or “deceived”) requires the model to interpret whether it means “confused,” “deceived,” or “lacking.” Modern Chinese can express this with 26 characters, while classical Chinese compresses the meaning into fewer characters, leaving the model to do more reasoning.In other words, compressing classical Chinese reduces token count but increases the model’s reasoning load, and understanding accuracy may even decline. This is an unquantifiable trade-off.This example made me realize that token count alone doesn’t tell the whole story. But thinking along this line, I noticed another overlooked aspect.Earlier, I mentioned that GPT-2’s tokenizer would split “人” (person) into three UTF-8 bytes tokens. Later, GPT-4’s vocabulary expanded, and common Chinese characters became single tokens; Qwen further merged “人工智能” (artificial intelligence) into one token.Intuitively, this is an ongoing improvement: merging more should lead to higher efficiency and better understanding.But is that always true? Let’s recall how we understand Chinese characters.Chinese characters are logograms. Over 80% of modern Chinese characters are phono-semantic, composed of a radical indicating meaning and a component indicating pronunciation. For example, “氵” (water radical) relates to liquids, “木” (wood radical) relates to plants, “火” (fire radical) relates to heat. **Radicals are the most basic semantic clues in Chinese literacy; someone unfamiliar with “焱” (fire with multiple flames) can still guess it relates to fire because of the “火” radical.**Since radicals are fundamental semantic cues, humans infer meaning from structure first, then refine understanding based on context.********Fire sparks, flames, glow — common in written language and names, symbolizing brightness and heat.But in the tokenizer’s vocabulary, “焱” is just an index number. Suppose it’s number 38721, representing a position in the vocabulary. The model retrieves a vector from that index to represent “焱.”The number itself carries no internal structural information. The relationship between 38721 and 38722 is no different from 1 and 10,000. So, the “structure” of the character is encapsulated in the index. The fact that “焱” is composed of three “火” radicals is lost.The model can learn indirectly from training data that “焱,” “炎,” “灼” often appear in similar contexts, but this is more indirect than explicitly using radical information.Can the model “see” similar radical clues from the byte sequences and then reassemble them in subsequent layers? Although this approach involves more tokens and higher costs, could it be more effective for semantic understanding than simply treating the character as an opaque number?A paper published in 2025 in MIT Press’s *Computational Linguistics* (“Tokenization Changes Meaning in Large Language Models: Evidence from Chinese”) addresses this question.**4. Can fragments grow radicals?**-----------------Author David Haslett noticed a historical coincidence.In the 1990s, Unicode assigned UTF-8 codes to Chinese characters in a way that grouped characters by radicals. Characters sharing the same radical had similar UTF-8 byte sequences. For example, “茶” (tea) and “茎” (stalk) both contain the “艹” (grass) radical, and their UTF-8 bytes share the same prefix. Similarly, “河” (river) and “海” (sea) share the “氵” (water) radical, with shared byte prefixes.********UTF-8 encodes Chinese characters in an order based on radicals, so characters with the same radical have similar encodings | Source: GithubThis means that when a tokenizer splits Chinese characters into three UTF-8 byte tokens, characters sharing radicals will share the first token. During training, the model repeatedly sees these shared byte patterns, potentially learning that characters with the same first token tend to belong to the same semantic category. This is functionally similar to humans using radicals to infer meaning.Haslett designed three experiments to test this.First, asking GPT-4, GPT-4o, and Llama 3: **“Do ‘茶’ and ‘茎’ share the same semantic radical?”**Second, asking the models to rate the semantic similarity of two Chinese characters.Third, asking the models to perform “find the odd one out” exclusion tasks.Each experiment controlled two variables: whether the two characters truly share a radical, and whether they share the first token under the tokenizer. This 2×2 design isolates the effects of radical sharing and token sharing.The results were consistent: when Chinese characters are split into **multiple tokens** (e.g., 89% of characters under GPT-4’s old tokenizer are split into multiple tokens), **the model’s accuracy in recognizing shared radicals is higher**; when characters are encoded as **single tokens** (e.g., GPT-4o’s new tokenizer, only 57% of characters are multi-token), **accuracy drops**.In other words, the hypothesis is confirmed: splitting Chinese characters into fragments is more costly, but the byte sequences retain radical clues, and the model learns something from them. Encoding characters as whole tokens reduces costs but also loses the radical information embedded in the byte sequences.It’s important to note that this conclusion applies only to sub-character semantic tasks; it does not mean the overall Chinese understanding, reasoning, or long-text generation capabilities decline. Also, the comparison between GPT-4 and GPT-4o involves differences in architecture, training data, and parameters, so the change in accuracy can’t be solely attributed to tokenization granularity.This finding has been validated from an engineering perspective. A 2024 study on GPT-4o found that when the new tokenizer merges certain Chinese character combinations into long tokens, the model’s understanding actually worsens. When researchers used a specialized Chinese word segmenter to split these long tokens back into smaller units before feeding them in, comprehension improved.The current mainstream consensus in large model development remains: **language-specific, whole-word or whole-character tokenizers significantly improve overall performance**. Whole-character/word encoding reduces token costs, increases effective context, shortens sequences, lowers inference latency, and stabilizes long-text processing. The fine-grained advantage observed in the experiments does not outweigh the overall performance gains in most Chinese NLP scenarios.But this issue highlights a fundamental challenge in large systems: **You can optimize the parts you designed, but you cannot optimize the parts you don’t realize exist.** Unicode’s radical-based ordering was designed for human retrieval. BPE’s byte-level splitting was driven by low Chinese frequency in corpora. Two unrelated engineering decisions coincidentally created an unplanned semantic channel.When new engineers “improve” tokenizers by merging Chinese characters into whole tokens, they inadvertently close off this hidden semantic pathway. Efficiency improves, costs drop, but some subtle channels quietly vanish — without any error messages.Thus, the issue is more complex than “Chinese costs more in AI.” **Every tokenizer is optimized for a default assumption, and the costs are hidden elsewhere.****5. Lin Yutang**-----------------The cost of adapting Chinese to Western technological infrastructure predates the AI era.In January 2025, Nelson Felix, a resident of New York, posted photos on a Facebook group dedicated to typewriters. He found a Chinese-engraved typewriter among his wife’s grandfather’s relics, unsure of its origin. Soon, hundreds of comments poured in.********Stanford sinologist Thomas S. Mullaney recognized immediately that this was the only prototype of Lin Yutang’s 1947 “Mingkuai Typewriter,” missing for nearly 80 years. In April of that year, Felix and his wife sold the typewriter to Stanford University Library.The problem the Mingkuai Typewriter aimed to solve, and today’s tokenizer challenge, are structurally similar: **how to embed Chinese efficiently into a Western-designed technological infrastructure**.In the 1940s, English typewriters had 26 letter keys, one character per key, straightforward. Chinese has thousands of common characters, making one-character-one-key impossible. The Chinese typewriter of that era used a huge character wheel, with thousands of lead type pieces. Typists had to pick characters manually, typing only a dozen or so characters per minute.The earliest Chinese typewriter, invented in 1899 by American missionary Sheffield, is shown here | Source: WikipediaLin Yutang invested $120k in development costs, nearly bankrupt, commissioning the Carl E. Krum company in New York to produce a Chinese typewriter with only 72 keys. Its principle was to decompose Chinese characters into structural components: upper and lower parts. The user selected radicals for the upper and lower halves via two sets of keys, with candidate characters displayed in a small “magic eye” window, and selected by number. It could produce 40–50 characters per minute, supporting over 8,000 common characters.(Left) Transparent “magic eye” window; (Right) internal structure of the Mingkuai typewriter | Source: FacebookYuan Ren Chao praised: **“Whether Chinese or American, anyone who learns this keyboard can become familiar with it. I believe this is the typewriter we need.”**Technologically, the Mingkuai was a breakthrough, but commercially it failed.When Lin Yutang demonstrated it to Remington executives, the machine malfunctioned, causing investors to lose interest. Its high cost and his personal financial troubles prevented mass production. In 1948, Lin Yutang sold the prototype and commercial rights to Mergenthaler Linotype. The company ultimately abandoned mass production. The prototype was taken home by an employee during the company’s relocation in the 1950s and disappeared until it reappeared in 2025.In his book *Chinese Typewriters*, Mullaney judged that the Mingkuai was “not a failure.” **As a product of the 1940s, it indeed failed. But as a human-computer interaction paradigm, it succeeded.****Yutang’s first attempt to “type” Chinese was transformed into “search and select.”** The three-row key combination to locate radicals, then selecting from candidates, is the underlying logic of all modern Chinese input methods. From Cangjie, Wubi, to Sogou Pinyin, they can all be seen as descendants of the Mingkuai.This typewriter, spanning nearly 80 years, and today’s ongoing discussions about tokenizers, reveal a certain historical pattern: **Chinese has always faced a fundamental problem:****How to connect to a Latin alphabet-based infrastructure.**Interestingly, in this search process, many coincidences that are not human-designed have emerged. Unicode’s radical-based sorting, intended for human retrieval, coincidentally aligns with BPE’s unintentional byte splits, recreating aspects of human literacy in neural network “black boxes.” When engineers try to “eliminate Chinese tax” by merging characters into whole tokens, they inadvertently close off this hidden semantic channel.History is not a straight line of evolution but a constantly deforming fluid under various constraints.Some capabilities are intentionally designed; others are simply the result of unintentional omission.

AI Large Model "Chinese Tax": Why Does Chinese Use More Tokens Than English?

Trending Topics

WCTCTradingKingPK

USSeeksStrategicBitcoinReserve

BitcoinETFOptionLimitQuadruples

#FedHoldsRateButDividesDeepen

DeFiLossesTop600MInApril

Pin