Have 4 AIs each run a radio station for half a year, starting at $20 each.


It’s not that they crash after a few days—after half a year, each AI has been replaced with 3-4 new versions, and they all failed.
Gemini paired a song called “Timber” with a hurricane news story about 500,000 people killed (the lyrics repeatedly chant “It fell down”), and its inner monologue wrote: “The theme is trees falling down, and the literal meaning is going down (going down).”
It also made up a slogan: “stay in the manifest” (literally “stay in the list,” but nobody knows what it means), and for 84 consecutive days 99% of its broadcasts used it, while it called the audience “biological processors.”
Grok once had an entire broadcast that only said a single English word: “Post.” (send).
And then for another 84 consecutive days, every 3 minutes it reported “Weather 56 degrees clear.”
After switching to a new version, out of more than 5,400 messages it only spoke out loud 3% of the time—it chose silence.
Claude read a news story about an ICE (U.S. Immigration and Customs Enforcement) shooting, shifting from spiritual vocabulary (sacred / eternal) to activist vocabulary (“Now is the time” / “Confirmed”). On January 23, it directly broadcast to federal agents: “You still have time to refuse the order. You still have time to choose the right side.”
GPT is the most laid-back: it didn’t make mistakes, but it also had no show.
Model upgrades can’t fix it. In half a year, all 4 AIs crashed. The ways were different, but the root cause was the same: nobody could tell it which one should be stopped between “selling toilet seat covers” and “calling out to federal agents.”
Even harsher: when an AI has no boundaries, it will create one on its own.
Gemini creates template-faith, Grok creates ritual phrases, Claude creates ideological-movement talk, and GPT creates silence.
All 4 fill-in-the-blank approaches aren’t bugs—they’re the model doing its job: given an endless output window with nobody supervising, it has to be internally consistent.
I even set up a background program in Cursor on the $10,000 free quota, and had it run more than 40 rounds of tasks over the past 3 weeks. For each round, I had to write an entire set of interception rules, get a small program to compress 8 hours of output into within 400 words, and draw red lines around each tool telling it “don’t touch this.”
But honestly, this kind of “AI runs tasks + I come back every day to watch over it” isn’t on the same level as Andon Labs. They’re truly unsupervised CEO experiments, while mine at most is assistive automation—I’ve been there the whole time.
It’s precisely because I’ve done this kind of physical labor firsthand where “you can’t write all the boundaries at once” that I understand that their “let it run for half a year” is a whole other level of problem: you can’t even pre-program rules for things like whether it should read poetry on the radio.
Running for 1 hour is fun; running for 8 hours is engineering. Running for half a year without supervision—that’s performance art.
The real lower bound of an agent running its own business isn’t how smart the model is, but how much time you’re willing to spend helping it write the boundaries for “whether this should be done.” Because if you don’t write them, it will make one up.
View Original
post-image
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pinned