This is great and all but, who can actually afford to let these agents run on tasks all day long? Is anyone here actually using this or are these rollouts aimed at large companies?
I'm burning through so many tokens on Cursor that I've had to upgrade to Ultra recently - and i'm convinced they're tweaking the burn rate behind the scenes - usage allowance doesn't seem proportional.
Thank god the open source/local LLM world isn't far behind.
Real numbers from today. FastAPI codebase, ~50k LOC. 4 agents, 6 tasks, ~6 min wall clock vs ~18-20 min sequential. 24 tests, 0 file conflicts.
Token cost: roughly 4x a single session.
To your cost question — agent teams are sprinters, not marathon runners. You use them for a 6-minute burst of parallel work, not all day. A 6-minute burst at 4x cost is still cheaper than 20 minutes at 1x if your time matters more than tokens.
The constraint nobody mentions: tasks must be file-disjoint. Two agents editing the same file means overwrites. Plan decomposition matters more than the agents themselves.
One thing to watch: Claude Code crashed mid-session with a React reconciler error (#23555). 4 agents + MCP servers pushes the UI past its limits.
Need it be actually disjoint? Interested in learning about the limitation here because apparently the agents can coordinate.
Otherwise what’s the difference between what they are providing vs me creating two independent pull requests using agents and having an agent resolve merge conflicts?
It does need to be disjoint. The https://code.claude.com/docs/en/agent-teams are explicit:
"Two teammates editing the same file leads to overwrites. Break the work soeach teammate owns a different set of files."
locking is for task claiming — preventing two agents from grabbing the same task — not for file writes:
"Task claiming uses file locking to prevent race conditions when multiple teammates try to claim the same task simultaneously."
The coordination layer (TaskList, blockedBy, SendMessage) handles logical task sequencing, not concurrent file access. You can make agent B wait for agent A via dependencies, but that serializes the work and kills the parallelism benefit.
"To prevent two agents from trying to solve the same problem at the same time, the harness uses a simple synchronization algorithm:
Claude takes a "lock" on a task by writing a text file to current_tasks/ (e.g., one agent might lock current_tasks/parse_if_statement.txt, while another locks current_tasks/codegen_function_definition.txt). If two agents try to claim the same task, git's synchronization forces the second agent to pick a different one.
Claude works on the task, then pulls from upstream, merges changes from other agents, pushes its changes, and removes the lock. Merge conflicts are frequent, but Claude is smart enough to figure that out."
A Claude max 20x plan and you’ll be fine. I’d been doing my normal process of running 4 Claude sessions in parallel because that was about the right amount of concurrent sessions for me to watch what’s going on and approve/deny plans and code… and this blows it out of the water. With an agent swarm it’s so fast at executing and testing I’m limited by my idea and review capabilities now. I tried running 2 and I can’t keep up, I’m defining specs and the other window is done, tested, validated and waiting for me.
If it could do anything that a junior dev could, that’d be a valid point of comparison. But it continually, wildly performs slower and falls short every time I’ve tried.
Trying to make a media player, media server, all by using ffmpeg and a pre-built media streaming engine as it's core. Python and SQLite. About a week's worth of effort every time until it begins to go too far off the rails to be reliable to continue to develop with. It never did get the ffmpeg commands right, I had to go back to crafting those by hand, it never did get the streaming engine to play in the browser's video player in the supported hls and dash formats. Asked it to build a file and file metadata caching layer and then had to continue to re-prompt it to poll the caching layers before trying to get values from the database. Never even got to the library, metadata, or library image functionality. Had to ask it to create the rbac permissions model I wanted despite it being very junior-level common sense (super-admin, user-admin, metadata admin, image admin).
I recently built something in the same universe - using ffmpeg to receive streams from obs to capture audio and video - don't want to get into details beyond except to say it involved a fairly involved pipeline of ray actors and a significant admin interface with nicegui. I had no problem doing this with claude. You need to give it access to look up how do things, like context7. If you are doing something very specific, you need to have a session that does research to build a skill so it doesn't need to redo that research every time. And yes, you do need to tell it the architecture and be fairly detailed with something like how you want rbac.
Using these tools takes quite a bit of effort but even after doing all those steps to use the tool well, I still got this project done in a few days when it otherwise would have taken me 1-2 months and likely simply would never happened at all.
I'm curious which harness and which model(s) you've been using.
And whether you have a decent PRD or spec. Are you trying to prompt the harness with one bit at a time, or did you give it a complete spec and ask it to analyze it and break it down into individual issues with dependencies (e.g. using beads and beads_viewer)?
I'm not looking for reasons to criticize your approach or question your experience, but your answers may point to opportunities for you to get more out of these tools.
> A. You're working on some really deep thing that only world-class expects can do, like optimizing graphics engines for AAA games.
This is a relatively common skill. One thing I always notice about the video game industry is it's much more globally distributed than the rest of the software industry.
Being bad at writing software is Japan's whole thing but they still make optimized video games.
It’s a simple compiler optimization over bayesian statistics. It’s masters-level stuff at best, given that I’m on it instead of some expert. The codebase is mixed python and rust, neither of which are uncommon.
The issues I ran into are primarily “tail-chasing” ones - it gets into some attractor that doesn’t suit the test case and fails to find its way out. I re-benchmark every few months, but so far none of the frontier models have been able to make changes that have solved the issue without bloating the codebase and failing the perf tests.
It’s fine for some boilerplate dedup or spinning up some web api or whatever, but it’s still not suitable for serious work.
When really solid programmers who started skeptical (and even have a ban policy if PR submitters don’t disclose they used AI) now show how their workflows have been improved by AI agents, it may be worth trying to understand what they are doing and you are not.
Claude would be worse than an expert at this, but this is a benchmarkable task. Claude can do experiments a lot quicker than a human can. The hard part would be ensure that the results aren't just gaming your benchmark.
Companies are not comparing it straight to juniors. They're more making a comparison between a Senior with the assistance of one more more juniors, vs a Senior with the assistance of AI Agents.
I feel like comparison just to a junior developer is also becoming a fairly outdated comparison. Yes, it is worse in some ways, but also VASTLY superior in others.
It’s funny so many companies making people RTO and spending all this money on offices to get “hallway” moments of innovation, while emptying those offices of the people most likely to have a new perspective.
I guarantee you that price will double by 2027. Then it’ll be a new car payment!
I’m really not saying this to be snarky, I’m saying this to point out that we’re really already in the enshittification phase before the rapid growth phase has even ended. You’re paying $200 and acting like that’s a cheap SaaS product for an individual.
I pay less for Autocad products!
This whole product release is about maximizing your bill, not maximizing your productivity.
I don’t need agents to talk to each other. I need one agent to do the job right.
$200/month is peanuts when you are a business paying your employees $200k/year. I think LLMs make me at least 10% more effective and therefore the cost to my employer is very worth it. Lots of trades have much more expensive tools (including cars).
I think it depends on the tasks you use it for. Bootstrapping or translating projects between languages is amazing. New feature development? Questionable.
I don’t write frontend stuff, but sometimes need to fix a frontend bug.
Yesterday I fed claude very surgical instructions on how the bug happens, and what I want to happen instead, and it oneshot the fix. I had a solution in about 5 minutes, whereas it would have taken me at least an hour, but most likely more time to get to that point.
Literally an hour or two of my day was saved yesterday. I am salaried at around $250/hour, so in that one interaction AI saved my employer $250-500 in wages.
AI allows me to be a T shaped developer, I have over a decade of deep experience in infrastructure, but know fuck all about front end stuff. But having access to AI allows me as an individual who generally knows how computers work to fix a simple problem which is not in my domain.
Maybe this is a gray area, but that's kind of my experience with it too. I understand what I want to happen, but don't understand the language and it produces a language specific result that is close enough, maybe even one-shot, for me to use. I categorize this under translation.
My process, which probably wouldn't work with concurrent agents because I'm keeping an eye on it, is basically:
- "Read these files and write some documentation on how they work - put the documentation in the docs folder" (putting relevant files into the context and giving it something to refer to later on)
- "We need to make change X, give me some options on how to do it" (making it plan based on that context)
- "I like option 2 - but we also need to take account of Y - look at these other files and give me some more options" (make sure it hasn't missed anything important)
- "Revised option 4 is great - write a detailed to-do list in the docs/tasks folder" (I choose the actual design, instead of blindly accepting what it proposes)
- I read the to-do list and get it rewritten if there's anything I'm not happy with
- I clear the context window
- "Read the document in the docs folder and then this to-do list in the docs/tasks folder - then start on phase 1"
- I watch what it's doing and stop if it goes off on one (rare, because the context window should be almost empty)
- Once done, I give the git diffs a quick review - mainly the tests to make sure it's checking the right things
- Then I give it feedback and ask it to fix the bits I'm not happy with
- Finally commit, clear context and repeat until all phases are done
Most of the time this works really well.
Yesterday I gave it a deep task, that touched many aspects of the app. This was a Rails app with a comprehensive test suite - so it had lots of example code to read, plus it could give itself definite end points (they often don't know when to stop). I estimated it would take me 3-4 days for me complete the feature by hand. It made a right mess of the UI but it completed the task in about 6 hours, and I spent another 2 hours tidying it up and making it consistent with the visuals elsewhere (the logic and back-end code was fine).
So either my original estimate is way off, or it has saved me a good amount of time there.
New feature development in web and mobile apps is absolutely 10% more productive with these tools, and anyone who says otherwise is coping. That's a large fraction of software development.
Yes, the research is wrong. And in science, it's not taboo to call that out.
It's outdated, doesn't differentiate between people trying to incorporate it in their current workflow and the people who apply themselves to entirely new ones. It doesn't represent me in any way and I am releasing features to my platform daily now, instead of weekly. So I can wholeheartedly disagree with its conclusion.
The earth is either flat of it isn't. It's easy to proof it's not flat. It's not easy to conclude that the results of a study in a field that changes daily represents all people working in it, including the ones who did not participate.
If it is so self-evident that the research is wrong, that means there should be some research that supports the opposite conclusion then? Maybe you can link it?
The reason we don’t see any other research is because it’s neigh impossible to study a moving field. Especially at this pace.
If you have any ideas on how to measure objectively while this landscape changes daily, please share them with us. Maybe a researcher will jump on this bandwagon and proof you right.
I proposed a logically consistent perspective where both my experience and the study are true at the same time? What is your response to that other than comparing me to a flat earther? Do you have something useful to contribute?
Honestly, that is a “skill issue” as the kids these days say. When used properly and with skill, agents can increase your productivity. Like any tool, use it wrong and your life will be worse off. The logically consistent view if you want to believe this study and my experience is that the average person is hindered by using AI because they do not have the skills, but there are people out there who gain a net benefit.
It drives me nuts that people take the mean of AI code generation results and use that to make claims about what AI code generation is possible of. It's like using the mean basketball player to argue that people like LeBron and Jordan don't exist.
For sure. I like having discussions with nuanced takes, these are tools with strengths and weaknesses and being a good tool user includes knowing when not to pick it up.
It’s a skill issue, which means you can’t fire any of your highly skilled employees, which means it has the same value as any other business organization tool like Jira or Microsoft Excel, approximately $10-20 per user per month.
Autodesk Fusion for manufacturing costs less than Claude Max and you literally can’t do your job without it.
So Autodesk takes you from 0 to 100% productivity for under $200 a month and companies are expected to pay $200+ to gain an extra 10-20%?
That math isn’t how it works with any other business logic tools.
I pay $200/month, don’t come near the limits (yet), and if they raised the price to $1000/month for the exact same product I’d gladly pay it this afternoon (Don’t quote me on this Anthropic!)
If you’re not able to get US$thousands out of these models right now either your expectations are too high or your usage is too low, but as a small business owner and part/most-time SWE, the pricing is a rounding error on value delivered.
As a business expense to make profit, I can understand being ok with this price point.
But as an individual with no profit motive, no way.
I use these products at work, but not as much personally because of the bill. And even if I decided I wanted to pursue a for profit side project I’d have to validate it’s viability before even considering a 200$ monthly subscription
I'm paying $100 per month even though I don't write code professionally. It is purely personal use. I've used the subscription to have Claude create a bunch of custom apps that I use in my daily life.
This did require some amount of effort on my part, to test and iterate and so on, but much less than if I needed to write all the code myself. And, because these programs are for personal use, I don't need to review all the code, I don't have security concerns and so on.
$100 every month for a service that writes me custom applications... I don't know, maybe I'm being stupid with my money, but at the moment it feels well worth the price.
with the US salaries for SWEs $1000/month is not a rounding error for all but definitely for some. say you make $100/hr and CC saves you say 30hrs / month? not rounding error but no brainer. if you make $200+/hr it starts to become a rounding error. I have multiple max accounts at my disposal and at this point would for sure pay $1000/month for max plan. it comes down to simple math
1. 1-3 LLM vendors are substantially higher quality than other vendors and none of those are open source. This is an oligarchy and the scenario you described will play out.
2. >3 LLM vendors are all high quality and suitable for the tasks. At least one of these is open source. This is the "commodity" scenario, and we'll end up paying roughly the cost of inference. This still might be hundreds per month, though.
3. Somewhere in between. We've got >3 vendors, but 1-3 of them are somewhat better than the others, so the leaders can charge more. But not as much more than they can in scenario #1.
It's clear what's gonna play out. Chinese open source labs are slowly closing the gap, and as American frontier labs hit diminishing return on various tasks, the Chinese models are going to be good enough for the vast majority of use cases. This is going to strip American labs ability to do monopoly plays, and force them into open behavior.
The only place frontier labs will be able to profit take is niche models for specific purposes where they can control who has access to traces tightly. Any general pupose LLM with highly available traces is gonna get distilled down instantly.
> I’m saying this to point out that we’re really already in the enshittification phase before the rapid growth phase has even ended. You’re paying $200 and acting like that’s a cheap SaaS product for an individual.
Traditional SaaS products don't write code for me. They also cost much less to run.
I'm having a lot of trouble seeing this as enshittification. I'm not saying it won't happen some day, but I don't think we're there. $200 per month is a lot, but it depends on what you're getting. In this case, I'm getting a service that writes code for me on demand.
We can see especially in the case of Claude AI Max that while it sounds like you’re getting better value than the cheaper plans, the company is now encouraging less efficient use of the tool (having multiple agents talking to each other, rather than improving models so that one agent is doing work correctly).
> Traditional SaaS products literally “write code” for you (they implement business logic). See: Zapier, Excel.
Eh, I'd call those a sort of programming language. The user is still writing code, albeit in a "friendlier" manner. You can't just ask for what you want in English.
> The enshittification is that the costs are going up faster than inflation and companies like OpenAI are talking about adding advertisements.
In 1980, IT would have cost $0 at most companies. It's okay for costs to go up if you're getting a service you were not getting before.
In 1980, the costs associated with what we today call IT were not $0, they were just spread around in administrative clerical duties performed by a lot of humans.
Okay, but I think the analogy still works with that framing. These AI products can do tasks that would previously have been performed by a larger number of humans.
I could write an essay about how almost everything you wrote either is extremely incorrect or is extremely likely to be incorrect. I am too lazy to, though, so I will just have to wait for another commenter to do the equivalent.
Because, while I have been a huge AI optimist for decades, I generally don't like their current writing output. And even if I did, it would feel like plagiarism unless I prepended it with "an AI responded with this:", which would make me seem lazy. (Though I did already just admit I am very lazy in my first post, so perhaps that is what I will do going forward once they become better writers.)
I mean what you get for Claude Code Max is insane its 30x on the token price. If you don’t spend that all it’s your own fault. That must be below elecricity cost
I'm burning through so many tokens on Cursor that I've had to upgrade to Ultra recently - and i'm convinced they're tweaking the burn rate behind the scenes - usage allowance doesn't seem proportional.
Thank god the open source/local LLM world isn't far behind.