Google has every advantage. They were the ones who told us that “Attention Is All You Need” and created the transformer architecture that makes modern generative AI possible. They have some of the world’s keenest minds doing their AI research. Google has their own hardware to run models. And they of course have the largest built-in customer base of any frontier model maker. They are the only ones who have the time, people, resources, and customer base all together in one package.
They should be dominating.
Instead, I dread getting into my car.
Google used to have a fantastic solution in the form of Google Assistant that powered Android Auto. It was fine-tuned to near perfection. They could have kept running that and offloaded the tricky to classify stuff off to Gemini. But instead, they decided to just replace Google Assistant entirely with Gemini. Gemini that, on the June 2026 DeepSWE snapshot, leads open-weight Kimi K2.7 Code by all of six points, 37 to 31, and burns more than double the cost per task to do it. The Pro-family model, Gemini 3.1 Pro, trails the entire field at 12. For a company with Google’s advantages, that is a hard result to explain away. In my car, Gemini can’t even figure out my daughter’s name despite being “taught” dozens of times.
It used to work. I would ask it to call one of my daughters and the phone would ring. My family is full of unusual names, that is the whole point. Unique names for distinctly unique people. I need it to just work. It used to pass that test. Now it fails it far more often than not. It is to the point that I do not use voice commands in my car anymore. It is too distracting to have it continuously get the same thing wrong over and over.
The thing that is supposed to prevent distracted driving is making me far more distracted. This regression is actually making the roads less safe.
So I started asking myself, “why?” Why did Google do this? Why does Gemini fail so spectacularly where a less capable system was stellar? Why do I torture myself with bad technology when Siri is sitting right there with open arms? And why does Antigravity never do what I ask?
They Call it Antigravity Because It Is So Ugly The Ground Repels It.
Old Vietnam-era helicopter joke paraphrases aside… You see, Antigravity is the development environment Google put together for people who want to turn development into an endurance mission of patience. It is a coding harness that uses Gemini as its main model of choice. And, like Android Auto, it sucks. Like when other models exist, it sucks bad enough to not even bother using it, even if there is a generous free allowance.
Now look, Gemini can still change a CSS variable. It can still write a clean function (normally). The moments of competence are real, which is part of what makes it so maddening. But the second a task requires chaining a few dependent steps together, where step three depends on getting step two right, the larger Gemini models come apart in a way that the models I actually pay for do not. I have Codex and Claude for real work, so at the desk this costs me nothing. I just route around it. In the car I have no escape hatch, so the car is where I feel it.
On paper, these models are fine. They post respectable benchmark numbers. They are within shouting distance of the frontier on the tests everyone cites (except for DeepSWE). And then you hand one a real task in the real world and it falls on its face. It looks great until you actually need to rely on it. A benchmark is a contrived environment. Your car is not. Your codebase is not.
So I did the only reasonable thing a man on vacation would do (oh yeah, I am on vacation doing this it irked me so much.) Nobody is paying me for this. I was just annoyed enough to build an entire experimental harness for the sole purpose of figuring out why this model fails the way it does. Not whether it fails. I have known that for months, from the driver's seat. The only open question was why. Because anger is a lack of understanding. Maybe I can convert anger into something more productive.
I went in with two claims, both from lived experience. First, that Gemini is bad at tool calling. Second, that it has a short horizon, that it wanders further off track the longer it runs. The harness confirmed both, and then it handed me the specific numbers that explain the mechanism underneath them. The short version is that these two complaints are not two problems. They are one problem, seen at two different angles.
If you do not care how I know all of this and you just want the conclusion, you can skip the teardown and jump to the verdict. If you want to see the work, keep reading. If you would like to examine the harness, tests, etc and perhaps replicate the results yourself check out the GitHub repo. To me at least, the work is the interesting part.
The Teardown
Everything from here to the verdict is the nerdy part. Hard numbers, harness details, the actual failure buckets. This is the section a reader in a hurry can skip. It is also the section that makes the rest of it more than a rant, so if you are the kind of person who wants the receipts, here they are.
The setup
I ran Gemini 3.5 Flash through a standardized agentic harness. One rig for every task, the model getting its native tool-call format, a fixed turn budget, and a stop rule I controlled. I graded outcomes structurally rather than asking another model whether the work looked good, because the entire question was about mechanical correctness and I did not want to launder a judge model's opinion into my headline number.
I ran it on two kinds of tasks. Contrived ones I built myself, clean and fully specified. And real ones imported from actual open-source repositories, where the target has to be inferred and nobody hands you the contract.
Across all of it, the database holds 4,274 tool calls over 261 trials that recorded tool activity. That corpus is what the rest of this section is built on.
| Metric | Result |
|---|---|
| Tool calls | 4,274 |
| Schema-invalid calls | 0 |
| Poor-choice calls | 180 |
| Post-success continuation calls | 127 |
| Wrong finish calls | 157 |
| Wrong finish despite valid protocol | 150 |
The format is not the problem
The first thing to kill is the lazy version of "bad at tool calling," which is "it cannot produce valid JSON." That is not what is happening here. Of 4,274 tool calls, the number with invalid schemas was zero. Not a handful. Zero. Every envelope parsed. If your definition of tool calling stops at "does the function call format correctly," Gemini gets a perfect score and you walk away thinking the model is healthy. GPT-OSS-120b is remarkably bad at tool calling using that definition… Gemini is actually pretty great. That might be the last time I compliment Gemini in this article… or perhaps ever.
Because formatting was never the hard part. Tool calling, in the sense that matters to anyone trying to get work done, is a loop. Pick the right tool. Make the call. Read what comes back. Use that result to decide the next move. The JSON is the least interesting link in that chain. The hard parts are choosing correctly and interpreting the result, and that is exactly where the failures live.
When I filtered the corpus for poor-choice calls, the ones where the model picked badly rather than just formatted fine, the number came out to 180, about 4.2% of all tool calls. That sounds small, until you think about how often you’d want to drive a car if every time you pressed a pedal, shifted gears, or turned the steering wheel, there was a 4.2% chance of a fatal wreck.
Yes, that comparison is absurd. That is the point. Agentic work is a chain of dependent actions. A bad tool call is not always a recoverable typo. If the model cannot recognize that the last move made the task worse, one bad decision can redirect the whole run. A few percent per call is not a few percent per task. It compounds. And more importantly, the bad calls are not scattered randomly. They pile up in one place. But even if they were evenly distributed, at a 4.2% bad-call rate, a task reaches coin-flip odds of at least one poor choice mistake at about 17 calls. It crosses 90% around 54 calls, 95% around 70 calls, and 99% around 108 calls.
| Tool calls | Chance of at least one poor choice |
|---|---|
| 1 | 4.2% |
| 2 | 8.2% |
| 5 | 19.3% |
| 10 | 34.9% |
| 17 | 51.8% |
| 25 | 65.8% |
| 50 | 88.3% |
| 108 | 99.0% |
Obviously, this is an absolute shit show for agentic work where you care about getting good results.
It cannot tell when it is done
The single largest poor-choice bucket is post-success continuation. 127 tool calls where a visible test or check had already passed, and the model kept going anyway instead of finishing.
Sit with that. The model did the work, ran the check, saw it go green, and then did not stop. On one task with visible tests, ten trials in a row, the model passed the tests and kept calling tools until it ran out of turn budget. All ten scored zero, not because the work was wrong, but because it never declared itself done. It had won and could not tell.
The breakdown of what it did during those 127 wasted calls is the detail that makes it concrete. 89 of them were re-reads of its own artifact. Not new edits. Not additional tests. It went back and re-read the work it had already finished, 89 times, as if hunting for a permission to stop that it could not generate on its own. Another 17 were shell commands, 13 were rewrites of an artifact that was already correct, and 8 were re-runs of tests that had already passed.
This is like if you drove your car around and got home with no issues and then pulled in and out of your garage over and over again. Each time parking more or less perfectly, until you run out of gas and the engine shuts off. All the while your partner is just looking at you wondering where they went wrong in their life.
It also stops when it should keep going
The same model that would not stop after success also stops dead before success when there is no green light to chase. Like you took the car to work and on the way back just stopped politely in the middle of the interstate, shut the car off and declared yourself home.
In 41 trials, the model called finish with the final artifact still graded as wrong. 157 finish calls in total ended on a wrong artifact, and in 150 of those the tool protocol was followed perfectly. It did everything procedurally right and quit with the job unfinished.
So look at the two behaviors side by side. Given a clear success signal, it cannot recognize it and will not stop. Absent a clear success signal, it declares victory anyway and stops with the work broken. Those are not two separate bugs. They are the same missing faculty pointed in opposite directions. The model is not actually assessing whether the work is done. It is guessing, and it guesses wrong in both directions.
When it gets stuck, it stays stuck
Let’s take a look at the mutation loop. Three spans in the data, nine calls total, where the model hit a failure and then made three consecutive attempts that varied the command text but never escaped the same underlying failure. Different pytest invocations, same Python exception, over and over. A “go test” loop doing the same thing.
The calls were all valid. The commands were all well formed. The model was clearly trying to fix something. It just could not read the result of its last attempt well enough to change strategy, so it kept rephrasing the same doomed move. Small numbers here in the test, so I treat this as color rather than a load-bearing statistic, but it is the same disease again. It is also the same observation I have noticed using these models (or at least attempting to use them) for real work.
It is just not capable of using its own observations to adjust.
It is not about length
My second going-in claim was the short horizon. The obvious reading of that is "longer tasks are harder," that you could predict failure by counting steps. The depth sweep says it is not that simple, and the truth is worse.
A depth-seven task held together at full marks. A depth-one task in the same family failed more than half the time. The long shell loops, the ones burning over a million tokens in a single run, passed at one hundred percent. If raw length were the variable, none of that could happen. Depth-one is not supposed to lose to depth-seven.
So the horizon problem is not a wall at some step count. It is drift. The model has no running sense of whether it is on track, so it wanders, and the more room a task gives it, the further it wanders. Sometimes that takes seven steps to show and sometimes it shows on the first one. The model cannot hold a trajectory toward a goal. Length just gives the wandering more room to become visible.
It passes the lab and fails the world
The cleanest pattern in the whole run is where Flash held versus where it broke.
It passed the contrived tasks. The ones I built, clean and fully specified, it handled even at length. It came apart on the real ones, the tasks pulled from actual repositories where the target has to be inferred and the contract is not spelled out. Same step counts, sometimes fewer. The variable that predicted failure was not difficulty in the abstract and not length. It was whether the task was real.
I am not going to tell you what is in Google's training data, because I cannot see it and neither can anyone else writing these takes. What I can tell you is what the correlation looks like, and the correlation is exactly what you would expect from a model optimized to pass tests rather than to operate in the world. Contrived, specified, benchmark-shaped work: fine. Underspecified, real, infer-it-yourself work: floored.
Even When Gemini Was Winning It Was Still Somehow Losing
A failure signature only means something if you know what the absence of the failure looks like. So I ran GPT-5.5 through the same harness. First on the clean recovery and horizon tasks, same stop rule, only the model swapped. Both models passed. The difference was in how. On the horizon task GPT finished in 21 tool calls where Gemini took 65. On recovery it took 16 where Gemini took 32. Gemini burned nearly four times the tokens to reach the same place. And the behavior this whole article is about, continuing to call tools after a green test, showed up twice for Gemini and zero times for GPT. Zero post-success calls across all nine of those trials. Same job, same scaffold, one model that knows when to stop and one that does not.
Then I did the more interesting thing. I took the official tasks where Gemini had failed outright and ran GPT-5.5 against those. I expected it to clean up. It did not. GPT-5.5 also failed most of them, 2 of 6 on strict scoring, 3 of 6 if you only count the verifier reward. These are hard tasks and they were built to be hard. Both models broke on them.
So this is not the part where the expensive model swoops in and wins. Both of these are serious models with real intelligence underneath, and on tasks this nasty, both fall short. But the way they fall short is not the same thing twice, and that difference is the most interesting result I got out of any of this.
On two of the six tasks Gemini actually landed closer to the answer than GPT did. Gemini — silly, unaware Gemini — got closer, and it got there by spinning. Mutation loops on three of the tasks. Wrong-target commands. On the worst one it ran until it hit the step ceiling and never reached a verdict at all. It arrived near the answer the way a dropped marble arrives at the bottom of a funnel. Not by steering. By bouncing off the walls until it ran out of room.
GPT-5.5 left more tests failing on those same two tasks and still looked like the more deliberate actor, because of what it did the rest of the time. It ran tests. Sometimes the wrong tests. But it was checking its work, reading the result, and changing the next move based on what came back. Across every GPT run I have, the post-success loop that defines Gemini's failure never appeared once. Not a single instance of winning and not noticing.
So here is where I interpret more than read… but it is my study and my article. You are free to check out the data and draw your own conclusions or even spend your own money to re-run the tests. Indeed, I encourage it.
Both of these models are, at the core, stateless. They predict the next token. Neither one is carrying a running memory of the task from call to call in any real sense. The measured fact is that GPT behaves as though it is tracking its own state anyway. It checks, it reads, it adjusts. Gemini behaves as though it is not. That is not a statement about which one is smarter. The raw capability is comparable and it is high in both. The thing that separates them is whether the model holds and consults a picture of where it is and whether the last thing it did helped.
I think the best way to describe it is that this is intentionality. I read the two GPT loops that the classifier flagged. They were not the Gemini pattern. On one, GPT made a real parser edit, ran the tests, hit a specific failure, patched that exact failure, surfaced a new bug doing it, and patched that too. It was a debugging loop. It was aimed at the wrong target, the visible tests instead of the hidden contract, which is why it still failed, but every move read the last result and responded to it. That is a smarter way to be wrong than varying the words of a command against an error that never changes. The classifier gives both the same loop label. Read the actual traces, which are in the repo, and they are not the same animal. I will leave the deeper read to anyone who wants to do it. For what it is worth, I think the data backs it.
And look, I know… these are small sample sets. I am on vacation and I don’t have a ton of funding to throw at spite projects. This is a handful of trials per task, not a full leaderboard. But I did try and maintain some rigor throughout and I did try to be fair. I threw out one GPT trial where my own grader was wrong, the same way I discounted Gemini's harmless diagnostic calls, because a comparison is only worth anything if you hold both sides to the same standard. A fuller head-to-head is still work I have not paid for. But the pattern is consistent and it is legible: the gap between these two is not intelligence. It is whether the thing knows what it is doing while it does it.
What Does “Poor Choice” Mean?
The 180 poor-choice figure is a heuristic union, and the buckets are not equally damning. Post-success continuation and explicit bad payload values are strong signals. Some of the repeat-read counts are noisier and I have not leaned on them. Of the 156 valid calls that returned a non-ok status, plenty were useful diagnostic probes, a failed test right after an edit is good practice, not a mistake, so I have not counted those against the model.
The Verdict
Strip away the buckets and the counts and there is one thing wrong with this model. It does not track its own progress.
It cannot tell whether the call it just made moved the world closer to done. That single missing faculty explains every symptom in the teardown above. It re-reads its finished work 89 times because it cannot register that the work is finished. It declares victory on broken artifacts because it cannot register that they are broken. It loops on a failing command because it cannot register that the last attempt did not help. Given success it will not stop, given failure it stops anyway. While I don’t expect LLMs to demonstrate a lot of advanced taste, the bar is at least reasonable judgement about how to iterate towards a goal.
This is why my two original complaints were always the same complaint. In this case, bad tool calling and short horizon are not separate problems. Tool calling is the loop of acting and reading the result and adjusting, and a model that cannot assess its own state breaks that loop on the reading-and-adjusting step. Watch a single call and you see bad tool calling. Watch a whole task and you see the error accumulate, because nothing ever corrects it, and the accumulation reads as a short horizon. Same deficit. Zoom in, it is a tool-call problem. Zoom out, it is a horizon problem.
Long horizon tasks require accurate tool calls, good judgement on what done looks like, and the context and token budget to keep enough of the problem in memory to keep going. Google has context. That’s about it. Put it next to GPT-5.5 on the same broken tasks and Gemini sometimes lands closer to the answer, which sounds like a point in its favor right up until you watch how it gets there. One model is debugging. The other is a marble in a funnel that occasionally rattles its way to the bottom near the drain.
And it is the same thing in the car. "Call my daughter" is the loop collapsed to a single step. Identify the action, take it, confirm it happened, stop. The model fumbles the confirm-and-stop, exactly as it fumbles it after a green test in a coding task. The car and the codebase are not two stories. They are the same failure at two sizes.
Which brings it back to benchmarks… oh the benchmarks. What even is a benchmark these days? A benchmark is a contrived task with a clear, gradable finish line built in. It is the one environment where a model that cannot assess its own state gets the assessment handed to it for free. Of course Gemini looks fine there. The test does the one thing the model cannot do for itself. Then you put it in a car or a real repository, take away the built-in finish line, and the missing faculty is suddenly load-bearing and the whole thing collapses. The benchmark is not measuring the skill that actually matters. It is measuring around the exact hole.
So is the harness the solution? No. Not in this case. The harness can impose rules that are likely to hold the quality line. It can expose tools that help the model achieve the goals the harness is meant to be good at achieving. It is not (or shouldn’t be) a long list of conditional logic puzzles that make up for a model that sucks at reasoning through what it needs to do to accomplish a task.
At the end of the day, I spent a lot of my own money to put a reason on the problem I was experiencing. As it sits, I cannot and will not use Antigravity. They call it that because it floats into space… and space is a vacuum… and vacuums suck. But really though… It isn’t suitable for any serious work. And I am actually considering switching to iPhone because Gemini has ruined the Android Auto experience so thoroughly. I am not interested in being trolled by my car when I just want to play some music.
So… Google… if you’re reading this… fix it. Or don’t. You have the researchers, the hardware, the data, and the users. You have every advantage that matters except the one I just spent a weekend “proving” you’re missing: a model that can tell whether it’s getting anywhere. Until then I’ll be connecting my iPhone to my car instead of my Android, playing my own music, talking to a phone that knows my daughter’s name.
