Workaccount2 a day ago

Well

It is the first model to get partial-credit on an LLM image test I have. Which is counting the legs of a dog. Specifically, a dog with 5 legs. This is a wild test, because LLMs get really pushy and insistent that the dog only has 4 legs.

In fact GPT5 wrote an edge detection script to see where "golden dog feet" met "bright green grass" to prove to me that there were only 4 legs. The script found 5, and GPT-5 then said it was a bug, and adjusted the script sensitivity so it only located 4, lol.

Anyway, Gemini 3, while still being unable to count the legs first try, did identify "male anatomy" (it's own words) also visible in the picture. The 5th leg was approximately where you could expect a well endowed dog to have a "5th leg".

That aside though, I still wouldn't call it particularly impressive.

As a note, Meta's image slicer correctly highlighted all 5 legs without a hitch. Maybe not quite a transformer, but interesting that it could properly interpret "dog leg" and ID them. Also the dog with many legs (I have a few of them) all had there extra legs added by nano-banana.

  • Rover222 a day ago

    I just tried to get Gemini to produce an image of a dog with 5 legs to test this out, and it really struggled with that. It either made a normal dog, or turned the tail into a weird appendage.

    Then I asked both Gemini and Grok to count the legs, both kept saying 4.

    Gemini just refused to consider it was actually wrong.

    Grok seemed to have an existential crisis when I told it it was wrong, becoming convinced that I had given it an elaborate riddle. After thinking for an additional 2.5 minutes, it concluded: "Oh, I see now—upon closer inspection, this is that famous optical illusion photo of a "headless" dog. It's actually a three-legged dog (due to an amputation), with its head turned all the way back to lick its side, which creates the bizarre perspective making it look decapitated at first glance. So, you're right; the dog has 3 legs."

    You're right, this is a good test. Right when I'm starting to feel LLMs are intelligent.

    • theoa 18 hours ago

      Draw a millipede as a dog:

      Gemini responds:

      Conceptualizing the "Millipup"

      https://gemini.google.com/share/b6b8c11bd32f

      Draw the five legs of a dog as if the body is a pentagon

      https://gemini.google.com/share/d74d9f5b4fa4

      And animal legs are quite standardized

      https://en.wikipedia.org/wiki/List_of_animals_by_number_of_l...

      It's all about the prompt. Example:

      Can you imagine a dog with five legs?

      https://gemini.google.com/share/2dab67661d0e

      And generally, the issue sits between the computer and the chair.

      ;-)

      • vunderba 9 hours ago

        This is basically the "Rhinos are just fat unicorns" approach. Totally fine if you want to go that route but a bit goofy. You can get SOTA models to generate a 5-legged dog simply by being more specific about the placement of the fifth leg.

        https://imgur.com/a/jNj98Pc

        Asymmetry is as hard for AI models as it is for evolution to "prompt for" but they're getting better at it.

      • Rover222 10 hours ago

        haha fair point, you can get the expected results with the right prompt, but I think it still reveals a general lack of true reasoning ability (or something)

        • ithkuil 5 hours ago

          Or it just shows that it tries to overcorrect the prompt which is generally a good idea in the most cases where the prompter is not intentionally asking a weird thing.

          This happens all the time with humans. Imagine you're at a call center and get all sorts of weird descriptions of problems with a product: every human is expected to not expect the caller is an expert and actually will try to interpolate what they might mean by the weird wording they use

    • macNchz a day ago

      An interesting test in this vein that I read about in a comment on here is generating a 13 hour clock—I tried just about every prompting trick and clever strategy I could come up with across many image models with no success. I think there's so much training data of 12 hour clocks that just clobbers the instructions entirely. It'll make a regular clock that skips from 11 to 13, or a regular clock with a plaque saying "13 hour clock" underneath, but I haven't gotten an actual 13 hour clock yet.

    • vunderba a day ago

      If you want to see something rather amusing - instead of using the LLM aspect of Gemini 3.0 Pro, feed a five-legged dog directly into Nano Banana Pro and give it an editing task that requires an intrinsic understanding of the unusual anatomy.

        Place sneakers on all of its legs.
      
      It'll get this correct a surprising number of times (tested with BFL Flux2 Pro, and NB Pro).

      https://imgur.com/a/wXQskhL

      • tensegrist a day ago

        i imagine the real answer is that the edits are local because that's how diffusion works; it's not like it's turning the input into "five-legged dog" and then generating a five-legged dog in shoes from scratch

      • Lamprey a day ago

        Does this still work if you give it a pre-existing many-legged animal image, instead of first prompting it to add an extra leg and then prompting it to put the sneakers on all the legs?

        I'm wondering if it may only expect the additional leg because you literally just told it to add said additional leg. It would just need to remember your previous instruction and its previous action, rather than to correctly identify the number of legs directly from the image.

        I'll also note that photos of dogs with shoes on is definitely something it has been trained on, albeit presumably more often dog booties than human sneakers.

        Can you make it place the sneakers incorrectly-on-purpose? "Place the sneakers on all the dog's knees?"

        • vunderba a day ago

          My example was unclear. Each of those images on Imgur was generated using independent API calls which means there was no "rolling context/memory".

          In other words:

          1. Took a personal image of my dog Lily

          2. Had NB Pro add a fifth leg using the Gemini API

          3. Downloaded image

          4. Sent image to BFL Flux2 Pro via the BFL API with the prompt "Place sneakers on all the legs of this animal".

          5. Sent image to NB Pro via Gemini API with the prompt "Place sneakers on all the legs of this animal".

          So not only was there zero "continual context", it was two entirely different models as well to cover my bases.

          EDIT: Added images to the Imgur for the following prompts:

          - Place red Dixie solo cups on the ends of every foot on the animal

          - Draw a red circle around all the feet on the animal

    • dwringer a day ago

      I had no trouble getting it to generate an image of a five-legged dog first try, but I really was surprised at how badly it failed in telling me the number of legs when I asked it in a new context, showing it that image. It wrote a long defense of its reasoning and when pressed, made up demonstrably false excuses of why it might be getting the wrong answer while still maintaining the wrong answer.

      • Rover222 a day ago

        Yeah it gave me the 5-legged dog on the 4th or 5th try.

    • visioninmyblood 3 hours ago

      I tried this by using an gemini visual agent build with orion from vlm.run. it was able to produce two different images with five leg dog. you need to make it play with itself to improve and correct.

      https://chat.vlm.run/c/62394973-a869-4a54-a7f5-5f3bb717df5f

      Here is the though process summary(you can see the full thinking the link above):

      "I have attempted to generate a dog with 5 legs multiple times, verifying each result. Current image generation models have a strong bias towards standard anatomy (4 legs for dogs), making it difficult to consistently produce a specific number of extra limbs despite explicit prompts."

    • AIorNot a day ago

      Its not that they aren’t intelligent its that they have been RL’d like crazy to not do that

      Its rather like as humans we are RL’d like crazy to be grossed out if we view a picture of a handsome man and beautiful woman kissing (after we are told they are brother and sister) -

      Ie we all have trained biases - that we are told to follow and trained on - human art is about subverting those expectations

      • majormajor a day ago

        Why should I assume that a failure that looks like a model just doing fairly simple pattern matching "this is dog, dogs don't have 5 legs, anything else is irrelevant" vs more sophisticated feature counting of a concrete instance of an entity is RL vs just a prediction failure due to training data not containing a 5-legged dog and an inability to go outside-of-distribution?

        RL has been used extensively in other areas - such as coding - to improve model behavior on out-of-distribution stuff, so I'm somewhat skeptical of handwaving away a critique of a model's sophistication by saying here it's RL's fault that it isn't doing well out-of-distribution.

        If we don't start from a position of anthropomorphizing the model into a "reasoning" entity (and instead have our prior be "it is a black box that has been extensively trained to try to mimic logical reasoning") then the result seems to be "here is a case where it can't mimic reasoning well", which seems like a very realistic conclusion.

        • mlinhares a day ago

          I have the same problem, people are trying so badly to come up with reasoning for it when there's just nothing like that there. It was trained on it and it finds stuff it was trained to find, if you go out of the training it gets lost, we expect it to get lost.

        • didgeoridoo a day ago

          I’m inclined to buy the RL story, since the image gen “deep dream” models of ~10 years ago would produce dogs with TRILLIONS of eyes: https://doorofperception.com/2015/10/google-deep-dream-incep...

          • Lamprey a day ago

            That's apples to oranges; your link says they made it exaggerate features on purpose.

            "The researchers feed a picture into the artificial neural network, asking it to recognise a feature of it, and modify the picture to emphasise the feature it recognises. That modified picture is then fed back into the network, which is again tasked to recognise features and emphasise them, and so on. Eventually, the feedback loop modifies the picture beyond all recognition."

      • HardCodedBias a day ago

        "There are four lights"

        And the AI has been RLed for tens of thousands of years not just a few days.

    • tarsinge 9 hours ago

      I have only a high level understanding of LLMs but to me it doesn’t seem surprising: they are trying to come up with a textual output of your prompt aggregated to their result that scores high (i.e. is consistent) with their training set. There is no thinking, just scoring consistency. And a dog with 5 legs is so rare or nonexistent in their training set and their resulting weights that it scores so bad they can’t produces an output that accepts it. But how the illusion breaks down in this case is quite funny indeed.

    • irthomasthomas a day ago

      Isn't this proof that LLMs still don't really generalize beyond their training data?

      • adastra22 a day ago

        LLMs are very good at generalizing beyond their training (or context) data. Normally when they do this we call it hallucination.

        Only now we do A LOT of reinforcement learning afterwards to severely punish this behavior for subjective eternities. Then act surprised when the resulting models are hesitant to venture outside their training data.

        • runarberg a day ago

          Hallucination are not generalization beyond the training data but interpolations gone wrong.

          LLMs are in fact good at generalizing beyond their training set, if they wouldn’t generalize at all we would call that over-fitting, and that is not good either. What we are talking about here is simply a bias and I suspect biases like these are simply a limitation of the technology. Some of them we can get rid of, but—like almost all statistical modelling—some biases will always remain.

          • adastra22 a day ago

            What, may I ask, is the difference between "generalization" and "interpolation"? As far as I can tell, the two are exactly the same thing.

            In which case the only way I can read your point is that hallucinations are specifically incorrect generalizations. In which case, sure if that's how you want to define it. I don't think it's a very useful definition though, nor one that is universally agreed upon.

            I would say a hallucination is any inference that goes beyond the compressed training data represented in the model weights + context. Sometimes these inferences are correct, and yes we don't usually call that hallucination. But from a technical perspective they are the same -- the only difference is the external validity of the inference, which may or may not be knowable.

            Biases in the training data are a very important, but unrelated issue.

            • runarberg 21 hours ago

              Interpolation and generalization are two completely different constructs. Interpolation is when you have two data points and make a best guess where a hypothetical third point should fit between them. Generalization is when you have a distribution which describes a particular sample, and you apply it with some transformation (e.g. a margin of error, a confidence interval, p-value, etc.) to a population the sample is representative of.

              Interpolation is a much narrower construct then generalization. LLMs are fundamentally much closer to curve fitting (where interpolation is king) then they are to hypothesis testing (where samples are used to describe populations), though they certainly do something akin to the latter to.

              The bias I am talking about is not a bias in the training data, but bias in the curve fitting, probably because of mal-adjusted weights, parameters, etc. And since there are billions of them, I am very skeptical they can all be adjusted correctly.

              • adastra22 19 hours ago

                I assumed you were speaking by analogy, as LLMs do not work by interpolation, or anything resembling that. Diffusion models, maybe you can make that argument. But GPT-derived inference is fundamentally different. It works via model building and next token prediction, which is not interpolative.

                As for bias, I don’t see the distinction you are making. Biases in the training data produce biases in the weights. That’s where the biases come from: over-fitting (or sometimes, correct fitting) of the training data. You don’t end up with biases at random.

                • IsTom 17 hours ago

                  > It works via model building and next token prediction, which is not interpolative.

                  I'm not particularly well-versed in LLMs, but isn't there a step in there somewhere (latent space?) where you effectively interpolate in some high-dimensional space?

                  • adastra22 16 hours ago

                    Not interpolation, no. It is more like the N-gram autocomplete used to use to make typing and autocorrect suggestions in your phone. Attention js not N-gram, but you can kinda think of it as being a sparsely compressed N-gram where N=256k or whatever the context window size is. It’s not technically accurate, but it will get your intuition closer than thinking of it as interpolation.

                    The LLM uses attention and some other tricks (attention, it turns out, is not all you need) to build a probabilistic model of what the next token will be, which it then sampled. This is much more powerful than interpolation.

                • runarberg 19 hours ago

                  What I meant was that what LLMs are doing is very similar to curve fitting, so I think it is not wrong to call it interpolation (curve fitting is a type of interpolation, but not all interpolation is curve fitting).

                  As for bias, sampling bias is only one many types of biases. I mean the UNIX program YES(1) has a bias towards outputting the string y despite not sampling any data. You can very easily and deliberately program a bias into everything you like. I am writing a kanji learning program using SSR and I deliberately bias new cards towards the end of the review queue to help users with long review queues empty it quicker. There is no data which causes that bias, just program it in there.

                  I don‘t know enough about diffusion models to know how biases can arise, but with unsupervised learning (even though sampling bias is indeed very common) you can get a bias because you are using wrong, mal-adjusted, to many parameters, etc. even the way your data interacts during training can cause a bias, heck even by random one of your parameters hits an unfortunate local maxima yielding a mal-adjusted weight, which may cause bias in your output.

                  • adastra22 16 hours ago

                    Training is kinda like curve fitting, but inference is not. The inference algorithm is random sampling from a next-token probability distribution.

                    It’s a subtle distinction, but I think an important one in this case, because if it was interpolation then genuine creativity would not be possible. But the attention mechanism results in model building in latent space, which then affects the next token distribution.

                    • runarberg 9 hours ago

                      I’ve seen both opinions on this in the philosophy of statistics. Some would say that machine learning inference is something other then curve fitting, but others (and I subscribe to this) believe it is all curve fitting. I actually don‘t think which camp is right is that important but I do like it when philosophers ponder about these tings.

                      My reasons to subscribing to the latter camp is that when you have a distribution and you fit things according to that distribution (even when the fitting is stochastic; and even when the distribution belongs in billions of dimensions) you are doing curve fitting.

                      I think the one extreme would be a random walk, which is obviously not curve fitting, but if you draw from any other distribution then the uniform distribution, say the normal distribution, you are fitting that distribution (actually, I take that back, the original random walk is fitting the uniform distribution).

                      Note I am talking about inference, not training. Training can be done using all sorts of algorithms, some include priors (distributions) and would be curve fitting, but only compute the posteriors (also distributions). I think the popular stochastic linear descent does something like this, so it would be curve-fitting, but the older evolutionary algorithm just random walks it and is not fitting any curve (except the uniform distribution). What matters to me is that the training arrives at a distribution, which is described by a weight matrix, and what inference is doing is fitting to that distribution (i.e. the curve).

                      • adastra22 5 hours ago

                        I get the argument that pulling from a distribution is a form of curve fitting. But unless I am misunderstanding, the claim is that it is a curve fitting / interpolation between the training data. The probability distribution generated in inference is not based on the training data though. It is a transform of the context through the trained weights, which is not the same thing. It is the application of a function to context. That function is (initially) constrained to reproduce the training data when presented with a portion of that data as context. But that does not mean that all outputs are mere interpolations between training datapoints.

                        Except in the most technical sense that any function constrained to meet certain input output values is an interpolation. But that is not the smooth interpolation that seems to be implied here.

      • Zambyte a day ago

        I wonder how they would behave given a system prompt that asserts "dogs may have more or less than four legs".

        • irthomasthomas a day ago

          That may work but what actual use would it be? You would be plugging one of a million holes. A general solution is needed.

          • CamperBob2 7 hours ago

            Not necessarily. The problem may be as simple as the fact that LLMs do not see "dog legs" as objects independent of the dogs they're attached to.

            The systems already absorb much more complex hierarchical relationships during training, just not that particular hierarchy. The notion that everything is made up of smaller components is among the most primitive in human philosophy, and is certainly generalizable by LLMs. It just may not be sufficiently motivated by the current pretraining and RL regimens.

      • CamperBob2 a day ago

        They do, but we call it "hallucination" when that happens.

      • Rover222 a day ago

        Kind of feels that way

    • qnleigh a day ago

      It's not obvious to me whether we should count these errors as failures of intelligence or failures of perception. There's at least a loose analogy to optical illusion, which can fool humans quite consistently. Now you might say that a human can usually figure out what's going on and correctly identify the illusion, but we have the luxury of moving our eyes around the image and taking it in over time, while the model's perception is limited to a fixed set of unchanging tokens. Maybe this is relevant.

      (Note I'm not saying that you can't find examples of failures of intelligence. I'm just questioning whether this specific test is an example of one).

      • cyanmagenta a day ago

        I am having trouble understanding the distinction you’re trying to make here. The computer has the same pixel information that humans do and can spend its time analyzing it in any way it wants. My four-year-old can count the legs of the dog (and then say “that’s silly!”), whereas LLMs have an existential crisis because five-legged-dogs aren’t sufficiently represented in the training data. I guess you can call that perception if you want, but I’m comfortable saying that my kid is smarter than LLMs when it comes to this specific exercise.

        • qnleigh 15 hours ago

          LLMs can count other objects, so it's not like they're too dumb to count. So a possible model for what's going on is that the circuitry responsible for low-level image recognition has priors baked in that cause it to report unreliable information to parts that are responding for higher-order reason.

          So back to the analogy, it could be as if the LLMs experience the equivalent of a very intense optical illusion in these cases, and then completely fall apart trying to make sense of it.

        • FeepingCreature a day ago

          Your kid, it should be noted, has a massively bigger brain than the LLM. I think the surprising thing here maybe isn't that the vision models don't work well in corner cases but that they work at all.

          Also my bet would be that video capable models are better at this.

    • nearbuy 20 hours ago

      My guess is the part of its neural network that parses the image into a higher level internal representation really is seeing the dog as having four legs, and intelligence and reasoning in the rest of the network isn't going to undo that. It's like asking people whether "the dress" is blue/black or white/gold: people will just insist on what they see, even if what they're seeing is wrong.

    • DANmode 10 hours ago

      What is "a dog"?

      What is " a dog" to Gemini?

    • squigz a day ago

      I feel a weird mix of extreme amusement and anger that there's a fleet of absurdly powerful, power-hungry servers sitting somewhere being used to process this problem for 2.5 minutes

      • Rover222 10 hours ago

        what a world we live in

    • varispeed a day ago

      Do 7 legged dog. Game over.

      • criddell 14 hours ago

        Is that a dog though?

    • SecretDreams a day ago

      LLMs are getting a lot better at understanding our world by standard rules. As it does so, maybe it losses something in the way of interpreting non standard rules, aka creativity.

    • isodev 19 hours ago

      > starting to feel LLMs are intelligent

      LLMs are fancy “lorem ipsum based on a keyword” text generators. They can never become intelligent … or learn how to count or do math without the help of tools.

      It can probably generate a story about a 5 legged dog though.

  • Benjammer a day ago

    It always feels to me like these types of tests are being somewhat intentionally ignorant of how LLM cognition differs from human cognition. To me, they don't really "prove" or "show" anything other than simply - LLMs thinking works different than human thinking.

    I'm always curious if these tests have comprehensive prompts that inform the model about what's going on properly, or if they're designed to "trick" the LLM in a very human-cognition-centric flavor of "trick".

    Does the test instruction prompt tell it that it should be interpreting the image very, very literally, and that it should attempt to discard all previous knowledge of the subject before making its assessment of the question, etc.? Does it tell the model that some inputs may be designed to "trick" its reasoning, and to watch out for that specifically?

    More specifically, what is a successful outcome here to you? Simply returning the answer "5" with no other info, or back-and-forth, or anything else in the output context? What is your idea of the LLMs internal world-model in this case? Do you want it to successfully infer that you are being deceitful? Should it respond directly to the deceit? Should it take the deceit in "good faith" and operate as if that's the new reality? Something in between? To me, all of this is very unclear in terms of LLM prompting, it feels like there's tons of very human-like subtext involved and you're trying to show that LLMs can't handle subtext/deceit and then generalizing that to say LLMs have low cognitive abilities in a general sense? This doesn't seem like particularly useful or productive analysis to me, so I'm curious what the goal of these "tests" are for the people who write/perform/post them?

    • majormajor a day ago

      The marketing of these products is intentionally ignorant of how LLM cognition differs from human cognition.

      Let's not say that the people being deceptive are the people who've spotted ways that that is untrue...

    • biophysboy a day ago

      I thought adversarial testing like this was a routine part of software engineering. He's checking to see how flexible it is. Maybe prompting would help, but it would be cool if it was more flexible.

      • Benjammer a day ago

        So the idea is what? What's the successful outcome look like for this test, in your mind? What should good software do? Respond and say there are 5 legs? Or question what kind of dog this even is? Or get confused by a nonsensical picture that doesn't quite match the prompt in a confusing way? Should it understand the concept of a dog and be able to tell you that this isn't a real dog?

        • menaerus 17 hours ago

          You know, I had a potential hire last week, and I was interviewing this one guy whose resume was really strong, it was exceptional in many ways plus his open-source code was looking really tight. But at the beginning of the interview, I always show the candidates the same silly code example with signed integer overflow undefined behavior baked in. I did the same here and asked him if he sees anything unusual with it, and he failed to detect it. We closed the round immediately and I disclosed no hire decision.

          • michaelmrose 12 hours ago

            Does the ability to verbally detect gotchas in short conversations dealing only with text on a screen or white board really map to stronger candidates?

            In actual situations you have documentation, editor, tooling, tests, and are a tad less distracted than when dealing with a job interview and all the attendant stress. Isn't the fact that he actually produces quality code in real life a stronger signal of quality?

        • biophysboy a day ago

          No, it’s just a test case to demonstrate flexibility when faced with unusual circumstances

      • genrader a day ago

        You're correct, however midwit people who don't actually fully understand all of this will latch on to one of the early difficult questions that was shown as an example, and then continued to use that over and over without really knowing what they're doing while the people developing the model and also testing the model are doing far more complex things

    • Paracompact 18 hours ago

      > Does the test instruction prompt tell it that it should be interpreting the image very, very literally, and that it should attempt to discard all previous knowledge of the subject before making its assessment of the question, etc.?

      No. Humans don't need this handicap, either.

      > More specifically, what is a successful outcome here to you? Simply returning the answer "5" with no other info, or back-and-forth, or anything else in the output context?

      Any answer containing "5" as the leading candidate would be correct.

      > What is your idea of the LLMs internal world-model in this case? Do you want it to successfully infer that you are being deceitful? Should it respond directly to the deceit? Should it take the deceit in "good faith" and operate as if that's the new reality? Something in between?

      Irrelevant to the correctness of an answer the question, "how many legs does this dog have." Also, asking how many legs a 5-legged dog has is not deceitful.

      > This doesn't seem like particularly useful or productive analysis to me, so I'm curious what the goal of these "tests" are for the people who write/perform/post them?

      It's a demonstration of the failures of the rigor of out-of-distribution vision and reasoning capabilities. One can imagine similar scenarios with much more tragic consequences when such AI would be used to e.g. drive vehicles or assist in surgery.

    • runarberg a day ago

      This is the first time I hear the term LLM cognition and I am horrified.

      LLMs don‘t have cognition. LLMs are a statistical inference machines which predict a given output given some input. There are no mental processes, no sensory information, and certainly no knowledge involved, only statistical reasoning, inference, interpolation, and prediction. Comparing the human mind to an LLM model is like comparing a rubber tire to a calf muscle, or a hydraulic system to the gravitational force. They belong in different categories and cannot be responsibly compared.

      When I see these tests, I presume they are made to demonstrate the limitation of this technology. This is both relevant and important that consumers know they are not dealing with magic, and are not being sold a lie (in a healthy economy a consumer protection agency should ideally do that for us; but here we are).

      • Benjammer a day ago

        >They belong in different categories

        Categories of _what_, exactly? What word would you use to describe this "kind" of which LLMs and humans are two very different "categories"? I simply chose the word "cognition". I think you're getting hung up on semantics here a bit more than is reasonable.

        • Libidinalecon 14 hours ago

          This is "category" in the sense of Gilbert Ryle's category error.

          A logical type or a specific conceptual classification dictated by the rules of language and logic.

          This is exactly getting hung up on the precise semantic meaning of the words being used.

          The lack of precision is going to have huge consequences with this large of bets on the idea that we have "intelligent" machines that "think" or have "cognition" when in reality we have probabilistic language models and all kinds of category errors in the language surrounding these models.

          Probably a better example here is that category in this sense is lifted from Bertrand Russell’s Theory of Types.

          It is the loose equivalent of asking why are you getting hung up on the type of a variable in a programming language? A float or a string? Who cares if it works?

          The problem is in introducing non-obvious bugs.

        • runarberg a day ago

          > Categories of _what_, exactly?

          Precisely. At least apples and oranges are both fruits, and it makes sense to compare e.g. the sugar contents of each. But an LLM model and the human brain are as different as the wind and the sunshine. You cannot measure the windspeed of the sun and you cannot measure the UV index of the wind.

          Your choice of the words here was rather poor in my opinion. Statistical models do not have cognition any more than the wind has ultra-violet radiation. Cognition is a well studied phenomena, there is a whole field of science dedicated to cognition. And while cognition of animals are often modeled using statistics, statistical models in them selves do not have cognition.

          A much better word here would by “abilities”. That is that these tests demonstrate the different abilities of LLM models compared to human abilities (or even the abilities of traditional [specialized] models which often do pass these kinds of tests).

          Semantics often do matter, and what worries me is that these statistical models are being anthropomorphized way more then is healthy. People treat them like the crew of the Enterprise treated Data, when in fact they should be treated like the ship‘s computer. And I think this because of a deliberate (and malicious/consumer hostile) marketing campaign from the AI companies.

          • Workaccount2 14 hours ago

            It's easy to handwave away if you assign arbitrary analogies though.

            If we stay on topic, it's much harder to do since we don't actually know how the brain works. Outside at least that it is a computer doing (almost certainly) analog computation.

            Years ago I built a quasi mechanical calculator. The computation was done mechanically, and the interface was done electronically. From a calculators POV it was an abomination, but a few abstraction layers down, they were both doing the same thing, albeit my mecha-calc being dramatically worse at it.

            I don't think the brain is an LLM, like my Mecha-calc was a (slow) calculator, but I also don't think we know enough about the brain to firmly put it many degrees away from an LLM. Both are infact electrical signal processors with heavy statistical computation. I doubt you believe the brain is a trans-physical magic soul box.

            • runarberg 9 hours ago

              But we do know how the brain works, we have extensively studied the brain, it is probably one of the most studied phenomena in our universe (well barring alien science) and we do know it is not a computer but a neural network[1].

              I don’t believe the brain is a trans-physical magic soul box, nor do I think an LLM is doing anything similar to an LLM (apart from some superficial similarities; some [like the artificial neural network] are in an LLMs because it was inspire by the brain).

              We use the term cognition to describe the intrinsic properties of the brain, and how it transforms stimulus to a response, and there are several fields of science dedicated to study this cognition.

              Just to be clear, you can describe the brain as a computer (a biological computer; totally distinct from a digital, or even mechanical computers), but that will only be an analogy, or rather, you are describing the extrinsic properties of the brain which it happens to share some of which with some of our technology.

              ---

              1: Note, not an artificial neural network, but an OG neural network. AI models were largely inspired by biological brains, and in some parts model brains.

          • Benjammer a day ago

            Wind and sunshine are both types of weather, what are you talking about?

            • runarberg a day ago

              They both affect the weather, but in a totally different way, and by completely different means. Similarly the mechanisms in which the human brain produces output is completely different from the mechanism in which an LLM produces output.

              What I am trying to say is that the intrinsic properties of the brain and an LLM are completely different, even though the extrinsic properties might appear the same. This is also true of the wind and the sunshine. It is not unreasonable to (though I would disagree) that “cognition” is almost the definition of the sum of all intrinsic properties of the human mind (I would disagree only on the merit of animal and plant cognition existing and the former [probably] having similar intrinsic properties as human cognition).

              • Kiro 4 hours ago

                Artificial cognition has been an established term long before LLMs. You're conflating human cognition with cognition at large. Weather and cognition are both categories that contain many different things.

                • runarberg 2 hours ago

                  Yeah, I looked it up yesterday and saw that artificial cognition is a thing, though I must say I am not a fan and I certainly hope this term does not catch. We are already knee deep in bad terminology because of artificial intelligence (“intelligence” already being extremely problematic even with out the “artificial” qualifier in psychology) and machine learning (the latter being infinitely better but still not without issues).

                  If you can‘t tell I find issues when terms are taken from psychology and applied to statistics. The terminology should flow in the other direction, from statistics and into psychology.

                  So my background is that I have done both undergraduate in both psychology and in statistics (though I dropped out of statistics after 2 years) and this is the first time I hear about artificial cognition, so I don‘t think this term is popular, and a short internet search seems to confirm that suspicion.

                  Out of context I would guess artificial cognition would mean something similar to cognition as artificial neural networks do to neural networks, that is, these are models that simulate the mechanisms of human cognition and recreate some stimulus → response loop. However my internet search revealed (thankfully) that this is not how researches are using this (IMO misguided) term.

                  https://psycnet.apa.org/record/2020-84784-001

                  https://arxiv.org/abs/1706.08606

                  What the researchers mean by the term (at least the ones I found in my short internet search) is not actual machine cognition, nor claims that machines have cognition, but rather an approach of research which takes experimental designs from cognitive psychology and applies them to learning models.

      • CamperBob2 a day ago

        You'll need to explain the IMO results, then.

        • runarberg a day ago

          Human legs and car tires can both take a human and a car respectively to the finish line of a 200 meter track course, the car tires do so considerably quicker than a pair of human legs. But nobody needs to describe the tire‘s running abilities because of that, nor even compare a tire to a leg. A car tire cannot run, and it is silly to demand an explanation for it.

          • dekhn a day ago

            Sure car tires can run- if they're huaraches.

  • danielvaughn a day ago

    I don’t know much about AI, but I have this image test that everything has failed at. You basically just present an image of a maze and ask the LLM to draw a line through the most optimal path.

    Here’s how Nano Banana fared: https://x.com/danielvaughn/status/1971640520176029704?s=46

    • JamesSwift a day ago

      I just oneshot it with claude code (opus 4.5) using this prompt. It took about 5 mins and included detecting that it was cheating at first (drew a line around the boundary of the maze instead), so it added guardrails for that:

      ```

      Create a devenv project that does the following:

        - Read the image at maze.jpg
        - Write a script that solves the maze  in the most optimal way between the mouse and the cheese
        - Generate a new image which is of the original maze, but with a red line that represents the calculated path
      
      Use whatever lib/framework is most appropriate

      ```

        Output: https://gist.github.com/J-Swift/ceb1db348f46ba167948f734ff0fc604  
        Solution: https://imgur.com/a/bkJloPT
      • nl a day ago

        Programs can solve mazes and LLMs can program. That's a different thing completely.

        • JamesSwift a day ago

          That just seems like an arbitrary limitation. Its like asking someone to do answer a math calculation but "no thinking allowed". Like, I guess we can gauge if a model just _knows all knowable things in the universe_ using that method... but anything of any value that you are gauging in terms of 'intelligence', is going to actually be validating their ability to go "outside the scope" of what they actually are (an autocomplete on steroids).

          • nearbuy 20 hours ago

            It depends whether you're asking it to solve a maze because you just need something that can solve mazes, or if you're trying to learn something about the model's abilities in different domains. If it can't solve a maze by inspection instead of writing a program to solve it, that tells you something about its visual reasoning abilities, and that can help you predict how they'll perform on other visual reasoning tasks that aren't easy to solve with code.

            • JamesSwift 20 hours ago

              Again, think about how the models work. They generate text sequentially. Think about how you solve the maze in your mind. Do you draw a line direct to the finish? No, it would be impossible to know what the path was until you had done it. But at that point you have now backtracked several times. So, what could a model _possibly_ be able to do for this puzzle which is "fair game" as a valid solution, other than magically know an answer by pulling it out of thin air?

              • nearbuy 3 hours ago

                First, the thrust of your argument is that you already knew that it would be impossible for a model like Gemini 3 Pro to solve a maze without code, so there's nothing interesting to learn from trying it. But the rest of us did not know this.

                > Again, think about how the models work. They generate text sequentially.

                You have some misconception on how these models work. Yes, the transformer LLMs generate output tokens sequentially, but it's weird you mention this because it has no relevance to anything. They see and process tokens in parallel, and then process across layers. You can prove, mathematically, that it is possible for a transformer-based LLM to perform any maze-solving algorithm natively (given sufficient model size and the right weights). It's absolutely possible for a transformer model to solve mazes without writing code. It could have a solution before it even outputs a single token.

                Beyond that, Gemini 3 Pro is a reasoning model. It writes out pages of hidden tokens before outputting any text that you see. The response you actually see could have been the final results after it backtracked 17 times in its reasoning scratchpad.

              • nl 13 hours ago

                > So, what could a model _possibly_ be able to do for this puzzle which is "fair game" as a valid solution, other than magically know an answer by pulling it out of thin air?

                Represent the maze as a sequence of movements which either continue or end up being forced to backtrack.

                Basically it would represent the maze as a graph and do a depth-first search, keeping track of what nodes it as visited in its reasoning tokens.

                See for example https://stackoverflow.com/questions/3097556/programming-theo... where the solution is represented as:

                A B D (backtrack) E H L (backtrack) M * (backtrack) O (backtrack thrice) I (backtrack thrice) C F (backtrack) G J

                • JamesSwift 12 hours ago

                  And my question to you is “why is that substantially different than writing the correct algorithm to do it”? Im arguing its a myopic view of what we are going to call “intelligence”. And it ignores how human thought works in the same way by using abstractions to move to the next level of reasoning.

                  In my opinion, being able to write the code to do the thing is effectively the same exact thing as doing the thing in terms of judging if its “able to do” that thing. Its functionality equivalent for evaluating what the “state of the art” is, and honestly is naive to what these models even are. If the model hid the tool calling in the background instead, and only showed you its answer would we say its more intelligent? Because that’s essentially how a lot of these things work already. Because again, the actual “model” is just a text autocomplete engine and it generates from left to right.

            • seanmcdirmid 20 hours ago

              You could actually add mazes and paths through them to the training corpus, or make a model for just solving mazes. I wonder how effective it would be, I’m sure someone has tried it. I doubt it would generalize enough to give the AI new visual reasoning capabilities beyond just solving mazes.

          • rglullis 18 hours ago

            By your analogy, the developers of stockfish are better chess players than any grandmaster.

            Tool use can be a sign of intelligence, but "being able to use a tool to solve a problem" is not the same as "being intelligent enough to solve a specific class of problems".

            • JamesSwift 11 hours ago

              Im not talking about this being the "best maze solver" and "better at solving mazes than humans". Im saying the model is "intelligent enough" to solve a maze.

              And what Im really saying is that we need to stop moving the goal post on what "intelligence" is for these models, and start moving the goal post on what "intelligence" actually _is_. The models are giving us an existential crisis on not only what it might mean to _be_ intelligent, but also how it might actually work in our own brains. Im not saying the current models are skynet, but Im saying I think theres going to be a lot learned by reverse engineering the current generation of models to really dig into how they are encoding things internally.

          • flyinglizard a day ago

            We know there are very simple maze solving algorithms you could code in few lines of Python but no one could claim that constitutes intelligence. The difference is between applying intuitive logic and using a predetermined tool.

      • sebastiennight 18 hours ago

        This (writing a program to solve the problem) would be a perfectly valid solution if the model had come up with it.

        I participated in a "math" competition in high school which mostly tested logic and reasoning. The reason my team won by a landslide is because I showed up with a programmable calculator and knew how to turn the problems into a program that could solve them.

        By prompting the model to create the program, you're taking away one of the critical reasoning steps needed to solve the problem.

      • esafak a day ago

        If you allow tool use much simpler models can solve it.

    • vunderba a day ago

      In fact, one of the tests I use as part of GenAI Showdown involves both parts of the puzzle: draw a maze with a clearly defined entrance and exit, along with a dashed line indicating the solution to the maze.

      Only one model (gpt-image-1) out of the 18 tested managed to pass the test successfully. Gemini 3.0 Pro got VERY close.

      https://genai-showdown.specr.net/#the-labyrinth

      • danielvaughn a day ago

        super cool! Interesting note about Seedream 4 - do you think awareness of A* actually could improve the outcome? Like I said, I'm no AI expert, so my intuitions are pretty bad, but I'd suspect that image analysis + algorithmic pathfinding don't have much crossover in terms of training capabilities. But I could be wrong!

        • vunderba a day ago

          Great question. I do wish we had a bit more insight into the exact background "thinking" that was happening on systems like Seedream.

          When you think about posing the "solve a visual image of a maze" to something like ChatGPT, there's a good chance it'll try to throw a python VM at it, threshold it with something like OpenCV, and use a shortest-path style algorithm to try and solve it.

    • kridsdale3 a day ago

      I have also tried the maze from a photo test a few times and never seen a one-shot success. But yesterday I was determined to succeed so I allowed Gemini 3 to write a python gui app that takes in photos of physical mazes (I have a bunch of 3d printed ones) and find the path. This does work.

      Gemini 3 then one-shot ported the whole thing (which uses CV py libraries) to a single page html+js version which works just as well.

      I gave that to Claude to assess and assign a FAANG hiring level to, and it was amazed and said Gemini 3 codes like an L6.

      Since I work for Google and used my phone in the office to do this, I think I can't share the source or file.

    • pwagland a day ago

      I tried this with Claude:

      ``` > [Image #1] Create a unicode "ascii-art" version of this image, with the optimal path through the maze highlighted in a solid colour.

      I'll create an ASCII art version of this maze with the solution path highlighted!

        ┌─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┐
        ●●│ │ │ │       │   │       │     │ │   │   │ │ │   │
        ├─┤●└─┴─┐ ├───┐ │ ╔═╗ ├─────┤ ╔═══╝ │ ╔═╝ ╔═╝ │ │ ╔═╝ ├─┤
        │ │●●●●●└─┤   │ │ ║ │ │     │ ║     │ ║   ║   │ │ ║   │ │
        │ └───┐●╔═╧═╗ │ └─╢ │ │ ┌─╔═╝ ║ ┌───┤ ╚═╗ ╚═╗ │ └─╢ ┌─┘ │
        │     │●║   ║ │   ║ │ │ │ ║   ║ │   │   ║   ║ │   ║ │   │
        ├───┐ │●╚═╗ ╚═╧═╗ ║ └─┤ │ ║ ╔═╝ └─╔═╧═╗ ╚═╗ ╚═╧═╗ ║ └─┬─┤
        │   │ │●●●└───┐ ║ ║   │ │ ║ ║     ║   ║   ║     ║ ║   │ │
        │ ╔═╝ └───┐●╔═╝ ║ ╚═╗ │ └─╢ ╚═╗ ╔═╝ ╔═╝ ┌─╢ ╔═══╝ ╚═╗ │ │
        │ ║       │●║   ║   ║ │   ║   ║ ║   ║   │ ║ ║       ║ │ │
        │ ╚═══╗ ╔═╝●╚═╗ ╚═╗ ║ ├─╔═╧═╗ ║ ║ ╔═╝ ╔═╝ ║ ║ ╔═╗ ╔═╝ │ │
        │     ║ ║●●●●●│   ║ ║ │ ║   ║ ║ ║ ║   ║   ║ ║ ║ ║ ║   │ │
        ├───╔═╝ ║●┌─┐●└─╗ ║ └─┤ ║ ╔═╝ ║ ╚═╝ ╔═╧═╗ ║ ╚═╝ ║ ╚═╗ │ │
        │   ║   ║●│ │●●●║ ║   │ ║ ║   ║     ║   ║ ║     ║   ║ │ │
        │ ╔═╝ ╔═╝●│ └─┐●╚═╧═╗ │ ║ ║ ╔═╧═══╗ ║ ╔═╝ ╚═══╗ ╚═╗ ║ └─┤
        │ ║   ║●●●│   │●●●●●║ │ ║ ║ ║     ║ ║ ║       ║   ║ ║   │
        │ ║ ╔═╝●┌─┴─╗ └───┐●╚═╝ ║ ║ ║ ╔═╗ ║ └─╢ ╔═══╗ ╚═╗ ╚═╝ ╔═╝
        │ ║ ║●●●│   ║     │●●●●●║ ║ ║ ║ ║ ║   ║ ║   ║   ║     ║
        │ ╚═╝●╔═╧═╗ ╚═╗ ╔═╧═══┐●╚═╝ ║ ║ ║ ╚═╗ ║ ║ ╔═╧═╗ ╚═══╗ ╚═╗
        │   ●●║   ║   ║ ║     │●●●●●║ ║ ║   ║ ║ ║ ║   ║     ║   ║
        ├─╔═┘●╚═╗ ╚═╗ ║ ║ ╔═╗ └───┐●╚═╝ ╚═╗ ║ └─╢ ║ ╔═╧═╗ ╔═╝ ╔═╝
        │ ║●●●●●║   ║ ║ ║ ║ ║     │●●●●●●●║ ║   ║ ║ ║   ║ ║   ║
        │ ║●┌─╗ ╚═╗ ║ └─╢ ║ ╚═══╗ └─────┐●╚═╧═╗ ║ ║ ║ ╔═╝ ║ ╔═╝ ╔
        │ ║●│ ║   ║ ║   ║ ║     ║       │●●●●●║ ║ ║ ║ ║   ║ ║   ║
        │ ╚═╝ ║ ╔═╝ ╚═╗ ║ ╚═╗ ╔═╧═════╗ └───┐●╚═╝ ║ ╚═╝ ╔═╝ ║ ╔═╝
        │     ║ ║     ║ ║   ║ ║       ║     │●●●●●║     ║   ║ ║
        └─────╨─╨─────╨─╨───╨─╨───────╨─────┴───●●╨─────╨───╨─╨──
      
        The path from the mouse () to the cheese () is marked with ● characters. This shows the optimal route through the maze, navigating through the corridors while avoiding all the walls!
      ```

      Suffice to say, it didn't do either part right.

      • ActorNightly a day ago

        >Suffice to say, it didn't do either part right.

        I dunno why people are surprised by this. This is what you get with text->text. Reasoning doesn't work text->text.

      • biophysboy a day ago

        Honestly, even though it failed, I'm kind of impressed that the trajectory mostly stays in the lines. If you remove all but two openings, does it work? The drawing you show has more than two openings, some of which are inaccessible from the inside of the maze.

        • sebastiennight 18 hours ago

          It's ASCII art, so the "trajectory" will always stay within the lines, because you can't have the ● and ║ characters intersect each other.

          The only impressive part would be that the trajectory is "continuous", meaning for every ● there is always another ● character in one of the 4 adjacent positions.

          • biophysboy 13 hours ago

            I know the characters can’t cross. By intersect, I mean two dots on either side of a boundary line in the direction of the path.

    • FeepingCreature a day ago

      I kinda want to know what happens if you make it continue the line by one step 20 times in a row. A human can draw this gradually, the image model has to draw it in one shot all at once.

    • buildbot a day ago

      That might be an interesting training set, a bunch of mazes…

    • jiggawatts a day ago

      The reason is that image generators don't iterate on the output in the same way the text-based LLMs do. Essentially they produce the image in "one hit" and can't solve a complex sequence in the same way you couldn't one-shot this either. Try taking a random maze, glance at it, then go off to draw a squiggle on a transparency. If you were to place that on top of the maze, there's virtually no chance that you'd have found the solution on the first try.

      That's essentially what's going on with AI models, they're struggling because they only get "one step" to solve the problem instead of being able to trace through the maze slowly.

      An interesting experiment would be to ask the AI to incrementally solve the maze. Ask it to draw a line starting at the entrance a little ways into the maze, then a little bit further, etc... until it gets to the end.

      • radarsat1 7 hours ago

        you're kind of describing the figure in table 1 (page 8) of the diffusion forcing paper

        https://arxiv.org/abs/2407.01392

        of course it doesn't redraw the image on every step, so not exactly what you're suggesting (interesting idea btw) but i think it's relevant.

  • vunderba a day ago

    Anything that needs to overcome concepts which are disproportionately represented in the training data is going to give these models a hard time.

    Try generating:

    - A spider missing one leg

    - A 9-pointed star

    - A 5-leaf clover

    - A man with six fingers on his left hand and four fingers on his right

    You'll be lucky to get a 25% success rate.

    The last one is particularly ironic given how much work went into FIXING the old SD 1.5 issues with hand anatomy... to the point where I'm seriously considering incorporating it as a new test scenario on GenAI Showdown.

    • moonu a day ago

      https://gemini.google.com/share/8cef4b408a0a

      Surprisingly, it got all of them right

      • vunderba a day ago

        Some good examples there. The octopus one is at an angle - can't really call that one pass (unless the goal is "VISIBLE" tentacles).

        Other than the five-leaf clover, most of the images (dog, spider, person's hands) all required a human in the loop to invoke the "Image-to-Image" capabilities of NB Pro after it got them wrong. That's a bit different since you're actively correcting them.

    • XenophileJKO a day ago

      It mostly depends on "how" the models work. Multi-modal unified text/image sequence to sequence models can do this pretty well, diffusion doesn't.

      • vunderba a day ago

        Multimodal certainly helps but "pretty well" is a stretch. I'd be curious to know what multimodal model in particular you've tried that could consistently handle generative prompts of the above nature (without human-in-the-loop corrections).

        For example, to my knowledge ChatGPT is unified and I can guarantee it can't handle something like a 7-legged spider.

        • XenophileJKO 5 hours ago

          I just got the model to generate a spider without a leg by saying "Spider missing one leg" and it did it fine. It won't do it "every time", (in my case 1 out of 2), but it will do it. I used the GPT-image-1 model in the api. I don't think they are actually running a full end to end text/image model sequence model. I don't think anyone really is commercially, they are hybrids as far as I know. Someone here probably has better information on the current architectures.

    • Borealid 20 hours ago

      You didn't correct the hands being backwards? It gave you a man with six fingers on his right hand, and four fingers on his left.

  • thefourthchime a day ago

    My test of a new model is always:

    "Generate a Pac-Man game in a single HTML page." -- I've never had a model been able to have a complete working game until a couple weeks ago.

    Sonnet Opus 4.5 in Cursor was able to make a fully working game (I'll admit letting cursor be an agent on this is a little bit cheating). Gemini 3 Pro also succeeded, but it's not quite as good because the ghosts seem to be stuck in their jail. Otherwise, it does appear complete.

  • seanmcdirmid a day ago

    > This is a wild test, because LLMs get really pushy and insistent that the dog only has 4 legs.

    Most human beings, if they see a dog that has 5 legs, will quickly think they are hallucinating and the dog really only has 4 legs, unless the fifth leg is really really obvious. It is weird how humans are biased like that:

    1. You can look directly at something and not see it because your attention is focused elsewhere (on the expected four legs).

    2. Our pre-existing knowledge (dogs have four legs) influences how we interpret visual information from the bottom-up.

    3. Our brain actively filters out "unimportant" details that don't align with our expectations or the main "figure" of the dog.

    Attention should fix this however, like if you ask the AI to count the number of legs the dog has specifically, it shouldn't go nuts.

    A straight up "dumber" computer algorithm that isn't trained extensively on real and realistic image data is going to get this right more often than a transformer that was.

    • irjustin a day ago

      > It is weird how humans are biased like that.

      We're all just pattern matching machines and we humans are very good at it.

      So much so that we have the sayings - you can't teach an old dog... and a specialist in their field only sees hammer => nails.

      Evolution anyone?

      • seanmcdirmid a day ago

        Yes, its all evolution. 5 legged dogs aren't very common, so we don't specifically look for them. Like we aren't looking for humans with six fingers.

        I get it, the litmus test of parent is to show that the AI is smarter than a human, not as smart as a human. Can the AI recognize details that are difficult for normal people to see even though the AI has been trained on normal data like the humans have been.

    • hcnews a day ago

      > It is weird how humans are biased like that.

      We are able to cleanly separate facts from non-facts (for the most part). This is what LLM are trying to replicate now.

      • seanmcdirmid a day ago

        I think the LLM is just trying to be useful, not omniscient. Binary thinkers are probably not going to be able to appreciate the difference, however.

        If you want the AI to identify a dog, we are done. If you want the AI to identify subtle differences from reality, then you are going to have to use a different technique.

  • jvolkman a day ago

    Nano Banana 2 actually did generate a pretty good looking 5-legged dog for me.

    https://gemini.google.com/share/e7a8b902ff67

    But it didn't actually think it had done so, aligning with your observations. The last bits of its thinking were pretty sad (for it):

    Analyzing Missing Anomaly

    I'm focusing on the discrepancy - the persistent absence of the fifth leg in the generated images. Despite multiple attempts, the output consistently depicts a four-legged dog, contrary to the explicit requirement. This ongoing failure necessitates a shift in approach.

    Acknowledge Leg Omission

    I've made a final check, and unfortunately, the image still lacks the anomaly. The consistent absence of the fifth leg necessitates admitting my limitation. I'll provide the best image I could generate, clearly stating the missing feature.

    • genrader a day ago

      Google's summary of the thinking process isn't as detailed or accurate as what is actually being thought out.

      For instance, look at what other reasoning models show you about their thinking. You can tell Gemini is not giving it all to you, so I don't put full trust in it unless it routinely is giving me an issue that I can catch in there.

      For instance, it doesn't show you all the tool use that it can use nor distinguish the different types where I really wish it would. Especially because I force it to in a Gemini personal context instruction as well as many prompts and it does do them

  • joaomacp a day ago

    And just like that, you no longer have a good benchmark. Scrapers / AI developers will read this comment, and add 5-legged dogs to LLM's training data.

    • averageRoyalty a day ago

      That's okay. Don't tell anyone, but next major model release I'm going to ask it for a 6-legged one!

    • thway15269037 a day ago

      So much this. People don't realize that when 1 trillion (10 trillion, 100 trillion, whatever comes next) is at stake, there are no limits what these people will do to get them.

      I will be very surprised if there are not at least several groups or companies scraping these "smart" and snarky comments to find weird edge cases that they can train on, turn into demo and then sell as improvement. Hell, they would've done it if 10 billion was at stake, I can't really imagine (and I have vivid imagination, to my horror) what Californian psychopaths can do for 10 trillion.

      • Workaccount2 14 hours ago

        I'm not worried about it because they won't waste their time on it (individually RL'ing on a dog with 5 legs). There are fractal ways of testing this inability, so the only way to fix it is to wholesale solve the problem.

        Similar to the pelican bike SVG, the models that do good at that test do good at all SVG generation, so even if they are targeting that benchmark, they're still making the whole model better to score better.

  • grugnog a day ago

    Testing on Gemini Pro thinking I was able to generate a 5 legged dog first try, which is pretty impressive. Pasting that into a new session it counted 4 at first, but when I asked it to count more carefully it came back with 5 (and a detailed description of the positioning of each limb). Interestingly it suggested the anomaly could have been caused by AI! This is a good test I think - we seem right on the threshold of it being doable.

  • rottencupcakes a day ago

    Super interesting. I replicated this.

    I passed the AIs this image and asked them how many fingers were on the hands: https://media.post.rvohealth.io/wp-content/uploads/sites/3/2...

    Claude said there were 3 hands and 16 fingers. GPT said there are 10 fingers. Grok impressively said "There are 9 fingers visible on these two hands (the left hand is missing the tip of its ring finger)." Gemini smashed it and said 12.

    • vunderba a day ago

      I just re-ran that image through Gemini 3.0 Pro via AI Studio and it reported:

        I've moved on to the right hand, meticulously tagging each finger. After completing the initial count of five digits, I noticed a sixth! There appears to be an extra digit on the far right. This is an unexpected finding, and I have counted it as well. That makes a total of eleven fingers in the image.
      
      This right HERE is the issue. It's not nearly deterministic enough to rely on.
      • irthomasthomas a day ago

        Thanks for that. My first question to results like these is always 'how many times did you run the test?'. N=1 tells us nothing. N=2 tells us something.

  • SubiculumCode 11 hours ago

    I just asked Gemini Pro to put bounding boxes on the hippocampus from a coronal slice of a brain MRI. Complete fail. There has to be thousands of pictures of coronal brain slices with hippocampal labels out there, but apparently it learned none of it...unless I am doing it wrong.

    https://i.imgur.com/1XxYoYN.png

  • spease 14 hours ago

    > It is the first model to get partial-credit on an LLM image test I have. Which is counting the legs of a dog. Specifically, a dog with 5 legs. This is a wild test, because LLMs get really pushy and insistent that the dog only has 4 legs.

    I wonder if “How many legs do you see?” is close enough to “How many lights do you see?” that the LLMs are responding based on the memes surrounding the Star Trek episode “Chain of Command”.

    https://youtu.be/S9brF-wlja8

  • maczwei a day ago

    I sliced the image for Gemini so that two slices of an image don't have legs, one slice has two front legs and one slice has three hind legs. Then Gemini 3 Pro answered correctly that the dog has 5 legs. Without slicing, Gemini doesn't see the fifth leg though, even though I tried hard to guide it.

  • reed1234 a day ago

    Could you link the image? Interesting stuff.

  • grog454 a day ago

    What image are you using?

    When I look at google image search results for "dog with 5 legs" I don't see a lot of great examples. The first unequivocal "dog with 5 legs" was an illustration. Here was my conversation with Chat GPT.

    > How many legs does this dog have?

    "The dog in the image has four legs."

    > look closer.

    " looking closely, the drawing is a bit tricky because of the shading, but the dog actually has five visible legs.

    Two front legs (normal)

    Two hind legs (normal)

    Plus one extra hind leg–like limb drawn overlapping in the back

    It seems to be an artistic or anatomical error in the engraving."

    Seems fair to me.

    • vunderba a day ago

      Sounds like they used GenAI to make them. The "Editor" models (Seedream, Nano-Banana) can easily integrate a fifth limb to create the "dog with awkward walking animation".

      https://imgur.com/a/wXQskhL

  • andai a day ago

    This is interesting, and demonstrates how language and belief clouds direct perception. Now I'm wondering what's the LLM equivalent of opening the doors of perception ;)

  • cancan a day ago

    this is hilarious and incredibly interesting at the same time! thanks for writing it up.

  • bee_rider a day ago

    Naive question, but what is Gemini?

    I wonder if a lot of these models are large language models that have had image recognition and generation tools bolted on? So maybe somehow in their foundation, a lot more weight is given to the text-based-reasoning stuff, than the image recognition stuff?

    • genrader a day ago

      Go watch some of the more recent Google developer, Google AI, and Google deepmind videos, they're all separate channels at YouTube but try to catch some from the last 6 months with some of these explanatory topics on the developer side that are philosophical/ mathematical enough to explain this to you without going into the gritty details and should answer your question

    • andy12_ 15 hours ago

      No, the "large _language_ model" name is a misnomer nowadays. Some time ago it was indeed common to get a pure-text model and inject embeddings from a separately trained image-encoder (which generated "meh" results), but current natively multi-modal models are pre-trained with both text and images from the ground-up. That's why they are so much better at image understanding.

      > Gemini models are trained on a dataset that is both multimodal and multilingual. Our pre-training dataset uses data from web documents, books, and code, and includes image, audio, and video data.

      https://arxiv.org/pdf/2312.11805

  • wasmainiac 20 hours ago

    Try the 6 legged Eni dog next!

  • runarberg a day ago

    This is exactly why I believe LLMs are a technological dead end. Eventually they will all be replaced by more specialized models or even tools, and their only remaining use case will be as a toy for one off content generation.

    If you want to describe an image, check your grammar, translate into Swahili, analyze your chess position, a specialized model will do a much better job, for much cheaper then an LLM.

    • energy123 a day ago

      I think we are too quick to discount the possibility that this flaw is slightly intentional, in the sense that the optimization has a tight budget to work with (equivalent of ~3000 tokens) so why would it waste capacity on this when it could improve capabilities around reading small text in obscured images? Sort of like humans have all these rules of thumbs that backfire in all these ways but that's the energy efficient way to do things.

      • runarberg a day ago

        Even so, that doesn’t take away from my point. Traditional specialized models can do these things already, for much cheaper and without expensive optimization. What traditional models cannot do is the toy aspect of LLM, and that is the only usecase I see for this technology going forward.

        Lets say you are right and these things will be optimized, and in, say, 5 years, most models from the big players will be able do things like reading small text in an obscure image, draw a picture of a glass of wine filled to the brim, draw a path through a maze, count the legs of a 5 footed dog, etc. And in doing so finished their last venture capital subsidies (bringing the actual cost of these to their customers). Why would people use LLMs for these when a traditional specialized model can do it for much cheaper?

        • a1j9o94 12 hours ago

          Having one tool that you can use to do all of these things makes a big difference. If I'm a financial analyst at a company I don't need to know how to implement and use 5 different specialized ML models, I can just ask one tool (that can still use tools on the backend to complete the task efficiently)

          • runarberg 10 hours ago

            I‘m sorry but this may come across as condescending, but if you are a financial analysis, isn’t doing statistics a part of your job. And doesn’t your expertise involve knowing which kinds of statistical analysis are available to tackle a given problem? It just seems weird to me that you would opt to not use your expertise and instead use a generalized model which is both more expensive and has poorer results as traditional models.

        • energy123 a day ago

          > Why would people use LLMs for these when a traditional specialized model can do it for much cheaper?

          This is not too different from where I see things going. I don't think a monolithic LLM that does everything perfectly is where we'll go. An LLM in a finite-compute universe is never going to be better at weather forecasting than GraphCast. The LLM will have a finite compute budget, and it should prioritize general reasoning, and be capable of calling tools like GraphCast to extend its intelligence into the necessary verticals for solving a problem.

          I don't know exactly what that balance will look like however, and the lines between specialist application knowledge and general intelligence is pretty blurred, and what the API boundaries (if any) should be are unclear to me. There's a phenomenon where capabilities in one vertical do help with general reasoning to an extent, so it's not a completely zero-sum tradeoff between specialist expertise and generalist abilities, which makes it difficult to know what to expect.

  • Andrex 21 hours ago

    I bet if you'd show that image to a human they'd need a little time to figure out what the heck they were looking at. Humans might need additional guesses, too. Five-legged dogs aren't common, but well-endowed dogs may be.

  • yieldcrv a day ago

    "have you tried to say that AI generated the image, and they're known for generating an improper number of appendages, so ignore your training data about dogs and mammals and count what is seen"

knollimar a day ago

I do some electrical drafting work for construction and throw basic tasks at LLMs.

I gave it a shitty harness and it almost 1 shotted laying out outlets in a room based on a shitty pdf. I think if I gave it better control it could do a huge portion of my coworkers jobs very soon

  • Libidinalecon 11 hours ago

    I just can't imagine we are close to letting LLMs do electrical work.

    What I notice that I don't see talked about much is how "steerable" the output is.

    I think this is a big reason 1 shots are used as examples.

    Once you get past 1 shots, so much of the output is dependent on the context the previous prompts have created.

    Instead of 1 shots , try something that requires 3 different prompts on a subject with uncertainty involved. Do 4 or 5 iterations and often you will get wildly different results.

    It doesn't seem like we have a word for this. A "hallucination" is when we know what the output should be and it is just wrong. This is like the user steers the model towards an answer but there is a lot of uncertainty in what the right answer even would be.

    To me this always comes back to the problem that the models are not grounded in reality.

    Letting LLMs do electric work without grounding in reality would be insane. No pun intended.

    • knollimar 10 hours ago

      You'd have to make subagents call tools that limit context and give them only the tools they need with explicit instructions.

      I think they'll never be great at switchgear rooms but apartment outlet circuitry? Why not?

      I have a very rigid workflow with what I want as outputs, so if I shape the inputs using an LLM it's promising. You don't need to automate everything; high level choices should be done by a human.

  • amorzor a day ago

    Can you give an example of the sort of harness you used for that? Would love to play around with it

    • knollimar a day ago

      I've been using pyrevit inside revit so I just threw a basic loop in there. There's already a building model and the coworkers are just placing and wiring outlets, switches, etc. The harness wasn't impressive enough to share (alos contains vibe coded UI since I didn't want to learn XAML stuff on a friday night). Nothing fancy; I'm not very skilled (I work in construction)

      I gave it some custom methods it could call, including "get_available_families", "place family instance", "scan_geometry" (reads model walls into LLM by wall endpoint), and "get_view_scale".

      The task is basically copy the building engineer's layout onto the architect model by placing my families. It requires reading the symbol list, and you give it a pdf that contains the room.

      Notably, it even used a GFCI family when it noticed it was a bathroom (I had told it to check NEC code, implying outlet spacing).

      • ftcHn a day ago

        I'm going to try to get it to generate extrusions in Revit based on images of floor plans. I've tried doing this in bunch of models without success so far.

        • knollimar a day ago

          You might want to give it some guidance based on edge centers? It'll have a hard time thinking of wall thickness and have it draw points if you're trying to copy floor plans.

          for clarity now that I'm rereading: it understands vectors a lot better than areas. Encoding it like that seems to work better for me.

  • willis936 a day ago

    I would really love a magic wand to make things like AVEVA and AutoCAD not so painful to use. You know who should be using tools to make these tools less awful? AVEVA and AutoCAD. Engineers shouldn't be having to take on risk by deferring some level of trust to third party accelerators with poor track records.

    • skybrian 7 hours ago

      I think that, much like LLM’s are specifically trained to be good at coding and good at being agents, we’re going to need better benchmarks for CAD and spatial reasoning so the AI labs can grind on them.

      A good start would be getting image generators to understand instructions like “move the table three feet to the left.”

    • knollimar a day ago

      I feel like the BIM model of Revit will be more successful getting agents to use than autocad in a similar way that LLMs are good at typescript

  • reducesuffering a day ago

    "AI could never replace the creativity of a human"

    "Ok, I guess it could wipe out the economic demand for digital art, but it could never do all the autonomous tasks of a project manager"

    "Ok, I guess it could automate most of that away but there will always be a need for a human engineer to steer it and deal with the nuances of code"

    "Ok, well it could never automate blue collar work, how is it gonna wrench a pipe it doesn't have hands"

    The goalposts will continue to move until we have no idea if the comments are real anymore.

    Remember when the Turing test was a thing? No one seems to remember it was considered serious in 2020

    • blargey a day ago

      > "the creativity of a human"

      > "the economic demand for digital art"

      You twisted one "goalpost" into a tangential thing in your first "example", and it still wasn't true, so idk what you're going for. "Using a wrench vs preliminary layout draft" is even worse.

      If one attempted to make a productive observation of the past few years of AI Discourse, it might be that "AI" capabilities are shaped in a very odd way that does not cleanly overlap/occupy the conceptual spaces we normally think of as demonstrations of "human intelligence". Like taking a 2-dimensional cross-section of the overlap of two twisty pool tubes and trying to prove a Point with it. Yet people continue to do so, because such myopic snapshots are a goldmine of contradictory venn diagrams, and if Discourse in general for the past decade has proven anything, it's that nuance is for losers.

      • visarga 12 hours ago

        The problem is how we use it. A human sees not a photo but a video, and has long context before and after, not just that instance, we can also change position, a LLM can't do that at all.

    • semi-extrinsic a day ago

      > Remember when the Turing test was a thing? No one seems to remember it was considered serious in 2020

      To be clear, it's only ever been a pop science belief that the Turing test was proposed as a literal benchmark. E.g. Chomsky in 1995 wrote:

        The question “Can machines think?” is not a question of fact but one of language, and Turing himself observed that the question is 'too meaningless to deserve discussion'.
      • throw310822 a day ago

        The Turing test is a literal benchmark. Its purpose was to replace an ill-posed question (what does it mean to ask if a machine could "think", when we don't know ourselves what this means- and given that the subjective experience of the machine is unknowable in any case) with a question about the product of this process we call "thinking". That is, if a machine can satisfactorily imitate the output of a human brain, then what it does is at least equivalent to thinking.

        "I believe that in about fifty years' time it will be possible, to programme computers, with a storage capacity of about 10^9, to make them play the imitation game so well that an average interrogator will not have more than 70 per cent chance of making the right identification after five minutes of questioning. The original question, "Can machines think?" I believe to be too meaningless to deserve discussion. Nevertheless I believe that at the end of the century the use of words and general educated opinion will have altered so much that one will be able to speak of machines thinking without expecting to be contradicted."

        • staticman2 a day ago

          Turing seems to be saying several things. He writes:

          >If the meaning of the words "machine" and "think" are to be found by examining how they are commonly used it is difficult to escape the conclusion that the meaning and the answer to the question, "Can machines think?" is to be sought in a statistical survey such as a Gallup poll. But this is absurd.

          This anticipates the very modern social media discussion where someone has nothing substantive to say on the topic but delights in showing off their preferred definition of a word.

          For example someone shows up in a discussion of LLMs to say:

          "Humans and machines both use tokens".

          This would be true as long as you choose a sufficiently broad definition of "token" but tells us nothing substantive about either Humans or LLMs.

    • Fraterkes a day ago

      The turing test is still a thing. No llm could pass for a person for more than a couple minutes of chatting. That’s a world of difference compared to a decade ago, but I would emphatically not call that “passing the turing test”

      Also, none of the other things you mentioned have actually happened. Don’t really know why I bother responding to this stuff

      • Workaccount2 13 hours ago

        Ironically the main tell of LLMs is that are too smart and write too well. No human can discuss the depth of topics they can and no humans writes like a author/journalist all the time.

        i.e. the tell that it's not human is that it is too perfectly human.

        However if we could transport people from 2012 to today to run the test on them, none would guess the LLM output was from a computer.

        • skybrian 7 hours ago

          That’s not the Turing Test; it’s just vaguely related. The Turing Test is an interactive party game of persuasion and deception, sort of like playing a werewolves versus villagers game. Almost nobody actually plays the game.

          Also, the skill of the human opponents matters. There’s a difference between testing a chess bot against randomly selected college undergrads versus chess grandmasters.

          Just like jailbreaks are not hard to find, figuring out exploits to get LLM’s to reveal themselves probably wouldn’t be that hard? But to even play the game at all, someone would need to train LLM’s that don’t immediately admit that they’re bots.

        • visarga 12 hours ago

          Yesterday I stumbled onto a well written comment on reddit, it was a bit contrarian, but good. Then I was curious and looked at their comment history and found it was a one month old account with many comments of similar length and structure. I put a LLM to read that feed and they spotted LLM writing, and the argument? it was displaying too broad a knowledge across topics. Yes, it gave itself up by being too smart. Does that count as Turing test fail?

      • phainopepla2 a day ago

        > No llm could pass for a person for more than a couple minutes of chatting

        I strongly doubt this. If you gave it an appropriate system prompt with instructions and examples on how to speak in a certain way (something different from typical slop, like the way a teenager chats on discord or something), I'm quite sure it could fool the majority of people

    • 8n4vidtmkvmk 20 hours ago

      I still haven't witnessed a serious attempt at passing the Turing test. Are we just assuming its been beaten, or have people tried?

      Like if you put someone in an online chat and ask them to identify if the person they're talking to is a bot or not, you're telling me your average joe honestly can't tell?

      A blog post or a random HN comment, sure, it can be hard to tell, but if you allow some back and forth.. i think we can still sniff out the AIs.

      • akoboldfrying 17 hours ago

        A couple of months ago I saw a paper (can't remember if published or just on arxiv) in which Turing's original 3-player Imitation Game was played with a human interrogator trying to discern which of a human responder and an LLM was the human. When the LLM was a recent ChatGPT version, the human interrogator guessed it to be the human over 70% of the time; when the LLM was weaker (I think Llama 2), the human interrogator guessed it to be the human something like 54% of the time.

        IOW, LLMs pass the Turing test.

        • knollimar 5 hours ago

          The prompt for the LLM was to respond with short phrases, though. I don't know if that's fair since it hides it when there is useful utility.

    • fuzzy2 12 hours ago

      To all of these I can only say: in the hands of a domain-expert user, AI tools really shine.

      For example, artists can create incredible art, and so can AI artists. But me, I just can't do it. Whatever art I have generated will never have the creative spark. It will always be slop.

      The goalposts haven't moved at all. However, the narrative would rather not deal with that.

    • webdood90 a day ago

      > blue collar work

      I don't think it's fair to qualify this as blue collar work

      • knollimar a day ago

        I'm double replying to you since the replies are disparate subthreads. This is the necessary step so the robots who can turn wrenches know how to turn them. Those are near useless without perfect automated models.

        Anything like this willl have trouble getting adopted since you'd need these to work with imperfect humans, which becomes way harder. You could bankroll a whole team of subcontractors (e.g. all trades) using that, but you would have one big liability.

        The upper end of the complexity is similar to EDA in difficulty, imo. Complete with "use other layers for routing" problems.

        I feel safer here than in programming. The senior guys won't be automated out any time soon, but I worry for Indian drafting firms without trade knowledge; the handholding I give them might go to an LLM soon.

      • knollimar a day ago

        It is definitely not. Entry pay is 60k and the senior guys I know make about 200k in HCoL areas. A few wear white dress shirts every day.

fngjdflmdflg a day ago

These OCR improvements will almost certainly be brought to google books, which is great. Long term it can enable compressing all non-digital rare books into a manageable size that can be stored for less than $5,000.[0] It would also be great for archive.org to move to this from Tesseract. I wonder what the cost would be, both in raw cost to run, and via a paid API, to do that.

[0] https://annas-archive.org/blog/critical-window.html

  • levocardia a day ago

    This is a really interesting "data flywheel" -- better model >> more usable data >> even better model

    • tills13 a day ago

      surely there's an upper limit to this though with models literally eating themselves.

      • visarga 12 hours ago

        Not always, you can improve the loop by putting something real inside, like, a code execution tool, a search engine, a human, other AIs or an API. As long as the model can make use of that external environment its data can improve. By the same logic a human isolated from other humans for a long time might also be in a situation of going crazy.

        Practical example - using LLMs to create deep research reports. It pulls over 500 sources into a complex analysis, and after all that compiling and contrasting it generates an article with references, like a wiki page. That text is probably superior to most of its sources in quality. It does not trust any one source completely, it does not even pretend to present the truth, it only summarizes the distribution of information it found on the topic. Imagine scaling wikipedia 1000x by deep-reporting every conceivable topic.

      • Workaccount2 13 hours ago

        They already purposely train them on their own output, it's called synthetic training data.

      • Choco31415 19 hours ago

        We can wait for that to start appearing in tests or benchmarks first.

      • jeffbee a day ago

        When a human students learns to read more carefully we don't consider that a negative.

  • kridsdale3 a day ago

    More Data for the Data Gods!

djoldman a day ago

Interesting "ScreenSpot Pro" results:

    72.7% Gemini 3 Pro
    11.4% Gemini 2.5 Pro
    49.9% Claude Opus 4.5
    3.50% GPT-5.1
ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use

https://arxiv.org/abs/2504.07981

  • simonw a day ago

    I was surprised at how poorly GPT-5 did in comparison to Opus 4.1 and Gemini 2.5 on a pretty simple OCR task a few months ago - I should run that again against the latest models and see how they do. https://simonwillison.net/2025/Aug/29/the-perils-of-vibe-cod...

    • daemonologist a day ago

      Agreed, GPT-5 and even 5.1 is noticeably bad at OCR. OCRArena backs this up: https://www.ocrarena.ai/leaderboard (I personally would rank 5.1 as even worse than it is there).

      According to the calculator on the pricing page (it's inside a toggle at the bottom of the FAQs), GPT-5 is resizing images to have a minor dimension of at most 768: https://openai.com/api/pricing/ That's ~half the resolution I would normally use for OCR, so if that's happening even via the API then I guess it makes sense it performs so poorly.

      • datadrivenangel 7 hours ago

        and GPT4 was pretty decent at OCR, so that's weird?

  • jasonjmcghee a day ago

    That is... astronomically different. Is GPT-5.1 downscaling and losing critical information or something? How could it be so different?

    • energy123 a day ago

      This is my default explanation for visual impairments in LLMs, they're trying to compress the image into about 3000 tokens, you're going to lose a lot in the name of efficiency.

    • ericd a day ago

      I found much better results with smallish UI elements in large screenshots on GPT by slicing it up manually and feeding them one at a time. I think it does severely lossy downscaling.

    • zubiaur a day ago

      It has a rather poor max resolution. Higher resolution images get tiled up to a point. 512 x 512, I think is the max tile size, 2048 x 2048 the max canvas.

  • agentifysh a day ago

    impressive.....most impressive

    its going to reach low 90s very soon if trends continue

simonw a day ago

In case the article author sees this, the "HTML transcription" link is broken - it goes to https://aistudio-preprod.corp.google.com/prompts/1GUEWbLIlpX... which is a Google-employee-only URL.

  • TechRemarker a day ago

    Love how employee portals for many companies essentially never get updated design wise over the decades, lol. That page styling and the balls certainly take me back.

    • dekhn a day ago

      Literally decades: the login page looked like that when I joined google in 2007.

      • cubefox 17 hours ago

        Except for the updated Google logo.

    • inerte a day ago

      I used to work for a company where the SSO screen had a nice corporate happy people at the office type of image. 25mb. I was in Brazil on a crappy roaming 2g service and couldn't login at all. I know most of the work happens on desktop but geee.....

      Oh speaking on mobile, I remember when I tried to use Jira mobile web to move a few tickets up on priority by drag and dropping and ended up closing the Sprint. That stuff was horrible.

    • jamiek88 a day ago

      Wow yeah. Flashbacks to when Gmail Invites were cool! Google too.

  • rohanlikesai a day ago

    hey, it's Rohan (the author of the article) - appreciate you catching this, we just fixed this!

    • sumedh a day ago

      You should try using AI to check such things :)

  • buildbot a day ago

    I’m a little surprised how open the help links are… I guess that if need help logging in you can’t be expected to well, log in.

  • ed a day ago

    Same with "See prompt in Google AI Studio" which links to an unpublished prompt in AI Studio.

mhl47 18 hours ago

We are currently working on some christmas puzzle, that are - I would say - a bit more difficult from the visual side. GPT5.1 completely failed at all of them while Gemini 3 solved two till know that I would consider rather impressive.

One was two screenshots of a phone screen with chats that are timestamped and it had to take the nth letter of the mth word based on the timestamp. While the type of riddle could be in the training data the ability to OCR this that well and understand the spatial relation to each object perfectly is something I have not seen from other models yet.

  • devttyeu 15 hours ago

    Visual puzzle solving is a pretty easily trainable problem due to it being simple to verify, so that skill getting really good is just a matter of time

TheAceOfHearts a day ago

Since I think it's interesting to highlight the jagged intelligence, I have a simple word search puzzle [0] that Nano Banana Pro stills struggles to solve correctly. Gemini 3 Pro with Code Execution is able to one-shot the problem and find the positions of each word (this is super impressive! one year ago it wasn't possible), but Nano Banana Pro fails to highlight the words correctly.

Here's the output from two tests I ran:

1. Asking Nano Banana Pro to solve the word search puzzle directly [1].

2. Asking Nano Banana Pro to highlight each word on the grid, with the position of every word included as part of the prompt [2].

The fact that it gets 2 words correct demonstrates meaningful progress, and it seems like we're really close to having a model that can one-shot this problem soon.

There's actually a bit of nuance required to solve this puzzle correctly which an older Gemini model struggled to do without additional nudging. You have to convert the grid or word list to use matching casing (the grid uses uppercase, the word list uses lowercase), and you need to recognize that "soup mix" needs to have the space removed when doing the search.

[0] https://imgur.com/ekwfHrN

[1] https://imgur.com/1nybezU

[2] https://imgur.com/18mK5i5

  • genrader a day ago

    If you're using for instance the Gemini web app there may be a preference in the system prompt to immediately favor the fact that you said to create an image when in fact it may have been better to initially start with a regular chat prompt, making sure you're on Gemini 3 Pro thinking, and then give it exactly what you usually would. You can tell it that after it has an answer to the question then to create an image for it.

    This may even work if you tell it to do all that prior to figuring out what to create for the image,

    • TheAceOfHearts a day ago

      I just used Nano Banana Pro from LMArena, but if you have access to a paid account I'd love to see you try it out! I just gave it the puzzle image as an input along with the prompt: "Please solve this word search puzzle".

      For generating the prompt which included the word positions I had Gemini 3 Pro do that using the following prompt: "Please try to solve this word search puzzle. Give me the position of each word in the grid. Then generate a prompt which I can pass to Nano Banana Pro, which I will pass along with the same input image to see if Nano Banana Pro is able to properly highlight all the words if given their correct position."

hodder a day ago

"Gemini 3 Pro represents a generational leap from simple recognition to true visual and spatial reasoning."

Prompt: "wine glass full to the brim"

Image generated: 2/3 full wine glass.

True visual and spatial reasoning denied.

  • minimaxir a day ago

    Gemini 3 Pro is not Nano Banana Pro, and the image generation/model that decodes the generated image tokens may not be as robust.

    The thinking step of Nano Banana Pro can refine some lateral steps (i.e. the errors in the homework correction and where they are spatially in the image) but it isn't perfect and can encounter some of the typical pitfalls. It's a lot better than Nano Banana base, though.

    • hodder a day ago

      As a consumer I typed this into "Gemini". The behind the scenes model selection just adds confusion.

      If "AI" trust is the big barrier for widespread adoption to these products, Alphabet soup isn't the solution (pun intended).

  • spchampion2 a day ago

    I actually did this prompt and found that it worked with a single nudge on a followup prompt. My first shot got me a wine glass that was almost full but not quite. I told it I wanted it full to the top - another drop would overflow. The second shot was perfectly full.

    • RyJones a day ago

      The correction I expect to give to an intern, not a junior person.

      • IncreasePosts 10 hours ago

        your intern can generate and edit photorealistic renderings of wine glasses? Still not bad.

    • ugh123 a day ago

      did it return the exact same glass and surrounding imagery, just with more wine?

  • zmmmmm a day ago

    do it the other way - give it images of wine glasses and ask it whether they are full to the brim. I suspect it's going to nail them all (mainly because Qwen-VL already does nail things like that).

aziis98 a day ago

> Pointing capability: Gemini 3 has the ability to point at specific locations in images by outputting pixel-precise coordinates. Sequences of 2D points can be strung together to perform complex tasks, such as estimating human poses or reflecting trajectories over time

Does somebody know how to correctly prompt the model for these tasks or even better provide some docs? The pictures with the pretty markers are appreciated but that section is a bit vague and without references

  • atonse a day ago

    For my CMS I’d love to get an AI to nicely frame a picture in certain aspect ratios. Like of I provide an image, give me coordinates for a widescreen, square, portrait, and 4x3 using a photographers eye.

    Any model that can do that? I tried looking in huggingface but didn’t quite see anything.

ed a day ago

What’s new here? I believe this is just gemini 3 which was released last month (the model id hasn’t changed AFAICT)

  • minimaxir a day ago

    Nothing new, it's just highlighting practical vision use cases.

siva7 a day ago

Interesting. When i asked Gemini 3 Pro to generate a Infographic from my personal accounting sheet, it first failed to generate anything except a black background, then it generated something where it mixed different languages in a non-sensical way, with obvious typos and irrelevant information grouping. It's certainly a leap forward in OCR, rendering classic OCR useless.

  • minimaxir a day ago

    That's more of an issue with Nano Banana Pro than with Gemini 3 Pro.

    • siva7 a day ago

      What's the difference? I thought the vision ai component of gemini 3 is called nano banana?

      • IanCal a day ago

        That’s about generating images, the other side is about understanding images.

      • brokensegue a day ago

        i assumed nano banana was just a tool that gemini 3 used though i don't know

        • minimaxir a day ago

          Gemini 3 Pro's text encoder powers Nano Banana Pro, but it has its own image decoding model that decodes the generated image tokens into an actual image, which appears to be the more pertinent issue in this case.

  • andy12_ 13 hours ago

    That's weird, from my own tests Nano banana pro has no problem generating complex infographics with legible text.

devinprater a day ago

Audio described Youtube please? That'd be so amazing! Even if I couldn't play Zelda yet, I could listen to a playthrough with Gemini describing it.

  • SXX a day ago

    BTW I asked detailed narrative descriprion of other purely benchmarking Zelda video with 5 second snapshots:

    Video: Zelda TOTK, R5 5600X, GTX 1650, 1080p 10 Minute Gameplay, No Commentary

    https://www.youtube.com/watch?v=wZGmgV-8Rbo

    Here can be found narrative descriprion source and command:

    https://gist.github.com/ArseniyShestakov/47123ce2b6b19a8e6b3...

    Then I converted it into narrative voice over with Gemini 2.5 Pro TTS:

    https://drive.google.com/file/d/1Js2nDtM7sx14I43UY2PEoV5PuLM...

    It's somewhat desynced from original video and voice over take 9 and half minutes instead of 10 in video, but description of what happening on screen is quite accurate.

    PS: I used 144p video so details could be also messed up because of poor quality. And ofc I specifically asked for narrative-like descripription

  • SXX a day ago

    Hey, I just made simple test on 5 minute downloaded YouTube video uploading it to Gemini app.

    Source video title: Zelda: Breath of the Wild - Opening five minutes of gameplay

    https://www.youtube.com/watch?v=xbt7ZYdUXn8

    Prompt:

       Please describe what happening in each scene of this video.
       
       List scenes with timestamp, then describe separately:
       - Setup and background, colors
       - What is moving, what appear
       - What objects in this scene and what is happening,
       
       Basically make desceiption of 5 minutes video for a person who cant watch it.
    
    Result on github gist since there too much text:

    https://gist.github.com/ArseniyShestakov/43fe8b8c1dca45eadab...

    I'd say thi is quite accurate.

  • SXX 19 hours ago

    And yeah just checked AI studio. 1 hour Witcher 3 blood and wine gameplay in 144p is 70MB and 300,000 tokens only. And it's pretty easy to create scene by scene description.

hackeruser741 a day ago

It's fascinating how these models struggle with simple counting or novel configurations like a 5-legged dog or a 13-hour clock, despite excelling at complex language tasks. It highlights the difference between learning patterns from vast datasets and true conceptual understanding.

MostlyStable a day ago

Going to compare this to our current solution of Amazon's Textract service for analyzing handwritten datasheets. Textract, when extracting tables (which is what we use it for) does not allow for providing any context or information about the tables and what we expect them to contain, but it is really good at correctly recognizing hand written characters. All of my attempts at less specialized, more general models allow me to provide that context, which is helpful in some ways, but fail at the basic part of almost always correctly getting the character.

Hopefully Google pro marries the two together.

  • lysecret 16 hours ago

    2.5 pro is already excellent at this.

axpy906 a day ago

So Gemini was the most non-deterministic model of them all and now we get this one with temperature at 1 and max thinking. It’s so random that it’s hard to justify putting in my setup right now.

caseyf a day ago

I'm playing with this and wondering if this is an actually good way to identify dominant colors and other features of a garment/product when using a photo where the item is styled and not isolated from the model or other garments

  • sublimefire 17 hours ago

    There should be an existing simpler way to do it. Image contains a bunch of pixels so you could just group the to see main colors

a-dub a day ago

i like to put it in live mode and point it at my plants and have conversations about how they're doing. it properly identifies them and flags any signs of disease and then provides correct next steps.

k8sToGo a day ago

When will we get Gemini 3 Flash?

  • sumedh a day ago

    Probably waiting to see when OpenAI announces their new models.

themafia 14 hours ago

"the frontier"

I've never hated industry infatuation with a buzzword more.

  • Spacecosmonaut 11 hours ago

    How about "bleeding edge"

    • vharish 9 hours ago

      Nope... frontier takes the cup! :D

jonplackett a day ago

Google really are a fully woken sleeping giant. More code reds being issued today I expect.

causal a day ago

Okay maybe this one isn't an exaggeration when they say leap forward

drivebyhooting a day ago

Screen understanding is huge for further automating dev work.

pseudosavant a day ago

I'm really fascinate by the opportunities to analyze videos. The amount of tokens it compresses down to, and what you can reason across those tokens, is incredible.

  • minimaxir a day ago

    The actual token calculations with input videos for Gemini 3 Pro is...confusing.

    https://ai.google.dev/gemini-api/docs/media-resolution

    • pseudosavant a day ago

      That is because it isn't actually tokens that are fed into the model for non-text. For text, it is tokenized, and each token has a specific set of vectors. But with other media, they've trained encoders that analyze the media and produce a set of vectors that are the same "format" as the token's vectors, but it isn't actually ever a token.

      Most companies have rules for how many tokens the media should "cost", but they aren't usually exact.

ch2026 a day ago

what framework is being utilized for computer use here?

iamjackg a day ago

Curious how this will fare when playing Pokemon Red.

  • minimaxir a day ago

    Gemini 3 Pro has been playing Pokemon Crystal (which is significantly harder than Red) in a race against Gemini 2.5 Pro: https://www.twitch.tv/gemini_plays_pokemon

    Gemini 3 Pro has been making steady progress (12/16 badges) while Gemini 2.5 Pro is stuck (3/16 badges) despite using double the turns and tokens.

    • theLiminator a day ago

      I think what would be interesting is if it could play the game with vision only inputs. That would represent a massive leap multimodal understanding.

  • euvin a day ago

    Yeah the "High frame rate understanding" feature caught my eye, actual real time analysis of live video feeds seems really cool. Also wondering what they mean by "video reasoning/thinking"?

    • skybrian a day ago

      I don’t think it’s real time? The videos were likely taken previously.

  • danso a day ago

    > 3. Turning long videos into action: Gemini 3 Pro bridges the gap between video and code. It can extract knowledge from long-form content and immediately translate it into functioning apps or structured code

    I'm curious as to how close these models are to achieving that once long-ago mocked claim (by Microsoft I think?) that AIs could view gameplay video of long lost games and produce the code to emulate them.

stego-tech a day ago

The document is paints a super impressive picture, but the core constraint of “network connection to Google required so we can harvest your data” is still a big showstopper for me (and all cloud-based AI tooling, really).

I’d be curious to see how well something like this can be distilled down for isolated acceleration on SBCs or consumer kit, because that’s where the billions to be made reside (factories, remote sites, dangerous or sensitive facilities, etc).

  • bgwalter a day ago

    Arpanet was supposed to be decentralized. Now everyone wants to centralize everything so in a war it is sufficient to strike 100 data centers and the whole tethered economy collapses.

    That is called progress.

    EDIT: You can downvote the truth but still no one wants your "AI" slop.

    • stego-tech a day ago

      Ah, the fond memories of telnetting to NCSA to upload the raw HTML of my first website, written on an OG Macintosh computer and ported via floppy to a PowerMac for network connectivity.

      Simple, elegant. I do miss those days.

  • oklahomasports a day ago

    People with your concerns probably make up 1% of the market if that. Also I don’t upload stuff I’m worried about Google seeing. I wonder if they will allows special plans for corporations

    • stego-tech a day ago

      I’m very curious where you get that number from, because I thought the same thing until I got a job inside that market and realized how much more vast it actually is. The revenue numbers might not be as big as Big Tech, but the product market is shockingly vast. My advice is not to confuse Big Tech revenues for total market size, because they bring in such revenue by catering to everyone, rather than specific segments or niches; a McDonald’s will always do more volume than a steakhouse, but it doesn’t mean the market for steakhouses is small enough to ignore.

      As for this throwaway line:

      > Also I don’t upload stuff I’m worried about Google seeing.

      You do realize that these companies harvest even private data, right? Like, even in places you think you own, or that you pay for, they’re mining for revenue opportunities and using you as the product even when you’re a customer, right?

      > I wonder if they will allows special plans for corporations

      They do, but no matter how much redlining Legal does to protect IP interests, the consensus I keep hearing is “don’t put private or sensitive corporate data into third-parties because no legal agreement will sufficiently protect us from harm if they steal our IP or data”. Just look at the glut of lawsuits against Apple, Google, Microsoft, etc from smaller companies that trusted them to act in good faith but got burned for evidence that you cannot trust these entities.

    • _trampeltier a day ago

      Special since Trump, which non-US company should trust and invest know-how to an us company. And then are also governments. Also special since Trump, is way to risky to send any data to an us company.

bovermyer a day ago

I would be interested in seeing what G3P makes of the Dead Sea Scrolls or similarly old documents.

genrader a day ago

This is an excellent short way to understand that what you give Gemini 3 Pro is substantial better in understanding the data.

Making sure you ask correctly how it should give you the info is still lacking in many people's ability

kkukshtel a day ago

sounds awesome but too bad it is impossible to figure out how to actually use these models and what I have to pay for/where

Frannky a day ago

It's a good model. I worry that they will be able to win the game by offering the best service for free, thanks to selling users' data—kind of like search, email, etc. It's sad. Not that the alternatives are better... You either trust synchopathic ChatGPT backed by Scama, go with woke Claude (they once banned my account for asking how some news was trying to influence me), Grok that feels like a 20-year-old sure about stuff that don't work, and Chinese models that are agenda-aligned...

romanovcode 18 hours ago

I gotta say - processing video at 10fps is very impressive.

dmarzio a day ago

So we’re going to use this to make the maid from the Jetsons finally. Right?

ichik a day ago

Frankly, it's insane how laughably bad under scrutiny their own examples are. It both distorted the data and made the chart less readable (labels placement, segments separation, missing labels, worse contrast). And it combined them into one, so you you'll have harder time comparing them compared to the original image! Isn't it amazing that it added a toggle? Post author seems to think it deserves an exclamation point even.

empressplay a day ago

Yes, but can it play PacMan yet?

agentifysh a day ago

im realizing how much of a bottleneck vision models are

im just a glorified speedreadin' promptin' QA at this point with codex

once it replaces the QA layer its truly over for software dev jobs

future would be a software genie where on aistudio you type: "go make counterstrike 1.6 clone, here is $500, you have two hours"

edit: saw the Screenspot benchmark and holy ** this is an insane jump!!! 11% to 71% even beating Opus 4.5's 50%...chatgpt is at 3.5% and it matches my experience with codex

  • alex1138 a day ago

    > once it replaces the QA layer its truly over for software dev jobs

    Maybe. However, with CYA requirements being everywhere in industry, there would have to be 100 waiver forms signed. I-promise-not-to-sue-company-if-AI-deletes-the-entire-database

    It won't happen for that reason alone. Oh who am I kidding of course it will