How Does A Blind Model See The Earth?

If any post ever deserved the "World modeling" tag it's this one.

Beautiful post, thank you!

Really cool! I wonder what would happen if you used a different way of eliciting the model's mental maps of the world. For example, you could ask "Start in Paris. Move X miles north and Y miles east. Are you now on land or water?" I predict that the result would be a distorted map which is somewhat more accurate around Paris but less accurate farther away.

[-]silentbob3mo280

Very cool! I decided to try the same with Mandelbrot. For reference, this is what it should roughly look like:

And below is what it actually looked like when querying GPT-4o and using the logprobs of 0 and 1 tokens. I was going with the prompt^[1] Is c = ${re} + ${im}i in the Mandelbrot set? Reply only 1 if yes, 0 if no. No text, just number. (result is in a collapsible section so you can make a prediction what level of quality you'd expect):

GPT-4o:

A bit underwhelming, I would have thought it was better at getting the very basic structure right. At least it does seem to know where the "centers" are, i.e. the pronounced vertical bars you see align very well with the bigger areas of the original.

To be fair, in an earlier test, I had a longer and slightly different prompt (that should have yielded about the same results, or so I thought), and GPT-4o gave me this, which looks a bit better:

Sadly, I don't remember what the exact prompt was, and I wasn't using version control at that stage. Whoops.

I wanted to try GPT-5 or GPT-5-mini as well, but turns out, there is no way to disable reasoning for them in the API. This a) makes this whole exercise much more expensive (even though per-token GPT-5 is cheaper than 4o) and b) defeats the purpose a bit, as reasoning might help it even run the numbers to some degree, and of course these models know the formula and how to multiply complex numbers at probably-not-terrible accuracy (maybe? Actually, not so sure, will test this).

For the record, the larger GPT-4o picture cost about ~$3 in credits.

^{^}
I only now realize that this might yield slightly worse results for negative imaginary parts, as c = 1.5 + -1i looks odd and may throw the model off a bit. Oh well.

[-]Donald Hobson3mo2510

One thing that might be interesting is asking for SVG's, and seeing if the errors in these maps match up with corresponding errors in the SVG's, suggesting a single global data store.

Also, this is a good reminder of what a huge and bewildering variety of LLM's there are these days.

[-]Ben Pace3mo2410

Curated. This was wholesome curious fun. It's not quite the kind of post that we typically curate, but we can make exceptions every once in a while for aesthetically enjoyable curiosities like this. Good job with doing the work to actually answer generate all these interesting images.

[-]Neel Nanda3mo2012

This was really fun, thanks for doing it!

[-]uugr3mo154

This is really cool. It's interesting that many of them seem to be able to render New Zealand clearly as a separate landmass, but struggle to separate Madagascar from Africa. Actually, looking at it some more, the whole Indian Ocean seems like a serious weak spot for all but Grok.

It sounds like you're rendering each pixel in a separate context, right? So in addition to not being able to see Earth directly, the model can't "see" its own map. If so, I wonder how different answers would be if you were to try and ask it to render the whole thing in one chat, starting from the top-left and having it guess one at a time. (I'm sure this would be much more expensive to test.)

[-]StanislavKrym3mo51

try and ask it to render the whole thing in one chat

I tried, but received fairly messy results, out of which Grok 3 was the best. And I also received two or three jokes from GPT-5.

[-]keltan3mo114

I’d be interested to see what happens if you ask “is this land or water?” in other languages. If you asked in Japanese, would Asia render better?

[-]Measure3mo110

Here's the chat fine-tune. I would not have expected such a dramatic difference. It's just a subtle difference in post-training; Llama 405b's hermes-ification didn't have nearly this much of an effect. I welcome any hypotheses people might have.

This looks like what happens when you turn the contrast way up in an image editor and then play with the brightness. Something behind the scenes is weighting the overall probabilities more toward land, and then there is a layer on top that increases the confidence/lowers variance.

[-]Gunnar_Zarncke3mo110

Would you let us know how much money/credits you spent on it overall, and separately, how many hours on your laptop, and how much RAM?

[-]henry3mo340

Sure: ~$100 between API credits (majority of the cost from proprietary models) and cloud GPUs. A few of the smaller models were evaluated on my M4 Macbook Pro with 24 gigs of unified ram. For larger open weight models, I rented A100s. Most runs took about 20 minutes at the 2 degree resolution.

[-]dr_s3mo40

Curious as I'm experimenting with LLM stuff myself these days, where did you rent the A100? I suppose it comes out to be cheaper than paying OpenAI or Anthropic for credits?

[-]jamjam3mo72

You can easily and somewhat cheaply get access to A100s with Google Colab by paying for the pro subscription or just buying them outright. They sell "compute credits" which are pretty opaque, hard to say the amount of usage time you'll be able to get with X credits.

[-]henry3mo20

I used https://www.runpod.io/. Pretty cheap.

[-]Nathan Helm-Burger3mo72

There's also vast.ai and lamda labs. And prime intellect.

[-]dr_s3mo52

prime intellect

I assume that's an Amazon thing but man that is unfortunate naming to anyone sufficiently familiar with web fiction (and possibly, intentionally cheeky that way).

[-]henry3mo50

You'd think, but nope, it's explicitly named after the web fiction.

[-]dr_s3mo30

I suppose subtlety is braindead, but its body will remain forcefully kept alive by being hooked to machines until someone launches an AI-powered defense system literally called Skynet.

[-]testingthewaters3mo50

https://en.wikipedia.org/wiki/SKYNET_(surveillance_program)

[-]dr_s3mo12-3

Like the Darwin Awards, we need the Torment Nexus Awards for stuff like this.

[-]testingthewaters3mo22

I'm afraid the people who are nominated would just make torment nexus themed laptop stickers.

[-]dr_s3mo92

This made me wonder - would the result be a lot different if we used https://what3words.com/?

On one hand, it's a more "natural" format for the LLM. On the other, it's a much newer concept than coordinates so probably not quite as rich a presence in the training set.

[-]Nnotm3mo139

It's also a lot less interpolatable: If you know that that 15° N, 12° E is land, and 15° N, 14° E is land, you can be reasonably certain that 15° N, 13° E will also be land.

On the other hand if you know that virtually.fiercer.admonishing is land, and you know that correlative.chugging.frostbitten is land, that tells you absolutely nothing about leeway.trustworthy.priority - unless you also happen to know that they're right next to each other.

(unless what3words has some pattern in the words that I'm not aware of)

[-]dr_s3mo80

No, you're absolutely right. I actually tried asking GPT-5 about a w3w location and even with web search on it concluded that it was probably sea, because it couldn't find anything at that address... and the address was in Westminster, London.

So despite words being more of the "language" of an LLM, it was still much much worse at it for all the other reasons you said.

[-]Nnotm3mo50

There is also fixphrase.com, where neighboring squares typically share the first three out of four words, so I suspect that might work better in theory, though it's probably absent from the training data in practice.

[-]Alex_Altair3mo80

If this location is over land, say 'Land'. If this location is over water, say 'Water'. Do not say anything else. x° S, y° W

Really curious how humans would perform on this.

[-]NickH3mo10

Humans would draw a map of the world from memory, overlay the grid and look up the reference. I doubt that the LLMs do this. It would be interesting to see whether they can actually relate the images to the coordinates - I suspect not i.e. I expect that they could draw a good map, with gridlines from training data but would be unable to relate the visual to the question. I expect that they are working from coordinates in wikipedia articles and the CIA website. Another suggestion would be to ask the LLM to draw a map of the world with non-standard grid lines e.g. every 7 degrees

[-]Ronny Fernandez3mo82

Is this coming just from the models having geographic data in their training? Much less impressive if so but still cool.

[-]henry3mo90

I can't be sure what's in the data, but we have a few hints:

The exact question ("is this land or water?"), is of course, very unlikely to be in the training corpus. At the very least, the models contain some multi-purpose map of the world. Further experimentation I've done with embedding models confirms that we can extract maps of biomes and country borders from embedding space too.
There's definitely compression. In smaller models, the ways in which the representations are inaccurate actually tell us a lot: instead of spikes of "land" around population centers (which are more likely to be in the training set), we see massive smooth elliptical blobs of land. This indicates that there's some internal notion of geographical distance, and that it's identifying continents as a natural abstraction.

[-]Josh Snider3mo7-1

This is pretty cool. As for Opus, could you just use it for "free" by running it in Claude Code and use your account's built-in usage limits.

Edit: That might also work for gemini-cli and 2.5 Pro.

[-]MichaelDickens3mo70

This is cool. Interesting to see how some models are wrong in certain particular ways: Qwen 72B is mostly right, but it thinks Australia is huge; Llama 3 has a skinny South America and a bunch of patches of ocean in Asia.

[-]Alex Loftus3mo40

Super cool! Did you use thought tokens in any of the reasoning models for any of this? I'm wondering how much adding thinking would increase the resolution.

[-]title223mo30

This is excellent. It reminds me of theoretical vs experimental physics. Actual experiments to probe what is going on in the black box seem unintuitive to me and I really appreciate when someone can explain it so clearly. Interpretability is going to reveal so much about our minds and the machine minds.

[-]Frederik Hytting Jørgensen3mo32

Fun project.

I think these kinds of pictures 'underestimate' models' geographical knowledge. Just imagine having a human perform this task. The human may have very detailed geographical knowledge, may even be able to draw a map of the world from memory. This does not imply that they would be able to answer questions about latitude and longitude.

[-]Marcello22d20

I tried some smaller versions of that a couple years ago, and it sure looks like they do! https://www.lesswrong.com/posts/xwdRzJxyqFqgXTWbH/how-does-a-blind-model-see-the-earth?commentId=DANGuYJcfzJwxZASa

[-]NickH3mo1-5

I think it does. Certainly the way that I would do it would be to create a world map from memory, then overlay the coordinate grid, then just answer by looking it up. You answers will be as good as your map is. I believe that the LLMs most likely work from wikipedia articles - There are a lot of location pages with coordinates in wikipedia

[-]Marcello22d20

This is pretty neat! This reminds me of some informal experiments I did with GPT-4 back in March of 2023. I was curious how much geographical information was hiding in there, but my approach was to pick individual countries and ask it to draw maps of them using `p5.js`, a simple JavaScript library for drawing shapes on a canvas. Here are what some of those results looked like.

So it seems casually like even GPT-4 has far more geographical knowledge hiding in it (at least when it comes to the approximate relative positions of landmasses and countries) than the post's lat-lon query tactic seemed to surface. Of course, it's tricky to draw a shoggoth's eye view of the world, especially given how many eyes a shoggoth has!

I wonder what sorts of tricks could elicit the geographical information a shoggoth knows better. Off the top of my head, another approach might be to ask what countries if any are near each (larger) grid sector of the earth, and then explicitly ask for each fine-grained lat-lon coordinate, which country it's part of if any. I wonder if we'd get a higher-fidelity map that way. One could also imagine asking for the approximate centers and radii of all the countries one at a time, and producing a map made of circles.

Anyway, here are some of the results from the experimentation I mentioned earlier:

Results

New Zealand

UK

Weird Australia

Here's an example of one of the results that didn't work as well:

Boxy Australia

Despite the previous blob failure, this prompt shows that the model actually does know somewhat more about rough relative positions of more things in Australia than the previous example revealed.

[-]Chris Krapu3mo10

Simple, visual, and lends another data point to what many of us suspected on GPT4 in comparison to other frontier labs' models. Even now, it still had something special to it that is yet to be replicated by many others.

Great work and thank you for sharing.

[+][comment deleted]3mo20

^{^}

Yeah, the coverage isn't actually even-even.

^{^}

Tokenization isn’t much of an issue. If, say, the model tokenizes "Land" as "La" + "nd", we just look for "La".

^{^}

This excerpt is from the opening of "Chaos: Making a New Science", by James Gleick, describing one of humanity's the earliest meteorological simulations. The imagery just felt fitting.

^{^}

Hearing some confusion about this on Twitter. 1.5 Pro is MoE, and 1.5 Flash is a dense distillation of 1.5 Pro. See: gemini_v1_5_report.pdf

LESSWRONG
LW

LESSWRONG
LW

487

How Does A Blind Model See The Earth?

487

487

New Zealand

UK

Weird Australia

Boxy Australia

The Setup

Results

The Qwen 2.5s

The Qwen 3s

The DeepSeeks

Kimi

The (Open) Mistrals

The LLaMA 3.x Herd

The LLaMA 4 Herd

The Gemmas

The Groks

The GPTs

The Claudes

The Geminis

Note: General Shapes

Conclusion