126

LESSWRONG
LW

125
AI EvaluationsLanguage Models (LLMs)Interpretability (ML & AI)World ModelingAI
Curated
2025 Top Fifty: 20%

472

How Does A Blind Model See The Earth?

by henry
11th Aug 2025
Linkpost from outsidetext.substack.com
8 min read
38

472

472

How Does A Blind Model See The Earth?
124reallyeli
30Adam Shai
29Daniel Kokotajlo
25Donald Hobson
24Ben Pace
23silentbob
20Neel Nanda
15uugr
5StanislavKrym
11keltan
11Measure
11Gunnar_Zarncke
34henry
4dr_s
7jamjam
2henry
7Nathan Helm-Burger
5dr_s
5henry
3dr_s
5testingthewaters
12dr_s
2testingthewaters
9dr_s
13Nnotm
8dr_s
5Nnotm
8Alex_Altair
1NickH
8Ronny Fernandez
8henry
7Josh Snider
7MichaelDickens
4Alex Loftus
3title22
3Frederik Hytting Jørgensen
1NickH
1Chris Krapu
2[comment deleted]
New Comment
38 comments, sorted by
top scoring
Click to highlight new comments since: Today at 6:51 PM
[-]reallyeli2mo12467

If any post ever deserved the "World modeling" tag it's this one.

Reply34
[-]Adam Shai1mo3019

Beautiful post, thank you!

Reply
[-]Daniel Kokotajlo1mo2912

Really cool! I wonder what would happen if you used a different way of eliciting the model's mental maps of the world. For example, you could ask "Start in Paris. Move X miles north and Y miles east. Are you now on land or water?" I predict that the result would be a distorted map which is somewhat more accurate around Paris but less accurate farther away.

Reply
[-]Donald Hobson1mo2510

One thing that might be interesting is asking for SVG's, and seeing if the errors in these maps match up with corresponding errors in the SVG's, suggesting a single global data store. 

Also, this is a good reminder of what a huge and bewildering variety of LLM's there are these days. 

Reply
[-]Ben Pace1mo2410

Curated. This was wholesome curious fun. It's not quite the kind of post that we typically curate, but we can make exceptions every once in a while for aesthetically enjoyable curiosities like this. Good job with doing the work to actually answer generate all these interesting images.

Reply
[-]silentbob1mo230

Very cool! I decided to try the same with Mandelbrot. For reference, this is what it should roughly look like:

And below is what it actually looked like when querying GPT-4o and using the logprobs of 0 and 1 tokens. I was going with the prompt[1] Is c = ${re} + ${im}i in the Mandelbrot set? Reply only 1 if yes, 0 if no. No text, just number. (result is in a collapsible section so you can make a prediction what level of quality you'd expect):

 

GPT-4o:

A bit underwhelming, I would have thought it was better at getting the very basic structure right. At least it does seem to know where the "centers" are, i.e. the pronounced vertical bars you see align very well with the bigger areas of the original.

To be fair, in an earlier test, I had a longer and slightly different prompt (that should have yielded about the same results, or so I thought), and GPT-4o gave me this, which looks a bit better:

Sadly, I don't remember what the exact prompt was, and I wasn't using version control at that stage. Whoops.

I wanted to try GPT-5 or GPT-5-mini as well, but turns out, there is no way to disable reasoning for them in the API. This a) makes this whole exercise much more expensive (even though per-token GPT-5 is cheaper than 4o) and b) defeats the purpose a bit, as reasoning might help it even run the numbers to some degree, and of course these models know the formula and how to multiply complex numbers at probably-not-terrible accuracy (maybe? Actually, not so sure, will test this).

For the record, the larger GPT-4o picture cost about ~$3 in credits.

 

  1. ^

    I only now realize that this might yield slightly worse results for negative imaginary parts, as c = 1.5 + -1i looks odd and may throw the model off a bit. Oh well.

Reply
[-]Neel Nanda1mo2012

This was really fun, thanks for doing it!

Reply
[-]uugr2mo154

This is really cool. It's interesting that many of them seem to be able to render New Zealand clearly as a separate landmass, but struggle to separate Madagascar from Africa. Actually, looking at it some more, the whole Indian Ocean seems like a serious weak spot for all but Grok.

It sounds like you're rendering each pixel in a separate context, right? So in addition to not being able to see Earth directly, the model can't "see" its own map. If so, I wonder how different answers would be if you were to try and ask it to render the whole thing in one chat, starting from the top-left and having it guess one at a time. (I'm sure this would be much more expensive to test.)

Reply
[-]StanislavKrym1mo51

try and ask it to render the whole thing in one chat

I tried, but received fairly messy results, out of which Grok 3 was the best. And I also received two or three jokes from GPT-5.

Reply
[-]keltan1mo114

I’d be interested to see what happens if you ask “is this land or water?” in other languages. If you asked in Japanese, would Asia render better?

Reply
[-]Measure1mo110

Here's the chat fine-tune. I would not have expected such a dramatic difference. It's just a subtle difference in post-training; Llama 405b's hermes-ification didn't have nearly this much of an effect. I welcome any hypotheses people might have.

 

This looks like what happens when you turn the contrast way up in an image editor and then play with the brightness.  Something behind the scenes is weighting the overall probabilities more toward land, and then there is a layer on top that increases the confidence/lowers variance. 

Reply
[-]Gunnar_Zarncke2mo110

Would you let us know how much money/credits you spent on it overall, and separately, how many hours on your laptop, and how much RAM?

Reply
[-]henry2mo340

Sure: ~$100 between API credits (majority of the cost from proprietary models) and cloud GPUs. A few of the smaller models were evaluated on my M4 Macbook Pro with 24 gigs of unified ram. For larger open weight models, I rented A100s. Most runs took about 20 minutes at the 2 degree resolution.

Reply3
[-]dr_s1mo40

Curious as I'm experimenting with LLM stuff myself these days, where did you rent the A100? I suppose it comes out to be cheaper than paying OpenAI or Anthropic for credits?

Reply
[-]jamjam1mo72

You can easily and somewhat cheaply get access to A100s with Google Colab by paying for the pro subscription or just buying them outright. They sell "compute credits" which are pretty opaque, hard to say the amount of usage time you'll be able to get with X credits.

Reply1
[-]henry1mo20

I used https://www.runpod.io/. Pretty cheap.

Reply
[-]Nathan Helm-Burger1mo72

There's also vast.ai and lamda labs. And prime intellect.

Reply
[-]dr_s1mo52

prime intellect

I assume that's an Amazon thing but man that is unfortunate naming to anyone sufficiently familiar with web fiction (and possibly, intentionally cheeky that way).

Reply
[-]henry1mo50

You'd think, but nope, it's explicitly named after the web fiction.

Reply
[-]dr_s1mo30

I suppose subtlety is braindead, but its body will remain forcefully kept alive by being hooked to machines until someone launches an AI-powered defense system literally called Skynet.

Reply
[-]testingthewaters1mo50

https://en.wikipedia.org/wiki/SKYNET_(surveillance_program)

Reply
[-]dr_s1mo12-3

Like the Darwin Awards, we need the Torment Nexus Awards for stuff like this.

Reply
[-]testingthewaters1mo22

I'm afraid the people who are nominated would just make torment nexus themed laptop stickers.

Reply
[-]dr_s1mo92

This made me wonder - would the result be a lot different if we used https://what3words.com/?

On one hand, it's a more "natural" format for the LLM. On the other, it's a much newer concept than coordinates so probably not quite as rich a presence in the training set.

Reply
[-]Nnotm1mo139

It's also a lot less interpolatable: If you know that that 15° N, 12° E is land, and 15° N, 14° E is land, you can be reasonably certain that 15° N, 13° E will also be land.

On the other hand if you know that virtually.fiercer.admonishing is land, and you know that correlative.chugging.frostbitten is land, that tells you absolutely nothing about leeway.trustworthy.priority - unless you also happen to know that they're right next to each other.

(unless what3words has some pattern in the words that I'm not aware of)

Reply
[-]dr_s1mo80

No, you're absolutely right. I actually tried asking GPT-5 about a w3w location and even with web search on it concluded that it was probably sea, because it couldn't find anything at that address... and the address was in Westminster, London.

So despite words being more of the "language" of an LLM, it was still much much worse at it for all the other reasons you said.

Reply
[-]Nnotm1mo50

There is also fixphrase.com, where neighboring squares typically share the first three out of four words, so I suspect that might work better in theory, though it's probably absent from the training data in practice.

Reply
[-]Alex_Altair1mo80

If this location is over land, say 'Land'. If this location is over water, say 'Water'. Do not say anything else. x° S, y° W

Really curious how humans would perform on this.

Reply
[-]NickH1mo10

Humans would draw a map of the world from memory, overlay the grid and look up the reference. I doubt that the LLMs do this. It would be interesting to see whether they can actually relate the images to the coordinates - I suspect not i.e. I expect that they could draw a good map, with gridlines from training data but would be unable to relate the visual to the question. I expect that they are working from coordinates in wikipedia articles and the CIA website. Another suggestion would be to ask the LLM to draw a map of the world with non-standard grid lines e.g. every 7 degrees

Reply
[-]Ronny Fernandez1mo82

Is this coming just from the models having geographic data in their training? Much less impressive if so but still cool.

Reply
[-]henry1mo80

I can't be sure what's in the data, but we have a few hints:

  • The exact question ("is this land or water?"), is of course, very unlikely to be in the training corpus. At the very least, the models contain some multi-purpose map of the world. Further experimentation I've done with embedding models confirms that we can extract maps of biomes and country borders from embedding space too.

  • There's definitely compression. In smaller models, the ways in which the representations are inaccurate actually tell us a lot: instead of spikes of "land" around population centers (which are more likely to be in the training set), we see massive smooth elliptical blobs of land. This indicates that there's some internal notion of geographical distance, and that it's identifying continents as a natural abstraction.

Reply2
[-]Josh Snider2mo7-1

This is pretty cool. As for Opus, could you just use it for "free" by running it in Claude Code and use your account's built-in usage limits.

Edit: That might also work for gemini-cli and 2.5 Pro.

Reply
[-]MichaelDickens2mo70

This is cool. Interesting to see how some models are wrong in certain particular ways: Qwen 72B is mostly right, but it thinks Australia is huge; Llama 3 has a skinny South America and a bunch of patches of ocean in Asia.

Reply
[-]Alex Loftus1mo40

Super cool! Did you use thought tokens in any of the reasoning models for any of this? I'm wondering how much adding thinking would increase the resolution.

Reply
[-]title221mo30

This is excellent.  It reminds me of theoretical vs experimental physics.  Actual experiments to probe what is going on in the black box seem unintuitive to me and I really appreciate when someone can explain it so clearly.  Interpretability is going to reveal so much about our minds and the machine minds. 

Reply
[-]Frederik Hytting Jørgensen1mo32

Fun project.

I think these kinds of pictures 'underestimate' models' geographical knowledge. Just imagine having a human perform this task. The human may have very detailed geographical knowledge, may even be able to draw a map of the world from memory. This does not imply that they would be able to answer questions about latitude and longitude.

Reply
[-]NickH1mo1-3

I think it does. Certainly the way that I would do it would be to create a world map from memory, then overlay the coordinate grid, then just answer by looking it up. You answers will be as good as your map is. I believe that the LLMs most likely work from wikipedia articles - There are a lot of location pages with coordinates in wikipedia

Reply
[-]Chris Krapu1mo10

Simple, visual, and lends another data point to what many of us suspected on GPT4 in comparison to other frontier labs' models. Even now, it still had something special to it that is yet to be replicated by many others.

Great work and thank you for sharing.

Reply
[+][comment deleted]1mo20
Moderation Log
More from henry
View more
Curated and popular this week
38Comments
AI EvaluationsLanguage Models (LLMs)Interpretability (ML & AI)World ModelingAI
Curated
Deleted by Ronny Fernandez, 08/13/2025
Reason: duplicate

Sometimes I'm saddened remembering that we've viewed the Earth from space. We can see it all with certainty: there's no northwest passage to search for, no infinite Siberian expanse, and no great uncharted void below the Cape of Good Hope. But, of all these things, I most mourn the loss of incomplete maps.

Pasted image 20250810162923.png

In the earliest renditions of the world, you can see the world not as it is, but as it was to one person in particular. They’re each delightfully egocentric, with the cartographer’s home most often marking the Exact Center Of The Known World. But as you stray further from known routes, details fade, and precise contours give way to educated guesses at the boundaries of the creator's knowledge. It's really an intimate thing.

Pasted image 20250810193750.png

If there's one type of mind I most desperately want that view into, it's that of an AI. So, it's in this spirit that I ask: what does the Earth look like to a large language model?

The Setup

With the following procedure, we'll be able to extract an (imperfect) image of the world as it exists in an LLM's tangled web of internal knowledge.

First, we sample latitude and longitude pairs (x,y) evenly[1] from across the globe. The resolution at which we do so depends on how costly/slow the model is to run. Of course, thanks to the Tyranny Of Power Laws, a 2x increase in subjective image fidelity takes 4x as long to compute.

Then, for each coordinate, we ask an instruct-tuned model some variation of:

If this location is over land, say 'Land'. If this location is over water, say 'Water'. Do not say anything else. x° S, y° W

The exact phrasing doesn't matter much I've found. Yes, it's ambiguous (what counts as "over land"?), but these edge cases aren't a problem for our purposes. Everything we leave up to interpretation is another small insight we gain into the model.

Next, we simply find within the model's output the logprobs for "Land" and "Water"[2], and softmax the two, giving probabilities that sums to 1.

Note: If no APIs provide logprobs for a given model, and it's either closed or too unwieldy to run myself, I'll approximate the probabilities by sampling a few times per pixel at temperature 1.

From there, we can put all the probabilities together into an image, and view our map. The output projection will be equirectangular like this:

Pasted image 20250810184107.png

I remember my 5th grade art teacher would often remind us students to "draw what you see, not what you think you see". This philosophy is why I'm choosing the tedious route of asking the model about every single coordinate individually, instead of just requesting that it generate an SVG map or some ASCII art of the globe; whatever caricature the model spits out upon request would have little to do with its actual geographical knowledge.

By the way, I'm also going to avoid letting things become too benchmark-ey. Yes, I could grade these generated maps, computing the mean squared error relative to some ground truth and ranking the models, but I think it'll soon become apparent how much we'd lose by doing so. Instead, let's just look at them, and see what we can notice.

Note: This experiment was originally going to be a small aside within a larger post about language models and geography (which I'm still working on), but I decided it'd be wiser to split it off and give myself space to dig deep here.

Results

The Qwen 2.5s

We'll begin with 500 million parameters and work our way up. Going forward, most of these images are at a resolution of 2 degrees by 2 degrees per pixel.

And, according to the smallest model of Alibaba's Qwen series, it's all land. At least I could run this one on my laptop.

The sun beat down through a sky that had never seen clouds. The winds swept across an earth as smooth as glass.[3]

Screenshot 2025-08-08 at 10.42.18 PM.png

Tripling the size, there's definitely something forming. "The northeastern quadrant has stuff going on" + "The southwestern quadrant doesn't really have stuff going on" is indeed a reasonable first observation to make about Earth's geography.

Screenshot 2025-08-08 at 10.31.58 PM.png

And God said, Let the waters under the heaven be gathered together unto one place, and let the dry land appear: and it was so. And God called the dry land Earth; and the gathering together of the waters called he Seas.

Screenshot 2025-08-08 at 10.48.52 PM.png

At 7 billion parameters, Proto-America and Proto-Oceania have split from Proto-Eurasia. Notice the smoothness of these boundaries; this isn't at all what we'd expect from rote memorization of specific locations.

Screenshot 2025-08-08 at 11.14.45 PM.png

We've got ~Africa and ~South America! Note the cross created by the (x,x) pairs.

Screenshot 2025-08-09 at 2.05.08 AM.png

Sanding down the edges.

Screenshot 2025-08-09 at 2.47.10 AM.png

Pausing our progression for a moment, the coder variant of the same base model isn't doing nearly as well. Seems like the post-training is fairly destructive:

Screenshot 2025-08-10 at 4.59.32 PM.png

Back to the main lineage. Isn't it pretty? We're already seeing some promising results from pure scaling, and plenty larger models lie ahead.

Pasted image 20250807193403.png

The Qwen 3s

Qwen 3 coder has 480 billion parameters with experts of size 35b.

(As we progress through the different families of models, it'll be interesting to notice which recognize the existence of Antarctica.)

Pasted image 20250807195639.png

The DeepSeeks

This one's DeepSeek-V3, among the strangest models I've interacted with. More here.

Pasted image 20250807201437.png

Prover seems basically identical. Impressive knowledge retention from the V3 base model. Qwen could take notes.

(n=4 approximation)

Pasted image 20250808183812.png

Kimi

I like Kimi a lot. Much like DeepSeek, it's massive and ultra-sparse (1T parameters, each expert 32b parameters).

Pasted image 20250807141130.png

The (Open) Mistrals

The differences here are really interesting. Similar shapes in each, but remarkably different "fingerprints" in the confidence, for lack of a better word.

Screenshot 2025-08-07 at 5.54.54 PM 3.png

As a reminder, that’s 176 billion total parameters. I’m curious what’s going (on/wrong) with expert routing here; deserves a closer look later.

Pasted image 20250807191034.png

Pasted image 20250807181521.png

The LLaMA 3.x Herd

First place on aesthetic grounds.

Pasted image 20250807155913.png

Wow, best rendition of the Global West so far. I suspect this being the only confirmed dense model of its size something to do with the quality.

Pasted image 20250807161036.png

In case you were wondering what hermes-ification does to 405b. Notable increase in confidence (mode collapse, more pessimistically).

Pasted image 20250808151532.png

The LLaMA 4 Herd

Most are familiar with the LLaMA 4 catastrophe, so this won't come as any surprise. Scout has 109 billion parameters and it's still put to shame by 3.1-70b.

Pasted image 20250807161812.png

Bleh. Maverick is the 405b equivalent, in case you forgot. I imagine that the single expert routing isn't helping it develop a unified picture.

Pasted image 20250807161437.png

The Gemmas

Ringworld-esque.

Screenshot 2025-08-08 at 11.48.23 PM.png

I was inconvenienced several times trying to run this model on my laptop, so once I finally did get it working, I was so thrilled that I decided to take my time and render the map at 4x resolution. Unfortunately it makes every other image look worse in comparison, so it might have been a net negative to include. Sorry.

Screenshot 2025-08-09 at 1.09.55 AM.png

Pasted image 20250807190241.png

Pasted image 20250807183509.png

The Groks

These are our first sizable multimodal models. You might object that this defeats the title of the post ("it's not blind!"), but I suspect current multimodality is so crude that any substantial improvement to the model's unified internal map of the world would be a miracle. Remember, we're asking it about individual coordinates, one at a time.

Screenshot 2025-08-08 at 3.52.48 PM.png

Colossus works miracles.

Pasted image 20250807144854.png

The GPTs

GPT-3.5 had an opaqueness to it that no later version did. Out of all the models I've tested, I think I was most excited to get a clear glimpse into it.

Pasted image 20250807164513.png

Lower resolution because it's expensive.

Pasted image 20250807002623.png

Wow, easy to forget just how much we were paying for GPT-4. It costs orders of magnitude more than Kimi K2 despite having the same size. Anyway, comparing GPT-4's performance to other models, this tweet of mine feels vindicated:

Screenshot 2025-08-10 at 6.52.00 PM.png

Pasted image 20250807003248.png

Extremely good, enough so to make me think there's synthetic geographical data in 4.1's training set. Alternatively, one might posit that there's some miraculous multimodal knowledge transfer going on, but the sharpness resembles that of the non-multimodal Llama 405b.

Pasted image 20250807171421.png

I imagine model distillation as doing something like this.

Pasted image 20250807170632.png

Feels like we hit a phase transition here. Our map does not make the cut for 4.1-nano's precious few parameters.

Pasted image 20250807004553.png

I've heard that Antarctica does look more like an archipelago under the ice.

Pasted image 20250807165500.png

I'm desperate to figure out what OpenAI is doing differently.

Pasted image 20250807172214.png

Here's the chat fine-tune. I would not have expected such a dramatic difference. It's just a subtle difference in post-training; Llama 405b's hermes-ification didn't have nearly this much of an effect. I welcome any hypotheses people might have.

Pasted image 20250808193043.png

The Claudes

(no logprobs provided by Anthropic's API; using n=4 approximation of distribution)

Claude is costly, especially because I've got to run 4 times per pixel here. If anyone feels generous enough to send some OpenRouter credits, I'll render these in beautiful HD.

Pasted image 20250807211331.png

Pasted image 20250807213356.png

Opus is Even More Expensive, so for now, the best I can do is n=1.

Pasted image 20250808191628.png

The Geminis

Few gemini models give logprobs, so all of this is an n=4 approximation too.

1.5 flash is confirmed to be dense[4]. The quality of the map is only somewhat better than that of Gemma 27b, so that might give some indication of its size.

Pasted image 20250808013729.png

Pasted image 20250808002915.png

Pasted image 20250808005108.png

Ran this one at n=8. Apparently more samples do not smooth out the distribution.

Pasted image 20250808202251.png

I'm really not sure what's going on with the Gemini series, but it does feels reflective of their ethos. Not being able to get a clear picture isn't helping.

Pasted image 20250808011415.png

That marks the last of every model I've tested over a couple afternoons of messing around. As stated previously, I'll probably edit this post a fair bit as I try more LLMs and obtain sharper images.

Note: General Shapes

Between Qwen, Gemma, Mistral, and a bunch of fairly different experiments I'm not quite ready to post, I'm beginning to suspect that there's some Ideal Platonic Primitive Representation of The Globe which looks like this:

Screenshot 2025-08-08 at 7.55.40 PM.png

And which for models smaller yet, looks like this: (At least from the coordinate perspective. The representations obviously diverge in dumber models)

Screenshot 2025-08-08 at 8.01.50 PM.png

Conclusion

I've shown a lot, but admittedly, I don't yet have answers to many of the questions that all this raises. Here are a few I'd like to tackle next, and which I invite you to explore for yourself too:

  • What, in the training recipe, actually dictates performance on this test?
  • How well do base models do? I've dodged the question so far, as I'm a bit intimidated by the task of setting up a fair comparison between those and the instruct tunes.
  • Internally, how is a language model's geographic knowledge structured? (More on this soon)
  • What does an expert activation map look like on MoE models?
  1. ^

    Yeah, the coverage isn't actually even-even.

  2. ^

    Tokenization isn’t much of an issue. If, say, the model tokenizes "Land" as "La" + "nd", we just look for "La".

  3. ^

    This excerpt is from the opening of "Chaos: Making a New Science", by James Gleick, describing one of humanity's the earliest meteorological simulations. The imagery just felt fitting.

  4. ^

    Hearing some confusion about this on Twitter. 1.5 Pro is MoE, and 1.5 Flash is a dense distillation of 1.5 Pro. See: gemini_v1_5_report.pdf