Does anyone know whether GPT-4 successfully generates ASCII art?

GPT-3.5 couldn't:

the hell

Which makes sense, cause of the whole words-can't-convey-phenomena thing. 

I'd expect multimodality to solve this problem, though?

New Answer
New Comment

6 Answers sorted by

Note that ASCII art isn't the only kind of art. I just asked GPT4 and Claude to both make SVGs of a knight fighting a dragon.

Here's Claude's attempt:

And GPT4s:

I asked them both to make it more realistic. Claude responded with the exact same thing with some extra text, GPT4 returned:

I asked followed up asking it for more muted colors and a simple background, and it returned:

I tried again, accidentally using GPT3.5 this time, which initially gave something really lame, but then said "more realistic please", and it gave me:

Next I asked it:


It responded with this image:


<svg width="300" height="300" viewBox="0 0 300 300" xmlns="">
  <!-- Background circle -->
  <circle cx="150" cy="150" r="140" fill="none" stroke="black" stroke-width="2"/>

  <!-- Body -->
  <ellipse cx="150" cy="100" rx="30" ry="40" fill="none" stroke="black" stroke-width="2"/>
  <rect x="140" y="140" width="20" height="60" fill="none" stroke="black" stroke-width="2"/>
  <line x1="100" y1="140" x2="200" y2="140" stroke="black" stroke-w
... (read more)

Any chance you have the generated svg's still, not just the resulting bitmap render?

Here was the final one: <svg viewBox="0 0 800 600" xmlns=""> <!-- Field --> <rect x="0" y="0" width="100%" height="100%" fill="#8BC34A"/> <!-- Sky and sun --> <rect x="0" y="0" width="100%" height="40%" fill="#90CAF9"/> <circle cx="700" cy="100" r="50" fill="#FFEB3B"/> <!-- Mountains --> <polygon points="100,300 300,100 500,300" fill="#BDBDBD"/> <polygon points="350,400 550,200 750,400" fill="#9E9E9E"/> <!-- Castle --> <rect x="200" y="150" width="200" height="200" fill="#F5F5F5"/> <rect x="250" y="200" width="100" height="100" fill="#BDBDBD"/> <rect x="220" y="190" width="60" height="60" fill="#8BC34A"/> <rect x="320" y="190" width="60" height="60" fill="#8BC34A"/> <rect x="290" y="230" width="20" height="60" fill="#BDBDBD"/> <polygon points="200,150 400,150 300,100" fill="#F5F5F5"/> <!-- Dragon --> <path d="M 550 400 Q 600 300 700 300 Q 800 300 750 400 Z" fill="#F44336"/> <ellipse cx="600" cy="350" rx="100" ry="50" fill="#E53935"/> <ellipse cx="660" cy="330" rx="30" ry="20" fill="#F5F5F5"/> <ellipse cx="540" cy="330" rx="30" ry="20" fill="#F5F5F5"/> <circle cx="620" cy="340" r="5" fill="#000"/> <circle cx="580" cy="340" r="5" fill="#000"/> <path d="M 600 380 Q 640 400 660 350" stroke="#000" stroke-width="2" fill="none"/> <path d="M 600 380 Q 560 400 540 350" stroke="#000" stroke-width="2" fill="none"/> <path d="M 520 330 Q 580 330 600 300" stroke="#000" stroke-width="2" fill="none"/> <path d="M 700 350 Q 680 320 680 340" stroke="#000" stroke-width="2" fill="none"/> <path d="M 700 350 Q 720 320 720 340" stroke="#000" stroke-width="2" fill="none"/> <!-- Knight --> <path d="M 250 450 L 300 350 L 350 450 L 325 500 L 275 500 Z" fill="#BDBDBD"/> <path d="M 325 500 L 325 550" stroke="#000" stroke-width="10" fill="none"/> <path d="M 275 500 L 275 550" stroke="#000" stroke-width="10" fill="none"/> <circle cx="312.5" cy="362.5" r="37.5" fill="#8BC34A"/> <rect x="290" y="375" width=

It's not great but it's trying

That's actually a rather good depiction of a dog's head, in my opinion.

I think it makes sense that it fails in this way. ChatGPT really doesn't see lines arranged vertically, it just sees the prompt as one long line. But given that it has been trained on a lot of ASCII art, it will probably be successful at copying some of it some of the time.

In case there is any doubt, here is GPT4's own explanation of these phenomena:

Lack of spatial awareness: GPT-4 doesn't have a built-in understanding of spatial relationships or 2D layouts, as it is designed to process text linearly. As a result, it struggles to maintain the correct alignment of characters in ASCII art, where spatial organization is essential.

Formatting inconsistencies in training data: The training data for GPT-4 contains a vast range of text from the internet, which includes various formatting styles and inconsistent examples of ASCII art. This inconsistency makes it difficult for the model to learn a single, coherent way of generating well-aligned ASCII art.

Loss of formatting during preprocessing: When text is preprocessed and tokenized before being fed into the model, some formatting information (like whitespaces) might be lost or altered. This loss can affect the model's ability to produce well-aligned ASCII art.

This is a more sensible representation of a balloon than one in the post, it's just small. More prompts tested on both ChatGPT-3.5 and GPT-4 would clarify the issue.

ChatGPT really doesn't see lines arranged vertically, it just sees the prompt as one long line.

Vision can be implemented in transformers by representing pictures with linear sequences of tokens, which stand for small patches of the picture, left-to-right, top-to-bottom (see appendix D.4 of this paper). The model then needs to learn on its own how the rows fit together into columns and so on... (read more)

It's a subjective matter whether the above is successful ASCII art balloon or not. If we hold GPT to the same standards we do for text generation, I think we can safely say the above depiction is a miserable failure. The lack of symmetry and overall childishness of it suggests it has understood nothing about the spatiality and only by random luck manages to approximate something it has explicitly seen in the training data. I've done a fair bit of repeated generations and they all come out poorly). I think the Transformer paper was interesting as well, although they do mention that it only works when there is a large amount of training data. Otherwise, the inductive biases of CNNs do have their advantages, and combining both is probably superior since the added computational burden of a CNN in conjunction with a Transformer is hardly worth talking about.

See my reply here for a partial exploration of this. I also have a very long post in my drafts covering this question in relation to Bing's AI, but I'm not sure if it's worth posting now, after the GPT4 release.

In my understanding, this is only possible by rote memorization.

3 comments, sorted by Click to highlight new comments since: Today at 5:56 PM

There is some discussion in comments to this Manifold question, suggesting GPT-4 still doesn't have a good visual understanding of ASCII art, at least not to the point of text recognition.

But it doesn't address the question for pictures of cats or houses instead of pictures of words. Or for individual letters of the alphabet.

I like this question - if it proves true that GPT-4 can produce recognizable ASCII art of things, that would mean it was somehow modelling an internal sense of vision and ability to recognize objects.

For this very reason, I was intrigued by the possibility of teaching them vision this way. 

I think ASCII art in its general form is an unfair setup, though; ChatGPT has no way of knowing the spacing or width of individual letters, that is not how they perceive them, and hence, they have no way of seeing which letters are above each other. Basically, if you were given an ASCII art in a string, but had no idea how the individual characters looked or what width they had, you would have no way to interpret the image.

ASCII works because we perceive characters both in their visual shape, and in their encoded meaning, and we also see them depicting in a particular font with particular kerning settings. That entails so much information that is simply missing for them. With a bunch of the pics they produce, you notice they are basically off in the way you would expect if you didn't know the width of individual characters.

This changes if we only use characters of equal width, and a square frame. Say only 8 and 0. And you tell them that if there is a row of characters 8 characters long, this means the ninth character will be right under the first character, the tenth right under the second, etc. This would enable them to learn the spatial relations between the numbers, first in 2D, then in 3D. I've been meaning to do that, see if I can teach them spatial reasoning that way, save the convo and report it to the developers for training data, but was unsure if that was a good way for them to retain the information, or whether it would become superfluous as image recognition is incoming.

I've seen attempts to not just request ASCII, but teach it. And notably, ChatGPT learned across the conversation and improved, despite the fact that I was stuck by how humans were giving an explanation that is terrible if the person you are explaining things to cannot see your interface. We need to explain ASCII like you are explaining it to someone who is blind and feeling along a set of equally spaced beads, telling them to arrange the beads in a 3D construct in their heads.

It is clearly something tricky for them, though. ChatGPT learned language first, not math, they struggle to do things like accurately count characters. I find it all the more impressive that what they generate is not meaningless, and improves.

With a lot of scenarios where people say ChatGPT failed, I have found that their prompts as is did not work, but if you explain things gently and step by step, they can do it. You can aid the AI in figuring out the correct solution, and I find it fascinating that this is possible, that you can watch them learn through the conversation. The difference is really whether you want to prove an AI can't do something, or instead treat it like a mutual teaching interaction, as though you were teaching to a bright student with specific disabilities. E.g. not seeing your interface visually is not a cognitive failing, and judging them for it reminds me of people mistaking hearing impaired people for intellectually disabled people, because they keep mishearing instructions.

New to LessWrong?