Swimmer963 highlights DALL-E 2 struggling with anime, realistic faces, text in images, multiple characters/objects arranged in complex ways, and editing. (Of course, many of these are still extremely good by the standards of just months ago, and the glass is definitely more than half full.) itsnotatumor asks:
How many of these "cannot do's" will be solved by throwing more compute and training data at the problem? Anyone know if we've started hitting diminishing returns with this stuff yet?
In general, we have not topped out on pretty much any scaling curve. Whether it's language modeling, image generation, DRL, or whathaveyou, AFAIK, not a single modality can be truly said to have been 'solved' with the scaling curve broken. Either the scaling curve is flat, or we're still far away. (There are some sound-related ones which seem to be close, but nothing all that important.) Diffusion models' only scaling law I know of is an older one which bends a little but probably reflects poor hyperparameters, and no one has tried eg. Chinchilla on them yet.
So yes, we definitely can just make all the compute-budgets 10x larger without wasting it.
To go through the specific issues (caveat: we do...
Google Brain just announced Imagen (Twitter), which on skimming appears to be not just as good as DALL-E 2 but convincingly better. The main change appears to be reducing the CLIP reliance in favor of a much larger and more powerful text encoder before doing the image diffusion stuff. They make a point of noting superiority on "compositionality, cardinality, spatial relations, long-form text, rare words, and challenging prompts." The samples also show text rendering fine inside the images as well.
I take this as strong support (already) for my claims 2-3: the problems with DALL-E 2 were not major or deep ones, do not require any paradigm shift to fix, or even any fix, really, beyond just scaling the components almost as-is. (In Kuhnian terms, the differences between DALL-E 2 and Imagen or Make-A-Scene are so far down in the weeds of normal science/engineering that even people working on image generation will forget many of the details and have to double-check the papers.)
EDIT: Google also has a more traditional autogressive DALL-E-1-style 1024px model, "Parti", competing with diffusion Imagen; it is slightly better in COCO FID than Imagen. It likewise does well on all those issues, with again no special fancy engineering aimed specifically at those issues, mostly just scaling up to 20b.
Will future generative models choke to death on their own excreta? No.
Now that goalposts have moved from "these neural nets will never work and that's why they're bad" to "they are working and that's why they're bad", a repeated criticism of DALL·E 2 etc is that their deployment will 'pollute the Internet' by democratizing high-quality media, which may (given all the advantages of machine intelligence) quickly come to exceed 'regular' (artisanally-crafted?) media, and that ironically this will make it difficult or impossible to train better models. I don't find this plausible at all but lots of people seem to and no one is correcting all these wrong people on the Internet, so here's a quick rundown why:
It Hasn't Happened Yet: there is no such thing as 'natural' media on the Internet, and never has been. Even a smartphone photograph is heavily massaged by a pipeline of algorithms (increasingly DL-based) before it is encoded into a codec designed to throw away human-perceptually-unimportant data such as JPEG. We are saturated in all sorts of Photoshopped, CGIed, video-game-rendered, Instagram-filtered, airbrushed, posed, lighted, (extremely heavily) curated media. If these models
I've seen it several times on Twitter, Reddit, and HN, and that's excluding the people like Jack Clark who has pondered it repeatedly in his Import.ai newsletter & used it as theme in some of his short stories (but much more playfully & thoughtfully in his case so he's not the target here). I think probably the one that annoyed me enough to write this was when Imagen hit HN and the second lengthy thread was all about 'poisoning the well' with most of them accepting the premise. It has also been asked here on LW at least twice in different places. (I've also since linked this writeup at least 4 times to various people asking this exact question about generative models choking on their own exhaust, and the rise of ChatGPT has led to it coming up even more often.)
Wow, this is going to explode picture books and book covers.
Hiring an illustrator for a picture book costs a lot, as it should given it's bespoke art.
Now publishers will have an editor type in page descriptions, curate the best and off they go. I can easily imagine a model improvement to remember the boy drawn or steampunk bear etc.
Book cover designers are in trouble too. A wizard with lighting in hands while mountain explodes behind him - this can generate multiple options.
It's going to get really wild when A/B split testing is involved. As you mention regarding ads you'd give the system the power to make whatever images it wanted and then split test. Letting it write headlines would work too.
Perhaps a full animated movie down the line. There are already programs that fill in gaps for animation poses. Boy running across field chased by robot penguins - animated, eight seconds. And so on. At that point it's like Pixar in a box. We'll see an explosion of directors who work alone, typing descriptions, testing camera angles, altering scenes on the fly. Do that again but more violent. Do that again but with more blood splatter.
Animation in the style of Family Guy seems a natural first ...
Perhaps a full animated movie down the line. There are already programs that fill in gaps for animation poses. Boy running across field chased by robot penguins - animated, eight seconds.
Video is on the horizon (video generation bibliography eg. FDM), in the 1-3 year range. I would say that video is solved conceptually in the sense that if you had 100x the compute budget, you could do DALL-E-2-but-for-video right now already. After all, if you can do a single image which is sensible and logical, then a video is simply doing that repeatedly. Nor is there any shortage of video footage to work with. The problem there is that a video is a lot of images: at least 24 images per second, so you could have 192 different samples, or 1 8s clip. Most people will prefer the former: decorating, say, a hundred blog posts with illustrations is more useful than a single OK short video clip of someone dancing.
So video's game is mostly about whether you can come up with an approach which can somehow economize on that, like clever tricks in reusing frames to update only a little while updating a latent vector, as a way to take a shortcut to that point in the future where you had so much compute tha...
Most DALL·E questions can be answered by just reading the paper of it or its competitors, or are dumb. This is probably the most interesting question that can't be, and also one of the most common: can DALL·E (which we'll use just as a generic representative of image generative models, since no one argues that one arch or model can and the others cannot AFAIK) invent a new style? DALL·E is, like GPT-3 in text, admittedly an incredible mimic of many styles, and appears to have gone well beyond any mere 'memorization' of the images depicting styles because it can so seamlessly insert random objects into arbitrary styles (hence all the "Kermit Through The Ages" or "Mughal space rocket" variants); but simply being a gifted epigone of most existing styles is not guarantee you can create a new one.
If we asked a Martian what 'style' was, it would probably conclude that "'style' is what you call it when some especially mentally-ill humans output the same mistakes for so long that other humans wearing nooses try to hide the defective output by throwing small pieces of green paper at the outputs, and a third group of humans wearing dresses try to exchange large ...
I think DALL-E has been nerfed (as a sort of low-grade "alignment" effort) and some of what you're talking about as "limitations" are actually bugs that were explicitly introduced with the goal of avoiding bad press.
OpenAI has made efforts to implement model-level technical mitigations that ensure that DALL·E 2 Preview cannot be used to directly generate exact matches for any of the images in its training data. However, the models may still be able to compose aspects of real images and identifiable details of people, such as clothing and backgrounds. (sauce)
It wouldn't surprise me if they just used intelligibility tools to find the part of the vectorspace that represents "the face of any famous real person" and then applied some sort of noise blur to the model itself, as deployed?
Except! Maybe not a "blur" but some sort of rotation of a subspace or something? This hint is weirdly evocative:
...they were very recognizably screenshots from Firefly in terms of lighting, ambiance, scenery etc, with an actor who looked almost like Nathan Fillion – as though cast in a remake that was trying to get it fairly similar – and who looked consistently the same across all 10 images, but was definite
Thread of all known anime examples.
whereas anime more broadly is probably pulling in a lot of poorer-quality anime art...(Also, sometimes if asked for “anime” it gives me content that either looks like 3D rendered video game cutscenes, or occasionally what I assume is meant to be people at an anime con in cosplay.)
That's how you know it's not a problem of pulling in lots of poorer-quality anime art. First, poorer-quality doesn't impede learning that much; remember, you just prompt for high-quality. Compute allowing, more n is always better. And second, if it was a master of poorer-quality anime drawings, it wouldn't be desperately 'sliding away', if you will, like squeezing a balloon, from rendering true anime, as opposed to CGI of anime or Western fanart of anime or photographs of physical objects related to anime. It would just do it (perhaps generating poorer-quality anime), not generate high-quality samples of everything but anime. (See my comment there for more examples.)
The problem is it's somehow not trained on anime. Everything it knows about anime seems to come primarily from adjacent images and the CLIP guidance (which does know plenty about anime, but we also know that pixel generation from CLIP guidance never works as well).
A prompt i'd love to see: "Anomalocaris Canadensis flying through space." I'm really curious how well it does with an extinct species which has very little existing artistic depictions. No text->image model i've played with so far has managed to create a convincing anomalocaris, but one interestingly did know it was an aquatic creature and kept outputting lobsters.
Going by the Wikipedia page reference, I think it got it somewhat closer than "lobsters" at least?
I'd rate these highly, there are many forms of anomalocarids (https://en.m.wikipedia.org/wiki/Radiodonta#/media/File%3A20191201_Radiodonta_Amplectobelua_Anomalocaris_Aegirocassis_Lyrarapax_Peytoia_Laggania_Hurdia.png) and it looks to have picked a wide variety aside from just candensis, but I'm thoroughly impressed that it got the form right in nearly all 10.
Challenging prompt ideas to try:
First one: ....yeah no, DALL-E 2 can't count to five, it definitely doesn't have the abstract reasoning to double areas. Image below is literally just "a horizontal row of five squares".
Yeah, this matches with my sense. It has a really extensive knowledge of the expected relationships between elements, extending over a huge number of kinds of objects, and so it can (in one of the areas that are easy for it) successfully fill in the blanks in a way that looks very believable, but the extent to which it has a gears-y model of the scene seems very minimal. I think this also explains its difficulty with non-stereotypical scenes that don't have a single focal element – if it's filling in the blanks for both "pirate ship scene" and "dogs in Roman uniforms scene" it gets more confused.
The Elon Musk one has realistic faces so I can't share it; I have, however, confirmed that DALL-E does not speak ASL with "The ASL word for "thank you"":
"Pages from a flip book of a water glass spilling" I...think DALL-E 2 does not know what a flip book is.
Slightly reworded to "a game as complex tic-tac-toe, screenshots showing the rules of the game", I am pretty sure DALL-E is not able to generate and model consistent game rules though.
Thanks for this thorough account. The bit where you tried to shorten the hair really made me laugh.
DALL-E is often sensitive to exact wording, and in particular it’s fascinating how “in the style of x” often gets very different results from “screenshot from an x movie”. I’m guessing that in the Pixar case, generic “Pixar style” might capture training data from Pixar shorts or illustrations that aren’t in their standard recognizable movie style.
I've seen this prompt programming bug noted on Twitter by DALL-E 2 users as well. With earlier models, there didn't seem to be that much difference between 'by X' vs 'in the style of X', but with the new high-e...
Thanks for that awesome sumup,
I tried to generate character (Dark Elf / Drow), Magic Items and Scene in a Dungeon and Dragon or Magic the Gathering style like so many cool images on Pinterest :
https://www.pinterest.fr/rbarlow177/dd-character-art/
It was very very difficult !
- Character style is very crappy like old Google Search clipart
- Some "technical term" like Dark Elf or Drow match nothing
The Idea was to generate Medieval Fantasy style for Card Game like Magic but it's very hard to get something good. I fail after 30+ attempt
This is great! I'm generally most interested to see people finding weaknesses of new DL tools, which in and of itself is a sign of how far the technology has progressed.
I'm having real trouble finding out about Dall E and copyright infringement. There are several comments about how Dall E can "copy a style" without it being a violation to the artist, but seriously, I'm appalled. I'm even having trouble looking at some of the images without feeling "the death of artists." It satisfies the envy of anyone who every wanted to do art without making the effort, but on whose backs? Back in the day, we thought that open source would be good advertising, but there is NO reference to any sources. I'm a...
I wonder if you could get it to generate Minecraft screenshots, such as:
It would also be interesting to see how “as a screenshot from Minecraft“ combines with other styles:
You could also append “as a screenshot from Minecraft” to more abstract prompts, for example:
Finally, some other miscellaneo...
The "one character" limitation makes it look like DALL-E was spawned from ongoing, massive programs to develop object recognizing systems, not any sort of general generative system.
Would it be accurate to characterize DALL-E as "basically inverted object recognition"?
My understanding is that the face model limitation may have been deliberate to avoid deepfakes of celebrities, etc. Interestingly, DALL-E can nonetheless at least sometimes do perfectly reasonable faces, either as photographs or in various art styles, if they're the central element of a scene. (And it keeps giving me photorealistic faces as a component of images where I wasn't even asking for that, meaning that per the terms and conditions I can't share those images publicly.)
FWIW, OpenAI just changed the requirements on face samples, loosening it consi...
Prompt from my brother:
What people from 1920 thought 2020 would look like. 1920's Artist's depiction of 2020
When they released the first Dall-E, didn't OpenAI mention that prompts which repeated the same description several times with slight re-phrasing produced improved results?
I wonder how a prompt like:
"A post-singularity tribesman with a pet steampunk panther robot. Illustration by James Gurney."
-would compare with something like:
"A post-singularity tribesman with a pet steampunk panther robot. Illustration by James Gurney. A painting of an ornate robotic feline made of brass and a man wearing futuristic tribal clothing. A steampunk scene by James Gurney featuring a robot shaped like a panther and a high-tech shaman."
even if it theoretically understands the English language.
If you mix up a prompt into random words so that it's no longer grammatically correct English, does it give worse results? That is, I wonder how much it's basically just going off keywords.
Curated. I think this post is a great demonstration of what our last curation choice suggested
Interpretability research is sometimes described as neuroscience for ML models. Neuroscience is one approach to understanding how human brains work. But empirical psychology research is another approach. I think more people should engage in the analogous activity for language models: trying to figure out how they work just by looking at their behavior, rather than trying to understand their internals.
I'm not yet convinced this will especially fruitful, but t...
I gather we're allowed to suggest prompts we wish to see? Here's a prompt trying to create fanart for my favourite web serial, Pale by Wildbow:
"A girl with red-blonde hair, in a forest. The girl is wearing a deer-mask with short antlers, a cape over a jersey and shorts, and a witch's hat. The girl is holding a hockey stick. Every branch of every tree has a bright ribbon tied to it. The cape rests atop her shoulders and falls over one arm like a musketeer's cape. The witch's hat and the cape are both navy-blue."
Thanks Swimmer963! This was very interesting.
I have a general question for the community. Does anyone know of any similar such descriptions of model limitations with so many examples performed for any language models such as GPT-3?
My personal experience is that visual output is inherently more intuitive, but I'd love to explore my intuition around language models with an equivalent article for GPT-3 or PaLM for example.
I'd predict such articles exist with high confidence but finding the appropriate article with sufficient quality might be trickier. I'm curious which articles commenters here would select in particular.
Some prompts:
The Last Supper by Leonardo Da Vinci, but painted from behind.
The Last Supper by Leonardo Da Vinci, but painted from above, looking straight downwards.
The Last Supper by Leonardo Da Vinci, as an X-ray image.
Relativity by Escher, as a high-resolution photograph.
Boris Johnson dressed as a clown and riding a unicycle along a tightrope, spray-painted onto a wall, in the style of Banksy.
"The Last Supper by Leonardo Da Vinci, as an X-ray image" It's trying!
I especially like this one (close-up): https://labs.openai.com/s/QsWCxHvbwRaIJEB7xbTCnvwx
This is great! Thanks.
A nitpick:
adding stylistic prompts actively changes the some of what I would consider to be content
Your examples here are not good since e.g. "...painting by Alphonse Mucha" is not just a rewording of "...in the style of Alphonse Mucha": the former isn't a purely stylistic prompt. For a [painting by x], x gets to decide what is in the painting - so it should be expected that this will change the content.
Similarly for "screenshots from the miyazaki anime movie".
Of course it's still a limitation if you can only get really good style results by using such not-purely-stylistic prompts.
Some prompts I’d love to see: “Infinite Jest” “Bedroom with impossible geometry” “Coffeeshop in non-Euclidian hyperbolic space” “Screenshot of Wikipedia front page” “The shadow in the corner of the room stared at me”
"A screenshot of the Wikipedia home page" this is one of the results that makes me feel ~anthropomorphized fondness for DALL-E. It's trying so hard!
Dall-E knows locations. We put a watercolor painting I did of our cabin on a lake and asked Dall-E to create a "variation". The watercolor image Dall-e created was literally my next door neighbors cabin which is a few hundred feet away from ours. Blew my mind how Dall-E even knew the location just based on the image I put in.
Big black furry monsters with tall heads are wearing construction outfits and are dripping with water and seaweed. They are using a dolly to pick up a dumpster in an alley and pointing at where to bring it. Realistic HD photo.
I am so confused by two completions of a human girl here. How is this possibly close in image-space to all the other images, especially given this prompt?
What if the prompt literally doesn't make sense? Like having a coherent prompt structure, but the content isn't logically valid.
For example, "A painting of a woman drawing herself, in the style of clocks"
Thank you for sharing all of these DALL-E tests!
I wonder whether it can reproduce three objects that reliably appear together in images. How about one of these prompts:
A bronze statue of three wise monkeys.
See no evil, hear no evil, speak no evil, statue of monkeys.
The Bill Watterson one requires me to request black bears attacking a black forest campground at midnight.
Optionally: "...as pixel art".
I have to ask, how does one get hold of any of the programs in this vein? I've seen Gwern's TWDNE, and now your experiments with DALL-E, and I'd love to mess with them myself but have no idea where to go. A bit of googling suggests one can buy GPT3 time from OpenAI, but I gather that's for text generation, which I can do just fine already.
It'd be interesting to see (e.g.):
Full body x-ray scan of a {X}. Detailed, medical professional scan.
Medical illustration of {X} skeleton, with labels. High quality, detailed, professional medical illustration.
Where X is some fictional creature, such as: mermaid, Pikachu, dragon.
Could you please return 10 for each of these prompts, I give you my best, ones that should get out of it interesting vividness:
1) Bright macro shot of a plush toy robot pikachu eating a hamburger in a nurse outfit against a white brick wall with mud splashed on pikachu from a tire on the road. 8K HD incredibly detailed.
2) Macro shot of the cool pikachu wearing black chains and laughing as seen in a truck selfie in the desert next to a sand castle with piranha plants seen through the heat. 8K incredibly detailed.
3) Future 2377 hospital with beds in glass co...
Could you try this?
"A DJ stands mightily on a festival stage with thousands of people cheering and dancing. The DJ's T-Shirt reads "CHUNTED". Drawn in the style of Bruno Mangyoku."
You say it performs well when two characters have a single trait which is different between them. I wonder how much better it performs better when you give character A many masculine traits, and B many feminine traits, without directly stating A is male and B is female, compared to if you randomize those traits for A or B.
In general, assigning traits which correlate highly with each other should give better results. Perhaps a problem is that the more characters and traits you assign, the less correlated those traits are with all the other traits, and so far lower performance is seen.
Some prompt requests for my daughter:
"A wild boar and an angel walking side by side along the beach - beautiful hyperrealistic art"
"A piggo-saurus - a pig-like dinosaur - hyper realistic art"
"A piggo-saurus - an illustration of a pig-like dinosaur"
"A little forest gnome leaving through his magic book - beautiful and detailed illustration"
I wanted: the Star-Eyed Goddess
Maybe DALL-E thought you meant Movie-Star-Eyed Goddess? 'Cause that's what the picture looks like to me :)
Regarding text, if the problem comes from encoding, does that mean the model does better with individual letters and digits? Eg
"The letter A"
"The letters X Y and Z"
"Number 8"
"A 3D rendering of the number 5"
Awesome writeup!
To further explore the interplay between style and content, how about trying something not very specific that could gain specificity from the style context?
For example "Aliens are conducting experiments on human subjects":
Prompt: A cartoon honey badger wearing a Brazilian Jiu Jitsu GI with a black belt, shooting in for a wrestling takedown
Can you try this one:
Glossy black crystal temples with silver barred gates releasing smoke along a metal path with spikes along it next to a red river, and a layer of smoke. Chains everywhere. A black portal is at the end with heavy glossy techno bosses guarding it. Realistic HD photo.
Here is an idea that I hope will give some interesting results:
A complex Rube Goldberg machine.
Some possible variations:
A Rube Goldberg machine made out of candy.
A photograph of a steampunk Rube Goldberg machine.
"White haired girl wearing white tights with a girl with black hair wearing opaque black tights and blushing, Anime fanart, danbooru, deviantart, advanced digital art settings"
(since there is 2 girls, it doesn't qualify as "explicit" and more just anime fanart)
As a cinematographer now I'm curious of how much it can understand more advanced photography techniques. For example can it do something like "Double exposure photo of the silhouette of a man with fireworks in the background"? I made a similar photo two years ago and I'll leave it here as reference to see how similar it can get: https://i.gyazo.com/ace7c2bd76a8f2710859362314a1f8c0.jpg
This is so incredible. I'm a cinematographer and I'm looking forward to having access because I'm curious how it'll perform in using it to make references for projects. I'm curious if it can take any specific film (not franchise) and take that style. An example of this would be something like "A man with a blue shirt walking through a dark hallway, in the style of Blade Runner 2049". If this works it would also explain why it is a bit loose when you mention Pixar the production company instead of a specific film with a more consistent style. A lot like the...
Heyyyy I got a prompt request:
Illustrated artwork by Hirohiko Araki depicting Shrek and Donkey in the style of Jojo's Bizarre Adventure.
If you want specific words spelled correctly try putting quotations on the specific words in the prompt
Prompt:
Axis and Allies board game 2022 setup. Digital image official concept
(Remove some words if it doesn't work)
Small white cat wearing a red collar with a bell on it hugging a shadow person. Cute digital art, enhanced digital image
Prompt request!
Prompt:
"Chi in Chi's Sweet Home japanese animation. Streaming service Crunchyroll. Screenshot of episode with Chi, who is a cute tabby-white mixed cat. 2D, Google Search Screenshot, Pinterest"
Very insightful post. May I use your images in my PhD dissertation to illustrate limitations of current image generation methods? Thanks!
Gabriel Huang
Reference Picture of Kyubey. Drawn By Puella Magi Madoka Magica. Digital Art Clip Studio Paint Anime, Pretty and Shining. Advanced Image of Kyubey. The character is Kyubey from Puella Magi Madoka Magica
I got access to DALL-E 2 earlier this week, and have spent the last few days (probably adding up to dozens of hours) playing with it, with the goal of mapping out its performance in various areas – and, of course, ending up with some epic art.
Below, I've compiled a list of observations made about DALL-E, along with examples. If you want to request art of a particular scene, or to test see what a particular prompt does, feel free to comment with your requests.
DALL-E's strengths
Stock photography content
It's stunning at creating photorealistic content for anything that (this is my guess, at least) has a broad repertoire of online stock images – which is perhaps less interesting because if I wanted a stock photo of (rolls dice) a polar bear, Google Images already has me covered. DALL-E performs somewhat better at discrete objects and close-up photographs than at larger scenes, but it can do photographs of city skylines, or National Geographic-style nature scenes, tolerably well (just don't look too closely at the textures or detailing.) Some highlights:
Pop culture and media
DALL-E "recognizes" a wide range of pop culture references, particularly for visual media (it's very solid on Disney princesses) or for literary works with film adaptations like Tolkien's LOTR. For almost all media that it recognizes at all, it can convert it in almost-arbitrary art styles.
[Tip: I find I get more reliably high-quality images from the prompt "X, screenshots from the Miyazaki anime movie" than just "in the style of anime", I suspect because Miyazaki has a consistent style, whereas anime more broadly is probably pulling in a lot of poorer-quality anime art.]
Art style transfer
Some of most impressively high-quality output involves specific artistic styles. DALL-E can do charcoal or pencil sketches, paintings in the style of various famous artists, and some weirder stuff like "medieval illuminated manuscripts".
IMO it performs especially well with art styles like "impressionist watercolor painting" or "pencil sketch", that are a little more forgiving around imperfections in the details.
Creative digital art
DALL-E can (with the right prompts and some cherrypicking) pull off some absolutely gorgeous fantasy-esque art pieces. Some examples:
The output when putting in more abstract prompts (I've run a lot of "[song lyric or poetry line], digital art" requests) is hit-or-miss, but with patience and some trial and error, it can pull out some absolutely stunning – or deeply hilarious – artistic depictions of poetry or abstract concepts. I kind of like using it in this way because of the sheer variety; I never know where it's going to go with a prompt.
The future of commercials
This might be just a me thing, but I love almost everything DALL-E does with the prompt "in the style of surrealism" – in particular, its surreal attempt at commercials or advertisements. If my online ads were 100% replaced by DALL-E art, I would probably click on at least 50% more of them.
DALLE's weaknesses
I had been really excited about using DALL-E to make fan art of fiction that I or other people have written, and so I was somewhat disappointed at how much it struggles to do complex scenes according to spec. In particular, it still has a long way to go with:
Scenes with two characters
I'm not kidding. DALL-E does fine at giving one character a list of specific traits (though if you want pink hair, watch out, DALL-E might start spamming the entire image with pink objects). It can sometimes handle multiple generic people in a crowd scene, though it quickly forgets how faces work. However, it finds it very challenging to keep track of which traits ought to belong to a specific Character A versus a different specific Character B, beyond a very basic minimum like "a man and a woman."
The above is one iteration of a scene I was very motivated to figure out how to depict, as a fan art of my Valdemar rationalfic. DALL-E can handle two people, check, and a room with a window and at least one of a bed or chair, but it's lost when it comes to remembering which combination of age/gender/hair color is in what location.
Even in cases where the two characters are pop culture references that I've already been able to confirm the model "knows" separately – for example, Captain America and Iron Man – it can't seem to help blending them together. It's as though the model has "two characters" and then separately "a list of traits" (user-specified or just implicit in the training data), and reassigns the traits mostly at random.
Foreground and background
A good example of this: someone on Twitter had commented that they couldn't get DALL-E to provide them with "Two dogs dressed like roman soldiers on a pirate ship looking at New York City through a spyglass". I took this as a CHALLENGE and spent half an hour trying; I, too, could not get DALL-E to output this, and end up needing to choose between "NYC and a pirate ship" or "dogs in Roman soldier uniforms with spyglasses".
DALL-E can do scenes with generic backgrounds (a city, bookshelves in a library, a landscape) but even then, if that's not the main focus of the image then the fine details tend to get pretty scrambled.
Novel objects, or nonstandard usages
Objects that are not something it already "recognizes." DALL-E knows what a chair is. It can give you something that is recognizably a chair in several dozen different art mediums. It could not with any amount of coaxing produce an "Otto bicycle", which my friend specifically wanted for her book cover. Its failed attempts were both hilarious and concerning.
Objects used in nonstandard ways. It seems to slide back toward some kind of ~prior; when I asked it for a dress made of Kermit plushies displayed on a store mannequin, it repeatedly gave me a Kermit plushie wearing a dress.
DALL-E generally seems to have extremely strong priors in a few areas, which end up being almost impossible to shift. I spent at least half an hour trying to convince it to give me digital art of a woman whose eyes were full of stars (no, not the rest of her, not the background scenery either, just her eyes...) and the closest DALL-E ever got was this.
Spelling
DALL-E can't spell. It really really cannot spell. It will occasionally spell a word correctly by utter coincidence. (Okay, fine, it can consistently spell "STOP" as long as it's written on a stop sign.)
It does mostly produce recognizable English letters (and recognizable attempts at Chinese calligraphy in other instances), and letter order that is closer to English spelling than to a random draw from a bag of Scrabble letters, so I would guess that even given the new model structure that makes DALL-E 2 worse than the first DALL-E, just scaling it up some would eventually let it crack spelling.
At least sometimes its inability to spell results in unintentionally hilarious memes?
Realistic human faces
My understanding is that the face model limitation may have been deliberate to avoid deepfakes of celebrities, etc. Interestingly, DALL-E can nonetheless at least sometimes do perfectly reasonable faces, either as photographs or in various art styles, if they're the central element of a scene. (And it keeps giving me photorealistic faces as a component of images where I wasn't even asking for that, meaning that per the terms and conditions I can't share those images publicly.)
Even more interestingly, it seems to specifically alter the appearance of actors even when it clearly "knows" a particular movie or TV show. I asked it for "screenshots from the second season of Firefly", and they were very recognizably screenshots from Firefly in terms of lighting, ambiance, scenery etc, with an actor who looked almost like Nathan Fillion – as though cast in a remake that was trying to get it fairly similar – and who looked consistently the same across all 10 images, but was definitely a different person.
There are a couple of specific cases where DALL-E seems to "remember" how human hands work. The ones I've found so far mostly involve a character doing some standard activity using their hands, like "playing a musical instrument." Below, I was trying to depict a character from A Song For Two Voices who's a Bard; this round came out shockingly good in a number of ways, but the hands particularly surprised me.
Limitations of the "edit" functionality
DALL-E 2 offers an edit functionality – if you mostly like an image except for one detail, you can highlight an area of it with a cursor, and change the full description as applicable in order to tell it how to modify the selected region.
It sometimes works - this gorgeous dress (didn't save the prompt, sorry) originally had no top, and the edit function successfully added one without changing the rest too much.
It often appears to do nothing. It occasionally full-on panics and does....whatever this is.
There's also a "variations" functionality that lets you select the best image given by a prompt and generate near neighbors of it, but my experience so far is that the variations are almost invariably less of a good fit for the original prompt, and very rarely better on specific details (like faces) that I might want to fix.
Some art style observations
DALL-E doesn't seem to hold a sharp delineation between style and content; in other words, adding stylistic prompts actively changes the some of what I would consider to be content.
For example, asking for a coffeeshop scene as painted by Alphonse Mucha puts the woman in in a long flowing period-style dress, like in this reference painting, and gives us a "coffeeshop" that looks a lot to me like a lady's parlor; in comparison, the Miyazaki anime version mostly has the character in a casual sweatshirt. This makes sense given the way the model was trained; background details are going to be systematically different between Nouveau Art paintings and anime movies.
DALL-E is often sensitive to exact wording, and in particular it's fascinating how "in the style of x" often gets very different results from "screenshot from an x movie". I'm guessing that in the Pixar case, generic "Pixar style" might capture training data from Pixar shorts or illustrations that aren't in their standard recognizable movie style. (Also, sometimes if asked for "anime" it gives me content that either looks like 3D rendered video game cutscenes, or occasionally what I assume is meant to be people at an anime con in cosplay.)
Conclusions
How smart is DALL-E?
I would give it an excellent grade in recognizing objects, and most of the time it has a pretty good sense of their purpose and expected context. If I give it just the prompt "a box, a chair, a computer, a ceiling fan, a lamp, a rug, a window, a desk" with no other specification, it consistently includes at least 7 of the 8 requested objects, and places them in reasonable relation to each other – and in a room with walls and a floor, which I did not explicitly ask for. This "understanding" of objects is a lot of what makes DALL-E so easy to work with, and in some sense seems more impressive than a perfect art style.
The biggest thing I've noticed that looks like a ~conceptual limitation in the model is its inability to consistently track two different characters, unless they differ on exactly one trait (male and female, adult and child, red hair and blue hair, etc) – in which case the model could be getting this right if all it's doing is randomizing the traits in its bucket between the characters. It seems to have a similar issue with two non-person objects of the same type, like chairs, though I've explored this less.
It often applies color and texture styling to parts of the image other than the ones specified in the prompt; if you ask for a girl with pink hair, it's likely to make the walls or her clothes pink, and it's given me several Rapunzels wearing a gown apparently made of hair. (Not to mention the time it was confused about whether, in "Goldilocks and the three bears", Goldilocks was also supposed to be a bear.)
The deficits with the "edit" mode and "variations" mode also seem to me like they reflect the model failing to neatly track a set of objects-with-assigned-traits. It reliably holds the non-highlighted areas of the image constant and only modifies the selected part, but the modifications often seem like they're pulling in context from the entire prompt – for example, when I took one of my room-with-objects images and tried to select the computer and change it to "a computer levitating in midair", DALL-E gave me a levitating fan and a levitating box instead.
Working with DALL-E definitely still feels like attempting to communicate with some kind of alien entity that doesn't quite reason in the same ontology as humans, even if it theoretically understands the English language. There are concepts it appears to "understand" in natural language without difficulty – including prompts like "advertising poster for the new Marvel's Avengers movie, as a Miyazaki anime, in the style of an Instagram inspirational moodboard", which would take so long to explain to aliens, or even just to a human from 1900. And yet, you try to explain what an Otto bicycle is – something which I'm pretty sure a human six-year-old could draw if given a verbal description – and the conceptual gulf is impossible to cross.