I got access to DALL-E 2 earlier this week, and have spent the last few days (probably adding up to dozens of hours) playing with it, with the goal of mapping out its performance in various areas – and, of course, ending up with some epic art. 

Below, I've compiled a list of observations made about DALL-E, along with examples. If you want to request art of a particular scene, or to test see what a particular prompt does, feel free to comment with your requests. 

DALL-E's strengths 

Stock photography content 

It's stunning at creating photorealistic content for anything that (this is my guess, at least) has a broad repertoire of online stock images – which is perhaps less interesting because if I wanted a stock photo of (rolls dice) a polar bear, Google Images already has me covered. DALL-E performs somewhat better at discrete objects and close-up photographs than at larger scenes, but it can do photographs of city skylines, or National Geographic-style nature scenes, tolerably well (just don't look too closely at the textures or detailing.) Some highlights: 

  • Clothing design: DALL-E has a reasonable if not perfect understanding of clothing styles, and especially for women's clothes and with the stylistic guidance of "displayed on a store mannequin" or "modeling photoshoot" etc, it can produce some gorgeous and creative outfits. It does especially plausible-looking wedding dresses – maybe because wedding dresses are especially consistent in aesthetic, and online photos of them are likely to be high quality? 
a "toga style wedding dress, displayed on a store mannequin"
  • Close-ups of cute animals. DALL-E can pull off scenes with several elements, and often produce something that I would buy was a real photo if I scrolled past it on Tumblr.
"kittens playing with yarn in a sunbeam"
  • Close-ups of food. These can be a little more uncanny valley – and I don't know what's up with the apparent boiled eggs in there – but DALL-E absolutely has the plating style for high-end restaurants down.
"dessert special, award-winning chef five star restaurant, close-up photograph"
  • Jewelry. DALL-E doesn't always follow the instructions of the prompt exactly (it seems to be randomizing whether the big pendant is amber or amethyst) but the details are generally convincing and the results are almost always really pretty. 
"silver statement necklace with amethysts and an amber pendant, close-up photograph"

 

Pop culture and media 

DALL-E "recognizes" a wide range of pop culture references, particularly for visual media (it's very solid on Disney princesses) or for literary works with film adaptations like Tolkien's LOTR. For almost all media that it recognizes at all, it can convert it in almost-arbitrary art styles. 

"art nouveau stained glass window depicting Marvel's Captain America"
"Elsa from Frozen, cross-stitched sampler"
Sesame Street, screenshots from the miyazaki anime movie

[Tip: I find I get more reliably high-quality images from the prompt "X, screenshots from the Miyazaki anime movie" than just "in the style of anime",  I suspect because Miyazaki has a consistent style, whereas anime more broadly is probably pulling in a lot of poorer-quality anime art.]

Art style transfer

Some of most impressively high-quality output involves specific artistic styles. DALL-E can do charcoal or pencil sketches, paintings in the style of various famous artists, and some weirder stuff like "medieval illuminated manuscripts". 

"a monk riding a snail, medieval illuminated manuscript"

IMO it performs especially well with art styles like "impressionist watercolor painting" or "pencil sketch", that are a little more forgiving around imperfections in the details.  

"A woman at a coffeeshop working on her laptop and wearing headphones, painting by Alphonse Mucha"
"a little girl and a puppy playing in a pile of autumn leaves, photorealistic charcoal sketch"

 

Creative digital art

DALL-E can (with the right prompts and some cherrypicking) pull off some absolutely gorgeous fantasy-esque art pieces. Some examples: 

"a mermaid swimming underwater, photorealistic digital art"
"a woman knitting the Milky Way galaxy into a scarf, photorealistic digital art"

The output when putting in more abstract prompts (I've run a lot of "[song lyric or poetry line], digital art" requests) is hit-or-miss, but with patience and some trial and error, it can pull out some absolutely stunning – or deeply hilarious – artistic depictions of poetry or abstract concepts. I kind of like using it in this way because of the sheer variety; I never know where it's going to go with a prompt. 

"an activist destroyed by facts and logic, digital art"
"if the lord won't send us water, well we'll get it from the devil, digital art"
"For you are made of nebulas and novas and night sky You're made of memories you bury or live by, digital art" (lyric from Never Look Away by Vienna Teng)

The future of commercials 

This might be just a me thing, but I love almost everything DALL-E does with the prompt "in the style of surrealism" – in particular, its surreal attempt at commercials or advertisements. If my online ads were 100% replaced by DALL-E art, I would probably click on at least 50% more of them. 

"an advertisement for sound-cancelling headphones, in the style of surrealism"

DALLE's weaknesses

I had been really excited about using DALL-E to make fan art of fiction that I or other people have written, and so I was somewhat disappointed at how much it struggles to do complex scenes according to spec. In particular, it still has a long way to go with:

Scenes with two characters 

I'm not kidding. DALL-E does fine at giving one character a list of specific traits (though if you want pink hair, watch out, DALL-E might start spamming the entire image with pink objects). It can sometimes handle multiple generic people in a crowd scene, though it quickly forgets how faces work. However, it finds it very challenging to keep track of which traits ought to belong to a specific Character A versus a different specific Character B, beyond a very basic minimum like "a man and a woman." 

The above is one iteration of a scene I was very motivated to figure out how to depict, as a fan art of my Valdemar rationalfic. DALL-E can handle two people, check, and a room with a window and at least one of a bed or chair, but it's lost when it comes to remembering which combination of age/gender/hair color is in what location. 

"a young dark-haired boy resting in bed, and a grey-haired older woman sitting in a chair beside the bed underneath a window with sun streaming through, Pixar style digital art"

Even in cases where the two characters are pop culture references that I've already been able to confirm the model "knows" separately – for example, Captain America and Iron Man – it can't seem to help blending them together. It's as though the model has "two characters" and then separately "a list of traits" (user-specified or just implicit in the training data), and reassigns the traits mostly at random.

"Captain America and Iron Man standing side by side" which is which????

Foreground and background

A good example of this: someone on Twitter had commented that they couldn't get DALL-E to provide them with "Two dogs dressed like roman soldiers on a pirate ship looking at New York City through a spyglass". I took this as a CHALLENGE and spent half an hour trying; I, too, could not get DALL-E to output this, and end up needing to choose between "NYC and a pirate ship" or "dogs in Roman soldier uniforms with spyglasses". 

DALL-E can do scenes with generic backgrounds (a city, bookshelves in a library, a landscape) but even then, if that's not the main focus of the image then the fine details tend to get pretty scrambled. 

Novel objects, or nonstandard usages 

Objects that are not something it already "recognizes." DALL-E knows what a chair is. It can give you something that is recognizably a chair in several dozen different art mediums. It could not with any amount of coaxing produce an "Otto bicycle", which my friend specifically wanted for her book cover. Its failed attempts were both hilarious and concerning. 

prompt was something like "a little girl with dark curly hair riding down a barren hill on a magical rickshaw with enormous bicycle wheels, in the style of Bill Watterson"
An actual Otto bicycle, per Google Images

Objects used in nonstandard ways. It seems to slide back toward some kind of ~prior; when I asked it for a dress made of Kermit plushies displayed on a store mannequin, it repeatedly gave me a Kermit plushie wearing a dress. 

"Dress made out of Kermit plushies, displayed on a store mannequin"

DALL-E generally seems to have extremely strong priors in a few areas, which end up being almost impossible to shift. I spent at least half an hour trying to convince it to give me digital art of a woman whose eyes were full of stars (no, not the rest of her, not the background scenery either, just her eyes...) and the closest DALL-E ever got was this.

I wanted: the Star-Eyed Goddess
I got: the goddess-eyed goddess of recursion

Spelling

DALL-E can't spell. It really really cannot spell. It will occasionally spell a word correctly by utter coincidence. (Okay, fine, it can consistently spell "STOP" as long as it's written on a stop sign.) 

It does mostly produce recognizable English letters (and recognizable attempts at Chinese calligraphy in other instances), and letter order that is closer to English spelling than to a random draw from a bag of Scrabble letters, so I would guess that even given the new model structure that makes DALL-E 2 worse than the first DALL-E, just scaling it up some would eventually let it crack spelling.  

At least sometimes its inability to spell results in unintentionally hilarious memes? 

EmeRAGEencey!

Realistic human faces

My understanding is that the face model limitation may have been deliberate to avoid deepfakes of celebrities, etc. Interestingly, DALL-E can nonetheless at least sometimes do perfectly reasonable faces, either as photographs or in various art styles, if they're the central element of a scene. (And it keeps giving me photorealistic faces as a component of images where I wasn't even asking for that, meaning that per the terms and conditions I can't share those images publicly.) 

Even more interestingly, it seems to specifically alter the appearance of actors even when it clearly "knows" a particular movie or TV show. I asked it for "screenshots from the second season of Firefly", and they were very recognizably screenshots from Firefly in terms of lighting, ambiance, scenery etc, with an actor who looked almost like Nathan Fillion – as though cast in a remake that was trying to get it fairly similar – and who looked consistently the same across all 10 images, but was definitely a different person. 

There are a couple of specific cases where DALL-E seems to "remember" how human hands work. The ones I've found so far mostly involve a character doing some standard activity using their hands, like "playing a musical instrument." Below, I was trying to depict a character from A Song For Two Voices who's a Bard; this round came out shockingly good in a number of ways, but the hands particularly surprised me. 

Limitations of the "edit" functionality 

DALL-E 2 offers an edit functionality – if you mostly like an image except for one detail, you can highlight an area of it with a cursor, and change the full description as applicable in order to tell it how to modify the selected region. 

It sometimes works - this gorgeous dress (didn't save the prompt, sorry) originally had no top, and the edit function successfully added one without changing the rest too much.

This is how people will dress in the glorious transhumanist future. 

It often appears to do nothing. It occasionally full-on panics and does....whatever this is. 

I was just trying to give the figure short hair!

There's also a "variations" functionality that lets you select the best image given by a prompt and generate near neighbors of it, but my experience so far is that the variations are almost invariably less of a good fit for the original prompt, and very rarely better on specific details (like faces) that I might want to fix.

Some art style observations 

DALL-E doesn't seem to hold a sharp delineation between style and content; in other words, adding stylistic prompts actively changes the some of what I would consider to be content. 

For example, asking for a coffeeshop scene as painted by Alphonse Mucha puts the woman in in a long flowing period-style dress, like in this reference painting, and gives us a "coffeeshop" that looks a lot to me like a lady's parlor; in comparison, the Miyazaki anime version mostly has the character in a casual sweatshirt. This makes sense given the way the model was trained; background details are going to be systematically different between Nouveau Art paintings and anime movies. 

"A woman at a coffeeshop working on her laptop and wearing headphones, painting by Alphonse Mucha"
"A woman at a coffeeshop working on her laptop and wearing headphones, screenshots from the miyazaki anime movie"

DALL-E is often sensitive to exact wording, and in particular it's fascinating how "in the style of x" often gets very different results from "screenshot from an x movie". I'm guessing that in the Pixar case, generic "Pixar style" might capture training data from Pixar shorts or illustrations that aren't in their standard recognizable movie style. (Also, sometimes if asked for "anime" it gives me content that either looks like 3D rendered video game cutscenes, or occasionally what I assume is meant to be people at an anime con in cosplay.) 

"A woman at a coffeeshop working on her laptop and wearing headphones, screenshots from the Pixar movie"
"A woman at a coffeeshop working on her laptop and wearing headphones, in the style of Pixar"

Conclusions

How smart is DALL-E? 

I would give it an excellent grade in recognizing objects, and most of the time it has a pretty good sense of their purpose and expected context. If I give it just the prompt "a box, a chair, a computer, a ceiling fan, a lamp, a rug, a window, a desk" with no other specification, it consistently includes at least 7 of the 8 requested objects, and places them in reasonable relation to each other – and in a room with walls and a floor, which I did not explicitly ask for. This "understanding" of objects is a lot of what makes DALL-E so easy to work with, and in some sense seems more impressive than a perfect art style. 

The biggest thing I've noticed that looks like a ~conceptual limitation in the model is its inability to consistently track two different characters, unless they differ on exactly one trait (male and female, adult and child, red hair and blue hair, etc) – in which case the model could be getting this right if all it's doing is randomizing the traits in its bucket between the characters. It seems to have a similar issue with two non-person objects of the same type, like chairs, though I've explored this less. 

It often applies color and texture styling to parts of the image other than the ones specified in the prompt; if you ask for a girl with pink hair, it's likely to make the walls or her clothes pink, and it's given me several Rapunzels wearing a gown apparently made of hair. (Not to mention the time it was confused about whether, in "Goldilocks and the three bears", Goldilocks was also supposed to be a bear.) 

The deficits with the "edit" mode and "variations" mode also seem to me like they reflect the model failing to neatly track a set of objects-with-assigned-traits. It reliably holds the non-highlighted areas of the image constant and only modifies the selected part, but the modifications often seem like they're pulling in context from the entire prompt – for example, when I took one of my room-with-objects images and tried to select the computer and change it to "a computer levitating in midair", DALL-E gave me a levitating fan and a levitating box instead. 

Working with DALL-E definitely still feels like attempting to communicate with some kind of alien entity that doesn't quite reason in the same ontology as humans, even if it theoretically understands the English language. There are concepts it appears to "understand" in natural language without difficulty – including prompts like "advertising poster for the new Marvel's Avengers movie, as a Miyazaki anime, in the style of an Instagram inspirational moodboard", which would take so long to explain to aliens, or even just to a human from 1900. And yet, you try to explain what an Otto bicycle is – something which I'm pretty sure a human six-year-old could draw if given a verbal description – and the conceptual gulf is impossible to cross. 

"advertising poster for the new Marvel's Avengers movie, as a Miyazaki anime, in the style of an Instagram inspirational moodboard"

315

260 comments, sorted by Click to highlight new comments since: Today at 9:17 PM
New Comment
Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

Swimmer963 highlights DALL-E 2 struggling with anime, realistic faces, text in images, multiple characters/objects arranged in complex ways, and editing. (Of course, many of these are still extremely good by the standards of just months ago, and the glass is definitely more than half full.) itsnotatumor asks:

How many of these "cannot do's" will be solved by throwing more compute and training data at the problem? Anyone know if we've started hitting diminishing returns with this stuff yet?

In general, we have not topped out on pretty much any scaling curve. Whether it's language modeling, image generation, DRL, or whathaveyou, AFAIK, not a single modality can be truly said to have been 'solved' with the scaling curve broken. Either the scaling curve is flat, or we're still far away. (There are some sound-related ones which seem to be close, but nothing all that important.) Diffusion models' only scaling law I know of is an older one which bends a little but probably reflects poor hyperparameters, and no one has tried eg. Chinchilla on them yet.

So yes, we definitely can just make all the compute-budgets 10x larger without wasting it.

To go through the specific issues (caveat: we do... (read more)

Wow, this is going to explode picture books and book covers.

Hiring an illustrator for a picture book costs a lot, as it should given it's bespoke art.

Now publishers will have an editor type in page descriptions, curate the best and off they go. I can easily imagine a model improvement to remember the boy drawn or steampunk bear etc.

Book cover designers are in trouble too. A wizard with lighting in hands while mountain explodes behind him - this can generate multiple options.

It's going to get really wild when A/B split testing is involved. As you mention regarding ads you'd give the system the power to make whatever images it wanted and then split test. Letting it write headlines would work too.

Perhaps a full animated movie down the line. There are already programs that fill in gaps for animation poses. Boy running across field chased by robot penguins - animated, eight seconds. And so on. At that point it's like Pixar in a box. We'll see an explosion of directors who work alone, typing descriptions, testing camera angles, altering scenes on the fly. Do that again but more violent. Do that again but with more blood splatter.

Animation in the style of Family Guy seems a natural first ... (read more)

Perhaps a full animated movie down the line. There are already programs that fill in gaps for animation poses. Boy running across field chased by robot penguins - animated, eight seconds.

Video is on the horizon (video generation bibliography), in the 1-3 year range. I would say that video is solved conceptually in the sense that if you had 100x the compute budget, you could do DALL-E-2-but-for-video right now already. After all, if you can do a single image which is sensible and logical, then a video is simply doing that repeatedly. Nor is there any shortage of video footage to work with. The problem there is that a video is a lot of images: at least 24 images per second, so you could have 192 different samples, or 1 8s clip. Most people will prefer the former: decorating, say, a hundred blog posts with illustrations is more useful than a single OK short video clip of someone dancing.

So video's game is mostly about whether you can come up with an approach which can somehow economize on that, like clever tricks in reusing frames to update only a little while updating a latent vector, as a way to take a shortcut to that point in the future where you had so much compute that the obvious Transformer & Diffusion models can be run in reasonable compute-budgets & video 'just worked'.

And either way, it may be the revolution that robotics requires (video is a great way to plan).

2Sable19d
Following up on your logic here, the one thing that DALLE-2 hasn't done, to my knowledge, is generate entirely new styles of art, the way that art deco or pointillism were truly different from their predecessors. Perhaps that'll be the new of of human illustrators? Artists, instead of producing their own works to sell, would instead create their own styles, generating libraries of content for future DALLEs to be trained against. They then make a percentage on whatever DALLE makes from image sales if the style used was their own.
6Swimmer96319d
...Hmm now I'm wondering if feeding DALL-E an "in the style of [ ]" request with random keywords in the blank might cause it do replicable weird styles, or if it would just get confused and do something different every time.
3Sable17d
I'd love to see it tried. Maybe even ask for "in the style of DALLE-2"?
7Swimmer96316d
"A woman riding a horse, in the style of DALLE-2"
1Sable13d
I have no idea how to interpret this. Any ideas? It seems like we got a variety of different styles, with red, blue, black, and white as the dominant colors. Can we say that DALLE-2 has a style of its own?

I think DALL-E has been nerfed (as a sort of low-grade "alignment" effort) and some of what you're talking about as "limitations" are actually bugs that were explicitly introduced with the goal of avoiding bad press.

OpenAI has made efforts to implement model-level technical mitigations that ensure that DALL·E 2 Preview cannot be used to directly generate exact matches for any of the images in its training data. However, the models may still be able to compose aspects of real images and identifiable details of people, such as clothing and backgrounds. (sauce)

It wouldn't surprise me if they just used intelligibility tools to find the part of the vectorspace that represents "the face of any famous real person" and then applied some sort of noise blur to the model itself, as deployed?

Except! Maybe not a "blur" but some sort of rotation of a subspace or something? This hint is weirdly evocative:

they were very recognizably screenshots from Firefly in terms of lighting, ambiance, scenery etc, with an actor who looked almost like Nathan Fillion – as though cast in a remake that was trying to get it fairly similar – and who looked consistently the same across all 10 images, but was definite

... (read more)
5gwern15d
Yes, I thought their 'horse in ketchup' example made the point well that it's an 'artificial stupidity' Harrison-Bergeron sort of approach rather than a genuine solution. (And then, like BPEs, there seems to be unpredictable fallout which would be hard to benchmark and which no one apparently even thought to benchmark - despite whatever they did on May 1st to upgrade quality, the anime examples still struggle to portray specific characters like Kyuubey, where Swimmer's examples are all very Kyuubey-esque but never actually Kyuubey. I am told the CLIP used is less degraded, and so we're probably seeing the output of 'CLIP models which know about characters like Kyuubey combined with other models which have no idea'.)

Thread of all known anime examples.

whereas anime more broadly is probably pulling in a lot of poorer-quality anime art...(Also, sometimes if asked for “anime” it gives me content that either looks like 3D rendered video game cutscenes, or occasionally what I assume is meant to be people at an anime con in cosplay.)

That's how you know it's not a problem of pulling in lots of poorer-quality anime art. First, poorer-quality doesn't impede learning that much; remember, you just prompt for high-quality. Compute allowing, more n is always better. And second, if it was a master of poorer-quality anime drawings, it wouldn't be desperately 'sliding away', if you will, like squeezing a balloon, from rendering true anime, as opposed to CGI of anime or Western fanart of anime or photographs of physical objects related to anime. It would just do it (perhaps generating poorer-quality anime), not generate high-quality samples of everything but anime. (See my comment there for more examples.)

The problem is it's somehow not trained on anime. Everything it knows about anime seems to come primarily from adjacent images and the CLIP guidance (which does know plenty about anime, but we also know that pixel generation from CLIP guidance never works as well).

Challenging prompt ideas to try:

  • A row of five squares, in which the rightmost four squares each have twice the area of the square to their  immediate left.
  • Screenshots from a novel game comparable in complexity to tic-tac-toe sufficient to demonstrate the rules of the game.
  • Elon Musk signing his own name in ASL.
  • The hands of a pianist as they play the first chord from Chopin's Polonaise in Ab major, Op. 53
  • Pages from a flip book of a water glass spilling.

First one: ....yeah no, DALL-E 2 can't count to five, it definitely doesn't have the abstract reasoning to double areas. Image below is literally just "a horizontal row of five squares". 

6AllAmericanBreakfast21d
Very interesting that it can't manage to count to five. That to me is strong evidence that DALL-E's not "constructing" the scenes it depicts. I guess it has more of a sense of relationships among scene element components? Like, "coffee shop" means there's a window-like element, and if there's a window element, then there's some sort of scene through the window, and that's probably some sort of rectangular building shape. Plausible guesses all the way down to the texture and color of skin or fur. Filling in the blanks on steroids, but with a complete lack of design or forethought.

Yeah, this matches with my sense. It has a really extensive knowledge of the expected relationships between elements, extending over a huge number of kinds of objects, and so it can (in one of the areas that are easy for it) successfully fill in the blanks in a way that looks very believable, but the extent to which it has a gears-y model of the scene seems very minimal. I think this also explains its difficulty with non-stereotypical scenes that don't have a single focal element – if it's filling in the blanks for both "pirate ship scene" and "dogs in Roman uniforms scene" it gets more confused. 

4AllAmericanBreakfast21d
You're making my dreams come true. I really want to see the Elon Musk one :) Edit: or the waterglass spilling. That's the one with my most uncertainty about its performance.

The Elon Musk one has realistic faces so I can't share it; I have, however, confirmed that DALL-E does not speak ASL with "The ASL word for "thank you"":

5AllAmericanBreakfast21d
We've got some funky fingers here. Six six fingers, a sort of double-tipped finger, an extra joint on the index finger on picture (1, 4). Fascinating.
2Measure19d
It seems to be mostly trying to go for the "I love you" sign, perhaps because that's one of the most commonly represented ones.
1jasperdale2d
I'm curious why this prompt resulted in overwhelmingly black looking hands. Especially considering that all the other prompts I see result in white subjects being represented. Any theories?
4gwern1d
It's unnatural, yes: ASL is predominantly white, and people involved in ASL are even more so (I went to NTID and the national convention, so can speak first-hand, but you can also check Google Image for that query and it'll look like what you expect, which is amusing because 'Deaf' culture is so university & liberal-centric). So it's not that ASL diagrams or photographs in the wild really do look like that - they don't. Overrepresentation of DEI material in the supersekrit licensed databases would be my guess. Stock photography sources are rapidly updated for fashions, particularly recent ones, and you can see this occasionally surfacing in weird queries. (An example going around Twitter which you can check for yourself: "happy white woman" in Google will turn up a lot of strange photos for what seems like a very easy straightforward query.) Which parts are causing it is a better question: I wouldn't expect there to be much Deaf stock photo material which had been updated, or much ASL material at all, so maybe there's bleedthrough from all of the hand-centric (eg 'Black Power salute', upraised Marxist fists, protests) iconography? There being so much of the latter and so little of the former that the latter becomes the default kind of hand imagery.
1jasperdale1d
It must be something like that, but it still feels like there's a hole there. The query is for "ASL", not "Hands", and these images don't look like something from a protest. The top left might be vaguely similar to some kind of street gesture. I'm curious what the role of the query writer is. Can you ask DALL-E for "this scene, but with black skin colour"? I got a sense that updating areas was possible but inconsistent. Could DALL-E learn to return more of X to a given person by receiving feedback? I really don't know how complicated the process gets.
2gwern1d
ASL will always be depicted by a model like DALL-E as hands; I am sure that there are non-pictorial ways to write down ASL but I can't recall them, and I actually took ASL classes. So that query should always produce hands in it. Then because actual ASL diagrams will be rare and overwhelmed by leakage from more popular classes (keep in mind that deafness is well under 1% of the US population, even including people like me who are otherwise completely uninvolved and invisible, and basically any political fad whatsoever will rapidly produce vastly more material than even core deaf topics), and maybe some more unCLIP looseness...

"Pages from a flip book of a water glass spilling" I...think DALL-E 2 does not know what a flip book is. 

9Swimmer96321d
I...think it just does not understand the physics of water spilling, period.
7Swimmer96321d
Relatedly, DALL-E is a little confused about how Olympic swimming is supposed to work.
5AllAmericanBreakfast21d
This is interesting, because you'd think it would at least understand that the cup should be tipping over. Makes me think it is considering the cup and the water as two distinct objects, and doesn't really understand that the cup tipping over would be what causes the water to spill. But it does understand that the water should be located "inside" the cup, but probably purely in a "it looks like the water is inside the cup" sense. I don't think DALL-E seems to understand the idea of "inside" as an actual location.
1Nazarii11d
I wonder if its understanding of the world is just 2D or semi-3D. Perhaps training it on photogrammetry datasets (photos of the same objects but from multiple points of view) would improve that?

Slightly reworded to "a game as complex tic-tac-toe, screenshots showing the rules of the game", I am pretty sure DALL-E is not able to generate and model consistent game rules though. 

3AllAmericanBreakfast21d
At least it seems to have figured out we wanted a game that was not tic-tac-toe.
6Charlie Steiner21d
Depends on if it generates stuff like this if you ask it for tic-tac-toe :P
1kjz20d
What about the combo: a tic-tac-toe board position, a tic-tac-toe board position with X winning, and a tic-tac-toe board position with O winning. Would it give realistic positions matching the descriptions?
2Swimmer96320d
I really doubt it but I'll give it a try once I'm caught on on all the requested prompts here!

Thanks for this thorough account. The bit where you tried to shorten the hair really made me laugh.

A prompt i'd love to see: "Anomalocaris Canadensis flying through space." I'm really curious how well it does with an extinct species which has very little existing artistic depictions. No text->image model i've played with so far has managed to create a convincing anomalocaris, but one interestingly did know it was an aquatic creature and kept outputting lobsters.

Going by the Wikipedia page reference, I think it got it somewhat closer than "lobsters" at least? 

I'd rate these highly, there are many forms of anomalocarids (https://en.m.wikipedia.org/wiki/Radiodonta#/media/File%3A20191201_Radiodonta_Amplectobelua_Anomalocaris_Aegirocassis_Lyrarapax_Peytoia_Laggania_Hurdia.png) and it looks to have picked a wide variety aside from just candensis, but I'm thoroughly impressed that it got the form right in nearly all 10.

DALL-E is often sensitive to exact wording, and in particular it’s fascinating how “in the style of x” often gets very different results from “screenshot from an x movie”. I’m guessing that in the Pixar case, generic “Pixar style” might capture training data from Pixar shorts or illustrations that aren’t in their standard recognizable movie style.

I've seen this prompt programming bug noted on Twitter by DALL-E 2 users as well. With earlier models, there didn't seem to be that much difference between 'by X' vs 'in the style of X', but with the new high-e... (read more)

This is great! I'm generally most interested to see people finding weaknesses of new DL tools, which in and of itself is a sign of how far the technology has progressed.

The "one character" limitation makes it look like DALL-E was spawned from ongoing, massive programs to develop object recognizing systems, not any sort of general generative system. 

Would it be accurate to characterize DALL-E as "basically inverted object recognition"?

I wonder if you could get it to generate Minecraft screenshots, such as:

  • A log cabin in a a clearing in a dark forest, as a screenshot from Minecraft

It would also be interesting to see how “as a screenshot from Minecraft“ combines with other styles:

  • A wagon caravan approaches a ruined city in the desert, as a Miyazaki anime, as a screenshot from Minecraft

You could also append “as a screenshot from Minecraft” to more abstract prompts, for example:

  • A machine that harvests luck from four leaf clovers, as a screenshot from Minecraft

Finally, some other miscellaneo... (read more)

Prompt from my brother:

What people from 1920 thought 2020 would look like. 1920's Artist's depiction of 2020

4Swimmer96318d
"What people from 1920 thought 2020 would look like. 1920's Artist's depiction of 2020"

When they released the first Dall-E, didn't OpenAI mention that prompts which repeated the same description several times with slight re-phrasing produced improved results?

I wonder how a prompt like:

"A post-singularity tribesman with a pet steampunk panther robot. Illustration by James Gurney."

-would compare with something like:

"A post-singularity tribesman with a pet steampunk panther robot. Illustration by James Gurney.  A painting of an ornate robotic feline made of brass and a man wearing futuristic tribal clothing.  A steampunk scene by James Gurney featuring a robot shaped like a panther and a high-tech shaman."

5Swimmer96320d
"A post-singularity tribesman with a pet steampunk panther robot. Illustration by James Gurney." Vs "A post-singularity tribesman with a pet steampunk panther robot. Illustration by James Gurney. A painting of an ornate robotic feline made of brass and a man wearing futuristic tribal clothing. A steampunk scene by James Gurney featuring a robot shaped like a panther and a high-tech shaman." Huh! Yeah, the second one definitely does seem to incorporate more detail.
1artifex020d
Thanks! I'm not sure how much the repetitions helped much with accuracy for this prompt- it's still sort of randomizing traits between the two subjects. Though with a prompt this complex, the token limit may be an issue- it might be interesting to test at some point whether very simple prompts get more accurate with repetitions. That said, the second set are pretty awesome- asking for a scene may have helped encourage some more interesting compositions. One benefit of repetition may just be that you're more likely to include phrases that more accurately describe what you're looking for.
4Shai Noy20d
Good point. I've also noticed good results for adding multiple details by mentioning each individually. E.g. instead of "tribesman with you blue robe, holding a club, looking angry, with a pet robot tiger" try "A tribesman with a pet tiger. The tribesman wears a blue robe. The tribesman is angry. The tribesman is holding a club. The tiger is a cyberpunk robot robot."

Prompt suggestion: "A drawing of an animal which has no resemblance to a cat"

5Swimmer96313d
Yeahhhh, as I expected DALL-E cannot super follow the negation here. (We also tried to ask it for "a stop sign, spelled incorrectly" and it just gave us stop signs.)
2Vanessa Kosoy13d
Hmm, theoretically, DALL-E might be assuming the prompt is irony. What about this: "Apparently, this is a cat???"
4Swimmer96312d
Yeah, no, it just gives me...cats.

even if it theoretically understands the English language.

If you mix up a prompt into random words so that it's no longer grammatically correct English, does it give worse results? That is, I wonder how much it's basically just going off keywords.

1Dirichlet-to-Neumann20d
That's an interesting question ! Although it clearly understand things like spatial positioning so it must understand some grammar.

Curated. I think this post is a great demonstration of what our last curation choice suggested

Interpretability research is sometimes described as neuroscience for ML models. Neuroscience is one approach to understanding how human brains work. But empirical psychology research is another approach. I think more people should engage in the analogous activity for language models: trying to figure out how they work just by looking at their behavior, rather than trying to understand their internals. 

I'm not yet convinced this will especially fruitful, but t... (read more)

I gather we're allowed to suggest prompts we wish to see? Here's a prompt trying to create fanart for my favourite web serial, Pale by Wildbow:

"A girl with red-blonde hair, in a forest. The girl is wearing a deer-mask with short antlers, a cape over a jersey and shorts, and a witch's hat. The girl is holding a hockey stick. Every branch of every tree has a bright ribbon tied to it. The cape rests atop her shoulders and falls over one arm like a musketeer's cape. The witch's hat and the cape are both navy-blue."

3Swimmer96314d
The AI is sort of trying to make this photographs, but I am judging that none of them are in danger of being photorealistic faces...
3ArisKatsaris14d
Lol. Thank you. To make it look like fanart I should have probably specified something about it, because these currently look more like photos of LARPing. Many thanks and I appreciate the effort. If it's not overtaxing your generosity can we then also attempt the following tweaks? "Highly detailed and beautiful digital art of a fantasy character: A 13-year-old girl with red-blonde hair, in a forest. The girl is wearing a deer-mask with short antlers, a cape over a jersey, and a witch's hat. The girl is holding a hockey stick. Every branch of every tree has a bright ribbon tied to it. The cape rests atop her shoulders and falls over one arm like a musketeer's cape." Many thanks in advance!
6Swimmer96313d
Ooooh! Yeah, definitely much more fantasy fan art style.

Thanks Swimmer963! This was very interesting.

I have a general question for the community. Does anyone know of any similar such descriptions of model limitations with so many examples performed for any language models such as GPT-3?

My personal experience is that visual output is inherently more intuitive, but I'd love to explore my intuition around language models with an equivalent article for GPT-3 or PaLM for example.

I'd predict such articles exist with high confidence but finding the appropriate article with sufficient quality might be trickier. I'm curious which articles commenters here would select in particular.

Some prompts:

The Last Supper by Leonardo Da Vinci, but painted from behind.

The Last Supper by Leonardo Da Vinci, but painted from above, looking straight downwards.

The Last Supper by Leonardo Da Vinci, as an X-ray image.

Relativity by Escher, as a high-resolution photograph.

Boris Johnson dressed as a clown and riding a unicycle along a tightrope, spray-painted onto a wall, in the style of Banksy.

"The Last Supper by Leonardo Da Vinci, as an X-ray image" It's trying! 

I especially like this one (close-up): https://labs.openai.com/s/QsWCxHvbwRaIJEB7xbTCnvwx

1bfinn20d
Thanks very much - yes, that one is pretty remarkable, as are several of them. On the close-up I see loaves, some kind of gadget left of centre, and is that the baby Jesus (with beard?) they're about to tuck into?! (I assume DALLE-2 is not always sure how to show people from this perspective.)
6Swimmer96320d
"The Last Supper by Leonardo Da Vinci, but painted from behind". (Based on previous playing around, I think that DALL-E does not have a super strong conception of "The Last Supper" in general, and sort of defaults to a generic supper table.)
2bfinn20d
Thanks. Interesting that it gets the general idea of 'from behind' but the specifics garbled - eg bottom left the people should be sitting on the bench, not the other side of the table!

This is great! Thanks.

A nitpick:

adding stylistic prompts actively changes the some of what I would consider to be content

Your examples here are not good since e.g. "...painting by Alphonse Mucha" is not just a rewording of "...in the style of Alphonse Mucha": the former isn't a purely stylistic prompt. For a [painting by x], x gets to decide what is in the painting - so it should be expected that this will change the content.
Similarly for "screenshots from the miyazaki anime movie".

Of course it's still a limitation if you can only get really good style results by using such not-purely-stylistic prompts.

3Swimmer96320d
That's a reasonable point. I have definitely found that saying "a painting by X" or a "a movie by X" gets results that a) I personally like much better, and b) are much more consistently and recognizably in the requested style! I'm not sure whether "in the style of X" just ends up being less of a strong hint for DALL-E, or whether it's pulling on a much bigger set of training data. Maybe there are all sorts of images online labeled as "in the style of Alphonse Mucha" by people who don't actually know how to assess styles? Anyway, this is "A woman at a coffeeshop working on her laptop and wearing headphones, in the style of Alphonse Mucha" and it's fine but it's much less what I ordered!

Some prompts I’d love to see: “Infinite Jest” “Bedroom with impossible geometry” “Coffeeshop in non-Euclidian hyperbolic space” “Screenshot of Wikipedia front page” “The shadow in the corner of the room stared at me”

"A screenshot of the Wikipedia home page" this is one of the results that makes me feel ~anthropomorphized fondness for DALL-E. It's trying so hard! 

8cwillu20d
It's basically what text looks like when I dream.
2philip_b18d
"A screenshot of the Wikipedia home page, Halloween version" please.
4Swimmer96318d
This came out super cute! Thanks for the prompt idea :)
1PoignardAzur6d
Fascinating. Dall-E seems to have a pretty good understanding of "things that should be straight lines", at least in this case.
4Swimmer96320d
I ran "Bedroom with impossible geometry by MC Escher" to give DALL-E more hints, because the first run was really not very impossible-geometry, I'm not sure if DALL-E was managing to parse that as a clause or just hearing 'geometry.'
4Swimmer96320d
"Infinite Jest"
2Swimmer96320d
"The shadow in the corner of the room stared at me, digital art"

I'd love to see:

>A group of happy people does Circling and Authentic Relating in a park

Big black furry monsters with tall heads are wearing construction outfits and are dripping with water and seaweed. They are using a dolly to pick up a dumpster in an alley and pointing at where to bring it. Realistic HD photo.

9Swimmer96315d

I am so confused by two completions of a human girl here. How is this possibly close in image-space to all the other images, especially given this prompt?

5gwern15d
That's an unusually realistic face, and a distinct hairstyle. I suspect that's a real person and if it is, knowing who might shed some light on how the prompt could possibly be tapping into her - that she shows up twice (and it's obviously the same girl twice given the hair style and clothing are the same) suggests there is some sort of real connection, like she's an animator famous for cartoon monsters or something.
7habryka15d
This also replicated when I asked someone else to generate new images for the same prompt (one image out of the 10 was again in this very different style and displaying approximately the same person).
2gwern14d
Very strange. I did some searching in Google Images & Yandex for the cropped face and for 'furry black monsters', and asides from being impressed just how many more women Yandex turns up who do in fact look a lot like the sample, didn't find anything obviously relevant.
1AttentionResearcher14d
Interesting. Both the 2 images of her have a white house wall to the left with same lighting, same hair color and bottom colored hair, same shirt color, and same skin color. Maybe the words 'wearing' and 'outfits' and 'black' and 'alley' and even 'dolly' and 'photo' may have triggered it to give us an alley - but one that has a clothing fashionist in it lol. It still seems to be choosing a single source though mostly.
3PoignardAzur6d
This seems like a major case study for interpretability. What you'd really want is to be able to ask the network "In what ways is this woman similar to the prompt?" and have it output a causality chain or something.
1AttentionResearcher15d
1. Glossy black crystal temples with silver gates smoking and huge spiked metal worms drilling through the temples. A layer of smoke sits on the glossy black floor and there is chains everywhere. A huge bridge made of metal spikes connects to this world. Realistic HD photo. 2. Up close shot of tall pikachus that have short white peach fuzz fur are wearing full furry white robes and are placing large gold keys into a white furry chest in utopia heaven. It's shining bright morning hour and everything has gold plated and crystal rimmed features. Realistic HD photo.

What if the prompt literally doesn't make sense? Like having a coherent prompt structure, but the content isn't logically valid.

For example, "A painting of a woman drawing herself, in the style of clocks"

4Swimmer96316d
It tries!

The Bill Watterson one requires me to request black bears attacking a black forest campground at midnight.

Optionally: "...as pixel art".

I have to ask, how does one get hold of any of the programs in this vein? I've seen Gwern's TWDNE, and now your experiments with DALL-E, and I'd love to mess with them myself but have no idea where to go. A bit of googling suggests one can buy GPT3 time from OpenAI, but I gather that's for text generation, which I can do just fine already.

2Swimmer96318d
OpenAI has a waitlist you can sign up for to get early access to DALL-E.
2Error18d
Ah, that put me on the right track. I've been asking google the wrong questions; I was looking for a downloadable program that I could run, but it looks like some (all?) of the interesting things in this space are server-side-only. Which I guess makes sense; presumably gargantuan hardware is required.
2MikkW2d
In the case of OpenAI, the server-side-only constraint, IIRC, is intentional, to prevent people from modifying the model, for AI safety reasons. My understanding is that usually running a model isn't as compute-intensive as training it in the first place, so I'd expect a user-side application to be viable; just not in line with OpenAI's modus operandi.
2ChristianKl18d
I asked a while ago https://www.lesswrong.com/posts/HnD8pqLKGn2bCbXJr/what-s-the-easiest-way-to-currently-generate-images-with [https://www.lesswrong.com/posts/HnD8pqLKGn2bCbXJr/what-s-the-easiest-way-to-currently-generate-images-with] There are a few Google Colab notebooks that you can run online but where you could also run the code offline if you desire.

It'd be interesting to see (e.g.):

Full body x-ray scan of a {X}. Detailed, medical professional scan.

Medical illustration of {X} skeleton, with labels. High quality, detailed, professional medical illustration.

Where X is some fictional creature, such as: mermaid, Pikachu, dragon.

6Swimmer96320d
"Medical illustration of a gryphon's skeleton, with labels. High quality, detailed, professional medical illustration." The labels are cute!
5Swimmer96320d
I had to fiddle with the prompt some, but "Detailed high quality full-body x-ray scan of a mermaid with fins and tail, medical records" gets at least a few results that are what I asked for.
3Shai Noy20d
Wow, those and the gryphon above are both awesome! Thanks! Would you be kind enough to share a high res versions of your picks from both? With your permission, I'd love to share those on the Dalle subreddit.
3Swimmer96320d
Pic from the mermaid one: https://labs.openai.com/s/fSTlhqXtpZee9Vedy9xMfsZD [https://labs.openai.com/s/fSTlhqXtpZee9Vedy9xMfsZD] And from the gryphon one: https://labs.openai.com/s/JydvuNEv6TCozRECbE4WygQB [https://labs.openai.com/s/JydvuNEv6TCozRECbE4WygQB]
1Shai Noy20d
🙏
1Zachary MacLeod19d
Oh dang! Would it be too much to ask to see what some of those might look like if they were uncropped by AI?

Could you please return 10 for each of these prompts, I give you my best, ones that should get out of it interesting vividness:

1) Bright macro shot of a plush toy robot pikachu eating a hamburger in a nurse outfit against a white brick wall with mud splashed on pikachu from a tire on the road. 8K HD incredibly detailed.

2) Macro shot of the cool pikachu wearing black chains and laughing as seen in a truck selfie in the desert next to a sand castle with piranha plants seen through the heat. 8K incredibly detailed.

3) Future 2377 hospital with beds in glass co... (read more)

3Swimmer96320d
"Video game case rated M, dark red rimmed, macro shot, a glossy black world that endlessly goes back into the distance with many black temples, gates, and chests. HD photo."
3Swimmer96320d
"A glossy black temple surrounded by lava and thunder with silver spiked chests on the ground next to the gate. HD, detailed." It's not super coping with all the details – could maybe do better with more repetition in the prompt? – but it's got the vibe.
3Swimmer96320d
I am going to register an advance prediction that many of these contain way too many details (both in terms of number of objects requested, and in terms of specific relationships between said objects) and are going to overwhelm the poor image model. I'll run them as-is, but I might also try modified/simplified versions if I think I can get something more in the spirit of your requests that way.
2Swimmer96320d
"A plate with fries, nuggets, steak and pikachu-shaped cake covered in ketchup and salt topped with ice cream. Close-up photograph."
2Swimmer96320d
"Video game case rated E, grey rimmed, macro shot, metal temples along a concrete river with silver gates, chests, and floating gold keys. HD photo."
2Swimmer96320d
"A motor connects to a hydraulics pump, which connects to blue energy rods soaking in pink liquid. It's smoking. Macro, detailed."
2Swimmer96320d
"Big black furry monsters with tall heads and white patches dripping with water and seaweed are picking up a dumpster in an alley. Realistic HD photo."
2Swimmer96320d
"Microscopic water bear meets a bacteria that looks like pikachu next to a bacteria hospital pouring out different colored creatures with spikes, furr, etc. HD detailed."
2Swimmer96320d
"Future 2377 hospital with beds in glass containers, white spheres that hold tools, robot maids, cameras everywhere, and blue scrubbing systems moving around the walls. Lots of detail and systems." I think this is my favourite: https://labs.openai.com/s/KQfiNLLHurkhwSW7Cj38GWA8 [https://labs.openai.com/s/KQfiNLLHurkhwSW7Cj38GWA8]
2Swimmer96320d
With some changes to the prompt, "A cool goth pikachu wearing black chains and laughing, sitting in a truck in the desert, next to a heat-shimmery sand castle with piranha plants. 8K incredibly detailed."
1localdeity20d
It tends to depict Pichu rather than Pikachu. But I note that Pichu's electric attacks damage itself, at least in Super Smash Bros (and I find a quote from the Bulbapedia article [https://bulbapedia.bulbagarden.net/wiki/Pichu_(Pok%C3%A9mon)] saying "it cannot discharge without being shocked itself"), which caused a friend to refer to Pichu as "emo Pikachu". Perhaps "goth Pikachu" ended up referring to the same thing...
2Swimmer96320d
"Bright macro shot of a plush toy robot pikachu eating a hamburger in a nurse outfit against a white brick wall with mud splashed on pikachu from a tire on the road. 8K HD incredibly detailed." Yeah - DALL-E seems to be landing at best a handful of the details you wanted, and in some of these it seems to be returning something almost random!
2Swimmer96320d
"A plush toy robot pikachu wearing a mud-splashed nurse outfit and eating a hamburger, against a white brick backdrop. Detailed HD footage." It's done much better here! I'm not sure any of the images managed the "mud-splashed" bit, but they've all got a reasonable Pikachu-robot, plus the hamburger and the white brick wall, and some of them are managing the nurse outfit.
1AttentionResearcher20d
Could you do the other prompts in my post, I want to push the model, maybe you missed them due to comment collapse. Or if want me to pick only a few let me know. This is so cool.
2Swimmer96320d
I'll come back to them! There's just a whole lot of comments on this post to process.
1AttentionResearcher20d
Wow! The glossy black temple one, wow! This is beyond belief, impossible! It not only came close to what I had imagined but forget the lava, it's better! Just what I want. Looks like a hard game. The others are also very impressive, I Love the dumpster one it came very good, and the hospital one, and the food one is just grand. The 2 video game case ones, good but not good haha I meant not those cases, how did your brain interpret the outputs - you saw it was wrong right (lol) ?. Here is a few more and let's try to get one of those games made right this time. Also I'm adding onto the food one something interesting and attempting to elongate that good one: 1) A video game sitting against a wall, rated E, grey rimmed, metal temples along a concrete river with silver gates, chests, and floating gold keys. HD photo, macro. 2) A plate with fries, nuggets, gold fork and knife, steak and pikachu-shaped cake covered in ketchup and salt topped with ice cream. Close-up photograph. Pikachu is leaning into the plate eating the food. 3) A gold room full of red rubies, gold coins, white crystals, silver spiked chests, and gold toilets lined up. 4) A floating liquid metal blob in a laboratory is 3D printing cameras, memory, and sensors. Scientists are trying to guide it. HD, detailed. 5) Arial view over a world consisting of glossy black temples, thunder, round purple chambers, spiked lava rivers, and flat paths that maze around and monsters guarding gates. HD, detailed. 6) Inside the bright gold temple restaurant is gold tables, crystal walls, waterfalls coming out the walls, robot maids, and lots of fries and red ruby decorations. HD, detailed. 7) Glossy black dragon statue shooting red laser beams from its eyes into a glossy black wall, making it crack open exposing a gold vault. In the rain at night, HD, detailed. 8) A glossy black temple surrounded by lava and thunder with silver spiked chests on the ground next to the gate. Big black bosses wearing gold chains and cr
3Swimmer96319d
Aaaaaaand final one. (I would kind of prefer if you keep any future requests to one or two prompts.) "A glossy black temple surrounded by lava and thunder with silver spiked chests on the ground next to the gate. Big black bosses wearing gold chains and crowns are walking into the temple. Raining, HD, detailed."
1AttentionResearcher19d
Ok! One last one to document its limits further: Black robots wearing gold chains and red robes sitting in thrones made of white crystal with gold spikes lined up. The robots are holding plates with fries and ice cream over white sinks in front of their thrones facing a mirror, in a red luxury bathroom full of gold coins and doors, and white and red ruby pots. Also can you do two Variations below showing all 10 results? (Note: I super-resolutioned one, so if you have the full version saved, check which is more detailed truly): https://ibb.co/jVFpQP8 [https://ibb.co/jVFpQP8] https://ibb.co/Tb36ZQw [https://ibb.co/Tb36ZQw] (uploaded using imgBB)
2Swimmer96316d
Plus your other request, "Black robots wearing gold chains and red robes sitting in thrones made of white crystal with gold spikes lined up. The robots are holding plates with fries and ice cream over white sinks in front of their thrones facing a mirror, in a red luxury bathroom full of gold coins and doors, and white and red ruby pots." Honestly pretty impressed with the level of detail in the image!
2Swimmer96319d
For the second request, I'm not sure I follow - are these results from previous prompt rounds that I ran?
1AttentionResearcher17d
The gold room one, yes please, and the other is a mario game that would be interesting to see if it can make Variations of too. (show all 10)
2Swimmer96316d
Gotcha! Gold room variations here: And the Mario game variations:
1AttentionResearcher16d
Was a text prompt used along side the image input to make these Variations? Or just image input? Very interesting results BTW.
2Swimmer96316d
just the image - I had uploaded them as new images bc it cleared my session and I didn't have the originals anymore.
1AttentionResearcher16d
Ok. That's good no text prompt was used. I wonder what would happen if you now tried the gold room image again with it's text prompt below, maybe it would guide the 10 Variations better? Though it seems as if you have, the Variations show toilets even though there is none in the input image, why is that? Here was the prompt, please try it (or without if you think you included text): 'A gold room full of red rubies, gold coins, white crystals, silver spiked chests, and gold toilets lined up.'
1AttentionResearcher19d
The gold room one, yes please, and the other is a mario game that would be interesting to see if it can make Variations of too. (show all 10)
2Swimmer96319d
"Black robots wearing gold chains and red robes sitting in thrones made of white crystal with gold spikes lined up. The robots are holding plates with fries and ice cream over white sinks in front of their thrones facing a mirror, in a red luxury bathroom full of gold coins and doors, and white and red ruby pots."
3Swimmer96319d
"Inside the bright gold temple restaurant is gold tables, crystal walls, waterfalls coming out the walls, robot maids, and lots of fries and red ruby decorations. HD, detailed."
3Swimmer96319d
"A gold room full of red rubies, gold coins, white crystals, silver spiked chests, and gold toilets lined up." I think it's confused on the color scheme - the room itself doesn't appear to be gold in any of these.
3Swimmer96319d
"A plate with fries, nuggets, gold fork and knife, steak and pikachu-shaped cake covered in ketchup and salt topped with ice cream. Close-up photograph. Pikachu is leaning into the plate eating the food." I think this is closer to what you were envisioning? Though, uh, mildly horrifying in a few, and also one of them made Pikachu a rubber duck?
2Swimmer96319d
Minor edit because 'shooting' appears to be a banned keyword: "Glossy black dragon statue flinging red laser beams from its eyes into a glossy black wall, making it crack open exposing a gold vault. In the rain at night, HD, detailed."
2Swimmer96319d
"Arial view over a world consisting of glossy black temples, thunder, round purple chambers, spiked lava rivers, and flat paths that maze around and monsters guarding gates. HD, detailed."
2Swimmer96319d
I modified #4 a bit to try to hint harder, since the initial round mostly gave me only the liquid blobs. It's still struggling with the details, especially at including any scientists, so I think there are too many weird/not-usually-combined elements here for it to manage without much more skilled and careful prompting. "There is a floating liquid metal blob in a laboratory. The floating liquid metal blob is is 3D printing cameras, memory, and sensors. There are scientists in the laboratory trying to guide the metal blob. HD, detailed."
2Swimmer96319d
This is a lot of requests and I'm at work, so I'll run them over the next few hours. (Honestly I'm not a video games person and had no idea that "case" was the same thing as...rating? and also I have no idea what an E rating is, I don't recognize that one from movies.) "A video game sitting against a wall, rated E, grey rimmed, metal temples along a concrete river with silver gates, chests, and floating gold keys. HD photo, macro." I don't think it super knows what you want here...
2Measure19d
A "case" in this context is the plastic clamshell that holds/protects the disc when not in use (DALL-E thinks this instead means some sort of container found within a video game environment). The E rating (for "Everyone") is similar to the G rating for movies.
1AttentionResearcher19d
What it should be creating is this below (a video game case) ... XD lol: https://ibb.co/9TtJbqJ

Some prompt requests for my daughter:

"A wild boar and an angel walking side by side along the beach - beautiful hyperrealistic art"

"A piggo-saurus - a pig-like dinosaur - hyper realistic art"

"A piggo-saurus - an illustration of a pig-like dinosaur"

"A little forest gnome leaving through his magic book - beautiful and detailed illustration"

Can it in some way describe itself? Something like "picture of DALL-E 2".

I wanted: the Star-Eyed Goddess

Maybe DALL-E thought you meant Movie-Star-Eyed Goddess? 'Cause that's what the picture looks like to me :)

Regarding text, if the problem comes from encoding, does that mean the model does better with individual letters and digits? Eg

"The letter A"

"The letters X Y and Z"

"Number 8"

"A 3D rendering of the number 5"

Awesome writeup!

To further explore the interplay between style and content, how about trying something not very specific that could gain specificity from the style context?

For example "Aliens are conducting experiments on human subjects":

  • as a screenshot from South Park (will these mostly feature the anal probe?)
  • as a medieval painting (will these be mostly dissection?)
  • as a screenshot from the movie Prometheus (will these be too scary to look at?)

Prompt: A cartoon honey badger wearing a Brazilian Jiu Jitsu GI with a black belt, shooting in for a wrestling takedown

Can you try this one:

Glossy black crystal temples with silver barred gates releasing smoke along a metal path with spikes along it next to a red river, and a layer of smoke. Chains everywhere. A black portal is at the end with heavy glossy techno bosses guarding it. Realistic HD photo.

Zz

[This comment is no longer endorsed by its author]Reply
5Swimmer96312d
Tweaked the prompt multiple times and this is the best I got re: tights and not stockings, I think DALL-E just has very strong priors on "stockings" going with this art style. "Girl wearing a beautiful white dress over white leggings. She is beside another happy girl with black hair wearing a dress over black leggings. The sun is behind the two, dramatic lighting, Anime fanart, safebooru, deviantart, advanced digital art settings, behance 8k super-quality beautiful"
1Zachary MacLeod5d
Have you considered using Dall-E 2's inpainting to "uncrop" the image? Take the picture, scale it down to leave some empty space outside the frame, then place it back in?
1Evidential12d
Dall-e 2 is so mean to me lol. I like the dresses though, especially on the bottom far-left. If you can send me that and the fourth one on the top I will be happy, thank you (going to try to edit it on photoshop or something)

Here is an idea that I hope will give some interesting results:

A complex Rube Goldberg machine.

Some possible variations:

A Rube Goldberg machine made out of candy.

A photograph of a steampunk Rube Goldberg machine.

"White haired girl wearing white tights with a girl with black hair wearing opaque black tights and blushing, Anime fanart, danbooru, deviantart, advanced digital art settings"

 

(since there is 2 girls, it doesn't qualify as "explicit" and more just anime fanart)

6Swimmer96315d
"A white haired girl wearing white tights. She is beside another girl with black hair wearing opaque black tights and blushing. Anime fanart, danbooru, deviantart, advanced digital art settings"
1Evidential13d
I tried fixing the prompt you can try seeing if it will work this time

As a cinematographer now I'm curious of how much it can understand more advanced photography techniques. For example can it do something like "Double exposure photo of the silhouette of a man with fireworks in the background"? I made a similar photo two years ago and I'll leave it here as reference to see how similar it can get: https://i.gyazo.com/ace7c2bd76a8f2710859362314a1f8c0.jpg

3Swimmer96316d
Well, this is the DALL-E attempt! not quite the same but definitely intriguing.
1TibuAI16d
That's cool! It understands the silhouette request and the fact that a double exposure will overlap the subjects, but it doesn't work within the physical rules of the thecnique. Makes complete sense and creates very interesting results. The 3rd one is the closest to what a physical double exposure would look like. Very nice.

This is so incredible. I'm a cinematographer and I'm looking forward to having access because I'm curious how it'll perform in using it to make references for projects. I'm curious if it can take any specific film (not franchise) and take that style. An example of this would be something like "A man with a blue shirt walking through a dark hallway, in the style of Blade Runner 2049". If this works it would also explain why it is a bit loose when you mention Pixar the production company instead of a specific film with a more consistent style. A lot like the... (read more)

3Swimmer96316d
"A man with a blue shirt walking through a dark hallway, in the style of Blade Runner 2049" Well, it apparently thinks I just want the hallway lighting to be blue, which is a pretty common sort of thing for it. Otherwise seems at least kind of Blade Runner-esque?
2gbear60516d
It seems like the atmosphere is right, and technically the shirt could be blue, we just can't tell.
1TibuAI16d
Wow, this is really interesting. I agree with gbear605 the atmosphere is right with the backlit silhouette style of a lot of the film. The 10th one is really really good. It's doing the usual thing of taking the properties of one element and applying it to the other things like the color of the lighting here. I'm still curious about the inpainting approach to do images piece by piece. Similar to what I mentioned for the 2 characer problem. Maybe using inpainting you could go element by element in other instances of this problem so it doesn't get so confused. Seeing these results is very satisfying and insightful, thank you!

Heyyyy I got a prompt request:

Illustrated artwork by Hirohiko Araki depicting Shrek and Donkey in the style of Jojo's Bizarre Adventure.

7Swimmer96318d
Here you go!
2Zachary MacLeod18d
Oh my god that worked well :O

If you want specific words spelled correctly try putting quotations on the specific words in the prompt

2Swimmer96318d
I have tried that! As far as I can tell it doesn't make much of a difference.

Prompt:

Axis and Allies board game 2022 setup. Digital image official concept

(Remove some words if it doesn't work)

3Swimmer96318d
"Axis and Allies board game 2022 setup. Digital image official concept." (I'll maybe play around a bit with the wording to see if I can get something more dramatic.)
1Evidential18d
Yes, it looks like it has some concept of the game. Tell me how it goes with changing the wording

Small white cat wearing a red collar with a bell on it hugging a shadow person. Cute digital art, enhanced digital image

2Swimmer96318d
It's having some trouble with the shadow person, but definitely a cute cat!

Cute White Cat Plushie On A Bed, 4K resolution, amateur photography

Prompt request!

  • "Dystopian hellscape" and/or "Dystopian hellscape, painted by William Blake" (Someone had to ask.  If the resulting images are too gross/disturbing, feel free to skip.)
  • "She made broken look beautiful and strong look invincible. She walked with the Universe on her shoulders and made it look like a pair of wings." (Quote from Ariana Dancu)
  • “But the stars that marked our starting fall away. We must go deeper into greater pain, for it is not permitted that we stay.” (Quote from Dante Alighieri, Inferno)
  • "How can a man die better, than facing
... (read more)
9Swimmer96319d
"But the stars that marked our starting fall away. We must go deeper into greater pain, for it is not permitted that we stay. Hyperrealistic digital art." Some of these are gorgeous! Let me know if you want full-size versions for any! (Not sure how well they capture Dante, but still.)
1Bezzi17d
In order to better capture Dante, I would suggest trying with "Engraving by Gustave Doré" instead of "Hyperrealistic digital art".
2Swimmer96316d
Here we are!
1Sable17d
...um, all of them? :) Holy crap I did not expect this. I think my favorites are the top middle three and the second from the right on the bottom. Which were yours?
2Swimmer96316d
(Oops, really sorry, it closes out my session every so often and I don't have the originals for this anymore.)
5Swimmer96319d
"Good versus evil in a climactic battle, epic matte painting"
1Sable17d
You had discussed how DALLE-2 seems to struggle with assigning traits to more than a single person. It seems to have done well here, with "good" getting more knight-like appearances and "evil" being more consistently demonic. I wonder how much further we could push with anthropomorphized concepts?
4Swimmer96319d
"She made broken look beautiful and strong look invincible. She walked with the Universe on her shoulders and made it look like a pair of wings." Tried with both just 'digital art' and 'hyperrealistic digital art', I find that works best for poetic-quote-prompts.
1Sable17d
These are gorgeous!
3Swimmer96319d
"How can a man die better, than facing fearful odds, for the ashes of his fathers, and the temples of his gods? Hyperrealistic digital art."
1Sable17d
It looks like DALLE-2 is pulling from several different genres? The top left two are very man-of-tomorrow, whereas the three on the top right are more fantastical. And the bottom five are all very distinct.
2Swimmer96319d
"Dystopian hellscape, painted by William Blake" Honestly not very disturbing?
1Sable17d
I'll admit, I'm pleasantly surprised. DALLE-2 seems to be pulling from Dante's Inferno cover art, honestly. Especially because it seems to have spit out a number of book titles?

Prompt:

 

"Chi in Chi's Sweet Home japanese animation. Streaming service Crunchyroll. Screenshot of episode with Chi, who is a cute tabby-white mixed cat. 2D, Google Search Screenshot, Pinterest"

4Swimmer96319d
Not sure what the deal is with top right...

Very insightful post. May I use your images in my PhD dissertation to illustrate limitations of current image generation methods? Thanks!

Gabriel Huang

2Swimmer96319d
You may! Just make sure to keep the DALL-E signature block (bottom right) and attribute it. Also feel free to request a couple of prompts if you want.

Reference Picture of Kyubey. Drawn By Puella Magi Madoka Magica. Digital Art Clip Studio Paint Anime, Pretty and Shining. Advanced Image of Kyubey. The character is Kyubey from Puella Magi Madoka Magica

2Swimmer96319d
1Evidential19d
Looking at the boobs on the first picture, I feel like the AI can do it but since it is an anime, it is mixed in with hentai pictures and animal-humans. The AI must get anime animals and humans confused. The sad thing is that the AI knows what Kyubey is but it adds a bunch of random anime context. Maybe it needs words like "pokemon cat" just to understand it's not some sort of catgirl body mixed with kyubey

A Cute Cat Creature Character: Kyubey, Anime Show: Puella Magi Madoka Magica, Style: Screenshot From Anime Show. Exact screenshot, no variations from original artwork

3Swimmer96319d
Here you go!
1Evidential19d
The third on the top is very cute
2Swimmer96319d
This one? https://labs.openai.com/s/lFZ3rLh0ozneh0m5BHJ8q58G [https://labs.openai.com/s/lFZ3rLh0ozneh0m5BHJ8q58G]
1Evidential19d
Yes! Thank you. I also think I will give up trying to get kyubey to generate but maybe whenever I get access I will try more idk

Prompt Idea:

Exact Picture of Kyubey, 2 Cat Ears, 2 Bunny Ears. Red Eyed Cat Antagonist From Puella Magi Madoka Magica. Specific Puella Magi Madoka Magica Anime Screenshot, No Variations

3Swimmer96319d
1Evidential19d
Maybe the word specific and exact picture throw it off. This actually makes this type of prompt very helpful for product / character design

Prompt idea: "a model of a human cell with all the organelles as a snow globe".

6Swimmer96320d
Wow this came out pretty cute!

Thanks to Benjamin Hilton on Twitter, I've been able to run some prompts despite not having access to DALLE 2 personally, and we noticed some interesting edge cases with DALLE's facial filter. Obviously in general DALLE is fine with animal faces and not fine with human faces, but there was one prompt I suggested, "a painting of a penguin jazz band, in the style of Edward Hopper's 'Nighthawks,'" that gave a bunch of penguins with eldritch abominations of faces. Another prompt, "a painting of a penguin in a suit, in ukiyo-e style," had no issues with generat... (read more)

2Swimmer96320d
Plain "penguins playing poker": And "penguins playing poker, in the style of Edward Hopper's 'Nighthawks'": It doesn't seem like it's especially face-abomination-y in either case? The second one is slightly iffier/weirder on close-up details generally, which fits with my observation that DALL-E gets worse at this if there are more things going on in a scene.
1Ryan Talvola19d
If I had to guess, it was that it was going for a painting before versus the broader style and the texturing got messed up. That probably implies that it's better to simply prompt with the style of painting you want instead of asking specifically for a painting, if you want coherent results. I also think it's interesting to note that with the second prompt, DALLE struggles immensely to figure out what belongs on a table when playing poker than compared to the first, supporting your assertion that the more complicated scene causes some details to collapse. If you're still taking suggestions for prompts, I think these turned out so well I'd be curious to explore more variations on the theme. Could you try "penguins playing poker, in the style of Salvador Dali's 'The Persistence of Memory'" and "penguins playing poker, in the style of Grant Wood's 'American Gothic'"? These should be styles it can handle well that purposely aren't suited to this subject matter.
4Swimmer96319d
"penguins playing poker, in the style of Grant Wood's 'American Gothic'"
4Swimmer96319d
"penguins playing poker, in the style of Salvador Dali's 'The Persistence of Memory'" honestly I really like this one! This in particular came out as just a pretty cool art piece: https://labs.openai.com/s/YeoG5VGOv8tJ3QOLOhB3lRFq [https://labs.openai.com/s/YeoG5VGOv8tJ3QOLOhB3lRFq]
1Ryan Talvola19d
These are too good. I like how for all of these different styles so far, it's at least making an honest attempt to match them, and that painting you specifically highlighted is excellent (as much as I don't think they're quite playing poker as I know it). If you haven't hit your tolerance of poker-playing penguins, how about "penguins playing poker, in the style of Rene Magritte's 'The Son of Man'" (my friend's suggestion) and "penguins playing poker, in the style of The Simpsons"? My original rationale with penguins as a subject is that they're black-and-white bipedal creatures, so hopefully not too hard to draw doing human-like things, that also aren't likely to have much existing artwork of them out there. The drawings I could find of penguins playing poker online were far worse IMO than any of these.
2Swimmer96319d
"penguins playing poker, in the style of The Simpsons". The art style is definitely more ~cartoon, but otherwise seems pretty generic and not especially Simpsons-y? I also ran "penguins playing poker, screenshot from The Simpsons TV show" for comparison, and it seems iffier/less consistent on details, but maybe more Simpsons-flavored?
2Swimmer96319d
"penguins playing poker, in the style of Rene Magritte's 'The Son of Man'" okay I have no idea what's up with bottom left, and bottom right has some face-monstrosities going on, but otherwise these are pretty well executed (though I am not sure how well they match the art style requested.)
1Ryan Talvola19d
I've never seen that degree of screw-up in any DALLE generation before. Wonder what could have happened there. So I think that's the extent of "penguins playing poker" as an artistic subject for now (although it was very nice seeing the contrasts in style, and if I ever get access to DALLE myself there are some other variations I might try), so I'm curious now to see what exactly the limits of penguin generation can be (and perhaps if anything trips the content filters). There's this lovely Claymation sketch on YouTube that remakes The Thing with Pingu, so I'd be curious to see if DALLE can handle "penguins in John Carpenter's 'The Thing'" or 'penguins in the chestburster scene from Alien". I suspect these might be too complex/specific for it to handle, but if either of them were to work... a third one that could be worth a try, too, is "penguins performing an exorcism".

I would be interested in two kinds of prompts:

First, can it reproduce something really popular like:
"V-J Day in Times Square - Alfred Eisenstaedt, 1945"
I know, that original has some faces, so it would be impossible to share, but still interesting to know the result.

Second, does it know some of the not so mainstream video game "styles"? Screenshots from any of the following would be perfect: "Don't starve", "Heroes of Might & Magic III", "Sid Meier's Civilization III", and "StarCraft".

3Swimmer96320d
"V-J Day in Times Square - Alfred Eisenstaedt, 1945"
1Mikhail Doroshenko20d
Interesting. It's actually much worse than I expected it to be. Maybe there was some sort of cleaning to remove duplicate images from the dataset. A few more requests, I would really like to see if you decide to do them. "Simple red dice showing six on top" This is to see whether other dice sides would be coherent with what's on top. "Very cool car" This one is tongue in cheek to see whether it would generate a frozen supercar to maximize both meanings of "cool".
3Swimmer96319d
"Very cool car" Nope, not frozen!
3Swimmer96319d
"Simple red dice showing six on top" Hmmmmmmm. I don't think DALL-E can count to six.
1Mikhail Doroshenko19d
Is it fails, if asked for "one on top" as well? If yes, then can you also try "Domino with 2 spots and 1 spot" or "Domino 2 and 1"?
4Swimmer96319d
Pffft it's really flailing here! "Simple red dice showing a one on top". 1/10! also one of them has nine on top, oops.
1Mikhail Doroshenko19d
Huh, it really can't do the math. I wonder if Flamingo is any better at it.

Suggestion: Can it do Kyubey from Madoka Magica?

  1. Kyubey from Madoka Magica, photorealistic, high quality anime, 4K, pixiv, digital picture

  2. Kyubey from Madoka Magica swimming in a pool of soul-gems, 4K anime, digital art, pixiv, hyperrealistic beautiful

  3. Kyubey from Puella Magi Madoka Magica in the style of Chi's Sweet Home Anime, 4K digital art anime, pixiv

(Feel free to change these around)

7Swimmer96320d
"Kyubey from Madoka Magica, white creature with four ears, 4k high quality anime, screenshot from Puella Magi Madoka Magica" (I fiddled with the prompt because I don't think it knows Madoka quiiiiite well enough, and was giving me vaguely Kyubey-themed anime girls.)
1Evidential20d
Since OpenAI optimized it's output for things as you suggested in your article (dresses, animals) I believe hidden in the depths of the AI, it can pull pictures such as Kyubey but requires an un-optimized input (as in broken English or maybe in Japanese for this specific one) So basically telling an alien what to generate in their own language... And the problem is that we don't know what this is with the information currently available (with tests from DALL-E 1)
1Evidential20d
So it knows the color-scheme and then tries to make some sort of Pokémon off of it. I think maybe the AI believes it is creating a fictional screenshot concept art type thing. Even when you give it a show to go off of, it doesn't understand what to pull off of. I think whenever I get access to dall-e 2, I will try figuring out key buzz words to give it. I also think since I added "pixiv" since most of the pictures there are anthropomorphic, it kinda just tagged it in. I believe the prompting is much more complicated than we think and requires further evaluation. There has to be certain phrases in specific orders the AI can use better that the community doesn't know yet.
4Swimmer96320d
"Kyubey from Madoka Magica swimming in a pool of soul-gems, 4K anime, digital art, pixiv, hyperrealistic beautiful". It's definitely confused on some of these about whether they're anime girls, but it gets the vibe!
1Evidential19d
Ohhhh so maybe next prompt we can specify that they are not "anime girls" Also, these are very cute lol. The first one is the most accurate and the fact that the AI understands what Kyubey looks like means that it is probably looking for very specific wording to get it accurate
3Swimmer96320d
"Kyubey from Puella Magi Madoka Magica in the style of Chi's Sweet Home Anime, 4K digital art anime, pixiv"
1Evidential19d
It looks like all the other modifiers override the "in the style of chis sweet home" I think the AI needs: "A Cute Cat Creature Character: Kyubey, Anime Show: Puella Magi Madoka Magica, Style: Screenshot From Anime Show. Exact screenshot, no variations from original artwork"

I used nightcafe.studio, a VQGAN+CLIP webservice a bunch in March for the worldbuilding.ai entry I was working on. I found it.. okay for generating images that I could then edit in photoshop, but it took many many tries to get something decent. I'd be particularly interested in seeing what DALLE-E 2 does with these prompts:

"Beautiful giant sunset over the saltwater marsh with tiny abandoned buildings in the distance" "Glass greenhouse with a beautiful forest inside, with people and drones flying" "People dropping into a beautiful marsh from flying drones on a sunny day" "Happy children hanging from flying drones on a sunny day beautiful storybook illustration"

4Swimmer96320d
"People falling from robotic flying drones into a beautiful marsh, on a sunny day, matte painting" I think some of the "people" are also robotic? DALL-E is trying though!
4Swimmer96320d
"People and drones flying around inside a giant glass greenhouse with a beautiful forest inside, 3D rendering". I swapped the order because when entered verbatim, the prompt you gave had DALL-E forgetting to include any people or drones. I find it's more likely to actually include smaller or foreground features of a scene if I put them at the front and describe the larger backdrop after. "3D rendering" is the best I got out of several style prompts (I tried "digital art" and "screenshots from a scifi blockbuster movie" as well.)
1Randomized, Controlled20d
Oooooh, these are much better than the ones I was got from nightcafe (I just checked, I was actually using "CLIP guided diffusion".) DALL-E 2's marshes and sunset marshes are slightly better than what I was getting.
3Swimmer96320d
"Beautiful giant sunset over the saltwater marsh with tiny abandoned buildings in the distance, matte painting" (came out better IMO than the original prompt with no style guidance, which sort of forgot about the buildings.)
2Swimmer96320d
"Happy children hanging from flying quadcopter drones on a sunny day, beautiful storybook illustration". Adding "quadcopter" made the drones much easier to recognize!

And it keeps giving me photorealistic faces as a component of images where I wasn't even asking for that, meaning that per the terms and conditions I can't share those images publicly.

Could you just blur out the faces? Or is that still not allowed?

2Swimmer96320d
I assume that would be allowed, but then it misses a lot of the point of sharing how impressive DALL-E's art is!
2cwillu20d
But… Firefly! Season 2! It's not all about the lantern jaw…

Amazing write up. Thanks so much. Can you share with us more about the terms and conditions? If you get early access are you allowed to use images for commercial purposes that involve resale of the images? What kind of license is offered for the images? Do you have to credit openai, etc?

Also, you explored your (on point) inferences about openai's AI ethics framework based on aspects of the T&C's (ie deep fakes); I'd love to hear more about this. Are there are terms that imply other beliefs that openai has about the ethics of AI and DallE2 in particular?

2Swimmer96320d
Their terms and conditions and content policy/sharing policy are public online: https://labs.openai.com/policies/terms [https://labs.openai.com/policies/terms] https://labs.openai.com/policies/content-policy [https://labs.openai.com/policies/content-policy] https://openai.com/api/policies/sharing-publication/ [https://openai.com/api/policies/sharing-publication/]
1Muskwalker20d
You mention a prohibition on photorealistic faces, but none of these terms appear to say anything about this. There is the prohibition "Do not upload images of people without their consent", but this appears to be bound to the matter of actually-existing humans whose consent could be involved (and notably isn't bound to what style actually-existing humans are depicted in, whether that's photorealistic or otherwise). DALL-E 2's main page does confirm that measures were taken to prevent the AI from making "photorealistic generations of real individuals’ faces"—but this again seems to be specifically about actually-existing humans. Is this guidance given anywhere specifically?
3Swimmer96320d
The guidance was in a google document they sent me in the email approving my access, which I think used to be the same as the document linked to in the "sharing publication" guidelines, but apparently now isn't?

Close-ups of cute animals. DALL-E can pull off scenes with several elements, and often produce something that I would buy was a real photo if I scrolled past it on Tumblr.

This is not surprising.

I was more puzzled by its inability to draw two characters consistently, the Iron Man + Captain America example was quite weird. I suppose that it basically calculates a score of "Iron Man-ness" and "Captain American-ness" on the whole image and tries to maximize those (the round shield of Captain America seems to be sort of an atomic trait, it was drawn almost perf... (read more)

Have you tried generating images with prompts that only describe the general vibe of a picture, without hinting at the content? Something like: "The best painting in history", "A very scary drawing", "A joyous photo".

2Swimmer96320d
Anyway, I ran "The best painting in history" and there sure is...a variety here... I think I like #2 best, but #4 is funniest.
2Swimmer96320d
At some point I ran "stunningly impressive digital art that is exactly what I ordered" and got the following:

Prompt I'd like to see: "Screenshot from 2020 Star trek the next generation reboot", maybe variations on the decade.  What does futuristic gritty wholesomeness look like?

3Swimmer96321d
3Swimmer96321d
Sorry you cannot post images in comments apparently, I will put them at the bottom of the main post. (Also, I ended up asking for the miyazaki anime because the prompt as-is gave me a bunch of photorealistic faces.
2Raemon21d
I'm confused about this. If you copy an image, you should be able to paste it straightforwardly into a comment – what did you end up experiencing? (I just tested this by copying something from your post into a comment and it worked)
6Swimmer96321d
I will try again, I guess! (I had clicked and dragged it before, and it appeared in the edit window but not the published comment.)
1cwillu21d
I was confused, seeing how much it favoured an anime interpretation. Then I read the prompt :p I suppose that was to avoid a public realistic human face term of service violation?
3Swimmer96321d
Yeah - I feel like it always gives me monstrous blob faces when I want faces, and perfectly normal realistic faces when I'm not even asking for that! (Though this one is more predictable, since "movie screenshots"; for the prompt "coordination" it kept giving me a bunch of guys in a business meeting.)
1cwillu21d
I suppose I could be satisfied with an enterprise-d from the 2020 remake of sttng :D
7Swimmer96321d
"The Enterprise-D in space, screenshot from 2020 Star trek the next generation reboot" here you go.

Thank you for sharing all of these DALL-E tests!

I wonder whether it can reproduce three objects that reliably appear together in images.  How about one of these prompts:

A bronze statue of three wise monkeys.

See no evil, hear no evil, speak no evil, statue of monkeys.

3Swimmer96316d
"A bronze statue of three wise monkeys." Pretty solid! "See no evil, hear no evil, speak no evil, statue of monkeys."
1PoignardAzur6d
Interesting. It seems to understand that the pattern should be "Three monkeys with hands on their heads somehow", but it doesn't seem to get that each monkey should have hands in a different position. I wonder if that means gwern is wrong when he says DALL-E 2's problem is that the text model compresses information, and the underlying "representation" model genuinely struggles with composition and "there must be three X with only a single Y among them" type of constraints.
1gturk115d
Thank you so much for this! It did do quite well. I have been trying to think of another set of three items that are reliably found together, but this is all I could come up with. Pairs of items are much easier to come up with.
1TibuAI16d
This is so good.

I'm having real trouble finding out about Dall E and copyright infringement.  There are several comments about how Dall E can "copy a style" without it being a violation to the artist, but seriously, I'm appalled.  I'm even having trouble looking at some of the images without feeling "the death of artists."  It satisfies the envy of anyone who every wanted to do art without making the effort, but on whose backs?  Back in the day, we thought that open source would be good advertising, but there is NO reference to any sources.  I'm a... (read more)

4Daphne_W8d
Sorry that automation is taking your craft. You're neither the first nor the last this will happen to. Orators, book illuminators, weavers, portrait artists, puppeteers, cartoon animators, etc. Even just in the artistic world, you're in fine company. Generally speaking, it's been good for society to free up labor for different pursuits while preserving production. The art can even be elevated as people incorporate the automata into their craft. It's a shame the original skill is lost, but if that kept us from innovating, there would be no way to get common people multiple books or multiple pictures of themselves or CGI movies. It seems fair to demand society have a way to support people whose jobs have been automated, at least until they can find something new to do. But don't get mad at the engine of progress and try to stop it - people will just cheer as it runs you over.