It was long believed that the first jobs to be obsoleted by AI would be lawyers and accountants, as those seemed the prime targets. After all, creativity has hardly been the forte of computers for the past half-century, being almost exclusively the product of human effort. However, in recent years, something has begun to change significantly. Widely introduced to the public via OpenAI's original DALL·E model, text-to-image has captured the imaginations of countless individuals who were under the impression that such advancements were still decades away. As even more advanced models rear their heads, such as DALL·E 2 and the (as of writing) brand new Imagen, we can clearly note that the quality of images is increasing at an incredibly rapid pace. While there have been many posts already written on the limitations of DALL·E 2, it is worth highlighting that newly released Imagen has already solved several of the listed issues, such as text generation and object colorization. All this is to say that text-to-image models are already remarkable, and they've hardly been around for half a decade.

Beyond current models

While existing solutions to the text-to-image problem are incredible, there is clearly a lot of room left to grow. The images are almost photorealistic, and some are even close enough to fool the uninformed, but close inspection always reveals minor flaws. Let's assume that the current rate of progress continues. This realistically leaves us with only a handful of years before images can be generated that are entirely indistinguishable from real photographs. (If you believe that this point will never be reached, I would love to hear your reasoning.)  Once that mountaintop is summitted, where is there left to go? Higher resolution? Quicker generation? The answer is obvious: video.

There are already models that are beginning to tackle text-to-video generation, such as NÜWA and the more recent (but not-as-appealingly named) Video Diffusion Models

"Play golf at swimming pool" from NÜWA
"Shinjuku Time Lapse" from Video Diffusion Models

While the results are low quality and nowhere near the text-to-images models of today, they do bear a striking resemblance to the image models of just a few years ago.

If we are to be bold and assume a rate of progress identical to the text-to-image models, we should expect to see near-photorealistic video generation within the next several years, however short those videos might be. If we are slightly more pessimistic and assume it takes twice as long for video models to see the same growth, that still lands us within this decade. Regardless of whether it takes six years, sixteen years, or sixty years, the end point is inescapable: we will eventually be able to instantly create photorealistic videos of anything, on demand.

Clearly, this raises several questions. One of which is the most prominent issue we face with modern day deepfakes: how to prevent using the technology to create blackmail and other illegal material. However, since this is not an issue exclusive to AI, having existed since the birth of photo editing, we can ignore it for the purposes of this post. Instead, I would like to focus on what this will do to the future of film.

Barriers to entry

Let's assume that, eventually, text-to-video models get to the point that the end user is able to sit down at their computer and manufacture a feature length film in as much time as it would take to watch it, complete with a compelling plot and interesting score. (You can imagine this being as far away as it needs to be to fit your personal timeline, as it doesn't especially matter when it happens.) When this day arrives, a notable barrier in creativity will have been broken. What the digital camera did to the photography industry, namely increase access and decrease the skill level needed to enter the field, synthetic media will do to the film industry. While these factors have been massaged down to a manageable point for some time, with access to film equipment and editing software more ubiquitous than ever, there is still one thing that prevents most people from participating: organization.

Where almost anyone can use a few hours of their spare time to follow a couple YouTube tutorials and make incredible images in Photoshop, there are comparatively few people who are able to make movies. (Note that when I make this point, I am not exclusively referring to Hollywood quality productions. However, I do think there is an important distinction to be made between low-budget indie films and a couple teenagers recording their LARP session on an iPhone.) The main barrier that stands in the way is simply organization. Where one man can easily sit down and create a phenomenal work of art in Photoshop, it is nearly impossible for that same one man to create a feature length film on his own.

Aside from obvious elements that would require talent in multiple fields, such as the aforementioned scripting and scoring, most movies require multiple actors. This simple fact immediately causes the production of a movie to be a multi-person organizational challenge. Namely, the same one man who was having a fine time working on his own with Photoshop now has to find and coordinate actors, who each also need to have their own set of skills that make them worthy of the role. All of this takes time and money; enough of both to make most people who aspire to be filmmakers to abandon their aspirations. This is why Hollywood, as corrupt and condemnable as it might be, is so successful. Most people would probably rather make their own films than watch whatever Netflix felt like financing, but they are simply unable to.

Preferences and death

This leaves us with a fascinating question: once individuals are able to create their own movies at the push of a button, where does that leave the film industry? There are already people discussing what future versions of text-to-image models might do to stock photography companies and illustrators. Why would anyone pay $500 to get a custom image drawn by a human when DALL·E 4 is able to do it just as well for free? It would seem that the people who find themselves currently making a living off of providing these services should be considering other career options, as their time is limited. Similarly, I would expect a gradual decline in the revenue generated by studio-produced movies. The shift will be gradual, but as it is now with photos, the writing will be on the wall. Just as the illustrator and the painter will be replaced (perhaps not wholesale, but certainly to a significant degree) by the text-to-image model, the director and the actor will be replaced by the text-to-video model.

In a world where people are able to create new Star Wars movies on demand, why would anyone settle for what Disney believes is the right way to go? If you were able to insert yourself into movies, wouldn't that be something that interests you? Instead of training to be an actor, moving to California and hoping to get lucky, what if you were simply able to tell an AI model to swap you in for the role of Luke Skywalker? I'm no fortune teller, but I would contend that a majority of people would find quite a lot of value in that proposal.

The time for speculating about this potential future is nearing its end. Just as we would now be foolish to imagine that text-to-image models will not result in significant changes to the way we interact with illustrators and photographers, we will soon be equally foolish to dismiss text-to-video models. Hollywood is not yet in any danger, and their vice grip on blockbuster films will remain firm for many years to come, but it won't last forever. Just as the silver screen was the death of the live theater, AI will be the death of the movie theater.

Edit: I have decided to continuously update this post with models and published works relating to this subject, instead of making follow-up posts every time something interesting happens. Check back periodically to see if anything new has been added.

Update 1

05/27/2022

The first update I would like to add would be the extremely recently published Flexible Diffusion Modeling of Long Videos, which has already produced incredible results. One of said results, is a 90 minute photorealistic video of a car driving on the road. Granted, the video looks like a 144p YouTube video uploaded in 2008, but it might very well have just set the record for longest coherent AI generated video.

A still from the 90 minute video

While this might seem underwhelming, it is important to keep in mind that the examples mentioned previously in this article were only able to produce videos that were several seconds long. The videos generated with this model might be boring to watch, but make no mistake - this is a sign of things to come. Coherence will forever be the greatest challenge this field of research faces, and it looks like the solution may be near.

Update 2

05/29/2022

The second update comes only a matter of days after the first, which is perhaps something important to take note of. CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers might just be the most impressive example of text-to-video so far. While it does not display the long term coherence of the Flexible Diffusion model that was presented in the first update, the results speak for themselves.

Various Prompts from CogVideo

At the time of writing, there is not a substantial amount of information available about this project, aside from the actual text inputs being Chinese.  Regardless of the lack of information, we can probably take a few educated guesses. In my opinion, the model was likely trained on a large set of stock videos, which would explain the more sterile appearance of the results. (Were a similar model to be trained on a corpus of videos randomly scraped off of YouTube, the results would likely be much more diverse and true to life.)

A quick visual comparison to the results generated by NÜWA, one of which can be seen in the main text of this post, this new model is certainly more capable of generating realistic results. If you blur your eyes, you might have a hard time telling these videos apart from something actually found in a stock video collection. A noticeable flaw is, while temporal coherence on a general scale seems to be under control, close inspection reveals that every generated video is having trouble staying consistent with fine details. There is an almost indescribable shimmering effect on every video.

Does this get us any closer to the aforementioned death of cinema? In truth, it's hard to tell. While this model does produce nicer looking results than anything previously seen, it is still a far cry from anything that would even be helpful in producing a film. While we will undoubtedly look back on models such as CogVideo and wonder how more people didn't see this technology coming, it is clear that there is still a far way to go.

That being said, this is probably a fire alarm moment. Do with it what you will.

(Update: The paper has been published.)

 

Update 3

6/8/2022

For update number three, which is only slightly more than one week removed from the previous entry, it's time to look at Generating Long Videos of Dynamic Scenes. (Similar to update two, the paper is not published at the time of writing. I'll append it when it is.) 

"Single videos on mountain biking dataset"

At first, the results are almost unbelievable. While there is a good amount of warping and inconsistency in the videos, the overall coherence is remarkable. When compared to StyleGAN-V, which might very well give you a headache if you watch it for too long, one could almost be forgiven for mistaking the new results for real videos recorded with bad cameras. Looking back on the model presented in update one, Flexible Diffusion Modeling of Long Videos, which also features motion combined with improved coherence, it feels as though a years worth of improvement has been made since then.

Something important to keep in mind about this model is that it isn't text-to-video quite like the other models in this post. Where everything else presented here has followed the formula of "Enter Text = Get Video", this model seems to be specifically trained on datasets finetuned to produce the desired results. Again, the paper isn't yet released, but from what is public as of writing this, it seems that in order to get a video of clouds you have to train it on a specific dataset of clouds and nothing else. Of course, this could just be an artificial limitation placed upon the model for the purposes of demonstration, with any text-to-video capabilities neutered for the sheer sake of appearances. Either way, this is unmistakably a significant step in believable video generation.

There honestly isn't much to say other than it's almost difficult to stay abreast of the progress being made in this field. When this post was made, the most advanced models were NÜWA and Video Diffusion Models, the results of which were muddy and barely passable as videos. I mentioned in update three that I believe we're currently witnessing a fire alarm moment for AI generated video, and my beliefs have only been strengthened by this release. When the first large-scale DALL·E-esque video generation model is put into the spotlight for the general public to see, it will take them by considerable surprise. Those who have been paying attention will be less surprised, but probably still not entirely ready for what comes next. I previously stated in the main post that such a moment could be several years away. Now, however, I could see this occurring by Q4 of this year, 2022.

(Update: The paper has been published.)

New Comment
14 comments, sorted by Click to highlight new comments since: Today at 8:57 AM

very good post. I had a scare while reading this, remembering a conversation I had with someone a few years ago, telling them that artists and painters should all be safe, as "I don't think computers will be able to paint". This stuff is scary (and fun somehow).

First the artists will become unemployed, then the software developers... and the truck drivers will keep their jobs until Singularity. The opposite of what many people expected.

I think software developers should keep their jobs for a while. Their jobs involve a lot of 'people asking very specific requests' and I think the only reason it works is because both parties are humans and can understand each other well. I think as long as people don't know how to be specific and robotic with how they request things, software devs should be fine.

Idk, language models are getting increasingly good at responding to fairly vague prompts. They aren't incredible at it or anything, but I expect that skill to increase over time, probably to human-level, but maybe even above that.

It was pretty clear that most of Dall-e's limitations were temporary, but Imagen looks really amazing. Would love to see all the Dall-e limitations reassessed with Imagen. Also, I'd bet OpenAI will update Dall-e within 2022 (possibly without anouncing Dall-e3). 

It wouldn't surprise me at all if we soon see an update to DALL·E similar to the instruct series for GPT-3. Not a total reworking, but a significant enough change that it produces improved results.

this approach of forward thinking is what I was looking for. Does anybody know of more content like this? (youtube, blog, LW)

Speculating and writing about how and in what order these new tools can change components of future every day life seems very valuable. We all know change is coming at ever increasing rate. But there is little written speculation about the near term future. I'd love to help piece together such thinking.

I think most people on LW try to keep their speculations to a minimum mainly to avoid embarrassment for when they don't come true. I have no such worries when it comes to this technology specifically, since the outcome is so obvious that it would be more unlikely for it to not happen.

While I'm not certain about any particular other content that might scratch the same itch, I definitely plan to write more posts on other topics similar to this one.

One possible barrier to that is going to be copyright laws. I have a feeling film studios won't take kindly to people creating movies using their intellectual property, and if the models required to generate such movies are larger than most private individuals can afford (which I would strongly expect, at least for a few years), then they may be able to get the entire endeavour shut down for a long while.

There's nothing stopping people from making fan films at the current moment, generally with the limitation that it isn't put up for sale. I would find them being able to shut down progress on this tech dubious at best, but certainly not outside the realm of possibility.

It's important to consider the idea that these models could also be used in tandem with actual footage as a cheap alternative to the modern CGI pipeline. Instead of paying hundreds of artists to painstakingly make your CGI alien world, you could ask a future model to inpaint the green screens with an alien world that it generates.

Very true, the only exception I can think off is that if I wanted a movie in which I was the main character I would have to spend immense time defining and explaining every aspect of my life and interests.

I could sit here and speculate on potential workarounds, but you're probably correct. It makes sense that if you want to place yourself in movies, you would need to first build a comprehensive model of yourself for the AI to work with. Fortunately, this is the kind of thing you'd only need to do once.