On Media Synthesis: An Essay on The Next 15 Years of Creative Automation

Yuli_Ban

One of my favorite childhood memories involves something that technically never happened. When I was ten years old, my waking life revolved around cartoons— flashy, colorful, quirky shows that I could find in convenient thirty-minute blocks on a host of cable channels. This love was so strong that I thought to myself one day, "I can create a cartoon." I'd been writing little nonsense stories and drawing (badly) for years by that point, so it was a no-brainer to my ten-year-old mind that I ought to make something similar to (but better than) what I saw on television.
The logical course of action, then, was to search "How to make a cartoon" on the internet. I saw nothing worth my time that I could easily understand, so I realized the trick to play— I would have to open a text file, type in my description of the cartoon, and then feed it into a Cartoon-a-Tron. Voilà! A 30-minute cartoon!
Now I must add that this was in 2005, which ought to communicate how successful my animation career was.

Two years later, I discovered an animation program at the local Wal-Mart and believed that I had finally found the program I had hitherto been unable to find. When I rode home, I felt triumphant in the knowledge that I was about to become a famous cartoonist. My only worry was whether the disk would have all the voices I wanted preloaded.

I used the program once and have never touched it since. Around that same time, I did research on how cartoons were made— though I was aware some required many drawings, I was not clear on the entire process until I read a fairly detailed book filled with technical industry jargon. The thought of drawing thousands of images of singular characters, let alone entire scenes, sounded excruciating. This did not begin to fully encapsulate what one needed to create a competent piece of animation— from brainstorming, storyboarding, and script editing all the way to vocal takes, music production, auditory standards, post-production editing, union rules, and more, the reality shattered every bit of naïveté I held prior about the 'ease' of creating a single 30-minute cartoon (let alone my cavalcade of concepts coming and going with the seasons).

In the most bizarre of twists, my ten and twelve-year-old selves may have been onto something; their only mistake was holding these ideas decades too soon.
In the early 2010s, progress in the field of machine learning began to accelerate exponentially as deep learning went from an obscure little breakthrough to the forefront of data science. Neural networks— once a nonstarter in the field of artificial intelligence— underwent a "grunge moment" and quickly ushered in a sweltering new AI summer which we are still in.

In very short form, neural networks are sequences of large matrix multiples with nonlinear functions used for machine learning, and machine learning is basically statistical gradient modeling. Deep learning involves massive layers of neural networks, parsing through a frankly stupid amount of data to optimize outputs.
As it turns out, deep learning is very competent at certain sub-cognitive tasks— things we recognize as language modeling, conceptual understanding, and image classification. In that regard, it was only a matter of time before we used this tool to generate media. Synthesize it, if you will.

Media synthesis is an umbrella term that includes deepfakes, style transfer, text synthesis, image synthesis, audio manipulation, video generation, text-to-speech, text-to-image, autoparaphrasing, and more.

AI has been used to generate media for quite some time— speech synthesis goes back to the 1950s, Markov chains have stitched together occasionally-quasi-coherent poems and short stories for decades, and Photoshop involves algorithmic changes to preexisting images. If you want to get very figurative, some mechanical automatons from centuries prior could write and do calligraphy.
It wasn't until roughly the 2010s that the nascent field of "media synthesis" truly began to grow thanks to the creation of generative-adversarial networks (first described in the 1990s by Jürgen Schmidhuber). Early successes in this area involved 'DeepDream', an incredibly psychedelic style of image synthesis that bears some resemblance to schizophrenic hallucinations— networks would hallucinate swirling patterns filled with disembodied eyes, doglike faces, and tentacles because they were trained on certain images.
When it came to generating more realistic images, GANs improved rapidly: in 2016, Google's image generation and classification system proved able to create a number of recognizable objects ranging from hotel rooms to pizzas. The next year, image synthesis improved to the point that GANs could create realistic high-definition images.

Neural networks weren't figuring out just images— in 2016, UK-based Google DeepMind unveiled WaveNet for the synthesis of realistic audio. Though it was meant for voices, synthesizing audio waves with such high precision means that you can synthesize any sound imaginable, including musical instruments.

And on Valentine's Day, 2019, OpenAI shocked the sci-tech world with the unveiling of GPT-2, a text-synthesis network with billions of parameters that is so powerful, it displays just a hint of some narrowly generalized intelligence— from text alone, it is capable of inferring location, distance, sequence, and more without any specialized programming. The text generated by GPT-2 ranges from typically incoherent all the way to humanlike, but the magical part is how consistently it can synthesize humanlike text (in the form of articles, poems, and short stories). GPT-2 crushes the competition on the Winograd Schema by over seven points— a barely believable leap forward in the state of the art made even more impressive by the fact GPT-2 is a single, rather simple network with no augmentation made by other algorithms. If given such performance enhancements, its score may reach as high as 75%. If the number of parameters for GPT-2 were increased 1,000x over, it very well could synthesize entire coherent novels— that is, stories that are at least 50,000 words in length.

This is more my area of expertise, and I know how difficult it can be to craft a novel or even an novella (which need only be roughly 20,000 words in length). But I am not afraid of my own obsolescence. Far from it. I fashion my identity more as a media creator who merely resorts to writing— drawing, music, animation, directing, etc. is certainly learnable, but I've dedicated myself to writing. My dream has always been to create "content", not necessarily "books" or any one specific form of media.
This is why I've been watching the progress in media synthesis so closely ever since I had an epiphany on the technology in December of 2017.

We speak of automation as following a fairly predictable path: computers get faster, algorithms get smarter, and we program robots to do drudgery— difficult blue-collar jobs that no one wants to do but someone has to do for society to function. In a better world, this would free workers to pursue more intellectual pursuits in the STEM field and entertainment, though there's the chance that this will merely lead to widespread unemployment and necessitate the implementation of a universal basic income. As more white-collar jobs are automated, humans take to creative jobs in greater numbers, bringing about a flourishing of the arts.

In truth, the progression of automation will likely unfold in the exact opposite pattern. Media synthesis requires no physical body. Art, objectively, requires a medium by which we can enjoy it— whether that's a canvass, a record, a screen, a marble block, or whathaveyou. The actual artistic labor is mental in nature; the physical labor involves transmitting that art through a medium. This can be perfectly replicated with data alone, as these forms of expression can be quantified in digital form. Thus, pure software can automate the creation of entertainment with humans needed only as peripheral agents to enjoy this art (or bring the medium to the software).

This is not the case with most other jobs. A garbageman does not use a medium of expression in order to pick up trash. Neither does an industrial worker. The results of these jobs also is not rooted in data or anything ephemeral— if there is trash to be picked up, you must use physical labor in order to do so. And while many of these jobs have indeed been automated, there is a limit to how automated they can be with current software. Automation works best when there are no variables. If something goes wrong on an assembly line, we send in a human to fix it because the machines are not able to handle errors or unexpected variables. What's more, physical jobs like this require a physical body— they require robotics. And anyone who has worked in the field of machine learning knows that there is a massive gap between what works heavenly in a simulation and what works in real life due to the exponentially increasing variables in reality that can't be modeled in computers even of the present.

To put it another way, in order for blue-collar automation to completely upend the labor market, we require both general-purpose robots (which we technically have) and general AI (which we don't). There will be increasing automation of the industrial and service sectors, sure, but it won't happen quite as quickly as some claim.

Conversely, "disembodied" jobs— the creatives and plenty of white-collar work— could be automated away within a decade. It makes sense that the economic elite would promote the opposite belief since this suggests they are the first on the chopping block of obsolescence, but when it comes to the entertainment industry, there is actually an element of danger in how stupendously close we are to great changes and yet how utterly unprepared we are to deal with them.

Or to put it shortly, jobs that involve the creation of data can be automated without any need for advancements in robotics. 10 years from now, many low and high-skill manual jobs will still be around, but plenty of white-collar and entertainment-based jobs will be obsolete.

There are essentially two types of art: art for art's sake and art as career. Art for art's sake isn't going away anytime soon and never has been in danger of automation. This, pure expression, will survive. Art as career, however, is doomed. What's more, its doom is impending and imminent. If your plan in life is to make a career out of commissioned art, as a professional musician, voice actor, cover model, pop writer, video game designer, keyframe artist, or asset designer, your field has at most 15 years left. In 2017, I felt this was a liberal prediction and that art-as-career would die perhaps in the latter half of the 21st century. Now, just two years later, I'm beginning to believe I was conservative. We need not to create artificial general intelligence to effectively destroy most of the model, movie, and music industries.

Models, especially cover models, might find a dearth of work within a year.

Yes, a year. If the industry were technoprogressive, that is. In truth, it will take longer than that. But the technology to completely unemploy most models already exists in a rudimentary form. State-of-the-art image synthesis can generate photorealistic faces with ease—we're merely waiting on the rest of the body at this point. Parameters can be altered, allowing for customization and style transfer between an existing image and a desired style, further giving options to designers. In the very near future, it ought to be possible to feed an image of any clothing item and make someone in a photo "wear" those clothes.

In other words, if I wanted to put Adolf Hitler in a Japanese schoolgirl's clothes for whatever esoteric reason, it wouldn't be impossible for me to do this.

And here is where we shift gears for a moment to discuss the more fun side of media synthesis.

With sufficiently advanced tools which we might find next decade, it will be possible to take any song you want and remix it anyway you desire. My classic example is taking TLC's "Waterfalls" and turning it into a 1900s-style barbershop quartet. This would could only be accomplished via an algorithm that understood what barbershop music sounds like and knew to keep the lyrics and melody of the original song, swap the genders, transfer the vocal style to a new one, and subtract the original instrumentation. A similar example of mine is taking Witchfinder General's "Friends of Hell" and doing just two things: change the singer into a woman, preferably Coven's Jinx Dawson, and changing a few of the lyrics. No pitch change to the music, meaning everything else has to stay right where it is.
The only way to do this today is to actually cover the songs and hope you do a decent enough job. In the very near future, through a neural manipulation of the music, I could accomplish the same on my computer with just a few textual inputs and prompts. And if I can manipulate music to such a level, surely I needn't mention the potential to generate music through this method. Perhaps you'd love nothing more than to hear Foo Fighters but with Kurt Cobain as vocalist (or co-vocalist), or perhaps you'd love to hear an entirely new Foo Fighters album recorded in the style of the very first record.

Another example I like to use is the prospect of the first "computer-generated comic." Not to be confused with a comic using CGI art, the first computer-generated comic will be one created entirely by an algorithm. Or, at least, drawn by algorithm. The human will input text and descriptions, and the computer will do the rest. It could conceivably do so in any art style. I thought this would happen before the first AI-generated animation, but I was wrong— a neural network managed to synthesize short clips of the Flintstones in 2018. Not all of them were great, but they didn't have to be.

Very near in the future, I expect there to be "character creator: the game" utilizing a fully customizable GAN-based interface. We'll be able to generate any sort of character we desire in any sort of situation, any pose, any scene, in any style. From there, we'll be able to create any art scene we desire. If we want Byzantine art versions of modern comic books, for example, it will be possible. If you wanted your favorite artist to draw a particular scene they otherwise never would, you could see the result. And you could even overlay style transferring visuals over augmented reality, turning the entire world itself into your own little cartoon or abstract painting.

Ten years from now, I will be able to accomplish the very thing my ten-year-old self always wanted: I'll be able to download an auto-animation program and create entire cartoons from scratch. And I'll be able to synthesize the voices— any voice, whether I have a clip or not. I'll be able to synthesize the perfect soundtrack to match it. And the cartoon could be in any art style. It doesn't have to have choppy animation— if I wanted it to have fluidity beyond that of any Disney film, it could be done. And there won't be regulations to follow unless I chose to publicly release that cartoon. I won't have to pay anyone, let alone put down hundreds of thousands of dollars per episode. The worst problem I might have is if this technology isn't open-source (most media synthesizing tools are, via GitHub) and it turns out I have to pay hundreds of thousands of dollars for such tools anyway. This would only happen if the big studios of the entertainment industry bought out every AI researcher on the planet or shut down piracy & open source sites with extreme prejudice by then.

But it could also happen willingly in the case said AI researchers don't trust these tools to be used wisely, as OpenAI so controversially chose with GPT-2.

Surely you've heard of deepfakes. There is quite a bit of entertainment potential in them, and some are beginning to capitalize on this— who wouldn't want to star in a blockbuster movie or see their crush on a porn star's body? Except that last one isn't technically legal.
And this is just where the problems begin. Deepfakes exist as the tip of the warhead that will end our trust-based society. Despite the existence of image manipulation software, most isn't quite good enough to fool people— it's easier to simply mislabel something and present it as something else (e.g. a mob of Islamic terrorists being labeled American Muslims celebrating 9/11). This will change in the coming years when it becomes easy to recreate reality in your favor.

Imagine a phisher using style transferring algorithms to "steal" your mother's voice and then call you asking for your social security number. Someone will be the first. We have no widespread telephone encryption system in place to prevent such a thing because such a thing is so unthinkable to us at the present moment.

Deepfakes would be the best at subtly altering things, adding elements that weren't there and you didn't immediately notice at first. But it's also possible for all aspects of media synthesis to erode trust. If you wanted to create events in history, complete with all the "evidence" necessary, there is nothing stopping you. Most probably won't believe you, but some subset will, and that's all you need to start wreaking havoc. At some point, you could pick and choose your own reality. If I had a son and raised him believing that the Beatles were an all-female band— with all live performances and interviews showcasing a female Beatles and all online references referring to them as women— then the inverse, that the Beatles were an all-male band, might very well become "alternate history" to him because how can he confirm otherwise? Someone else might tell him that the Beatles were actually called Long John & the Beat Brothers because that's the reality they chose.

This total malleability of reality is a symbol of our increasingly advanced civilization, and it's on the verge of becoming the present. Yet outside of mentions of deepfakes, there has been little dialogue on the possibility in the mainstream. It's still a given that Hollywood will remain relatively unchanged even into the 2040s and 2050s besides "perhaps using robots & holographic actors". It's still a given that you could get rich writing shlocky romance novels on Amazon, become a top 40 pop star, or trust most images and videos because we expect (and, perhaps, we want) the future to be "the present with better gadgets" and not the utterly transformative, cyberdelic era of sturm und drang ahead of us.

All my ten-year-old self wants is his cartoon. I'll be happy to give it to him whenever I can.

If you want to see more, come visit the subreddit: https://www.reddit.com/r/MediaSynthesis/

[-]mako yass7y50

An attempted reply to your concern about deepfakes grew into its own post.

If you wanted to create events in history, complete with all the "evidence" necessary, there is nothing stopping you.

For past footage, some of my proposed solutions wouldn't apply... but this will not attenuate our connection to history by very much. Most important historical documents are not videos. We are reliant on the accounts of honest people, and we always will be, if not for verifying direct evidence, for understanding it.

[-]johnvon7y30

Mako i just read your response post.

this proposed solution reminds me very much of some of the solutions the software and music industries proposed in order to stop piracy. unfortunately none of these worked, or were practical enough to put into widespread use. and of course the adoption has to be UNIVERSAL to be effective.

[-]mako yass7y10

They're related fields. For various reasons (some ridiculous) I've spent a lot of time thinking about the potential upsides of the thing that Richard Stallman called Treacherous Computing. There are many. We're essentially talking about the difference between having devices that can make promises and devices that can't. Devices that have the option of pledging to tell the truth in certain situations, and devices that can tell any lie that is possible to tell.

I think we have reason to believe Trusted Computing will be easier to achieve with better (cheaper) technology. I also think we have reasons to hope that it will be easier to achieve. Really, Trusted Computing and Treachery are separate qualities. An unsealed device can have secret backdoors. A sealed device can have an open design and an extensively audited manufacturing process.

I'm not sure what you're getting at with the universality concern. If a work could only be viewed in theatres and on TC graphics hardware with sealed screens (do those exist yet), it would still be very profitable. They would not strictly need universal adoption of sealed hardware.