Thoughts about AI Text-to-Image Creation and the Future

One of my first tests: a photo of a VR explorer from the 1970s

One of my first tests: a photo of a VR explorer from the 1970s

Table of Contents

Intro

If you work in a remotely creative field, you might have noticed that your social media feeds were and are increasingly being flooded with a new kind of image lately (Q1 2022). The images are often trippy mixes of different topics, which don’t belong together at first sight. While the visual quality is varying between almost photoreal and painterly with mixed perspectives and distorted faces, some images could be taken straight from someone’s mind tripping on psychedelics. Also, people don’t just post one image, but countless variations of the same topic. But none of these images were created by a person, but by different AI tools (Artificial Intelligence), as people love to say (and sell) these days. More precisely, these images are the result of a text input, called prompt, and that’s pretty much it.

Text. To. Image.

Just repeat it a couple of times in your mind to grasp the actual impact of this. If Steve Jobs was still alive, these three words alone would have made one hell of a keynote. Even though you know that it’s just math, depending on your age this might seem like utter magic.

After being 10+ years in the industry of CG image creation and direction (animation and stills), the only moments when I felt that excited about a new technology were the emergence of fully procedural content creation, GPU-based rendering and the rise of modern real-time engines. That being said, it feels like the implications for art, design, content creation, education and society, in general, are much larger this time. Since the public (although invitation-based) release of some of these tools I’ve had so many heated lunch debates, online conversations and thoughts about the topic, that I felt like writing these thoughts down to sort them and form an opinion for myself in the first place. Since this topic definitely needs public discourse, I decided to sum up my thoughts in this article.

It’s supposed to be an intro to the topic for some of you and a thought-starter for all of you, showing possible dystopian and utopian futures and a hopefully healthy way to think about the topic. In the later part, we’ll discuss how you can make the most of this amazing new technology by actually embracing its limitations.

If you’re already familiar with the topic or have even used the new tools feel free to jump directly to the chapter Infinite Options. Feel free to visit my text-to-image blog https://ai.rosch.xyz/ where I collect my experiments.

What do you want to see?

You type your prompt by describing what you want to see:

A cinematic wide-angle photo of a group of astronauts in space

add some additional parameters to guide the neural network in a more specific direction:

star wars, 4K, high detail, octane render

and maybe even add an artist name, whose style should be applied, wait a minute or two and you get your low-resolution previews of a couple of possible variations. Afterwards, you refine your prompt or select a version to increase resolution and detail. Depending on the tool, you can even mark areas of the image, which should be altered in a specific way. You might get lucky with your first try, but more often than not, your first result might be garbage. If you have a more precise idea, it might take several tries at rephrasing your prompt or selecting the right variations to continue.

The images you see these days, are mainly coming from a small group of tools: Midjourney, OpenAI’s DALL·E 2, Disco Diffusion and Google’s Imagen.

The Harbingers of Progress

Over the last couple of years, the impacts of AI for creative purposes were coming closer but still felt far away for actual use. Tech demos and papers gave us a glimpse into the black box of AI, but it always felt like tests or gimmicks. The earlier image-to-text recognition tests, with their sometimes racial biases and errors, always had a strange vibe. Google’s dreaming AI was indeed impressive on its own but it was hard to see a use for this other than filters. Then there was the sheer endless amount of GAN morphs, which would have been much more impressive if I had understood anything of the tech behind them. But I didn’t. Style transfer apps were interesting for a very short time, but the quality and direct copying of styles never felt right. In general most of these and similar examples were always more impressive for people, who understood the technology behind it, and thanks to AI being the black box that it is, that percentage of people is rather small.

But since the beginning of 2022, the perceived quantity and quality of AI image technology ramped up massively and all of a sudden we have a technology that is extremely easy to use has applications in all kinds of creative processes and definitely will kill some jobs. Thanks to mighty data centers, we don’t even need massive workstations anymore, but can use these new tools from a smartphone or tablet.

So technically, the craft of producing a digital image gets taken out of the equation and the technological entry barrier is lowered to a smartphone. If that doesn’t make you flinch you’re either already retired or actively working on a better technology in secret. It’s easy to see, why some artists fear for their jobs and see this as the final nail in the coffin of human creativity. Although that’s not my current opinion, I’m still in awe by how much faster this technology progressed than I even remotely anticipated. To be honest, I wouldn’t have imagined having this quality of text (or therefore speech) to image generation for another five to seven years.

If you want to get deeper into the topic of general AI and our distorted perception of technological progress, I can highly recommend Tim Urban’s thorough article The Road to Superintelligence. If you want to roughly understand the actual process of text-to-image generation and what happens in the mysterious Latent Space that everyone is talking about, watch this very well-made video.