Text-to-image generators have swept the web in recent months. These AI systems turn a written description into an image. So by entering “an astronaut riding a white horse,” the system turns this into an image of, well, an astronaut riding a white horse.
One of the first of these services — DALL-E developed by the Open AI Initiative--appeared early last year producing reasonably well rendered images. But advances since then have been striking. DALL-E 2, launched earlier this year, produces higher resolution images of surprising realism. Other systems look equally impressive.
Nevertheless, this technology has generated controversy because of its biases and potential for abuse. For example, ask DALLE-E 2 to produce an image of a doctor and it will show you a man in a white coat. Ask it for an image of a nurse and it will invariably produce an image of a woman.
But the approaches for tackling bias and preventing abuse are advancing slowly compared to the technology itself. And that raises the question about the challenges more advanced AI systems are likely to throw up.
Enter two teams of researchers from Google Research and Meta AI who have developed the next generation of text-to-image machines. Google’s system turns text into virtual 3D objects while Meta’s turns text into short videos.
These approaches open a wide range of exciting new applications that the technology companies are eager to explore. But at the same time, these approaches raise important questions about bias, social norms, deepfakes and accountability that the same technology companies appear less open about solving.
Text-to-image generators have become possible because of the huge datasets of annotated images that can be scraped off the web. These datasets are made up of images with accurate text descriptions of what they show.
The Open AI Initiative, Google and others have used these datasets to train AI systems to learn the kinds of images the words and phrases describe and then create entirely new images based only on a textual description. In August, an image created by an AI system called Midjourney controversially won a prize at the Colorado State Art Fair, beating every human artist in the process.
Now the new work goes a step further. Ben Poole at Google Research and colleagues have developed an AI system called DreamFusion that uses text strings to generate virtual 3D models. They reveal a number of examples here.
One way to do this would be by using a large database of annotated 3D models to train an AI system to associate words and models. However, sufficiently large databases of this kind do not exist.
So Pool and co started with a set of images and taught the system how to use them to create 3D models. By using a text-to-image generator to create the input image, DreamFusion can extrapolate that into a 3D model.
“The resulting 3D model of the given text can be viewed from any angle, relit by arbitrary illumination, or composited into any 3D environment” say Poole and co.
They point out that computer games and other digital media rely on thousands of 3D models that have to be generated by hand. DreamFusion should immediately make this process faster, quicker and cheaper. Expect to see an explosion of digital worlds as a result.
Uriel Singer and colleagues at Meta AI have developed a system called Make-a-Video that turns text into short videos. It’s easy to imagine that this could be done with a large dataset of videos with written descriptions of what they show.
But again this kind of dataset is not easy to scrape off the web or to create manually. So instead, Singer and co teach their AI system in general terms how the things move in the real world. The system then uses what it has learnt to convert still images into moving ones.
Make-a-Video relies on existing text-to-image AI to generate a picture and then uses its newfound knowledge to make it move. The end result is a short video.
That will also change the landscape for content creators. Video content on television, in movies and on the web contains numerous animations or special effects which are heavily reliant on human content creators.
But Make-a-Video AI makes this process significantly easier, faster and cheaper. It also gives the content creators a headstart on which they can develop more detailed and impressive effects (at least in theory).
All that should lead to a step change in the way video and immersive content is created. But it should also lead to a new focus on the problems that text-to-image systems are known for. The first of these is that the AI systems reflect the inherent bias in the datasets they are trained on. These datasets tend to be western, male oriented and ethnically white.
Various groups have attempted to correct these biases, with varying degrees of success. For example, the Open AI Initiative has admitted inserting phrases — such as black man or Asian woman--into some text strings to correct bias.
Google and Meta AI have been less clear on how they intend to tackle the problem. Indeed, they have not granted the public access to their systems, citing this bias as a reason.
But if they do find a way to correct bias, how will they do it? Can it be right that companies and organizations with little or no accountability or transparency suddenly find themselves in charge of determining the nature of bias, of deciding what constitutes a social norm?
Humanity is destined to spend significantly more time in virtual worlds. These worlds are going to be created automatically by AI systems like these. If society is going to decide how to tackle bias in an open and accountable fashion, it will need to grasp this nettle soon.
Refs: Make-A-Video: Text-to-Video Generation without Text-Video Data : arxiv.org/abs/2209.14792 DreamFusion: Text-to-3D using 2D Diffusion : arxiv.org/abs/2209.14988