Advanced Multimodal AI Is Moving Out of the Lab and Into Your Life

From “Young Frankenstein,” courtesy of 20th Century Fox

ALSO ON NAB AMPLIFY:

AI Is Going Hard and Is Going to Change Everything

From 2012 to 2022, the AI field has evolved at an unprecedented rate of progress.

Today, generative large language models, together with multimodal and art models, dominate the landscape, tech giants, ambitious startups, and non-profit organizations aim to leverage their potential — either for private benefit or to democratize their promises.

These include OpenAI’s release of GPT-3 — arguably the best-known AI model of the decade — and Google’s own LaMDA, the AI that earlier this year was claimed to be sentient by former Google engineer Blake Lemoine.

EXPLORING ARTIFICIAL INTELLIGENCE:

With nearly half of all media and media tech companies incorporating Artificial Intelligence into their operations or product lines, AI and machine learning tools are rapidly transforming content creation, delivery and consumption. Find out what you need to know with these essential insights curated from the NAB Amplify archives:

Even this has been superseded at Google by PaLM, published in April. PaLM currently holds the title of the largest dense language model and has the highest performance across benchmarks. Romero believes it’s state-of-the-art in language AI.

However, the next major advance is already in training. This phase is focused on building AI tools that mimic our other senses, notably hearing and sight — but also human creativity.

OpenAI’s DALL-E 2 is the most well-known AI art model (also known as diffusion-based generative visual models). Others include Microsoft’s NUWA, Meta’s Make-A-Scene, Google’s Imagen and Parti, Midjourney, and Stable Diffusion.

ALSO ON NAB AMPLIFY:

art AI machine learning artificial intelligence

What DOeS DALL-E Mean for the Future of Creativity?

“These models, some behind paid memberships and others free-to-use, are redefining the creative process and our understanding of what it means to be an artist,” Romero says.

But that’s no longer news. Throwing the evolution forward, Romero assumes that these AI models combining language, multimodal, and art-based features are going to become our next virtual assistants.

“This shift from research to production will entail a third technological revolution this century. It will complete a trinity formed by smartphones, social media, and large AI models, an interdependent mix of technologies that will have lasting effects on society and its individuals.”
— Alberto Romero

Advanced AI is going to be a “truly conversational Siri or Alexa,” and your next search engine will be a “intuitive and more natural Google Search or Bing.” and your next artistic tool “will be a more versatile and creative Photoshop.”

The large-scale AI models are emerging from the lab to find a home in consumer products.

How Does Generative AI Work?

By Abby Spessard

READ MORE: How do DALL-E, Midjourney, Stable Diffusion, and other forms of generative AI work? (Big Think)

Generative AI is taking the tech world by storm even as the debate about AI art rages on. “Meaningful pictures are assembled from meaningless noise,” Tom Hartsfield, writing at Big Think, summarizes the current situation.

The generative model programs that power the likes of DALL-E, Midjourney and Stable Diffusion can create images almost “eerily like the work of a real person.” But do AIs truly function like a person, Hartsfield asks, and is it accurate to think of them as intelligent?

“Generative Pre-trained Transformer 3 (GPT-3) is the bleeding edge of AI technology,” he notes. Developed by OpenAI and licensed to Microsoft, GPT-3 was built to produce words. However, OpenAI adapted a version of GPT-3 to create DALL-E and DALL-E 2 through the use of diffusion modeling.

Diffusion modeling is a two-step process where AIs “ruin images, then they try to rebuild them,” as Hartsfield explains. “In the ruining sequence, each step slightly alters the image handed to it by the previous step, adding random noise in the form of scattershot meaningless pixels, then handing it off to the next step. Repeated, over and over, this causes the original image to gradually fade into static and its meaning to disappear.

“When this process is finished, the model runs it in reverse. Starting with the nearly meaningless noise, it pushes the image back through the series of sequential steps, this time attempting to reduce noise and bring back meaning.”

While the destructive part of the process is primarily mechanical, returning the image to lucidity is where training comes in. “Hundreds of billions of parameters,” including associations between images and words, are adjusted during the reverse process.

The DALL-E creators trained their model “on a giant swath of pictures, with associated meanings, culled from all over the web.” This enormous collection of data is partially why Hartsfield says DALL-E isn’t actually very much like a person at all. “Humans don’t learn or create in this way. We don’t take in sensory data of the world and then reduce it to random noise; we also don’t create new things by starting with total randomness and then de-noising it.”

Does that mean generative AI isn’t intelligent in some other way? “A better intuitive understanding of current generative model AI programs may be to think of them as extraordinarily capable idiot mimics,” Hartsfield clarifies.

As an analogy, Hartsfield compares DALL-E to an artist, “who lives his whole life in a gray, windowless room. You show him millions of landscape paintings with the names of the colors and subjects attached. Then you give him paint with color labels and ask him to match the colors and to make patterns statistically mimicking the subject labels. He makes millions of random paintings, comparing each one to a real landscape, and then alters his technique until they start to look realistic. However, he could not tell you one thing about what a real landscape is.”

Whatever your stance is on generative AI, we’ve landed in a new era, one in which computers can generate fake images and text that are extremely convincing. “While the machinations are lifeless, the result looks like something more. We’ll see whether DALL-E and other generative models evolve into something with a deeper sort of intelligence, or if they can only be the world’s greatest idiot mimics.”

“This shift from research to production will entail a third technological revolution this century,” Romero maintains. “It will complete a trinity formed by smartphones, social media, and large AI models, an interdependent mix of technologies that will have lasting effects on society and its individuals.”

How is it all going to redefine our relationship with technology and with one another?

We’ll find out sooner rather than later.

Advanced Multimodal AI Is Moving Out of the Lab and Into Your Life

READ MORE: From Siri to Photoshop to Google Search — Large AI Models Will Redefine How We Live (Alberto Romero)

ALSO ON NAB AMPLIFY:

AI Is Going Hard and Is Going to Change Everything

EXPLORING ARTIFICIAL INTELLIGENCE:

ALSO ON NAB AMPLIFY:

What DOeS DALL-E Mean for the Future of Creativity?

READ MORE: How do DALL-E, Midjourney, Stable Diffusion, and other forms of generative AI work? (Big Think)

Are you interested in contributing ideas, suggestions or opinions? We’d love to hear from you. Email us here.

Subscribe