AI engines are getting pretty good at accurate image recognition but fail spectacularly in understanding what it is they are looking at. An approach used for natural language processing could address that.
In a shoot-out between humans and the AI smarts of Amazon AWS Rekognition, Google Vision, IBM Watson, and Microsoft Azure Computer Vision, the machines came out on top.
On a pure accuracy basis, Amazon, Google and Microsoft scored higher than human tagging for tags with greater than 90% confidence in a test completed by Perficient Digital, an edge-AI accelerator chip company, as reported at ZDnet.
However, in a machines versus humans rematch, the engine-generated descriptions matched up poorly with the way that we would describe the image. In other words, the study concluded, there is a clear difference between a tag being accurate and what a human would use to describe an image.
A couple years on, Perficient Digital CEO Steve Teig says advances in natural language processing (NLP) techniques can be applied to computer vision to give machines a better understanding of what they are seeing.
So-called attention-based neural network techniques, which are designed to mimic cognitive processes by giving an artificial neural network an idea of history or context, could be applied to image processing.
In NLP, the Attention mechanism looks at an input sequence, such as a sentence, and decides after each piece of data in the sequence (syllable or word) which other parts of the sequence are relevant. This is similar to how you are reading this article: Your brain is holding certain words in your memory even as it focuses on each new word you’re reading, because the words you’ve already read combined with the word you’re reading right now lend valuable context that help you understand the text.
Applying the same concept to a still image (rather than a temporal sequence such as a video) is less obvious but Teig says Attention can be used in a spatial context here. Syllables or words would be analogous to patches of the images.
As outlined by Teig using an example of computer vision applied to an image of a dog, “There’s a brown pixel next to a grey pixel, next to…” is “a terrible description of what’s going on in the picture,” as opposed to “There is a dog in the picture.”
He says new techniques help an AI “describe the pieces of the image in semantic terms. It can then aggregate those into more useful concepts for downstream reasoning.”
Interviewed by EE Times, Teig said, “I think there’s a lot of room to advance here, both from a theory and software point of view and from a hardware point of view, when one doesn’t have to bludgeon the data with gigantic matrices, which I very much doubt your brain is doing. There’s so much that can be filtered out in context without having to compare it to everything else.”
This matters because current NLP processing from the likes of Google is computational intensive. Deep learning language models like Generative Pre-trained Transformer 3 (GPT-3) require 175 billion parameters — or 1 trillion bits of information.
If you want to do this at the network Edge, to fuel next-gen applications over 5G, then think again.
“It’s like… I’m going to ask you a trillion questions in order to understand what you’ve just said,” Teig says. “Maybe it can’t be done in 20,000 or two million, but a trillion — get out of here! The flaw isn’t that we have a small [processor at the Edge]; the flaw there is that having 175 billion parameters means you did something really wrong.”
That said, this is all evolving very fast. He thinks that reducing Attention-based networks’ parameter count, and representing them efficiently, could bring attention-based embedded vision to Edge devices soon.