Artificial intelligence that can understand object relationships
Humans see relationships and objects when they look at a scene. A laptop might be sitting on top of your desk. It could also be in front of a computer monitor.
Deep learning models often struggle to see the world in this way, because they don’t know the intricate relationships between objects. A robot that was designed to assist someone in the kitchen would not be able to follow a command such as “pick up the spatula to the left of stove and place it on the cutting board.”
MIT researchers developed a model to understand the relationships between objects within a scene. The model depicts individual relationships and then blends these representations to create a scene. The model can generate more accurate images by combining text descriptions even if the scene contains multiple objects that have different relationships.
This could be useful in cases where industrial robots are required to perform complex, multi-step tasks such as stacking and assembling items. This also makes it possible to create machines that learn from their environment and interact with it more like humans.
“I can’t see an object at XYZ position when I look at a table. Our brains don’t work in this way. Our minds don’t work like that. When we see a scene, our understanding is based on its relationships. “We believe that if we build a system that understands the relationships between objects, it will be possible to manipulate and change our environment more effectively,” said Yilun Du (a PhD student at the Computer Science and Artificial Intelligence Laboratory, CSAIL), co-lead author on the paper.
Du co-authored the paper with Shuang Li (a CSAIL PhD student) and Nan Liu (a graduate student at University of Illinois at UrbanaChampaign); Joshua B. Tenenbaum (a professor of computational cognition science in the Department of Brain and Cognitive Sciences and a CSAIL member; and Antonio Torralba (a senior author and the Delta Electronics Professor of Electrical Engineering and Computer Science and a member of CSAIL); The research will be presented at December’s Conference on Neural Information Processing Systems.
One relationship at the time
Based on text descriptions of objects and their relationships, the framework developed by the researchers can create an image of a scene. For example, “A wood table to one side of a blue stool.” A red couch to one side of a blue stool.
The system would then break down the sentences into smaller pieces, which describe each relationship (“a wood stool to the left of a stool”) and then model each piece separately. The optimization process then creates an image from the scene by combining these pieces.
Researchers used an energy-based model, a machine learning technique to represent individual object relationships in scene descriptions. They can use one energy-based model for each relational description and then combine them in a way that makes it clear all objects and their relationships.
Li explained that the system can break down the sentences into smaller pieces for each relationship so it can recombine them in many different ways. This allows it to better adapt to scene descriptions it hasn’t seen before.
“Other systems would consider all relations holistically and create the image from the description. These approaches are not possible when there is more than one distribution description. This means that the model cannot be adapted in one shot to produce images with more relationships. Du says that by combining these smaller models, we can model more relationships and create new combinations.
It can also work in reverse: given an image, the system can match text descriptions to the relationships between objects in that scene. Their model can also be used to edit images by rearranging objects in the scene to match a new description.
Understanding complex scenes
Researchers compared their model with other deep learning methods that were given text descriptions, and given the task of creating images that showed the objects and their relationships. Their model was superior to the baseline in each case.
The researchers also asked people to assess whether the images generated matched the scene description. The most complicated examples had three relationships. 91 percent of participants thought the model performed better.
Du says, “One of the most interesting things we discovered was that we could increase our sentence from one relation description to having three or four, and our approach continues be able to produce images that are correctly described using those descriptions.”
Researchers also presented model images of scenes that it had never seen before. They also provided text descriptions for each image. It was able identify the best description that matched the object relationships within the image.
The researchers also gave the system two different descriptions of the same image, but in different ways. The model understood that they were identical.
Researchers were impressed with the robustness and ease of use of the model, particularly when it was working with descriptions that had never been encountered before.
This is more in line with how humans function. Although we may only be able to see a few examples, we can extract valuable information from them all and create endless combinations. Li says that our model is able to learn from less data, but can generalize to more complex scenes and image generations.
These early results are encouraging but the researchers want to see how the model performs in real-world images with more complex backgrounds and objects that block one another.
They also want to incorporate their model into robotics systems. This will allow a robot to infer objects relationships from videos, and then use this knowledge to manipulate objects around the world.
Computer vision is still struggling to develop visual representations that are able to deal with the compositional nature the world around us. This paper proposes an energy-based model which explicitly models multiple relationships among objects in the image. It makes significant progress towards solving this problem. Josef Sivic, an outstanding researcher at Czech Institute of Informatics, Robotics and Cybernetics, said that the results were really impressive. He was not involved in this research.