Researchers train a model to reach human-level performance at recognizing abstract concepts in video.
The ability to reason abstractly about events as they unfold is a defining feature of human intelligence. We know instinctively that crying and writing are means of communicating, and that a panda falling from a tree and a plane landing are variations on descending.
Organizing the world into abstract categories does not come easily to computers, but in recent years researchers have inched closer by training machine learning models on words and images infused with structural information about the world, and how objects, animals, and actions relate. In a new study at the European Conference on Computer Vision this month, researchers unveiled a hybrid language-vision model that can compare and contrast a set of dynamic events captured on video to tease out the high-level concepts connecting them.
Their model did as well as or better than humans at two types of visual reasoning tasks — picking the video that conceptually best completes the set, and picking the video that doesn’t fit. Shown videos of a dog barking and a man howling beside his dog, for example, the model completed the set by picking the crying baby from a set of five videos. Researchers replicated their results on two datasets for training AI systems in action recognition: MIT’s Multi-Moments in Time and DeepMind’s Kinetics.
The original article can be found here.
In addition to AI for reasoning, AI and causal inference is also an important topic. Professor Judea Pearl is a pioneer for developing a theory of causal and counterfactual inference based on structural models. In 2020, Professor Pearl is also awarded as World Leader in AI World Society (AIWS.net) by Michael Dukakis Institute for Leadership and Innovation (MDI) and Boston Global Forum (BGF). In the future, Professor Judea will also contribute to Causal Inference for AI transparency, which is one of important AIWS topics on AI Ethics.