Computer Vision has entered a golden era in which algorithms are being transformed from prototypes to realworld applications. Much of this success is due to the rise of deep learning algorithms, which have successfully tackled new Computer Vision tasks, ranging from object detection to semantic segmentation. Most models rely on the supervised learning paradigm, in which a convolutional neural network type of architecture is trained on very large datasets. While successful, there are still key challenges the MCML researchers address in this resaerch area: Going beyond convolutional neural networks by focusing on the development of novel models that encode both lowlevel pixel relationships as well as highlevel object interactions; going beyond supervised learning by proposing new techniques to learn from unlabeled data, focusing on other learning paradigms such as selfsupervised, semisupervised, or active learning; and going boing beyond 2D by moving from semantic analysis of images and videos to analyzing and reasoning about the shape, appearance and motion of the 3D world perceived through the camera.
Natural Language Processing (NLP) is the subarea of computer science that is concerned with understanding and generation of natural language text. The field has been revolutionized in the last 5+ years by the advent of deep learning. In spite of this impressive progress, the gap from the current state of the technology to humanlevel performance is still very large. There are a number of challenges that our MCML researchers tackle: The first challenge is that deep language understanding requires understanding the relationship between the words in a sentence. There are opportunities in addressing this problem by infusing deep learning models with structural biases, both new ones and those from the previous generation of NLP ML models. The second challenge is that current models do not possess common sense. There is an opportunity here to create experimental environments in which multimodal models can learn about the world through interacting with it. The third challenge is sample efficiency. NLP models are usually trained on large training sets. There is a vast discrepancy between what a truly intelligent being could learn from that much data on the one hand and what our current models do manage to learn on the other.
The ability for an intelligent, mobile actor to understand egomotion as well as the surroundings are a fundamental prerequisite for the choice of actions to take. However, vast challenges remain to achieve the necessary levels of safety, which are deeply rooted in research that MCML aims to carry out: Multisensor egomotion estimation and environment mapping, scene representations suitable for interaction in an open-ended environment, understanding and forecasting motion and events, and the the role of uncertainty in ML blocks as modular elements.