VISION AND LANGUAGE: FROM CAPTIONING TO EMBODIED AI

ABSTRACT
Recent progress in the Computer Vision and Natural Language Processing communities have made it possible to connect Vision and Language together in a variety of different tasks which lie at the intersection of Vision, Language, and Embodied AI. Those tasks range from generating meaningful descriptions of images, to answering questions and navigating agents in unseen environments via natural language instructions. This integration has grown up to the point that it is becoming endemic in literature, and a fundamental tool to develop AI algorithms. This tutorial will give a comprehensive guide through these advancements and will provide technical knowledge to develop the next Vision and Language algorithms, including state-of-the-art techniques for generating text from images and videos, for treating sequences (Attention, the Transformer paradigm, training with Reinforcement Learning) and for training massive models for cross-modal retrieval and one-shot prediction (e.g. CLIP). It will discuss how these approaches can be used on embodied agents which can interact with the physical world, for navigation, and for other embodied tasks. A particular focus will be given to training on large-scale data and to the upcoming challenges in the field.
SPEAKERS

Lorenzo Baraldi
Lorenzo Baraldi is a Tenure Track Assistant Professor at the University of Modena and Reggio Emilia. He works under the supervision of Prof. Rita Cucchiara on Deep Learning, video analysis and Multimedia, and teaches in the courses of Computer Vision, AI for Automotive and Image Processing. Among his research interests, he worked on Egocentric Vision and Gesture Recognition, Temporal Video Segmentation and Retrieval, Saliency, Video Captioning, Visual-Semantic alignment and Embodied AI. He is the author of more than 70 publications in international journals and conferences, and Associate Editor of Pattern Recognition Letters and of Frontiers in Artificial Intelligence. He has been elected as Scholar in the ELLIS society, the European Laboratory for Learning and Intelligent Systems. Since 2021, he has been appointed as deputy director of the Interdipartimental Centre on Digital Humanities of the University of Modena and Reggio Emilia. In 2017, he worked in the Facebook AI Research laboratory in Paris, under the supervision of Hervé Jégou, where he developed the video copy detection algorithm that currently runs in production on the social network.

Marcella Cornia
Marcella Cornia received the M.Sc. degree in Computer Engineering and the Ph.D. degree in Information and Communication Technologies from the University of Modena and Reggio Emilia in 2016 and 2020, respectively. In 2020 she received the Young Researcher Award for the category “Artificial Intelligence and Big Data” from “Gruppo 2003 per la Ricerca Scientifica” and in 2021 she was awarded by CVPL, the Italian Association for Computer Vision, Pattern Recognition and Machine Learning, for the best PhD thesis. She is currently a Tenure-Track Assistant Professor at the Department of Education and Humanities of the University of Modena and Reggio Emilia. She has authored or coauthored more than 40 publications in scientific journals and international conference proceedings and serves as Associate Editor for IEEE Robotics and Automation Letters (RA-L). Her research interests include image captioning, embodied AI, and computer vision solutions applied to the fashion domain. She has already had experience in organizing workshops and tutorials. Specifically, she co-organized the “Women in ICPR” workshop at ICPR 2020 and the “Vision, Language and Action: from Captioning to Embodied AI” tutorial at ICIAP 2019.

Silvia Cascianelli
Silvia Cascianelli received her Ph.D. degree cum laude in Information and Industrial Engineering from the University of Perugia in 2019. She is currently an Assistant Professor with the Department of Engineering “Enzo Ferrari” and a member of the Interdepartmental Centre for Digital Humanities (DHMoRe), University of Modena and Reggio Emilia. She works under the supervision of Prof. Rita Cucchiara on Deep Learning and Computer Vision. She has authored or coauthored more than 30 publications in scientific journals and international conference proceedings and serves as Associate Editor for the IEEE Robotics and Automation Letters. Her research interests include embodied AI, multimedia, and computer vision for cultural heritage and digital humanities applications.