Information retrieval and multimedia content access has a long history of comparative evaluation and many of the advances in the area over the past decade can be attributed to the availability of open datasets that support comparative and repeatable experimentation. Sharing data and code to allow other researchers to replicate research results is needed in the multimedia modeling field and this will help to improve the performance of systems and the reproducibility of papers published.

This multimedia dataset track will be an opportunity for researchers and practitioners to make their work permanently available and citable in a single forum, as well as to increase the public awareness of their considerable efforts.

Researchers within the multimedia community will be encouraged to submit their datasets, or papers related to dataset-generation to this track. Authors of dataset papers are asked to provide a paper describing its motivation, design, and usage, a brief summary of the experiments performed to date on the dataset, as well as discussing the way it can be useful to the community. The benefits for authors who successfully submit are:

  • Accepted contributions will be included in the conference proceedings.
  • Accepted contributions will be listed in a recognised index of multimedia datasets, thereby increasing their visibility.
  • Authors of accepted contributions will be invited to present their dataset as part of the special session programme at MMM2024.

Regarding the submission of a dataset, the authors should make it available by providing a URL for download, as mentioned above, and agree to the link being maintained on an MMM datasets dedicated site. All datasets must be licensed in such a manner that it can be legally and freely used with all appropriate ethical and access approvals completed. Authors are encouraged to prepare appropriate and helpful documentation to accompany the dataset, including examples of how it can be used by the community, examples of successful usage and restrictions on usage.

We may additionally accept position papers of high quality, which we believe can significantly impact multimedia datasets in the future, by addressing various aspects of dataset creation methodologies. We will prefer position papers that are backed up by recent results, which could be already published or appear first in the MDRE submission.

Authors do not need to anonymize their submission due to the inherent difficulty of doing so for open datasets.


  • Klaus Schöffmann, Klagenfurt University, Austria
  • Björn Þór Jónsson, Reykjavik University, Iceland
  • Cathal Gurrin, Dublin City University, Ireland
  • Duc-Tien Dang-Nguyen, University of Bergen, Norway
  • Liting Zhou, Dublin City University, Ireland

Multi-object multi-sensor tracking (MOMST) is a complex problem in computer vision and machine learning, involving the simultaneous tracking of multiple objects using data from multiple sensors. MOMST is essential in many applications, such as surveillance systems, autonomous vehicles, and robotics.

MOMST algorithms typically use a combination of sensor fusion and data association techniques to estimate the state of each object over time. Sensor fusion involves combining data from multiple sensors to obtain a more accurate and complete representation of the objects being tracked (as an enhancement of MOT challenges). Data association involves matching sensor measurements to tracks, which can be challenging in scenarios where there are occlusions, clutter, or sensor failures. One popular approach to MOMST is multiple hypothesis tracking (MHT), which maintains multiple possible tracks for each object and updates the probabilities of each track as new sensor measurements become available. Other approaches include the use of deep learning techniques, such as object detection and tracking with convolutional neural networks (CNNs) and recurrent neural networks (RNNs).

MOMST is a challenging problem due to the complexity of real-world environments and the limitations of sensor technology. Sensor measurements can be noisy, incomplete, and prone to errors, and objects can move in unpredictable ways. MOMST algorithms must be robust to handle these challenges and produce accurate and reliable results. Despite these challenges, MOMST has many important applications. In autonomous vehicles, MOMST is used to track other vehicles, pedestrians, and obstacles in the environment to ensure safe and efficient navigation. In robotics, MOMST is used to track objects in dynamic environments, such as warehouses or manufacturing facilities. In surveillance systems, MOMST is used to track people and vehicles in public spaces to prevent crime and enhance public safety.

In conclusion, MOMST is an important and challenging problem in computer vision and machine learning. Advances in sensor technology and algorithmic techniques have made significant progress in recent years, but there is still much work to be done to improve the accuracy and reliability of MOMST algorithms in complex real-world scenarios.


  • Mario Döller, University of Applied Sciences Kufstein Tirol, Austria
  • Ruben Tous, University Politecnica de Catalunya (UPC), Spain

Georeferenced multimedia data, such as satellite images and videos, are a key resource for researchers and practitioners in fields such as Earth Observation, urban computing, and lifelogging. However, this type of data is often highly heterogeneous, distributed, and semantically fragmented, which presents significant challenges for effective analysis and retrieval. The emergence of deep learning and multimodal analytics provides an opportunity to overcome these challenges and unlock the full potential of georeferenced multimedia data. By leveraging the strengths of different data modalities, researchers can enhance the value of these datasets and gain insights that were previously impossible to obtain.

In this context, this special session invites papers in the area of multimodal analytics and retrieval that leverage the importance of spatial information when combined with other data modalities, the value of the original data. Despite the popularity of research areas like cross-modal retrieval, image captioning, image generation, and visual question answering (VQA) in multimedia, however, their potential has yet to be fully explored in the context of location-based services. It is crucial to deploy interpretable machine learning techniques to unlock the knowledge that is hidden in multimodal data with geospatial information, given that many of the problems studied in multimedia are NP-hard and require approximation to determine the degree of computing power they can provide.

We believe that the MARGeM special session can serve as a joint venue for the different communities working on georeferenced data and its many applications, and thus propel the cross-fertilisation of ideas, methods and software between the communities.

This special session includes presentation of novel research within the following domains:

  • lifelog computing
  • urban computing
  • satellite computing and earth observation

Within these domains, the topics of interest include (but are not restricted to):

  • Multimodal analytics and retrieval techniques for georeferenced multimedia data
  • Deep learning and neural networks for interpretability, understanding, and explainability in artificial intelligence applied to georeferenced multimedia data
  • Cross-modal retrieval, image captioning, image generation, and visual question answering for location-based services
  • Satellite image analysis and retrieval Semantically-aware approaches for handling highly heterogeneous, distributed and semantically fragmented georeferenced multimedia data
  • Interpretable machine learning techniques for unlocking hidden knowledge in big georeferenced multimedia data
  • Digital Twins based on georeferenced multimedia
  • Applications of georeferenced multimedia data in urban and lifelog computing
  • Big data analytics and visualization on GIS platforms for georeferenced multimedia data.


  • Maria Pegia, Centre for Research and Technology Hellas, Information Technologies Institute, Greece
  • Ioannis Papoutsis, National Observatory of Athens, Greece
  • Ilias Gialampoukidis, Centre for Research and Technology Hellas, Information Technologies Institute, Greece
  • Björn Þór Jónsson Professor, Department of Computer Science, Reykjavik University, Iceland
  • Stefanos Vrochidis, Centre for Research and Technology Hellas, Information Technologies Institute

Data has become a critical component of human life in the digital age, where it can be collected from various sources and in real-time, providing valuable insights into our living environment. However, these data sources only represent a small piece of the larger puzzle of life. Therefore, the ability to collect and analyze data across multiple domains, modalities, and platforms is crucial to solving this puzzle faster. Recent research has focused on multimodal data analytics, but there is a lack of investigation into cross-data analysis and retrieval. This research direction includes cross-modal data, cross-domain, and cross-platform data analysis and retrieval. For example, cross-modal retrieval systems use a textual query to look for images, while air quality index can be predicted using lifelogging images, and daily exercises and meals can help predict sleeping quality.

To promote intelligent cross-data analytics and retrieval research and create a smarter, sustainable society, we invite submissions to a special article collection on "Intelligent Cross-Data Analysis and Retrieval." We welcome submissions from diverse research domains and disciplines, including well-being, disaster prevention and mitigation, mobility, climate change, tourism, healthcare, and food computing. Join us in exploring the exciting field of cross-data analysis and retrieval!

This Research Topic welcomes submissions from diverse research domains and disciplines such as well-being, disaster prevention and mitigation, mobility, climate change, tourism, healthcare, and food computing. Example topics of interest include, but are not limited to:

  • Event-based cross-data retrieval
  • Data mining and AI technology
  • Complex event processing for linking sensors data from individuals, regions to broad areas dynamically
  • Transfer Learning and Transformers
  • Hypotheses development of the associations within the heterogeneous data
  • Realization of a prosperous and independent region in which people and nature coexist
  • Applications leveraging intelligent cross-data analysis for a particular domain
  • Cross-datasets for repeatable experimentation
  • Federated Analytics and Federated Learning for cross-data
  • Privacy-public data collaboration
  • Integration of diverse multimodal data


  • Minh-Son Dao, National Institute of Information and Communications Technology, Japan
  • Michael Alexander Riegler, Simula Metropolitan Center for Digital Engineering, Norway
  • Duc Tien Dang Nguyen, University of Bergen, Norway
  • Thanh-Binh Nguyen, University of Science, Vietnam National University in HCM City

Extended Reality and Multimedia: Advancing Content Creation and Interaction (XR-MACCI) special session at the Multimedia Modelling 2024 conference invites researchers, industry experts, and enthusiasts to explore the latest advancements in extended reality (XR) and multimedia technologies. This session will focus on the development and integration of XR solutions with multimedia analysis, retrieval and processing methods, emphasizing seamless and interactive experiences that transform the way we live, work, and interact with our surroundings.

The XR-MACCI 2024 special session will address the following key topics:

  • Next-Generation XR Technologies: Exploring cutting-edge solutions in virtual reality (VR), augmented reality (AR), and mixed reality (MR) that push the boundaries of immersive multimedia experiences.
  • Real-time 3D Modeling and Rendering: Investigating innovative techniques for creating realistic and dynamic 3D models and environments, enabling high-quality visuals and interactions in XR applications.
  • Adaptive and Interactive Content Delivery: Developing methods for optimizing and personalizing multimedia content based on user preferences, context, and device capabilities, ensuring a seamless XR experience.
  • AI for XR Content Creation: Utilizing artificial intelligence and machine learning for content analysis, understanding and retrieval to facilitate XR content generation.
  • AI-Driven Multimedia and XR Integration: Utilizing artificial intelligence and machine learning to enhance recognition and manipulation in XR environments, leading to more intuitive and engaging experiences.
  • Multisensory Interfaces and Wearable Technologies: Investigating the latest advancements in haptic feedback, gesture recognition, and sensory input/output devices that facilitate natural and immersive interactions with XR and multimedia content.


  • Claudio Gennaro, Information Science and Technologies Institute, National Research Council, Italy
  • Sotiris Diplaris, Information Technologies Institute, Centre for Research and Technology Hellas, Greece
  • Stefanos Vrochidis, Information Technologies Institute, Centre for Research and Technology Hellas, Greece
  • Heiko Schuldt, University of Basel, Switzerland
  • Werner Bailer, Joanneum Research, Austria

Foundations models (FMs), currently in the form of large language models (LLMs) and large vision language models (LVLMs), are reshaping the way in which multimedia content is generated, analyzed, interpreted, and retrieved. While the current cyberspace remains dominated by user generated content (UGC), they are likely to be outnumbered by artificial intelligence generated content (AIGC) in the near future, with multimodal FMs as the driving force for such a seismic change.

Prompt engineering (PE) techniques are being actively developed for harnessing FMs for open-ended multimedia content recognition and interpretation. Also, it can be largely anticipated that in contrast to the current video search engine which answers a user's query by ranking existing videos in terms of their relevance with respect to the query, a next-generation video search engine will work in a ranking-and-generation manner.

Recognizing that FMs are essential to the future of multimedia computing, the remaining question is then how shall the multimedia community meet the new future? Given that both the development of FMs and their application on multimedia are still early-stage, we believe a special session on "Foundation Models for Multimedia" (FMM) will be a very timely reflection of the latest development on the topic and hopefully provide a partial answer to the question.


  • Xirong Li, Renmin University of China, China
  • Zhineng Chen, Fudan University, China
  • Xing Xu, University of Electronic Science and Technology of China
  • Symeon (Akis) Papadopoulos, Centre for Research and Technology Hellas, Greece
  • Jing Liu, Chinese Academy of Sciences

The increasing interest in advanced and human-like conversational systems, along with the rise of various digital communications channels such as social media, intelligent agents, and chatbots, leads to a pressing need to enhance their capabilities. However, traditional chatbots and virtual assistants have intrinsic limitations in their ability to engage users in natural and intuitive conversations, especially when involving different sources of multimedia and multimodal information. Incorporating multimedia and multimodality, such as visual and audio cues, into such conversational systems can lead to a better understanding of users' needs and intentions and provide a significantly improved user experience.

At the same time, the recent advancements in Large Language Models (LLMs), such as Open AI's ChatGPT, Google's Bard, and Stanford's Alpaca, have further opened up new opportunities to improve the performance of these systems, generating more natural and coherent responses. The integration of multimodality in conversational systems paired with the significant capabilities of LLMs is also an area of increasing research interest since it allows for more natural, intuitive, and engaging conversations.

This special session aims to present the most recent works and applications for addressing the challenges and opportunities in developing multimedia and multimodality-enabled conversational systems and chatbots. Indicative domains of application include healthcare, education, immigration, customer service, finance and others.

Topics of interest include, but are not limited to the following:

  • Multimodal and multimedia open-ended chatbots
  • Multimodal and multimedia task-oriented chatbots
  • Novel architectures for multimodal chatbots and conversational systems
  • Context-aware technologies for multimodal chatbots with Transformers and Large Language Models (LLMs)
  • Multimodal query processing and understanding in conversational systems
  • Integration of multimodal knowledge graphs in conversational systems
  • Knowledge distillation techniques for transferring knowledge from LLMs to smaller models for multimodal chatbots
  • Chatbots and conversational systems in healthcare and the medical domain
  • Multimodal data fusion for various applications, including education, immigration, customer service, etc.
  • Evaluation of multimodal conversational systems, including evaluation metrics for measuring the coherence and fluency of the generated responses
  • Evaluation of the user experience, including evaluation metrics for measuring user satisfaction, engagement, and system usability


  • Thanassis Mavropoulos, Centre for Research and Technology Hellas, Information Technologies Institute, Greece
  • Georgios Meditskos,School of Informatics, Aristotle University of Thessaloniki, Greece
  • Stefanos Vrochidis, Centre for Research and Technology Hellas, Information Technologies Institute

Cultural AI focuses on developing systems that can deal with the complexities of human culture, thereby improving applications to cultural data and enhancing AI systems' ability to deal with cultural complexities. An increasingly clear insight is that many of the complexities of culture are highly contextualised which often expresses itself in a multimodal manner. For instance, analysis of widely published iconic images (e.g., napalm girl, migrant mother) includes the image itself, the contexts in which it has been published, and how it has been received [1]. Similarly, the prominence of linked data representations of cultural data are opening up new possibilities for enriching datasets and visual content analysis [2]. As such, we argue that a cultural perspective is inherently a multimedia perspective, and whilst applications of multimedia systems to cultural data have always been at home in the multimedia community [3,4], the complexities that come from analysing culture (e.g., [5]) have insufficiently been foregrounded.

With this special session we aim to bring together experts from Cultural AI and Multimedia to discuss the challenges surrounding cultural data as well as the complexities of human culture. Additionally, we aim to demonstrate that culture is more than an aesthetically pleasing testbed for multimedia systems, and that culture offers new challenges that require multimedia solutions. In addition to technical papers we also explicitly invite high-quality position papers on related topics to cultural AI, or that highlight cultural challenges which require multimedia solutions. We will prioritise position papers that are supported by recent published results or preliminary findings that are being described for the first time in the submitted paper.


  • Nanne van Noord, University of Amsterdam, Netherlands
  • Melvin Wevers, University of Amsterdam, Netherlands
  • Stuart James, Italian Institute of Technology & UCL Centre for Digital Humanities, Italy
  • Cynthia Liem, TU Delft, Netherlands
  • Victor de Boer, Vrije Universiteit Amsterdam, Netherlands