Vision and Language

Vision and Language

Vision-language models (VLMs) represent a multimodal architecture that does simultaneous comprehension of image and text data modalities. Leveraging computer vision (CV) and natural language processing (NLP) models, these architectures skillfully correlate information, including embeddings, from both visual and linguistic domains. Our research delves into refining this synergy, integrating scene graphs for contextual depth, deploying captioning techniques for detailed descriptions, advancing visual question answering for nuanced understanding and enhancing the capabilities of VLMs through instruction-tuning and synthetic data augmentation by leveraging foundation models.

Experts

  Name Working area(s) Contact
Gopika Sudhakaran M.Sc.
_
Jan-Martin Steitz M.Sc.
+49 6151 16-21424
S2|02 A326
_
Simone Schaub-Meyer Dr. Sc.
+49 6151 16-25411
S2|02 A306
Prof. Stefan Roth, Ph.D.
+49 6151 16-21425
S2|02 A304