Vision and Language

Vision-language models (VLMs) represent a multimodal architecture that does simultaneous comprehension of image and text data modalities. Leveraging computer vision (CV) and natural language processing (NLP) models, these architectures skillfully correlate information, including embeddings, from both visual and linguistic domains. Our research delves into refining this synergy, integrating scene graphs for contextual depth, deploying captioning techniques for detailed descriptions, advancing visual question answering for nuanced understanding and enhancing the capabilities of VLMs through instruction-tuning and synthetic data augmentation by leveraging foundation models.

Recent Publications

JavaScript is required to view this publication list. If your browser does not support JavaScript, please refer to DBLP or Google Scholar for reasonably current publication lists.

Experts

	Name	Contact
	Jan-Martin Steitz M.Sc.	jan-martin.steitz@visinf.tu-... +49 6151 16-21424 S2\|02 A326
	Gopika Sudhakaran M.Sc.	gopika.sudhakaran@tu-...
	Prof. Dr. Sc. Simone Schaub-Meyer	simone.schaub@visinf.tu-... +49 6151 16-25411 S2\|02 A306
	Prof. Stefan Roth, Ph.D.	stefan.roth@visinf.tu-... +49 6151 16-21425 S2\|02 A304