Vision and Language
Vision-language models (VLMs) represent a multimodal architecture that does simultaneous comprehension of image and text data modalities. Leveraging computer vision (CV) and natural language processing (NLP) models, these architectures skillfully correlate information, including embeddings, from both visual and linguistic domains. Our research delves into refining this synergy, integrating scene graphs for contextual depth, deploying captioning techniques for detailed descriptions, advancing visual question answering for nuanced understanding and enhancing the capabilities of VLMs through instruction-tuning and synthetic data augmentation by leveraging foundation models.