I work at the intersection of Computer Vision and Machine Learning, with a focus on learning from limited supervision. My research is broadly grounded in self-supervised learning, 3D vision, and few/zero-shot learning, with the goal of building data-efficient and generalizable visual representations. Within this broader agenda, my recent work focuses on action-conditioned video world models. In particular, I explore how language can act as a structured supervisory signal for video representation learning across three key settings: Perception: understanding actions and states in videos, Prediction: forecasting future visual outcomes conditioned on actions, and Planning: inferring intermediate action sequences required to transition from an initial to a goal state.