My research focuses on developing Structured Physical Intelligence—enabling agents to perceive, understand, and interact with the physical world by anchoring embedded structures and compositional priors that reflect how the real world works—as part of a broader effort to advance Multimodal Foundation Models. This research bridges robotics with core areas of AI, including vision, language, and compositionality. Specifically, I explore how structured and compositional representations can be integrated into learning-based models—such as robotic foundation models—to support generalization, efficient task adaptation, and scalable pretraining, while providing a principled approach to overcome the data scarcity and annotation bottlenecks that constrain large-scale robotic learning. This involves methods like visual-language-action models, object-centric representations, affordances, object interactions, compositional reasoning, and particle-based motion representations. Ultimately, my goal is to build physically intelligent systems that can reason and adapt across a wide range of tasks, environments, and embodiments.