I am a Postdoctoral Scholar at the Berkeley AI Research (BAIR) Lab, and a Research Scientist at the MIT-IBM Watson AI Lab. I work with Prof. Trevor Darrell on Structured Physical AI, and closely collaborate with Prof. Jitendra Malik, Deva Ramanan, and Shankar Sastry.
I design architectures and learning algorithms for Physical AI, grounded in structural priors. My work develops physical inductive biases and structured world representations for general-purpose agents, using robots as the ultimate testbed.
Previously, I received my Ph.D. from Tel Aviv University, advised by Prof. Amir Globerson; I have an Erdős number of 3. I graduated magna cum laude with M.Sc. (CS), B.Sc. (CS), and B.Sc. (Physics).
Award Received the 2023 IAAI Best PhD Thesis Award for the outstanding thesis in Artificial Intelligence in Israel.
New Looking for motivated PhD students, Postdocs, RAs, Masters, and Undergrads to collaborate and publish in top-tier conferences on Robotics & AI.
Open Currently on the academic job market.
Selected recent work in vision and robotics.
Technical Report2026
project page bibtex social media
We introduce Robotic Steering, a finetuning approach grounded in mechanistic interpretability that leverages few-shot demonstrations to identify and selectively finetune task-specific attention heads aligned with the physical, visual, and linguistic requirements of robotic tasks.
CVPR (Findings)2026
project page bibtex social media
We developed a generalist tracking policy that enables a humanoid to mimic human actions from noisy, generated videos in a zero-shot manner.
ICLR2026
project page bibtex social media
Training robots on random toys enables zero-shot grasping of real-world objects.
EMNLP2025
project page code bibtex social media
IVA is a unified framework for Vision-Language-Action models that detects false-premise instructions, clarifies them in language, and acts safely — improving robustness while preserving task performance.
ICML2025
project page code bibtex social media
We propose ARM4R, an Autoregressive Robotic Model that leverages low-level 4D Representations learned from human video data to yield a better robotic model.
ICRA2025
project page code bibtex social media
We introduce RoboPrompt, a framework that enables off-the-shelf text-only LLMs to directly predict robot actions through ICL without training.
CoRL2024
project page code bibtex social media
We propose LLARVA, a model trained with a novel instruction tuning method that leverages structured prompts to unify a range of robotic configurations and introduces the concept of visual traces to further align the vision and action spaces.
Selected recent work in vision and language, with a particular focus on multimodal foundation models.
CVPR2026
We introduce LIVR (Latent Implicit Visual Reasoning), a framework for performing visual reasoning in the latent representation space of vision-language models.
ICLR2026
We introduce DAVE, a vision encoder purpose-built for VLMs and tailored for document understanding and web agents.
ICCV2025
project page code bibtex social media
SAVs is a finetuning-free method that leverages sparse attention head activations (fewer than 5% of heads) in LMMs as powerful feature representations for vision-language classification tasks, achieving state-of-the-art performance compared to both few-shot and finetuned baselines.
Technical Report2025
We introduce Granite Vision, a lightweight large language model with vision capabilities, specifically designed to excel in enterprise use cases, particularly in visual document understanding.
NeurIPS2024
We demonstrate the existence of multimodal task vectors — compact implicit representations of many-shot in-context examples compressed in the model's attention head — and leverage them for many-shot in-context learning in LMMs.
EMNLP2024
We present TraveLER, a modular multi-LMM agent framework for video question-answering that does not require task-specific fine-tuning or annotations. Through interactive question-asking using several agents with different roles, our framework answers questions by collecting relevant information from keyframes.
CVPR2024
We propose Compositional Chain-of-Thought (CCoT), a novel zero-shot Chain-of-Thought prompting method that utilizes scene graph representations to extract compositional knowledge from an LMM.
EMNLP2023
We propose to improve pretrained VLMs, which are usually trained on large-scale image-text pairs, by designing a specialized model architecture and a new training method that utilizes a small set of scene graph annotations from the Visual Genome dataset that are richer and reflect structured visual and textual information.
NeurIPS2023Spotlight
We propose a fine-tuning approach for automatically treating two factors limiting VL models' compositional reasoning performance: (i) the caption quality, or "image alignment", of the texts; and (ii) the level of caption density, which refers to the number of details that appear in the image.
CVPR2023
We demonstrate language augmentation techniques for teaching language structure to VL models.
NeurIPS2022
Winner of the Ego4D CVPR'22 Point of No Return Temporal Localization Challenge, 2022
We present SViT (Structured Video Tokens), a model that utilizes the structure of a small set of images, whether within or outside the domain of interest, available only during training for a video downstream task.
CVPR2022
We present ORViT, an object-centric approach that extends video transformer layers with a block that directly incorporates object representations.
ICML2021
project page code slides bibtex
We introduce the formalism of Action Graphs, a natural and convenient structure representing the dynamics of actions between objects over time. We show we can synthesize goal-oriented videos on the CATER and Something-Something datasets and generate novel compositions of unseen actions.
ECCV2020
We present a novel model that can inherently learn canonical graph representations and show better robustness to graph size, adversarial attacks, and semantic equivalent, generating superior images of complex visual scenes.
CVPR2020
project page code dataset bibtex
We propose a novel compositional action recognition task where the training combinations of verbs and nouns do not overlap with the test set. We show the effectiveness of our approach on the proposed compositional task and a few-shot compositional setting which requires the model to generalize across both object appearance and action category.
ICCV Workshop on Autonomous Driving2019Oral
We propose a latent inter-object graph representation for activity recognition that explores the visual interaction between the objects in a self-supervised manner.
CVPR2019
We collect a new SKU-110K dataset which takes detection challenges to unexplored territories, and propose a novel mechanism to learn deep overlap rates for each detection.
NeurIPS2018
We propose a novel invariant graph network for mapping images to scene graphs using the permutation invariant property, achieving state-of-the-art results on the Visual Genome dataset.
If you are a student interested in collaborating on research projects, please reach out.
Partial list of invited talks and presentations.