Roei Herzig
Hi there! I am Roei, a Postdoctoral Scholar in the Berkeley AI Research Lab (BAIR) at UC Berkeley,
working with Prof. Trevor Darrell.
Additionally, I recently worked under the supervision of Prof. Amir Globerson at Tel Aviv University between April 2019 and April 2023.
I'm also affiliated as a research scientist at IBM Research AI.
Previously, I graduated magna cum laude from Tel Aviv University
with MSc (CS), BSc (CS), and BSc (Physics), and worked as a Machine Learning & Deep Learning researcher
at Nexar and Trax Image Recognition for 5 years.
I was awarded the 2023 IAAI Best PhD thesis award for the outstanding thesis in the field of Artificial Intelligence in Israel.
I'm looking for strong Master's and senior undergrads to collaborate and publish in top-tier conferences
on Vision & Language, Vision & Robotics, and Video Understanding.
Email  / 
Twitter  / 
Github  / 
LinkedIn  / 
CV  / 
Google Scholar
|
|
Research
Compositionality is a fundamental aspect of human cognition and language understanding.
At its core, it is the ability to decompose complex entities or concepts into simpler,
more basic components, and then combine these components in various ways in order to create more complex entities.
Despite their impressive capabilities, AI models are limited in this area,
and they have difficulty solving tasks that require generalization beyond the training distribution.
My research goal is to develop compositionality into intelligent machines in order to
improve robustness and generalization across a wide range of fields, such as vision, language, and robotics.
|
Personal
I'm a proud father of Adam and Liam and happily married to Esti,
my amazing wife and my partner for life.
When I'm not working, I'm also a history buff and love learning about science, politics, music, and the two World Wars.
|
|
Multimodal Task Vectors Enable Many-Shot Multimodal In-Context Learning
Brandon Huang*,
Chancharik Mitra*,
Assaf Arbelle, Leonid Karlinsky,
Trevor Darrell,
Roei Herzig
Technical Report, 2024
bibtex
We demonstrate the existence of multimodal task vectors--compact implicit representations of many-shot in-context examples
compressed in the model’s attention head-- and leverage them for many-shot in-context learning in LMMs.
|
|
TraveLER: A Multi-LMM Agent Framework for Video Question-Answering
Chuyi Shang*,
Amos You*,
Sanjay Subramanian,
Trevor Darrell,
Roei Herzig
Technical Report, 2024
bibtex
We present TraveLER, a modular multi-LMM agent framework for video question-answering
that does not require task-specific fine-tuning or annotations.
Through interactive question-asking using several agents with different roles,
our framework aims to answer the question by collecting relevant information from keyframes.
|
|
LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning
Dantong Niu*,
Yuvan Sharma*, Giscard Biamby, Jerome Quenum, Yutong Bai, Baifeng Shi,
Trevor Darrell†,
Roei Herzig†
Conference on Robot Learning (CoRL), 2024
project page /
code /
bibtex
We propose LLARVA, a model trained with a novel instruction tuning method that
leverages structured prompts to unify a range of robotic configurations
and introduces the concept of visual traces to further align the vision and action spaces.
|
|
Recursive Visual Programming
Jiaxin Ge,
Sanjay Subramanian,
Baifeng Shi,
Roei Herzig,
Trevor Darrell
Proceedings of the European Conference on Computer Vision (ECCV) , 2024
code /
bibtex
We present Recursive Visual Programming (RVP), a new approach of visual programming that simplifies generated routines,
allows more efficient problem solving, and manages more complex data structures.
|
|
Compositional Chain-of-Thought Prompting for Large Multimodal Models
Chancharik Mitra,
Brandon Huang,
Trevor Darrell,
Roei Herzig
IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) , 2024
code /
bibtex
We propose Compositional Chain-of-Thought (CCoT), a novel zero-shot Chain-of-Thought prompting
method that utilizes scene graph representations in order to extract compositional knowledge from an LMM.
|
|
U2Seg: Unsupervised Universal Image Segmentation
Dantong Niu*,
Xudong Wang*,
Xinyang Han*,
Long Lian,
Roei Herzig,
Trevor Darrell
IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) , 2024
code /
bibtex
We present U2Seg, a unified framework for Unsupervised Universal image Segmentation
that consistently outperforms previous state-of-the-art methods.
|
|
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs
Roei Herzig*,
Alon Mendelson*,
Leonid Karlinsky,
Assaf Arbelle,
Rogerio Feris,
Trevor Darrell,
Amir Globerson
Conference on Empirical Methods in Natural Language Processing (EMNLP) , 2023
project page /
code /
bibtex
We propose to improve pretrained VLMs, which are usually trained on large-scale image-text pairs,
by designing a specialized model architecture and a new training method that utilizes
a small set of scene graph annotations from the Visual Genome dataset
that are richer and reflect structured visual and textual information.
|
|
Dense and Aligned Captions Promote Compositional Reasoning in VL Models
Sivan Doveh,
Assaf Arbelle,
Sivan Harary,
Roei Herzig,
Donghyun Kim,
Paola Cascante-bonilla,
Amit Alfassy,
Rameswar Panda,
Raja Giryes,
Rogerio Feris,
Shimon Ullman,
Leonid Karlinsky
Advanced in Neural Information Processing Systems (NeurIPS) , 2023  (Spotlight)
project page /
code /
bibtex
We propose a fine-tuning approach for automatically treating two factors limiting VL models’ compositional reasoning performance:
(i) the caption quality, or in other words 'image alignment', of the texts;
and (ii) the level of caption density, which refers to the number of details that appear in the image.
|
|
PromptonomyViT: Multi-Task Prompt Learning Improves Video Transformers using Synthetic Scene Data
Roei Herzig*,
Ofir Abramovich*,
Elad Ben-Avraham,
Assaf Arbelle,
Leonid Karlinsky,
Ariel Shamir,
Trevor Darrell,
Amir Globerson
Winter Conference on Applications of Computer Vision (WACV) , 2024
project page /
code /
bibtex
We present PromptonomyViT, a model that leverages a multi-task prompt learning
approach for video transformers, where a shared transformer backbone
is enhanced with task-specific prompts.
|
|
Teaching Structured Vision & Language Concepts to Vision & Language Models
Sivan Doveh,
Assaf Arbelle,
Sivan Harary,
Rameswar Panda,
Roei Herzig,
Eli Schwartz,
Donghyun Kim,
Raja Giryes,
Rogerio Feris,
Shimon Ullman,
Leonid Karlinsky
code /
bibtex
IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) , 2023
We demonstrate language augmentation techniques for teaching language
structure to VL models.
|
|
Bringing Image Scene Structure to Video via Frame-Clip Consistency of Object Tokens
Elad Ben-Avraham,
Roei Herzig,
Karttikeya Mangalam,
Amir Bar,
Anna Rohrbach,
Leonid Karlinsky,
Trevor Darrell,
Amir Globerson
Advanced in Neural Information Processing Systems (NeurIPS) , 2022
Winner of the Ego4D CVPR'22 Point of No Return Temporal Localization Challenge , 2022
project page /
code /
bibtex
We present SViT (for Structured Video Tokens), a model that utilizes the structure
of a small set of images, whether they are within or outside the domain of interest,
available only during training for a video downstream task.
|
|
FETA: Towards Specializing Foundational Models for Expert Task Applications
Amit Alfassy,
Assaf Arbelle,
Oshri Halimi,
Sivan Harary,
Roei Herzig,
Eli Schwartz,
Rameswar Panda,
Michele Dolfi,
Christoph Auer,
Kate Saenko,
Peter Staar,
Rogerio Feris,
Leonid Karlinsky
NeurIPS Datasets and Benchmarks , 2022
We present FETA, a novel benchmark for evaluating and improving
Foundation Vision and Language Models performance on expert data tasks,
such as technical document understanding.
|
|
Object-Region Video Transformers
Roei Herzig,
Elad Ben-Avraham,
Karttikeya Mangalam,
Amir Bar,
Gal Chechik,
Anna Rohrbach,
Trevor Darrell,
Amir Globerson
IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) , 2022
project page /
code /
bibtex
We present ORViT, an object-centric approach that extends video transformer layers with a block that
directly incorporates object representations.
|
|
Unsupervised Domain Generalization by Learning a Bridge Across Domains
Sivan Harary*, Eli Schwartz*, Assaf Arbelle*, Peter Staar, Shady Abu-Hussein,
Elad Amrani,
Roei Herzig,
Amit Alfassy, Raja Giryes, Hilde Kuehne, Dina Katabi,
Kate Saenko, Rogerio Feris, Leonid Karlinsky
IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) , 2022
We present a novel self-supervised cross-domain learning method based on semantically aligning
all the domains to a common BrAD domain - a learned auxiliary bridge domain as an edge map with image-to-image mappings.
|
|
DETReg: Unsupervised Pretraining with Region Priors for Object Detection
Amir Bar,
Xin Wang,
Vadim Kantorov,
Colorado J Reed,
Roei Herzig,
Gal Chechik,
Anna Rohrbach,
Trevor Darrell,
Amir Globerson
IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) , 2022
project page /
code /
Video
Pretraining transformers to localize potential objects improves object detection.
|
|
Compositional Video Synthesis with Action Graphs
Amir Bar*,
Roei Herzig*,
Xiaolong Wang,
Anna Rohrbach,
Gal Chechik,
Trevor Darrell,
Amir Globerson
International Conference on Machine Learning (ICML) , 2021
project page /
code /
slides /
bibtex
We introduce the formalism of Action Graphs,
a natural and convenient structure representing the dynamics of actions between objects over time.
We show we can synthesize goal-oriented videos on the CATER and Something Something datasets
and generate novel compositions of unseen actions.
|
|
Learning Canonical Representations for Scene Graph to Image Generation
Roei Herzig*,
Amir Bar*,
Huijuan Xu,
Gal Chechik,
Trevor Darrell,
Amir Globerson
Proceedings of the European Conference on Computer Vision (ECCV) , 2020
project page /
code /
bibtex
We present a novel model that can inherently learn canonical graph representations and show better
robustness to graph size, adversarial attacks, and semantic equivalent,
thus generating superior images of complex visual scenes.
|
|
Something-Else: Compositional Action Recognition with Spatial-Temporal Interaction Networks
Joanna Materzynska,
Tete Xiao,
Roei Herzig,
Huijuan Xu*,
Xiaolong Wang*,
Trevor Darrell*
IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) , 2020
project page /
code /
dataset /
bibtex
We propose a novel compositional action recognition task where the training combinations of verbs and nouns do not overlap with the test set.
We show the effectiveness of our approach on the proposed compositional task and
a few-shot compositional setting which requires the model to generalize across both object appearance and action category.
|
|
Differentiable Scene Graphs
Moshiko Raboh* ,
Roei Herzig*,
Gal Chechik,
Jonathan Berant,
Amir Globerson
Winter Conference on Applications of Computer Vision (WACV) , 2020
code /
bibtex
We propose an intermediate “graph-like” representation (DSGs) that can be learned in an end-to-end manner
from the supervision for a downstream visual reasoning task, which achieves a new state-of-the-art results
on Referring Relationships task.
|
|
Spatio-Temporal Action Graph Networks
Roei Herzig*, Elad Levi* ,
Huijuan Xu*,
Hang Gao,
Eli Brosh,
Xiaolong Wang,
Amir Globerson ,
Trevor Darrell
Workshop on Autonomous Driving at ICCV , 2019 (Oral)
code /
bibtex
We propose a latent inter-object graph representation for activity recognition
that explores the visual interaction between the objects in a self-supervised manner.
|
|
Precise Detection in Densely Packed Scenes
Eran Goldman*,
Roei Herzig*, Aviv Eisenschtat* ,
Jacob Goldberger,
Tal Hassner
IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) , 2019
code /
dataset /
bibtex
We collect a new SKU-110K dataset which takes detection challenges to unexplored territories,
and propose a novel mechanism to learn deep overlap rates for each detection.
|
|
Mapping Images to Scene Graphs with Permutation-Invariant Structured Prediction
Roei Herzig*, Moshiko Raboh* ,
Gal Chechik,
Jonathan Berant,
Amir Globerson
Advanced in Neural Information Processing Systems (NeurIPS) , 2018
code /
bibtex
We propose a novel invariant graph network for mapping images to scene graphs using the permutation invariant
property, which achieves a new state-of-the-art results on Visual Genome dataset.
|
Towards Compositionality in Large Multimodal Models
(Fall 2023 BAIR Visual AI Workshop), December 2023.
Towards Compositionality in Video Understanding
(Israeli Vision Day), January 2023.
NeurIPS 2022 Highlights
(TAU fundamental of AI, Tel-Aviv University), December 2022.
Towards Compositionality in Video Understanding
(Vision and AI Seminar, Weizmann Institute of Science), December 2022.
Towards Compositionality in Video Understanding
(Israeli Association for Artificial Intelligence Conference 2022), June 2022.
ORViT: Object-Region Video Transformers
(BAIR Visual Computing Workshop), March 2022.
Towards Compositionality in Video Understanding
(IMVC 2021), Oct 2021.
Towards Compositionality in Video Understanding by Prof. Trevor Darrell
(ICCV21 SRVU Workshop), Oct 2021.
Compositional Video Synthesis with Action Graphs
(Israel Vision Day), Dec 2020.
Learning Canonical Representations for Scene Graph to Image Generation
(BAIR Fall Seminar), Aug 2020.
Compositional Video Synthesis with Action Graphs
(Israeli Geometric Deep Learning), Aug 2020.
Structured Semantic Understanding for Videos and Images
(Advanced Seminar in Computer Graphics at TAU), Jun 2020.
|