Research
My research goal is to develop compositionality into intelligent machines to improve robustness and generalization
in multiple domains, such as vision, language, and robotics.
I believe our understanding of the world is naturally hierarchical and structured,
and intelligent machines would need to develop a compositional understanding that is robust and generalizable.
However, many existing vision architectures are not compositional,
and thus my research goal is to design compositional models that leverage inductive biases into
our architectures to generalize well across various tasks.
|
Personal
I'm a proud father of Adam and Liam and happily married to Esti,
my amazing wife and my partner for life.
When I'm not working, I'm also a history buff and love learning about science, politics, music, and the two World Wars.
|
|
Bringing Image Scene Structure to Video via Frame-Clip Consistency of Object Tokens
Elad Ben-Avraham,
Roei Herzig,
Karttikeya Mangalam,
Amir Bar,
Anna Rohrbach,
Leonid Karlinsky,
Trevor Darrell,
Amir Globerson
Tech report , 2022
Winner of the Ego4D CVPR'22 Point of No Return Temporal Localization Challenge , 2022
project page /
code
We present SViT (for Structured Video Tokens), a model that utilizes the structure
of a small set of images, whether they are within or outside the domain of interest,
available only during training for a video downstream task.
|
|
Object-Region Video Transformers
Roei Herzig,
Elad Ben-Avraham,
Karttikeya Mangalam,
Amir Bar,
Gal Chechik,
Anna Rohrbach,
Trevor Darrell,
Amir Globerson
IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) , 2022
project page /
code /
bibtex
We present ORViT, an object-centric approach that extends video transformer layers with a block that
directly incorporates object representations.
|
|
Unsupervised Domain Generalization by Learning a Bridge Across Domains
Sivan Harary*, Eli Schwartz*, Assaf Arbelle*, Peter Staar, Shady Abu-Hussein,
Elad Amrani,
Roei Herzig,
Amit Alfassy, Raja Giryes, Hilde Kuehne, Dina Katabi,
Kate Saenko, Rogerio Feris, Leonid Karlinsky
IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) , 2022
We present a novel self-supervised cross-domain learning method based on semantically aligning
all the domains to a common BrAD domain - a learned auxiliary bridge domain as an edge map with image-to-image mappings.
|
|
DETReg: Unsupervised Pretraining with Region Priors for Object Detection
Amir Bar,
Xin Wang,
Vadim Kantorov,
Colorado J Reed,
Roei Herzig,
Gal Chechik,
Anna Rohrbach,
Trevor Darrell,
Amir Globerson
IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) , 2022
project page /
code /
Video
Pretraining transformers to localize potential objects improves object detection.
|
|
Compositional Video Synthesis with Action Graphs
Amir Bar*,
Roei Herzig*,
Xiaolong Wang,
Anna Rohrbach,
Gal Chechik,
Trevor Darrell,
Amir Globerson
International Conference on Machine Learning (ICML) , 2021
project page /
code /
slides /
bibtex
We introduce the formalism of Action Graphs,
a natural and convenient structure representing the dynamics of actions between objects over time.
We show we can synthesize goal-oriented videos on the CATER and Something Something datasets
and generate novel compositions of unseen actions.
|
|
Learning Canonical Representations for Scene Graph to Image Generation
Roei Herzig*,
Amir Bar*,
Huijuan Xu,
Gal Chechik,
Trevor Darrell,
Amir Globerson
Proceedings of the European Conference on Computer Vision (ECCV) , 2020
project page /
code /
bibtex
We present a novel model that can inherently learn canonical graph representations and show better
robustness to graph size, adversarial attacks, and semantic equivalent,
thus generating superior images of complex visual scenes.
|
|
Something-Else: Compositional Action Recognition with Spatial-Temporal Interaction Networks
Joanna Materzynska,
Tete Xiao,
Roei Herzig,
Huijuan Xu*,
Xiaolong Wang*,
Trevor Darrell*
IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) , 2020
project page /
code /
dataset /
bibtex
We propose a novel compositional action recognition task where the training combinations of verbs and nouns do not overlap with the test set.
We show the effectiveness of our approach on the proposed compositional task and
a few-shot compositional setting which requires the model to generalize across both object appearance and action category.
|
|
Differentiable Scene Graphs
Moshiko Raboh* ,
Roei Herzig*,
Gal Chechik,
Jonathan Berant,
Amir Globerson
Winter Conference on Applications of Computer Vision (WACV) , 2020
code /
bibtex
We propose an intermediate “graph-like” representation (DSGs) that can be learned in an end-to-end manner
from the supervision for a downstream visual reasoning task, which achieves a new state-of-the-art results
on Referring Relationships task.
|
|
Spatio-Temporal Action Graph Networks
Roei Herzig*, Elad Levi* ,
Huijuan Xu*,
Hang Gao,
Eli Brosh,
Xiaolong Wang,
Amir Globerson ,
Trevor Darrell
Workshop on Autonomous Driving at ICCV , 2019 (Oral)
code /
bibtex
We propose a latent inter-object graph representation for activity recognition
that explores the visual interaction between the objects in a self-supervised manner.
|
|
Accurate Visual Localization for Automotive Applications
Eli Brosh*, Matan Friedmann*, Ilan Kadar*, Lev Yitzhak Lavy*, Elad Levi*, Shmuel Rippa*,
Yair Lempert, Bruno Fernandez-Ruiz, Roei Herzig,
Trevor Darrell
Workshop on Autonomous Driving at CVPR , 2019
blog /
code /
dataset /
bibtex
We propose a hybrid coarse-to-fine approach that leverages visual and GPS location cues with on
a new large-scale driving dataset based on video and GPS data.
|
|
Precise Detection in Densely Packed Scenes
Eran Goldman*,
Roei Herzig*, Aviv Eisenschtat* ,
Jacob Goldberger,
Tal Hassner
IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) , 2019
code /
dataset /
bibtex
We collect a new SKU-110K dataset which takes detection challenges to unexplored territories,
and propose a novel mechanism to learn deep overlap rates for each detection.
|
|
Mapping Images to Scene Graphs with Permutation-Invariant Structured Prediction
Roei Herzig*, Moshiko Raboh* ,
Gal Chechik,
Jonathan Berant,
Amir Globerson
Advanced in Neural Information Processing Systems (NeurIPS) , 2018
code /
bibtex
We propose a novel invariant graph network for mapping images to scene graphs using the permutation invariant
property, which achieves a new state-of-the-art results on Visual Genome dataset.
|
|