Roei Herzig

Hi there! I'm Roei, a second-year CS Ph.D. student at Tel Aviv University, working with Prof. Amir Globerson and Prof. Trevor Darrell, and a member of the Berkeley AI Research Lab.

I'm also a Machine Learning & Deep Learning Researcher, I have worked at Nexar and Trax Image Recognition in the last 5 years. Previously, I graduated magna cum laude from Tel Aviv University with MSc (CS), BSc (CS) and BSc (Physics).

fast-texture I'm looking for a strong MSc students that wish to collaborate and publish in top-tier conferences on Geometry and Learning for 3D and Video Understanding.

Email  /  Twitter  /  Github  /  LinkedIn  /  CV  /  Google Scholar

profile photo
Research

I mainly work on structured models in machine learning and deep learning for obtaining a better semantic understanding in videos and images (e.g., Structured Prediction). I believe our world is compositional, and humans do not perceive the world as raw pixels. Moreover, structured models can enjoy generalization and inductive-bias properties, which I find critical, mostly at the intersections of vision, language, and robotics.

Research Interest:

  • Machine Learning & Deep Learning: Generative Models, Graph Neural Networks, Self-Supervised Learning.
  • Vision & Language: Video Understanding, Scene Understanding, Visual Reasoning.
  • Vision & Robotics: Semantic Understanding, Structured Representation, Transfer Learning.

Personal

I'm a proud father of Adam, and when I'm not working, I'm also a history buff and love learning about science, politics, the two World Wars, the Cold War, and music.

Publications
fast-texture Compositional Video Synthesis with Action Graphs fast-texture
Amir Bar*, Roei Herzig*, Xiaolong Wang, Gal Chechik, Trevor Darrell, Amir Globerson
ArXiv preprint , 2020
project page / code / slides / bibtex

We introduce the formalism of Action Graphs, a natural and convenient structure representing the dynamics of actions between objects over time. We show we can synthesize goal-oriented videos on the CATER and Something Something datasets and generate novel compositions of unseen actions.

fast-texture Learning Canonical Representations for Scene Graph to Image Generation fast-texture
Roei Herzig*, Amir Bar*, Huijuan Xu, Gal Chechik, Trevor Darrell, Amir Globerson
Proceedings of the European Conference on Computer Vision (ECCV) , 2020
project page / code / bibtex

We present a novel model that can inherently learn canonical graph representations and show better robustness to graph size, adversarial attacks, and semantic equivalent, thus generating superior images of complex visual scenes.

fast-texture Something-Else: Compositional Action Recognition with Spatial-Temporal Interaction Networks
Joanna Materzynska, Tete Xiao, Roei Herzig, Huijuan Xu*, Xiaolong Wang*, Trevor Darrell*
IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) , 2020
project page / code / dataset / bibtex

We propose a novel compositional action recognition task where the training combinations of verbs and nouns do not overlap with the test set. We show the effectiveness of our approach on the proposed compositional task and a few-shot compositional setting which requires the model to generalize across both object appearance and action category.

fast-texture Differentiable Scene Graphs
Moshiko Raboh* , Roei Herzig*, Gal Chechik, Jonathan Berant, Amir Globerson
Winter Conference on Applications of Computer Vision (WACV) , 2020
code / bibtex

We propose an intermediate “graph-like” representation (DSGs) that can be learned in an end-to-end manner from the supervision for a downstream visual reasoning task, which achieves a new state-of-the-art results on Referring Relationships task.

fast-texture Spatio-Temporal Action Graph Networks
Roei Herzig*, Elad Levi* , Huijuan Xu*, Hang Gao, Eli Brosh, Xiaolong Wang, Amir Globerson , Trevor Darrell
Workshop on Autonomous Driving at ICCV , 2019 (Oral)
code / bibtex

We propose a latent inter-object graph representation for activity recognition that explores the visual interaction between the objects in a self-supervised manner.

fast-texture Accurate Visual Localization for Automotive Applications
Eli Brosh*, Matan Friedmann*, Ilan Kadar*, Lev Yitzhak Lavy*, Elad Levi*, Shmuel Rippa*, Yair Lempert, Bruno Fernandez-Ruiz, Roei Herzig, Trevor Darrell
Workshop on Autonomous Driving at CVPR , 2019
blog / code / dataset / bibtex

We propose a hybrid coarse-to-fine approach that leverages visual and GPS location cues with on a new large-scale driving dataset based on video and GPS data.

fast-texture Precise Detection in Densely Packed Scenes
Eran Goldman*, Roei Herzig*, Aviv Eisenschtat* , Jacob Goldberger, Tal Hassner
IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) , 2019
code / dataset / bibtex

We collect a new SKU-110K dataset which takes detection challenges to unexplored territories, and propose a novel mechanism to learn deep overlap rates for each detection.

fast-texture Mapping Images to Scene Graphs with Permutation-Invariant Structured Prediction
Roei Herzig*, Moshiko Raboh* , Gal Chechik, Jonathan Berant, Amir Globerson
Advanced in Neural Information Processing Systems (NeurIPS) , 2018
code / bibtex

We propose a novel invariant graph network for mapping images to scene graphs using the permutation invariant property, which achieves a new state-of-the-art results on Visual Genome dataset.

Invited Talks
Learning Canonical Representations for Scene Graph to Image Generation (BAIR Fall Seminar, 2020), Slides.
Compositional Video Synthesis with Action Graphs (Israeli Geometric Deep Learning, 2020), Slides.
Structured Semantic Understanding for Videos and Images (Advanced Seminar in Computer Graphics at TAU, 2020), Slides.

Webside template credits