Compositional Video Synthesis with Action Graphs

ICML 2021

Amir Bar*1   Roei Herzig*1
Xiaolong Wang2,3   Anna Rohrbach2   Gal Chechik4   Trevor Darrell2   Amir Globerson1  
*Equally contributed

1Tel-Aviv University  2UC Berkeley  3UC San Diego  4Bar-Ilan University, NVIDIA Research



Our main goal is to learn to synthesize videos of actions. More specifically, our model should be able to synthesize videos given multiple (potentially simulatnious) timed actions with multiple number of objects. Towards this end, we present the Action Graph (AG) structure which we use to represent timed actions and objects, and the AG2Vid model which can synthesize videos conditioned on actions given in AGs and initial image and layout.

Action Graphs

Classic work in cognitive-psychology argues that actions (and more generally events) are bounded regions of space-time and are composed of atomic action units (Quin, 1985; Zacks and Tversky, 2001). In a video, multiple actions can be applied to one or more objects, changing the relationships between the object and the subject of an action over time. Based on these observations, we introduce a formalism we call an Action Graph, a graph where nodes are objects, and the edges are actions specified by their start and end time.

Action Graph is a natural and convenient structure representing the dynamics of actions between objects over time.


We proposed the AG2Vid model. The video synthesis process is done in a coarse-to-fine manner, starting from scheduling the time-specific actions of the Action Graph. This is followed by the prediction of the next object locations which are then used to synthesize the next image pixels vt. To execute an Action Graph, each action (and its corresponding edge) is assigned with a "clock" which marks the progress of the action in a particular timestep. The Action Graph At at time t describes the execution stage of each action at time t.

Using the Action Graph at time t At and together with the previous layout and frame (lt-1, vt-1) the AG2Vid model synthesizes the next frame vt.


Qualitative examples

Qualitative examples for the generation of actions on the CATER and Something-Something V2 datasets. We use the AG2Vid model to generate videos of multiple (potentially simultanious) actions on CATER and single actions on Something-Something V2.

Zero-shot synthesis of sequential actions

We use the model trained on single actions on the Something-Something V2 dataset, and define more complex actions graphs, which contain sequential actions. In the given example, we define the sequence "Move up", "Move down", and "Move right".

Zero-shot synthesis of simultaneous actions

We use the model trained on single actions on the Something-Something V2 dataset, and define more complex actions graphs, which contain simultaneous actions. In the given example, we define the actions "Move right", "Move left" simultaneously.

Zero-shot synthesis of new action composites

Qualitative examples for the generation of new action composites on the Something-Something V2 and CATER datasets. We use the Action Graph representation to define 4 new action composites out of the existing actions, and use our AG2Vid model to generate corresponding videos.


Amir Bar*, Roei Herzig*, Xiaolong Wang, Anna Rohrbach, Gal Chechik, Trevor Darrell, Amir Globerson
Compositional Video Synthesis with Action Graphs
ICML 2021
Hosted on arXiv

*Equal contribution

Related Works

Our work builds and borrows code from multiple past works such as SG2Im, Canonical-SG2IM and Vid2Vid. If you found our work helpful, consider citing these works as well.


We would like to thank Lior Bracha for her help running MTurk experiments, and to Haggai Maron and Yuval Atzmon for helpful feedback and discussions. This project has received funding from the European Research Council (ERC) under the European Unions Horizon 2020 research and innovation programme (grant ERC HOLI 819080). Trevor Darrell’s group was supported in part by DoD including DARPA's XAI, LwLL, and/or SemaFor programs, as well as BAIR's industrial alliance programs. Gal Chechik's group was supported by the Israel Science Foundation (ISF 737/2018), and by an equipment grant to Gal Chechik and Bar-Ilan University from the Israel Science Foundation (ISF 2332/18). This work was completed in partial fulfillment for the Ph.D degree of Amir Bar.