Object-Region Video Transformers
1Tel-Aviv University 2UC Berkeley 3Bar-Ilan University, NVIDIA Research
Our ORViT model incorporates object information into video transformer layers.
The figure shows the standard (uniformly spaced) transformer patch-tokens in blue,
and object-regions corresponding to detections in orange.
In ORViT any temporal patch-token (e.g., the patch in black at time T)
attends to all patch tokens (blue) and region tokens (orange).
This allows the new patch representation to be informed by the objects.
Our method shows strong performance improvement on multiple video understanding tasks and datasets,
demonstrating the value of a model that incorporates object representations into a transformer architecture.
Recently, video transformers have shown great success in video understanding,
exceeding CNN performance; yet existing video transformer models do not
explicitly model objects, although objects can be essential for recognizing actions.
In this work, we present Object-Region Video Transformers (ORViT), an
object-centric approach that extends video transformer layers with a block
that directly incorporates object representations.
The key idea is to fuse object-centric representations starting from early layers
and propagate them into the transformer-layers, thus affecting the spatio-temporal
representations throughout the network.
Our ORViT block consists of two object-level streams: appearance and dynamics.
In the appearance stream, an ''Object-Region Attention'' module applies self-attention over the patches and object regions.
In this way, visual object regions interact with uniform patch tokens and enrich them with contextualized object information.
We further model object dynamics via a separate ''Object-Dynamics Module'', which captures trajectory interactions,
and show how to integrate the two streams.
We evaluate our model on four tasks and five datasets:
compositional and few-shot action recognition on SomethingElse, spatio-temporal action detection on AVA, and standard action recognition on Something-Something V2, Diving48 and Epic-Kitchen100.
We show strong performance improvement across all tasks and datasets considered, demonstrating the value of a model that incorporates object representations into a transformer architecture.
Consider the simple action of ''Picking up a coffee cup'' below.
Intuitively, a human recognizing this action would identify the hand, the coffee cup and the coaster, and perceive the upward movement of the cup.
This highlights three important cues needed for recognizing actions:
What/where are the objects? How do they interact? and how do they move?
Indeed, evidence from cognitive psychology also supports this structure of the action-perception system.
Recently, video transformers have been introduced as powerful video understanding models.
In these models, each video frame is divided into patches, and a spatio-temporal self-attention architecture obtains a context-dependent representation for the patches.
However, there is no explicit representation of objects in this approach, which makes it harder for such models to capture compositionality.
The challenge in building such an architecture is that it should have components for modeling the appearance of objects, the interaction between objects, and the dynamics of the objects (irrespective of their visual appearance).
We would like the objects to influence the representation of the scene throughout the bottom-up process rather than as a post-processing stage.
Our key idea is that object regions can be introduced into transformers in a similar way to that of the regular patches, and dynamics can also be integrated into this framework in a natural way.
We refer to our model as an ''Object-Region Video Transformer'' (or ORViT).
The ORViT block takes as input patch tokens and outputs refined patch tokens based on object information.
Within the block, it uses a set of object bounding boxes that are predicted using largely class-agnostic off-the-shelf trackers and serve to inform the model which parts of video contain objects.
This information is then used to generate two separate object-level streams: an ``Object-Region Attention'' stream that models appearance, and an ``Object-Dynamics Module'' stream that models trajectories.
We re-integrate the appearance and trajectory stream into a set of refined patch tokens, which have the same dimensionality as the input to our ORViT.
This means that the ORViT block can be called repeatedly.
The goal of Object-Region Attention is to extract information about each object and use it to refine the patch tokens.
This is done by using the object regions to extract descriptor vectors per region from the input tokens, which we refer to as object tokens.
These vectors are then concatenated with the THW patch tokens and serve as the keys and values, while the queries are only the patch tokens.
Finally, the output of the block is THW patch tokens.
Thus, the key idea is to fuse object-centric information into spatio-temporal representations.
Namely, inject the TO object region tokens into patch tokens THW.
To model object dynamics, we introduce the Object-Dynamics Module that only considers the box coordinates.
We visualize the attention allocated to the object tokens in the ORViT block (red, green, and blue) in each frame
for a video describing ''moving two objects away from each other''.
It can be seen that each remote object affects the patch-tokens in its region, whereas the hand has a broader map.
Attention Maps comparison between the ORViT+Mformer and the Mformer on videos from the SSv2 dataset.
The visualization shows the attention maps corresponding to the CLS query.
Object contribution to the patch tokens. For each object token, we plot the attention weight given by the patch tokens, normalized over the patch tokens.
Roei Herzig, Elad Ben-Avraham, Karttikeya Mangalam, Amir Bar, Gal Chechik, Anna Rohrbach, Trevor Darrell, Amir Globerson
Object-Region Video Transformers
Hosted on arXiv
Our work builds and borrows code from multiple past works such as SlowFast, MViT, TimeSformer and MotionFormer. If you found our work helpful, consider citing these works as well.
We would like to thank Tete Xiao, Medhini Narasimhan, Rodolfo (Rudy) Corona, and Colorado Reed for helpful feedback and discussions.
This project has received funding from the European Research Council (ERC) under the European Unions Horizon 2020 research and innovation programme (grant ERC HOLI 819080).
Prof. Darrell’s group was supported in part by DoD including DARPA's XAI, and LwLL programs, as well as BAIR's industrial alliance programs.
This work was completed in partial fulfillment for the Ph.D. degree of the first author.