Exploring Long Tail Visual Relationship Recognition
with Large Vocabulary
ICCV 2021

*Denotes Equal Contribution

^Work done while working at KAUST

Abstract

overview

Several approaches have been proposed in recent literature to alleviate the long-tail problem, mainly in object classification tasks. In this paper, we make the first large-scale study concerning the task of Long-Tail Visual Relationship Recognition (LTVRR). LTVRR aims at improving the learning of structured visual relationships that come from the long-tail (e.g., “rabbit grazing on grass”). In this setup, the subject, relation, and object classes each follow a long-tail distribution. To begin our study and make a future benchmark for the community, we introduce two LTVRR-related benchmarks, dubbed VG8K-LT and GQA-LT, built upon the widely used Visual Genome and GQA datasets. We use these benchmarks to study the performance of several state-of-the-art long-tail models on the LTVRR setup. Lastly, we propose a visiolinguistic hubless (VilHub) loss and a Mixup augmentation technique adapted to LTVRR setup, dubbed as RelMix. Both VilHub and RelMix can be easily integrated on top of existing models and despite being simple, our results show that they can remarkably improve the performance, especially on tail classes.

Video


A New Long-Tail Benchmark

To study the problem of Long-tail Recognition in VRD (Visual Relation Detection) setup, we propose two new benchmarks, namely GQA-LT and VG8K-LT, based on top of already existing GQA and VG datasets. Most long-tail literature focuses on the range of class frequency that is on a smaller scale than in our setup. For our benchmarks, we use following frequencies:For GQA-LT ( 1,703 object classes and 310 relation classes), the most frequent object and relationship categories have 374,282 and 1,692,068 examples, and the least frequent have 1 and 2 examples, respectively. This results in factors of around 300,000+ for objects and around 1.7 million for relations between the most frequent and least frequent classes. For VG8K-LT (5,330 objects classes and 2000 relation classes), the most frequent object and relationship categories have 196,944 and 618,687 examples, and the least frequent have 14 and 18 examples, respectively, which leads to factors of approximately 14,000 for objects and 34,000 for relations. For more info regarding the same you can refer to the paper and supplementary.

The histograms showing the sbj, rel, obj frequencies for the GQA-LT and VG8K-LT dataset. The figures shows the frequency values in log scale.

GQA-LT

You can download the annotations and splits necessary for the GQA-LT benchmark here. You should see a 'gvqa' folder once unzipped. It contains seed folder called 'seed0' that contains .json annotations suited for the dataloader used in our implementations. Also, download GQA images from here

VG8K-LT

You can download the annotations and splits necessary for the VG8K-LT benchmark here. You should see a 'vg8k' folder once unzipped. It contains seed folder called 'seed3' that contains .json annotations suited for the dataloader used in our implementations. Also, download VG images from here

VilHub & RelMix

To improve upon the classification accuracy on the long-tail spectrum of relationships and objects, we also propose two novel techniques, namely VilHub and RelMix. The main idea behind both of them being able to be used on top of any VRD models so far and improve its overall accuracy, especially on tail classes.

scales

The overall approach and pipeline used for our baseline models.

VilHub loss takes its inspiration from the problem of hubness, often talked in NLP problems. This is usually caused when some frequent words, called hubs, gets indistinguishably close to many other less represented words. In long-tail VRR context, these hubs are the head classes, which are often over-predicted at the expense of tail classes. To alleviate the hubness phenomenon, we develop a vision & language hublessloss (dubbed VilHub).

RelMix is an augmentation technique, inspired from Mixup and helps alleviate the problem of long-tail by augmenting more tail features in the training set. This is done in a mixup fashion, where triplets are selected belonging to different spectrum of data (i.e., head, med, tail) are combined in a systematic fashion to allow the augmented data having much more of tail features. This in turn helps for long-tail classification and for our overall problem.

Some of the qualitative results using our technique can be viewed below (click on the image to better view). For more qualitative as well quantitative results, please refer to the main paper and supplementary.

scales

Citation

The website template was borrowed from Michaël Gharbi.