Overview of Multi-label Learning in Computer Vision
- mingyu wang
- Jan 31, 2023
- 9 min read
1 What is Multi-Label learning?

Figure 1. Binary vs. multi-class vs. multi-label classification
General classification problems mainly focus on binary classification and multi-class classification (single-label learning, SLL): each object belongs to only one category of a label, and the categories of this label are mutually exclusive. But in many applications, an object can have multiple labels, for example: a news corresponds to multiple topics, politics, economy, and diplomacy; a city photo have multiple objects, vehicles, pedestrians, roads, buildings, etc. Multi-label learning is required in many practical application fields, such as multimedia content annotation, text information annotation, genetics, etc. In summary, multi-label is an extension of the general classification problem. A multi-label learning (MLL) model aims to predict all the labels corresponding to each input instance [1].
2 The key to the MLL problem

Figure 2. Correlation between labels in MLL
The characteristic of the MLL problem is that “there is a certain correlation between the labels” [1]. For example, in the movie classification, a movie with the label "children" has a high probability of having the label "family". Conversely, if a movie has the label "horror", it is almost impossible for the movie to have the label "children". The most interesting part of multi-label learning is that we can model by mining the relationship between labels.
3 Deep learning structure of MLL in computer vision

Figure3. Four deep neural network structures of MLL in CV. a, Shared backbone structure; b, RNN-based structure; c, GNN-based structure; d, Transformer-based structure
The powerful fitting capabilities of deep neural networks allow us to efficiently handle more and more difficult tasks. Therefore, as shown in Figure 3, in the field of Computer Vision (CV) in recent years, many researchers have gradually begun to introduce DNN to solve the MLL problem. With Recurrent Neural Network (RNN) and Graph Neural Network (GNN) and Transformer etc. structure, the deep neural network structure of MLL in CV is divided into the following types:
1) Shared backbone structure.
2) RNN-based structure.
3) GNN-based structure.
4) Transformer-based structure.
Below we will give examples of the above four architectures.
4 Shared backbone structure
Shared backbone structure is the simplest MLL structure, which generally includes a backbone network and multiple classification heads. Each classification head outputs the predicted values of different labels and shares the features output by a backbone.
4.1 Shared backbone in medical image (2020)
Article link: https://ieeexplore.ieee.org/abstract/document/9016204
Journal or Conference: IEEE access
Open-source code: None
This article builds a multi-label prediction network based on Resnet50 to predict benign and malignant thyroid nodules while predicting individual T-irads features (shape, calcification, composition, etc.) in ultrasound images. Specifically, because the various T-irads characteristics of thyroid nodules have been proved to be strongly correlated with benign and malignant, the use of this auxiliary information can improve the classification of benign and malignant. The diagram is shown in Figure 4. Specifically, the auxiliary labels are: 1) solid, 2) cystic, 3) mixed cystic and solid, 4) macro calcifications: whether there is coarse calcification, 5) punctate echogenic foci: whether there is microcalcification, 6) smooth or irregular.

Figure4. Overview of proposed network in Zhang, et al Ieee Access 8 (2020)
On a dataset of 16,946 images, the test results (Figure 5) show a higher accuracy of proposed method than other studies and radiologists. Unfortunately, this study did not conduct ablation experiments to compare the performance of multi-label modeling with single-label modeling.

Figure5. Experiment result in Zhang, et al Ieee Access 8 (2020)
It should be noted that this type of structure is simple, and its modeling of the correlation is an implicit method, so the correlation between multiple labels is not fully utilized in essence. The next three types of models will explicitly model the correlation between labels.
5 RNN-based structure
To explicitly construct the correlation of multiple labels, more and more researchers begin to use the method of sequence learning, which treats multiple labels as a sequence, and then learns the relationship between labels, and uses it to improve the performance of multi-label prediction. As a classic structure of sequence learning, RNN is very suitable for solving multi-label explicit modeling problems.
5.1 CNN-RNN (2016)
Article link:
Journal or Conference: CVPR
Open-source code: (unofficial)

Figure6. Overview of proposed network CNN-RNN in Wang, et al. CVPR. 2016.
The author's idea is to rely on the convolutional network to extract the features of the picture, and rely on the LSTM network to guide the CNN. For example, if you input a picture to the CNN, the vision it sees has no target, but if we use LSTM to guide it, the region of interest of CNN may be improved, which in turn allows the network to make the correct classification. As shown in Figure 5, a picture is first input and the corresponding features are extracted through the VGG network. And then n labels are embedded into the same embedding space with image feature to form a label embedding matrix U1. As shown in Figure 5, red dots represent label embeddings, and blue dots represent image embeddings (features). In the inference process, a zero vector will be input as the initial label embedding at the first time step, and then this embedding will be combined (mapped to a common space and added) with the blue dots which represents the image features, and then a fully connected layer is used to map it to a joint embedding (the black dots). This joint embedding is directly multiplied by the transposed matrix of the label embedding matrix Ul to obtain the distance between the joint embedding and each label embedding, and the closest distance means that the corresponding label is predicted. In each subsequent time step, this process will be iterated, such as, the RNN neuron will output a label after each output, first output the main ship, and then output other such as the sea. The comparison experiment results are shown in the figure below. The performance improvement after adding RNN time relationship modeling is still very obvious.

Figure7. Result of comparison results in Wang, et al. CVPR. 2016.
However, the limitation of CNN-RNN is that we need to obtain a priori dependency between labels in advance to determine the order of multi label prediction, which is very troublesome. Therefore, some algorithms that can predict all labels at the same time have been developed.
6 GNN-based structure
GNN is a kind of neural network specially dealing with graph data. Since multiple labels can be used as “nodes” of a graph, and the relationship between labels can be represented by the “edges” of graph, it is very appropriate to use GNN to build multiple label dependencies in MLL problems.
6.1 ML-GCN (2019)
Article link:
Journal or Conference: CVPR
Open-source code: https://github.com/megvii-research/ML-GCN

Figure8. Overview of proposed network ML-GCN in Chen, et al. CVPR. 2019.
The author's idea is very simple (Figure 8). Specifically, they use Graph convolution network (GCN, a kind of GNN) to build the correlation between multiple labels' word vectors (such as glove word vector), and take the output word vector after correlation interaction as the weight of the classifier. The weight and CNN output image features are multiplied by the matrix to obtain the prediction probability value of each label. The key to this method is that the traditional MLL algorithm has two problems in the calculation of the label co-occurrence matrix. First, the statistical method of co-occurrence matrix may obtain a long tail distribution, in which the number of occurrences is less may become an error. Second, the co-occurrence matrix in the training set and the test set may not be completely consistent, and the correlation matrix of overriding in the training set may greatly affect the generalization performance. Therefore, the author proposes a binarization method, which first binarizes the co-occurrence matrix. To avoid that the features of different categories will eventually become indistinguishable, they then use a reweighting method to constantly update this matrix. The experimental results are shown in the Figure 9, which shows that this method is very effective and superior to CNN-RNN and other methods.

Figure9. Result of comparison results in Chen, et al. CVPR. 2019.
However, this method has a limitation. Specifically, in the medical field, there are few word vectors for pre training medical vocabulary (protein names, gene names, etc.), so this method cannot be used to train models for some medical data.
6.2 SSGRL (2019)
Article link: https://arxiv.org/pdf/1908.07325.pdf
Journal or Conference: ICCV
Open-source code:
(official) https://github.com/HCPLab-SYSU/SSGRL

Figure10. Overview of proposed network SSGRL in Chen, et al. ICCV. 2019.
This method is very similar to the ML-GCN structure introduced in Section 6.1. Specifically, in SSGRL (Figure 10), the image will first pass through a CNN to extract the feature fI, and then this fI and all label word vectors will conduct semantic decoupling (SD, an attention mechanism) operations respectively to extract the joint embedding of each label and image. This joint embedding will be fed to a GRU-based GNN (GRU, Gated Recurrent Unit) for further interaction, and finally output features for multi label classification. The comparison experiment results are shown in the figure below. The results show that the proposed SSGRL is superior to MLL networks such as CNN-RNN.

Figure11. Result of comparison results in Chen, et al. ICCV. 2019.
7.1 C-Tran (2021)
Journal or Conference: CVPR
Open-source code: https://github.com/QData/C-Tran

Figure12. Overview of proposed network C-Tran in Lanchantin, et al. CVPR. 2021.
This method (Figure 12) first uses a CNN to extract image features. Then, the image features will be divided into multiple patch z in the spatial dimension to be used as the transformer after the input. In order to build the correlation between labels and between labels and image features, they embedded labels similar to CNN-RNN (subsection 5.1). At the same time, they superimposed label states (positive samples and negative samples) with label embedding to input transformer. Through the interaction of the transformer, the output label embedding will be fed into multiple classification heads to output the final multi label prediction probability.
This method combines transformer and CNN in a simple way, and adds state embedding to fully extract the correlation between labels. It is a good MLL method, and its performance is better than ML-GCN, CNN-RNN and other networks (Figure 13). In the original text, the author also introduced a variety of different reasoning modes, such as using the real values of some labels as an aid to improve the classification performance of the remaining labels (Figure 14).

Figure13. Comparison of experimental results in Lanchantin, et al. CVPR. 2021.

Figure14. Results assisted by additional labels in Lanchantin, et al. CVPR. 2021.
7.2 Qurey2label (2021)
Article link: https://arxiv.org/abs/2107.10834
Journal or Conference: arxiv
Open-source code: (official) https://github.com/SlongLiu/query2labels

Figure15. Overview of proposed network C-Tran in Liu, et al. arXiv (2021).
As shown in Figure 15, Query2Label is a two-stage framework. In the first stage, the image is extracted through a backbone network. In the second stage, the image feature and label feature are sent to the Transformer decoder together. The image feature is used as the key and value, and the label feature is used as the query. The query feature output by the Transformer is used to predict the existence of the label after adaptive feature pooling and linear projection. For the first stage, as a feature extractor, the backbone network can be freely replaced. You can use the CNN based network or ViT and other transformer-based networks. For the second stage, the label features output feature vectors through the Transformer layer can be directly projected to obtain the corresponding logits. Adaptive pooling and linear projection are common operation. It should be noted that, unlike C-tran, the label and image features here do cross attention rather than self attention. The differences are as follows:

The comparison test results of this method are shown in Figure 16, which is outperforms C-trans and other MLL networks.

Figure16. Results of comparison in Liu, et al. arXiv (2021).
7.3 ML decoder (2021)
Article link: https://arxiv.org/abs/2111.12933
Journal or Conference: arxiv
Open-source code: https://github.com/Alibaba-MIIL/ML_Decoder

Figure17. Overview of proposed network ML-decoder in Ridnik, et al. Arxiv. 2021.
The main improvement of this method is the use of a unique transformer decoder. In general decoders (on the left side of Figure 9), self attention is executed by default, and the dimension of the attention matrix before the final feed forward is N × D. In ML decoder, N - number of classes, K - number of group queries, D - tokens length. Removing the redundant self-attention block relaxes the quadratic dependence in the number of queries to a linear one, while retraining the same expressivity. When using group queries with fixed number of queries K < N, ML-Decoder becomes fully scalable, with spatial pooling cost independent in the number of classes. The results (Figure18) show that this method is superior to ML-GCN and Q2L (query2label) algorithms.

Figure18. Overview of proposed network ML-decoder in Ridnik, et al. Arxiv. 2021.
8 Conclusion
As an important problem of machine learning, MLL has developed rapidly in recent years. The network structure has gradually developed from simple CNN to the combination with GNN, transformer and other networks, and gradually built the correlation modeling between displayed labels.
Reference
[1] Liu, Weiwei, et al. "The emerging trends of multi-label learning." IEEE transactions on pattern analysis and machine intelligence (2021).
[2] Zhang, Shijie, et al. "A novel interpretable computer-aided diagnosis system of thyroid nodules on ultrasound based on clinical experience." Ieee Access 8 (2020): 53223-53231.
[3] Wang, Jiang, et al. "Cnn-rnn: A unified framework for multi-label image classification." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.
[4] Chen, Zhao-Min, et al. "Multi-label image recognition with graph convolutional networks." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019.
[5] Chen, Tianshui, et al. "Learning semantic-specific graph representation for multi-label image recognition." Proceedings of the IEEE/CVF international conference on computer vision. 2019.
[6] Lanchantin, Jack, et al. "General multi-label image classification with transformers." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021.
[7] Liu, Shilong, et al. "Query2label: A simple transformer way to multi-label classification." arXiv preprint arXiv:2107.10834 (2021).
[8] Ridnik, Tal, et al. "Ml-decoder: Scalable and versatile classification head." arXiv preprint arXiv:2111.12933 (2021).
Comments