Multi-label learning in action with ML-GCN

mingyu wang
Jan 31, 2023
7 min read

0.Introduction

Large datasets with label information enable deep learning methods to achieve expert-level performance in various tasks in medical images. The goal of multi-label image recognition task is to predict multiple object labels appearing in an image. It is widely used in search engines and recommendation systems, and has long been a foundation in the field of computer vision and machine learning. The research topic has attracted the attention of the academic community. Since multiple related objects usually appear in an image at the same time, an ideal way to improve the recognition performance is to explore the core problem of multi-label recognition, that is, "how to effectively model the synergistic relationship between labels". effectively model the interdependencies. In this paper, the classic multi-label classification model ML-GCN is applied to the multi-label classification task of chest X-ray.

1. Dataset

Many label classification tasks have associated information between labels, and thoracic diseases are no exception. Take the multi-label data set CheXpert as an example. This data set is a large-scale chest X-ray data set published by Andrew Ng's team in 2019. It contains 65,240 patients and 224,316 chest X-ray images. The corresponding article is CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison. The information in this dataset is obtained by Labeler extracting 14 observations from radiology reports (somewhat similar to a text recognition plus semantic analysis program), 12 of which are disease categories and 2 are non-disease categories. Each category has 4 labels to choose from (1: Positive, 0: Negative, -1: Uncertain, empty: abstract indicates that the disease is not mentioned in the report), Positive indicates that the category is a positive sample, Negative indicates that The category is a negative sample, and Uncertain means that the image through the X-ray cannot be judged as a positive sample or a negative sample.

Figure 1. Label file for the Chexpert dataset. Among them, there is one patient per row and one category per column. Each label are with 4 labels for each category (1: Positive, 0: Negative, -1: Uncertain, empty: abstract means that the disease is not mentioned in the report)

As shown in Figure 2, there is a father-son relationship between Lung Opacity and Consolidation, Atelectasis and other diseases, that is, if someone have Consolidation disease, he will definitely suffer from Lung Opacity disease. These relationships are not explored in the officially provided Baseline method, so this is what can be improved.

Figure 2 The relationship between each label in the Chexpert dataset

2、ML-GCN

ML-GCN is a multi-label network proposed in CVPR2019, and its source code address is: https://github.com/megvii-research/ML-GCN. The network architecture is shown in the figure below. The CNN part is used to extract image features (dimension is Dx1), the GCN part is used to extract the association information between labels (dimension is DxC, C refers to the number of categories of classification), and finally Cx1 is obtained by dot product dimension output. It is worth mentioning that this network does not use a fully connected layer at the end. The CNN part is a conventional framework. Let’s focus on the GCN framework part (the derivation principle of GCN is not explained much, this blog focuses on how to use GCN). To use the GCN framework, I only need to know the content of two parts: 1. The input of the GCN network ; 2. the formula of the GCN network, I will mainly introduce these two parts below.

2.1 The input of GCN

The input of the GCN network is the Word Embedding (word vector) of the required classification category, and the Chexpert dataset we use corresponds to 14 categories of Word Embedding. So, first of all, what is a word vector. Usually we generally express that a word is in the One-hot way, but there is a problem, that is, the relationship information between words is not included in the One-hot representation method, for example: One-hot way [ 1,0,0] means "little brother", [0,1,0] means "little sister", [0,0,1] means "sky", there will be problems, we ask "little brother" and the cosine similarity of the " little sister " vector is 0, and the cosine similarity of "little brother" and "sky" is also 0. According to common sense, the correlation between "little brother" and " little sister sister" should be higher than that of "little brother" and "Sky" has a stronger correlation, but the one-hot encoding does not contain this correlation. The Word Embedding method can contain these correlations, because the vector representation of each word is obtained by training the thesaurus using the corresponding training method, so the relevant information of the word is contained in the word vector.

So how to obtain and use word vectors? In the paper, the word vectors trained by the Glove method are used. Of course, there are also word vectors trained by FastText, GoogleNews and other methods. However, after the author's experiments and comparisons, it is found that the differences are similar. The word vectors trained by the Glove method are used as the input of GCN, and the results are slightly better. Some, so I ended up using Glove-trained word vectors.

The following code is how to obtain and use the word vector trained by Glove, and finally save it as pkl type for direct use during training.

1. import torch

2. import torchtext.vocab as vocab

3. # Calculate the cosine similarity

4. def Cos(x, y):

5. cos = torch.matmul(x, y.view((-1,))) / (

6. (torch.sum(x * x) + 1e-9).sqrt() * torch.sum(y * y).sqrt())

7. return cos

8. if __name__ == '__main__':

9. total = np.array([])

10. # Select the set of word vectors you need

11. glove = vocab.GloVe(name="6B", dim=300)

12. # No Finding

13. # glove.stoi[] method is to get the index subscript of the corresponding word vector

14. # glove.vectors[] method is to get the word vector of the corresponding word vector subscript

15. no = glove.vectors[glove.stoi['no']]

16. finding = glove.vectors[glove.stoi['finding']]

17. no_finding = no + finding

18. total = np.append(total, no_finding.numpy())

19. # lung opacity

20. lung = glove.vectors[glove.stoi['lung']]

21. opacity = glove.vectors[glove.stoi['opacity']]

22. lung_opacity = lung + opacity

23. total = np.append(total, lung_opacity.numpy())

24. # Atelectasis

25. atelectasis = glove.vectors[glove.stoi['atelectasis']]

26. total = np.append(total, atelectasis.numpy())

27. # Fracture

28. fracture = glove.vectors[glove.stoi['fracture']]

29. total = np.append(total, fracture.numpy())

30. # chexperts have 14 categories, so I'll put 14 here

31. total = total.reshape(14, -1)

32. # Save the word embedding of the corresponding category

33. pickle.dump(total, open('./glove_wordEmbedding.pkl', 'wb'), pickle.HIGHEST_PROTOCOL)

34. # You can print it out and look at the cosine similarity

35. print("NO Finding vs Fracture cos sim：", Cos(no_finding, fracture))

36. print("Lung Opacity vs Atelectasis cos sim:", Cos(lung_opacity, atelectasis))

37. print("Lung Opactiy vs Fracture cos sim:", Cos(lung_opacity, fracture))

38. '''

39. print：

40. NO Finding vs Fracture cos sim： tensor(0.0954)

41. Lung Opacity vs Atelectasis cos sim: tensor(0.2576)

42. Lung Opactiy vs Fracture cos sim: tensor(0.2670)

43. '''

From the results of cosine similarity, it can be seen that the correlation between NO Finding and Fracture is very small, and there is a little correlation between Lung Opacity and Fracture, which is in line with common sense.

2.2 Calculation process of GCN network

The formula is as above, is the output of the first layer GCN network, is a correlation matrix that has been processed, and is the input of the first layer GCN network (ie the output of the l-1 layer GCN network), (transition matrix) is the parameter that the lth layer GCN network can learn. As we said above, the initial input of the network is Word Embedding, and is the word embedding obtained through Glove training. is a parameter that can be learned, we can initialize it randomly at the beginning, h() represents the activation function, and the LeakyReLU activation function is generally used. Now the question is how do we get ? To get , first get A, the code to get A and is shown below.

1. #gen_A() is to get the matrix A

2. def gen_A(num_classes, t, adj_file):

3. import pickle

4. result = pickle.load(open(adj_file, 'rb'))

5. _adj = result['adj']

6. _nums = result['nums']

7. _nums = _nums[:, np.newaxis]

8. _adj = _adj / _nums

9. _adj[_adj < t] = 0

10. _adj[_adj >= t] = 1

11.

12. #ps:This may not be the same as the formula in the paper, but it is written in the code, and I will follow the code

13. _adj = _adj * 0.25 / (_adj.sum(0, keepdims=True) + 1e-6)

14. _adj = _adj + np.identity(num_classes, np.int)

15. return _adj

16.

17. #gen_adj() is to gett A^ hat matrix from A

18. def gen_adj(A):

19. D = torch.pow(A.sum(1).float(), -0.5)

20. D = torch.diag(D)

21. adj = torch.matmul(torch.matmul(A, D).t(), D)

22. return adj

It can be seen from the above code that the official code for obtaining the A matrix has been provided. We need to provide three parameters: num_class, t, and adj_file. num_class is the number of our own categories, t indicates that the official threshold is 0.3, and adj_file is generated by ourselves. The adj_file file is a dictionary that contains adj and nums. The following is the saving format of my local training set data. I need to obtain my own adj_file by counting the number of occurrences of each pair of labels in the training set data.

1. # This is the code that I wrote to generate adj file

2. def make_adj_file():

3. #opt.train_csv is the absolute address of my own training set

4. dataset = load_data(opt.train_csv)

5. #opt.classes is the field list of my categories. There are 14 categories in total, which is equivalent to removing the previous 5 fields such as Path, Sex and Age

6. dataset = dataset[opt.classes].values

7. #co-occurrence matrix shape is (14,14)

8. adj_matrix = np.zeros(shape=(len(opt.classes), len(opt.classes)))

9. #The total number of occurrences of each category shape is (14, )

10. nums_matrix = np.zeros(shape=(len(opt.classes)))

11.

12. '''

13. Algorithm pipeline

14. First, traverse each row of data

15. 1. Count the occurrence times of pairwise labels in each row (oneself and oneself are not counted, that is, adj_matrix is a symmetric matrix with 0 diagonal)

16. 2. Count the number of occurrences of each category in each row

17. '''

18. for index in range(len(dataset)):

19. data = dataset[index]

20. for i in range(len(opt.classes)):

21. if data[i] == 1:

22. nums_matrix[i] += 1

23. for j in range(len(opt.classes)):

24. if j != i:

25. if data[j] == 1:

26. adj_matrix[i][j] += 1

27.

28. adj = {'adj': adj_matrix,

29. 'nums': nums_matrix}

30. pickle.dump(adj, open('./adj.pkl', 'wb'), pickle.HIGHEST_PROTOCOL)

In this regard, we have obtained word embedding (input of GCN network, obtained by Glove pre-trained word vector), A (obtained by gen_A() function), (obtained by gen_adj(A) function), W transfer matrix (random initialization) get).

3. Add GCN to your network

To add GCN to our own network, we only need to add the code in model.py in GitHub to the code structure of our own network. I used DenseNet-121 for the CNN part and ResNet in the original paper, so I made a little tweak. (I only used the GCN network structure code in model.py, and the code in util.py to generate A and )