Lessons learned from reproducing ResNet and DenseNet on CIFAR-10 dataset
Deep convolutional neural networks (DCNNs) have shown their promising success in image classification. Two recently proposed DCNNs have been gaining great reputation very quickly: Residual network (ResNet) proposed in 2015 has reported an outstanding performance in terms of classification accuracy, which is still one of the best state-of-art classifier on CIFAR-10 dataset by achieving a very small error rate of 6.97% with 56 layers; Furthermore, Densely-connected network (DenseNet) proposed in 2018 has reduced the small error rate by another 2+% with 110 layers.
As these two papers were written very precise and there are quite a few versions of implementations for both of them, I thought it would be really straightforward to reproduce their results. However, there are a few tricks which dragged me from reaching the expected accuracy, and it takes me quite a lot of time to figure them out. In this article, the lessons that I have learned are written in three aspects — data augmentation, the optimizer used to train neural networks and normalization.
Both of the papers indicate that the basic data augmentation, which first resize the 32*32 images to 36*36 images and randomly crop the resized images to obtain 32*32 images, and secondly the cropped images are randomly horizontally flipped, is used, but the size of the augmented data is not specified, i.e. how many times of data are generated from data augmentation.
At the beginning, I assumed that the method used in the paper augmented the data to 100,000 which is 2 times of the original data size of 50,000, but the best accuracy of ResNet-56 (ResNet with the depth of 56) that I could achieve was around 90.38% which is more that 2% worse than the accuracy shown in the paper. Therefore, I tried to augment the data to 3 times of the original data size, which gave me around 2% increase of the accuracy.
In Fig.1, it shows test error, training error and training loss across 300 epochs. It can be observed that for both 2 times and 3 times data augmentations, the training loss and training error rate follow similar curves, but the test error rate of 3 times data augmentation is better than the corresponding competitor of 2 times augmentation, so it demonstrates that 3 times data augmentation works better than 2 times. In addition, by using 3 time data augmentation, a test accuracy of 92.58% accuracy is achieved, which is very close to the accuracy shown in the paper. For DenseNet, the same improvement was found by increasing the test accuracy from 93.20% to 94.48%. To sum up, both ResNet and DenseNet accomplished an obvious improvement by using 3 times data augmentation than using 2 times counterpart.
Fig. 1: Comparison of the results of ResNet-56 with 2X and 3X data augmentation. Red lines: the results from 2X augmentation; Blue lines : the results from 3X data augmentation
The optimizer used to train neural networks
In my previous experience of deep neural network, I used Adam optimizer and it worked very well in most of occasions, so I made an assumption that using Adam optimizer would not make a big difference from using Stochastic Gradient Decent (SGD) which is used by the two papers of ResNet and DenseNet. It turns out that the assumption is a disaster because using SGD with a scheduled learning rate by following the paper can obtain a much better result than using Adam optimizer for both ResNet and DenseNet.
As shown in Fig. 2, we can see that using Adam optimizer without any manual interference, the training process didn’t converge to zero training loss. On the other hand, by applying SGD with a scheduled learning rate which is 0.1 at the beginning, divided by 10 at the epoch of 90 and divided by another 10 again at the epoch of 135, it managed to achieve a zero training loss at 100 epoch approximately, which therefore produces a smaller error rate. The same learning pattern of an error rate decrease of more than 3% can be observed for DenseNet, so it can be concluded that using SGD with scheduled learning rate plays an important role of reproducing the results of ResNet and DenseNet.
Fig. 2: Comparison of the results of ResNet-56 obtained by using SGD and Adam Optimizer. Red lines: the results obtained by using Adam Optimizer; Blue lines: the results obtained by using SGD.
There are two normalization methods used in the two papers — Per-pixel mean subtraction and Global Contrast Normalization (GCN) for ResNet and DenseNet, respectively. In order to compare different normalization strategy, three experiments were done, which are an experiment of not using normalization at all, an experiment of using per-pixel mean subtraction and the other one using GCN. However, there is no significant difference having been achieved. The reason I believe is that both for ResNet and DenseNet, batch normalization has been applied through out different layers of the CNNs, so it makes sense that data normalization does not seem helpful to improve the performance of both ResNet and DenseNet.
A methodology of gradually improving a CNN
During the whole process of reproducing the results of ResNet and DenseNet, a solution of how to gradually improve a CNN comes up, which can be observed from Fig. 3 to Fig. 5.
The first attempt of improving the CNN can be data augmentation. In Fig. 3, firstly, the CNN is trained by using the original dataset without any regularization, and the test error rate and the training loss are plotted as the two red lines. It can be found in the right figure of Fig. 3 that the training process has converged quite fast, so the training of the CNN is complete. Therefore, the larger error rate is possibly caused by the over-fitting problem. The most straightforward way to sort over-fitting issue is data augmentation, so data augmentation is applied and the blue line in the left figure of Fig. 3 shows the improvement of the classification accuracy, which proves that data augmentation works effective in terms of improving the accuracy.
Fig. 3: Comparison of the results obtained by using original data and augmented data. Red lines: the results from using original data; Blue lines: the results from using augmented data.
The second bid to improve the performance of the CNN could be applying regularization. By looking into the left figure of Fig. 4, the test error rate is not reduced by applying regularization. However, it can be found that in the right figure of Fig. 4, the training loss is much bigger than zero represented by the blue line, which means that if the training loss can be reduced, the test error may be further decreased.
Fig. 4: Comparison of the results obtained by using regularization or not using regularization. Red lines: the results without regularization; Blue lines: the results obtained by applying L2 norm regularization.
The objective of reducing the training loss may be accomplished by tuning the learning rate. The method of tuning the learning rate is to use a learning rate at the beginning, and divide the learning rate by 10 after every certain number of epochs. In my example of Fig. 5, the learning rate was divided by 10 after 90 epochs, and it can be seen that at around 100 epoch, the training loss plunges into almost zero, which is just as expected. At the mean time, the test error rate achieved a steep fall.
Fig. 5: Comparison of the results obtained by tuning and not tuning the learning rate. Red lines: the results without tuning; Blue lines: the results with tuning.
The above three steps can be recursively performed until the acceptable error rate is achieved.
There are a number of lessons learned from reproducing the results of ResNet and DenseNet. First of all, when reproducing the results of any paper, we need to carefully read the paper and extract every details and parameters of the implementation. Taking one mistake I made for example, Adam Optimizer used by me was not the optimizer used in the paper, so the expected results can not be reached. Secondly, data augmentation and scheduled SGD are two effective ways to improve the classification accuracy of CNNs. Last but not least, a methodology of improving CNNs is learned by observing the results of the process of improving ResNet and DenseNet.
 He, Kaiming, et al. “Deep Residual Learning for Image Recognition.” arXiv preprint arXiv:1512.03385 (2015).
 Huang, Gao, et al. “Densely Connected Convolutional Networks.” arXiv preprint arXiv:1608.06993 (2018).
About the Author: 王斌 Bin Wang is Head of the Engineering team at Arcanum. He is also completing a PhD in Computer Science at Victoria University in Wellington. Bin has written a number of successful posts on Medium — check them out here.