本文来自《MobiFace: A Lightweight Deep Learning Face Recognition on Mobile Devices》,时间线为2018年11月。是做者分别来自CMU和uark学校。git
随着DCNN的普及,在目标检测,目标分割等领域都有不小的进步,然而其较高准确度背后倒是大量的参数和计算量。如AlexNet须要61百万参数量,VGG16须要138百万参数量,Resnet-50须要25百万参数量。Densenet190(k=40)须要40百万参数量。虽然这些网络如今看来都不算很深的网络,但是仍是须要200MB和500MB的内存。所以,这样的模型一般是不能部署在移动端或者嵌入式端的。因此最近在图像分类和目标检测领域中也有很多压缩模型被提出来,如剪枝[13,14,32],逐深度卷积[18,38],二值网络[3,4,22,36],mimic网络[31,44]。这些网络能够在没有损失较多准确度的基础上对inference速度进行加速。然而这些模型没有应用在人脸识别领域上。相对于目标检测和人脸分类,人脸识别问题一般须要必定数量的层去提取够鲁棒的辨识性的人脸特征,毕竟人脸模板都同样(两个眼睛,一个嘴巴)。网络
本文做者提出一个轻量级可是高性能的深度神经网络,以此让人脸识别能部署在移动设备上。相比于其余网络,MobiNet优点有:架构
- 让MobileNet架构变得更轻量级,提出的MobiNet模型能够很好的部署在移动设备上;
- 提出的MobiNet能够end-to-end的优化;
- 将MobiNet与基于mobile的网络和大规模深度网络在人脸识别数据上进行对比。
目前为止,已经有很多轻量级深度网络的设计方案,如binarized networks, quantized networks, mimicked networks, designed compact modules 和 pruned networks。本文主要关注最后两种设计方案。app
Designed compact modules
经过整合小的模型或者紧凑的模块和层,能够减小权重的数量,有助于减小内存使用和inference阶段的时间消耗。MobileNet提出一个逐深度分离的卷积模块来代替传统的卷积层,以此明显减小参数量。逐深度卷积操做首先出如今Sifre[41]论文中,而后用在[2,18,38]网络中。在Mobilenet[18]中,空间输入经过一个3x3空间可分通道滤波器进行卷积生成独立的特征,而后接一个逐点(1x1)卷积操做以今生成新的特征。经过这个策略代替传统的卷积操做,使得MobileNet只有4.2百万的参数量和569百万的MAdds。在Imagenet上得到70.6%的结果(VGG16结果是71.5%)。为了提高MobileNet在多任务和benchmark上的性能。Sandler等人提出一个倒置残差和线性botleneck(inverted residuals and linear bottlenecks),叫MobileNet-v2。倒置残差相似[16]中的残差bottleneck,可是中间特征能够关于输入通道的数量扩展到一个特定比例。线性bottleneck是不带有ReLU层的块。MobileNetv2将以前准确度提高到72%,而只须要3.4百万参数量和300百万MAdds。虽然逐深度可分卷积被证明颇有效,[18,38]仍然在iphone和安卓上占用不少内存和计算力。而本文发出的时间上,做者并未找到逐深度卷积在CPU上有很好的框架(tf,pytorch,caffe,mxnet)实现。为了减小MobileNet的计算量,FD-Mobilenet中引入快速下采样策略。受到MobileNet-v2的结构启发,MobileFaceNet经过引入类似的网络结构,并经过将全局平均池化层替换成全局逐深度卷积层来减小参数量。框架
Pruned networks
DNN一直受到参数量巨大和内存消耗不少的困扰。[14]提出一个深度压缩模型经过绝对值去剪枝那些不重要的链接,在Alexnet和VGG16上得到了9x和13x的加速,且并未有多少准确度损失。[32]使用BN中的缩放因子(而不是权重的绝对值)对网络进行瘦身。这些缩放因子经过L1-惩罚进行稀疏训练。在VGG16,DenseNet,ResNet中Slimming networks [32]基于CIFAR数据集得到比原始网络更好的准确度。然而,每一个剪枝后的链接索引须要存在内存中,这拉低了训练和测试的速度。iphone
带有扩展层的bottleneck残差块(Bottleneck Residual block with the expansion layers)
[37]中引入bottlenect残差块,该块包含三个主要的变换操做,两个线性变换和一个非线性逐通道变换:ide
- 非线性变换学习复杂的映射函数;
- 在内层中增长了feature map的数量;
- 经过shortcut链接去学习残差。
给定一个输入\(\mathbf{x}\)和对应size为\(h\times w\times k\),一个bottleneck残差块能够表示为:
\[F(\mathbf{x})=[F_1\cdot F_2 \cdot F_3](\mathbf{x})\]
其中,\(F_1:R^{w\times h\times k}\mapsto R^{w\times h\times tk}\),\(F_3:R^{w\times h\times k}\mapsto R^{\frac{w}{s}\times \frac{h}{s}\times k_1}\)都是经过1x1卷积实现的线性函数,t表示扩展因子。\(F_2:R^{w\times h \times tk}\mapsto R^{\frac{w}{s}\times \frac{h}{s}\times tk}\)是非线性映射函数,经过三个操做组合实现的:ReLU,3x3逐深度卷积(stride=s),和ReLU。
在bottleneck块中采用了残差学习链接,以此阻止变换中的流行塌陷和增长特征embedding的表征能力[37]>函数
快速下采样
基于有限的计算资源,紧凑的网络应该最大化输入图像转换到输出特征中的信息变换,同时避免高代价的计算,如较大的feature map空间维度(分辨率)。在大规模深度网络中,信息流是经过较慢的下采样策略实现的,如空间维度在层之间是缓慢变小的。而轻量级网络不能这样。
所谓快速下采样,就是在特征embedding过程的最初阶段连续使用下采样步骤,以免feature map的大空间维度,而后在后面的阶段上,增长更多feature map来保证信息流的传递。要注意的是,虽然增长更多feature map,会致使通道数量的上升,可是由于自己feature map的分辨率够小,因此增长的计算代价不大。性能
MobiFace网络,给定输入人脸图像size为112x112x3,该轻量级网络意在最大化信息流变换同时减小计算量。基于上述分析,带有扩展层的参数botteneck块(Residual Bottleneck block with expansion layers)能够做为MobiFace的构建块。表1给出了MobiFace的主要结构。
学习
- 一个3x3的卷积层;
- 一个3x3的逐深度分离卷积层(depthwise separable convolutional layer);
- 一系列bottleneck块和残差bottleneck块;
- 一个1x1卷积层;
- 一个全链接层。
其中残差bottleneck块和bottleneck块很像,除了残差bottleneck块会添加shortcut方式以链接1×1卷积层的输入和输出。并且在bottleneck 块中stride=2,而在残差bottleneck块中每层stride=1。
MobiFace经过引入快速下采样策略,快速减小层/块的空间维度。能够发现原本输入大小为112x112x3,在前两层就减小了一半,而且在后面7个bottleneck块中就减小了8x之多。扩展因子保持为2,而通道数在每一个bottleneck块后就翻倍了。
除了标记为“linear”的卷积层以外,在每一个卷积层以后应用BN和非线性激活函数。本文中,主要用PReLU而不是ReLU。在Mobiface最后一层,不采用全局平均池化层,而是采用全链接层。由于全局平均池化是无差异对待每一个神经元(而中间区域神经元的重要性要大于边缘区域神经元),FC层能够针对不一样神经元学到不一样权重,从而将此类信息嵌入到最后的特征向量中。
先基于提炼后的MS-Celeb-1M数据集(3.8百万张图片,85个ID)进行训练,而后在LFW和MegaFace数据集上进行评估结果。
在预处理阶段,采用MTCNN模型进行人脸检测和5个关键点检测。而后将其对齐到112x112x3上。而后经过减去127.5并除以128进行归一化。在训练阶段,经过SGD进行训练,batchsize为1024,动量为0.9.学习率在40K,60K,80K处分别除以10。一共迭代100K次。
表2给出了在LFW上的benckmark。
reference: [1] S. Chen, Y. Liu, X. Gao, and Z. Han. Mobilefacenets: Efficient cnns for accurate real-time face verification on mobile devices. arXiv preprint arXiv:1804.07573, 2018. [2] F. Chollet. Xception: Deep learning with depthwise separable convolutions. In CVPR, pages 1800–1807. IEEE Computer Society, 2017. [3] M. Courbariaux and Y. Bengio. Binarynet: Training deep neural networks with weights and activations constrained to +1 or -1. CoRR, abs/1602.02830, 2016. [4] M. Courbariaux, Y. Bengio, and J. David. Binaryconnect: Training deep neural networks with binary weights during propagations. In NIPS, pages 3123–3131, 2015. [5] J. Deng, W. Dong, R. Socher, L. jia Li, K. Li, and L. Fei-fei. Imagenet: A large-scale hierarchical image database. In In CVPR, 2009. [6] C. N. Duong, K. Luu, K. Quach, and T. Bui. Beyond principal components: Deep boltzmann machines for face modeling. In CVPR, 2015. [7] C. N. Duong, K. Luu, K. Quach, and T. Bui. Longitudinal face modeling via temporal deep restricted boltzmann machines. In CVPR, 2016. [8] C. N. Duong, K. Luu, K. Quach, and T. Bui. Deep appearance models: A deep boltzmann machine approach for face modeling. Intl Journal of Computer Vision (IJCV), 2018. [9] C. N. Duong, K. G. Quach, K. Luu, T. H. N. Le, and M. Savvides. Temporal non-volume preserving approach to facial age-progression and age-invariant face recognition. In ICCV, 2017. [10] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. 2014. [11] Y. Guo, L. Zhang, Y. Hu, X. He, and J. Gao. Ms-celeb-1m: A dataset and benchmark for large-scale face recognition. In European Conference on Computer Vision, pages 87–102. Springer, 2016. [12] M. S. H. N. Le, R. Gummadi. Deep recurrent level set for segmenting brain tumors. In Medical Image Computing and Computer Assisted Intervention (MICCAI), pages 646–653. Springer, 2018. [13] S. Han, H. Mao, and W. J. Dally. Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. CoRR, abs/1510.00149, 2015. [14] S. Han, J. Pool, J. Tran, and W. J. Dally. Learning both weights and connections for efficient neural networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1, NIPS’15, pages 1135–1143, Cambridge, MA, USA, 2015. MIT Press. [15] K. He, G. Gkioxari, P. Doll´ar, and R. Girshick. Mask R-CNN. In Proceedings of the International Conference on Computer Vision (ICCV), 2017. [16] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016. [17] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. [18] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. CoRR, abs/1704.04861, 2017. [19] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017. [20] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In CVPR, pages 2261–2269, 2017. [21] G. B. Huang, M. Mattar, T. Berg, and E. Learned-Miller. Labeled faces in the wild: A database forstudying face recognition in unconstrained environments. In Workshop on faces in’Real-Life’Images: detection, alignment, and recognition, 2008. [22] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio. Binarized neural networks. In NIPS, pages 4107–4115, 2016. [23] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, volume 37, pages 448–456. JMLR.org, 2015. [24] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. B. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In ACM Multimedia, pages 675–678. ACM, 2014. [25] I. Kemelmacher-Shlizerman, S. M. Seitz, D. Miller, and E. Brossard. The megaface benchmark: 1 million faces for recognition at scale. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4873–4882, 2016. [26] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Master’s thesis, Department of Computer Science, University of Toronto, 2009. [27] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097–1105. Curran Associates, Inc., 2012. [28] H. N. Le, C. N. Duong, K. Luu, and M. Savvides. Deep contextual recurrent residual networks for scene labeling. In Journal of Pattern Recognition, 2018. [29] H. N. Le, K. G. Quach, K. Luu, and M. Savvides. Reformulating level sets as deep recurrent neural network approach to semantic segmentation. In Trans. on Image Processing (TIP), 2018. [30] H. N. Le, C. Zhu, Y. Zheng, K. Luu, and M. Savvides. Robust hand detection in vehicles. In Intl. Conf. on Pattern Recognition (ICPR), 2016. [31] Q. Li, S. Jin, and J. Yan. Mimicking very efficient network for object detection. 2017 IEEE Conference on CVPR, pages 7341–7349, 2017. [32] Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang. Learning efficient convolutional networks through network slimming. 2017 IEEE International Conference on Computer Vision (ICCV), pages 2755–2763, 2017. [33] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR. [34] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. 2017. [35] Z. Qin, Z. Zhang, X. Chen, C. Wang, and Y. Peng. Fd-mobilenet: Improved mobilenet with a fast downsampling strategy. In 2018 25th IEEE International Conference on Image Processing (ICIP), pages 1363–1367. IEEE, 2018. [36] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. Xnor-net: Imagenet classification using binary convolutional neural networks. In ECCV (4), volume 9908 of Lecture Notes in Computer Science, pages 525–542. Springer, 2016. [37] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4510–4520, 2018. [38] M. Sandler, A. G. Howard, M. Zhu, A. Zhmoginov, and L. Chen. Inverted residuals and linear bottlenecks: Mobile networks for classification, detection and segmentation. CoRR, abs/1801.04381, 2018. [39] M. W. Schmidt, G. Fung, and R. Rosales. Fast optimization methods for l1 regularization: A comparative study and two new approaches. In ECML, 2007. [40] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 815–823, 2015. [41] L. Sifre. Rigid-motion scattering for image classification, 2014. [42] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition, 2014. [43] H. Wang, Y. Wang, Z. Zhou, X. Ji, D. Gong, J. Zhou, Z. Li, and W. Liu. Cosface: Large margin cosine loss for deep face recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018. [44] Y. Wei, X. Pan, H. Qin, and J. Yan. Quantization mimic: Towards very tiny cnn for object detection. CoRR, abs/1805.02152, 2018. [45] X. Wu, R. He, Z. Sun, and T. Tan. A light cnn for deep face representation with noisy labels. IEEE Transactions on Information Forensics and Security, 13(11):2884–2896, 2018. [46] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters, 23(10):1499–1503, 2016. [47] Y. Zheng, C. Zhu, K. Luu, H. N. Le, C. Bhagavatula, and M. Savvides. Towards a deep learning framework for unconstrained face detection. In BTAS, 2016. [48] C. Zhu, Y. Ran, K. Luu, and M. Savvides. Seeing small faces from robust anchor’s perspective. In CVPR, 2018. [49] C. Zhu, Y. Zheng, K. Luu, H. N. Le, C. Bhagavatula, and M. Savvides. Weakly supervised facial analysis with dense hyper-column features. In IEEE Conference on Computer Vision and Pattern Recognition Workshop (CVPRW), 2016. [50] C. Zhu, Y. Zheng, K. Luu, and M. Savvides. Enhancing interior and exterior deep facial features for face detection in the wild. In Intl Conf. on Automatic Face and Gesture Recognition (FG), 2018.