场景分割：MIT Scene Parsing 与DilatedNet 扩展卷积网络

时间 2019-12-14

标签场景分割 mit scene parsing dilatednet 扩展网络栏目系统网络繁體版

原文原文链接

MIT Scene Parsing Benchmark简介
html

Scene parsing is to segment and parse an image into different image regions associated with semantic categories, such as sky, road, person, and bed. MIT Scene Parsing Benchmark (SceneParse150) provides a standard training and evaluation platform for the algorithms of scene parsing. The data for this benchmark comes fromADE20K Dataset which contains more than 20K scene-centric images exhaustivelyannotated with objects and object parts. Specifically, the benchmark is divided into 20K images for training, 2K images for validation, and another batch of held-out images for testing. There are totally 150 semantic categories included for evaluation, which include stuffs like sky, road, grass, and discrete objects like person, car, bed. Note that there are non-uniform distribution of objects occuring in the images, mimicking a more natural object occurrence in daily scene.git

scene Benchmark包含了150个物体类别，包括通常无定型的墙壁、水域、地板、道路，也包括常见的室内目标如窗户、桌子、椅子、床、杯子等粘附和非粘附目标，包含了COCO数据集的大多数类别。github

主页连接：http://sceneparsing.csail.mit.edu/
网络

预训练模型： http://sceneparsing.csail.mit.edu/model/
ide

Model ZOO ： https://github.com/CSAILVision/sceneparsing/wiki/Model-Zoo
学习

一些State 的结果：https://drive.google.com/drive/folders/0B9CKOTmy0DyaQ2oxUHdtYUd2Mm8?usp=sharing
this

挑战结果： http://placeschallenge.csail.mit.edu/results_challenge.html Face++ 暂时排在第一名
google

1. FCN与去卷积网络
lua

deconv的其中一个用途是作upsampling，即增大图像尺寸。而dilated conv并非作upsampling，而是增大感觉野。spa

参考：如何理解深度学习中的去卷积网络层

(1) s>1，即卷积的同时作了downsampling，卷积后图像尺寸减少；

(2) s=1，普通的步长为1的卷积，好比在tensorflow中设置padding=SAME的话，卷积的图像输入和输出有相同的尺寸大小；

(3) 0<s<1，fractionally strided convolution，至关于对图像作upsampling。好比s=0.5时，意味着在图像每一个像素之间padding一个空白的像素后，stride改成1作卷积，获得的feature map尺寸增大一倍。

而dilated conv不是在像素之间padding空白的像素，而是在已有的像素上，skip掉一些像素，或者输入不变，对conv的kernel参数中插一些0的weight，达到一次卷积看到的空间范围变大的目的。

2. 所谓孔洞卷积

dilated conv，中文能够叫作空洞卷积或者扩张卷积。

参考：如何理解扩展卷积网络？下一段摘抄于此文

参考：Multi-scale context aggregation by dilated convolutions

诞生背景，在图像分割领域，图像输入到CNN（典型的网络好比FCN[3]）中，FCN先像传统的CNN那样对图像作卷积再pooling，下降图像尺寸的同时增大感觉野，可是因为图像分割预测是pixel-wise的输出，因此要将pooling后较小的图像尺寸upsampling到原始的图像尺寸进行预测（upsampling通常采用deconv反卷积操做，deconv可参见知乎答案如何理解深度学习中的deconvolution networks？），以前的pooling操做使得每一个pixel预测都能看到较大感觉野信息。所以图像分割FCN中有两个关键，一个是pooling减少图像尺寸增大感觉野，另外一个是upsampling扩大图像尺寸。在先减少再增大尺寸的过程当中，确定有一些信息损失掉了，那么能不能设计一种新的操做，不经过pooling也能有较大的感觉野看到更多的信息呢？答案就是dilated conv。

下面看一下dilated conv原始论文[4]中的示意图：

(a)图对应3x3的1-dilated conv，和普通的卷积操做同样，(b)图对应3x3的2-dilated conv，实际的卷积kernel size仍是3x3，可是空洞为1，也就是对于一个7x7的图像patch，只有9个红色的点和3x3的kernel发生卷积操做，其他的点略过。也能够理解为kernel的size为7x7，可是只有图中的9个点的权重不为0，其他都为0。能够看到虽然kernel size只有3x3，可是这个卷积的感觉野已经增大到了7x7（若是考虑到这个2-dilated conv的前一层是一个1-dilated conv的话，那么每一个红点就是1-dilated的卷积输出，因此感觉野为3x3，因此1-dilated和2-dilated合起来就能达到7x7的conv）,(c)图是4-dilated conv操做，同理跟在两个1-dilated和2-dilated conv的后面，能达到15x15的感觉野。对比传统的conv操做，3层3x3的卷积加起来，stride为1的话，只能达到(kernel-1)*layer+1=7的感觉野，也就是和层数layer成线性关系，而dilated conv的感觉野是指数级的增加。

dilated的好处是不作pooling损失信息的状况下，加大了感觉野，让每一个卷积输出都包含较大范围的信息。在图像须要全局信息或者语音文本须要较长的sequence信息依赖的问题中，都能很好的应用dilated conv，好比图像分割[3]、语音合成WaveNet[2]、机器翻译ByteNet[1]中。

能够把网络看作一个pooling层插值网络。

参考：Yu, Fisher, and Vladlen Koltun. "Multi-scale context aggregation by dilated convolutions."arXiv preprint arXiv:1511.07122 (2015).

使用预训练模型获得的一些结果：

预处理模型效果不是很好，应该使用竞赛排名top的几个模型