Feature Pyramid Networks for Object Detection 阅读笔记

在这里插入图片描述
Our proposed Feature Pyramid Network (FPN) is fast like (b) and ©, but more accurate. In this figure, feature maps are indicate by blue outlines and thicker
outlines denote semantically stronger features.

The principle advantage of featurizing each level of an image pyramid is that it produces a multi-scale feature representation in which all levels are semantically strong, including the high-resolution levels.

However, image pyramids are not the only way to compute a multi-scale feature representation. A deep ConvNet computes a feature hierarchy layer by layer, and with subsampling layers the feature hierarchy has an inherent multiscale, pyramidal shape.

This in-network feature hierarchy produces feature maps of different spatial resolutions, but introduces large semantic gaps caused by different depths.

在这里插入图片描述
Figure 2. Bottom: our model that has a similar structure but leverages it as a feature pyramid, with predictions made independently at all levels.

But to avoid using low-level features SSD foregoes reusing already computed layers and instead builds the pyramid starting from high up in the network (e.g., conv4 3 of VGG nets [36]) and then by adding several new layers. Thus it misses the opportunity to reuse the higher-resolution maps of the feature hierarchy. We show that these are important for detecting small objects.

The goal of this paper is to naturally leverage the pyramidal shape of a ConvNet’s feature hierarchy while creating a feature pyramid that has strong semantics at all
scales. To achieve this goal, we rely on an architecture that combines low-resolution, semantically strong features with high-resolution, semantically weak features via a top-down pathway and lateral connections (Fig. 1(d)). The result is a feature pyramid that has rich semantics at all levels and is built quickly from a single input image scale.

Similar architectures adopting top-down and skip connections are popular in recent research [28, 17, 8, 26]. Their goals are to produce a single high-level feature map of a fine resolution on which the predictions are to be made (Fig. 2 top). On the contrary, our method leverages the architecture as a feature pyramid where predictions (e.g., object detections) are independently made on each level (Fig. 2 bottom). Our model echoes a featurized image pyramid, which has not been explored in these works.

Related Work

Methods using multiple layers
There are recent methods exploiting lateral/skip connections that associate low-level feature maps across resolutions and semantic levels, including U-Net [31] and SharpMask [28] for segmentation, Recombinator networks [17] for face detection, and Stacked Hourglass networks [26] for keypoint estimation. Ghiasi et al. [8] present a Laplacian pyramid presentation for FCNs to progressively refine segmentation. Although these methods adopt architectures with pyramidal shapes, they are unlike featurized image pyramids [5, 7, 34] where predictions are made independently at all levels, see Fig. 2. In fact, for the pyramidal architecture in Fig. 2 (top), image pyramids are still needed to recognize objects across multiple scales [28].

Feature Pyramid Networks

在这里插入图片描述
Figure 3. A building block illustrating the lateral connection and the top-down pathway, merged by addition.

This process is independent of the backbone convolutional architectures (e.g., [19, 36, 16]), and in this paper we present results using ResNets [16].

Bottom-up pathway. The bottom-up pathway is the feedforward computation of the backbone ConvNet, which computes a feature hierarchy consisting of feature maps at several scales with a scaling step of 2. There are often many
layers producing output maps of the same size and we say
these layers are in the same network stage. For our feature
pyramid, we define one pyramid level for each stage. We
choose the output of the last layer of each stage as our reference set of feature maps, which we will enrich to create
our pyramid. This choice is natural since the deepest layer
of each stage should have the strongest features.
Specifically, for ResNets [16] we use the feature activations output by each stage’s last residual block. We denote
the output of these last residual blocks as fC2; C3; C4; C5g
for conv2, conv3, conv4, and conv5 outputs, and note that
they have strides of f4, 8, 16, 32g pixels with respect to the
input image. We do not include conv1 into the pyramid due
to its large memory footprint

Top-down pathway and lateral connections. The topdown pathway hallucinates higher resolution features by
upsampling spatially coarser, but semantically stronger, feature maps from higher pyramid levels. These features are
then enhanced with features from the bottom-up pathway
via lateral connections. Each lateral connection merges feature maps of the same spatial size from the bottom-up pathway and the top-down pathway. The bottom-up feature map
is of lower-level semantics, but its activations are more accurately localized as it was subsampled fewer times.
Fig. 3 shows the building block that constructs our topdown feature maps. With a coarser-resolution feature map,
we upsample the spatial resolution by a factor of 2 (using
nearest neighbor upsampling for simplicity). The upsam-pled map is then merged with the corresponding bottom-up
map (which undergoes a 1×1 convolutional layer to reduce
channel dimensions) by element-wise addition. This process is iterated until the finest resolution map is generated.
To start the iteration, we simply attach a 1×1 convolutional
layer on C5 to produce the coarsest resolution map. Finally, we append a 3×3 convolution on each merged map to
generate the final feature map, which is to reduce the aliasing effect of upsampling. This final set of feature maps is
called fP2; P3; P4; P5g, corresponding to fC2; C3; C4; C5g
that are respectively of the same spatial sizes.
Because all levels of the pyramid use shared classifiers/regressors as in a traditional featurized image pyramid,
we fix the feature dimension (numbers of channels, denoted
as d) in all the feature maps. We set d = 256 in this paper and thus all extra convolutional layers have 256-channel
outputs. There are no non-linearities in these extra layers,
which we have empirically found to have minor impacts.
Simplicity is central to our design and we have found that
our model is robust to many design choices. We have experimented with more sophisticated blocks (e.g., using multilayer residual blocks [16] as the connections) and observed
marginally better results. Designing better connection modules is not the focus of this paper, so we opt for the simple
design described above.

Applications

Feature Pyramid Networks for RPN

We adapt RPN by replacing the single-scale feature map
with our FPN. We attach a head of the same design (3×3
conv and two sibling 1×1 convs) to each level on our feature
pyramid. Because the head slides densely over all locations
in all pyramid levels, it is not necessary to have multi-scaleanchors on a specific level. Instead, we assign anchors of
a single scale to each level. Formally, we define the anchors to have areas of f322; 642; 1282; 2562; 5122g pixels
on fP2; P3; P4; P5; P6g respectively.1 As in [29] we also
use anchors of multiple aspect ratios f1:2; 1:1, 2:1g at each
level. So in total there are 15 anchors over the pyramid.

Feature Pyramid Networks for Fast R-CNN

Thus we can adapt the assignment strategy of region-based detectors [15, 11] in the case when they
are run on image pyramids. Formally, we assign an RoI of
width w and height h (on the input image to the network) to
the level Pk of our feature pyramid by:

We attach predictor heads (in Fast R-CNN the heads are
class-specific classifiers and bounding box regressors) to all
RoIs of all levels. Again, the heads all share parameters,
regardless of their levels. In [16], a ResNet’s conv5 layers (a 9-layer deep subnetwork) are adopted as the head on
top of the conv4 features, but our method has already harnessed conv5 to construct the feature pyramid. So unlike
[16], we simply adopt RoI pooling to extract 7×7 features,
and attach two hidden 1,024-d fully-connected (fc) layers
(each followed by ReLU) before the final classification and
bounding box regression layers. These layers are randomly
initialized, as there are no pre-trained fc layers available in
ResNets. Note that compared to the standard conv5 head,
our 2-fc MLP head is lighter weight and faster.

Based on these adaptations, we can train and test Fast RCNN on top of the feature pyramid.