Feature Pyramid Network for Multi-Scale Detection

10 min readJan 14, 2025

The architecture that revolutionized multi-scale detection

Content Outline

Introduction
Three primary components of modern neural networks (Backbone, Neck and Head)
Why Feature Pyramid Networks (The Problem)??
Feature Pyramid Networks : The Elegant Solution
Implementation Example Code
Variations (Evolution of Feature Pyramid Network)
Limitations and Considerations
Conclusion

Introduction

In the world of deep learning and computer vision, architectural innovations have played a important role in advancing the field. Among these innovations, Feature Pyramid Networks (FPN) stand out as a fundamental building block that has revolutionized how we handle multi-scale feature representation in neural networks. In this article we’ll explore the FPN.

Three primary components (Backbone, Neck and Head)

Before getting into FPN, it’s important to understand the three primary components of modern neural networks in computer vision:

Backbone

The backbone, typically a convolutional neural network like ResNet or VGG, serves as the primary feature extractor. It processes the raw input image and generates hierarchical feature representations at different scales.

Think of it as the foundation that captures basic to complex features, from edges and textures to higher-level semantic information.

Neck

The neck component serves as a feature fusion and enhancement module between the backbone and head networks. Its primary purpose is to process and combine features from different scales or stages of the backbone to generate more discriminative feature representations.

Think of it as a processing plant that takes raw materials (features) from different sources and refines them into more useful products.

The neck can perform various operations like:

Feature fusion across different scales
Feature enhancement through additional convolutions
Information flow management between different network levels

Feature Pyramid Network is one popular implementation of a neck architecture, but others exist like Path Aggregation Network (PANet) and High-Resolution Network (HRNet).

Head

The head is the task-specific component that uses the refined features to make final predictions. Different tasks (detection, segmentation, classification) require different head architectures, but they all benefit from well-processed features from the neck.

Why Feature Pyramid Networks?

Photo by Image Hunter: https://www.pexels.com/photo/man-hand-holding-note-with-question-21939167/

Because of the :

“The Multi-Scale Challenge”

The multi-scale challenge in computer vision comes from multiple fundamental limitations in traditional CNN architectures:

i. Feature Hierarchy Problem

As we go deeper in a CNN, the spatial resolution decreases while the semantic level increases. For example, in a typical ResNet:

Early layers (e.g., Conv1) have 1/2 resolution with basic features (edges, textures)
Middle layers (e.g., Conv3) have 1/8 resolution with mid-level features (parts, patterns)
Deep layers (e.g., Conv5) have 1/32 resolution with high-level features (objects, scenes)

ii. Scale Variance

Objects in natural images appear at vastly different scales. Consider autonomous driving:

Nearby pedestrians might occupy 300x600 pixels
Distant vehicles might only occupy 30x60 pixels
Traffic signs could appear at any size in between

iii. Information Loss

Traditional feature pyramids (like image pyramids) maintain spatial resolution but lack semantic strength at lower levels, making them inefficient for modern deep learning.

These three problems work together to create a major challenge in computer vision ⚠️

Let’s think about a real-world example: a self-driving car trying to detect objects on a street.

The car’s camera sees objects at many different distances — some things are close, others are far away. To spot a distant pedestrian, the system needs to work with high-resolution (detailed) images to see the small details.

However, there’s a problem: the early layers of the network that process these detailed images aren’t very good at understanding what they’re looking at. They might see the basic shape of a person, but can’t tell if it’s actually a person or just a streetlight pole, because they lack deeper understanding.

You might think:

Why not just use traditional methods like image pyramids, where we create copies of the image at different sizes?

Unfortunately, this approach creates another problem — the features we extract from these pyramid images aren’t rich enough in information to be truly useful for modern deep learning.

So we end up stuck between two bad choices:

Either we get good detail but poor understanding, or good understanding but poor detail.

It’s like having to choose between a magnifying glass that shows you every detail but can’t tell you what you’re looking at, and a pair of blurry glasses that can recognize objects but can’t see them clearly.

This frustrating trade-off between “seeing clearly” and “understanding what we’re seeing” is exactly why researchers developed Feature Pyramid Networks — to finally solve this dilemma.

Feature Pyramid Networks (FPN) : The Elegant Solution

Image Credit : https://wiki.cloudfactory.com

As shown in image, FPN introduces a sophisticated yet very eady to understand architecture that combines low-level and high-level features through three key components:

i. Bottom-up Pathway (Backbone):

This is the regular ConvNet forward pass
Features get progressively more semantic but lose spatial resolution
Each stage outputs feature maps at different scales (C₂, C₃, C₄, C₅)

ii. Top-down Pathway:

Starts from the deepest layer and progressively upsamples spatially coarser but semantically stronger features
Creates higher resolution features (P₅, P₄, P₃, P₂)
Uses nearest neighbor upsampling to increase resolution

iii. Lateral Connections:

1x1 convolutions reduce channel dimensions of backbone features
Element-wise addition merges features from bottom-up and top-down pathways
3x3 convolutions smooth the merged features

The technical process works as follows:

Bottom-up features {C₂, C₃, C₄, C₅} are extracted
Top level feature C₅ is processed by 1x1 conv to create P₅
P₅ is upsampled and merged with processed C₄ to create P₄
This process continues until P₂
Each level in the final pyramid {P₂, P₃, P₄, P₅} contains rich semantic information while maintaining appropriate spatial resolution

Implementation Example Code

Let’s implement a basic FPN with a ResNet-18 backbone for image classification.

While FPN is more commonly used in detection and segmentation tasks, this example demonstrates its core concepts in a simpler classification context.

Here’s the complete code for classification with an FPN neck (👉🏼 Don’t worry, we’ll break it down step by step 😉) :

import torch
import torch.nn as nn
import torchvision.models as models

class FPNNeck(nn.Module):
    def __init__(self, in_channels_list, out_channels):
        super(FPNNeck, self).__init__()
        
        # Lateral connections (1x1 convolutions)
        self.lateral_convs = nn.ModuleList([
            nn.Conv2d(in_channels, out_channels, 1)
            for in_channels in in_channels_list
        ])
        
        # Top-down pathway (upsampling + smoothing)
        self.fpn_convs = nn.ModuleList([
            nn.Conv2d(out_channels, out_channels, 3, padding=1)
            for _ in range(len(in_channels_list))
        ])
        
    def forward(self, features):
        # features should be ordered from highest resolution to lowest
        laterals = [conv(feature) for feature, conv in zip(features, self.lateral_convs)]
        
        # Top-down pathway
        for i in range(len(laterals)-1, 0, -1):
            laterals[i-1] += nn.functional.interpolate(
                laterals[i], size=laterals[i-1].shape[-2:], mode='nearest'
            )
        
        # Smoothing
        outputs = [conv(lateral) for lateral, conv in zip(laterals, self.fpn_convs)]
        return outputs

class ResNetFPN(nn.Module):
    def __init__(self, num_classes):
        super(ResNetFPN, self).__init__()
        
        # Load pretrained ResNet-18 as backbone
        resnet = models.resnet18(pretrained=True)
        self.backbone_layers = nn.ModuleList([
            nn.Sequential(resnet.conv1, resnet.bn1, resnet.relu, resnet.maxpool, resnet.layer1),
            resnet.layer2,
            resnet.layer3,
            resnet.layer4
        ])
        
        # FPN neck
        in_channels_list = [64, 128, 256, 512]  # ResNet-18 output channels
        self.fpn = FPNNeck(in_channels_list, out_channels=256)
        
        # Classification head
        self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
        self.fc = nn.Linear(256 * 4, num_classes)  # 4 feature maps from FPN
        
    def forward(self, x):
        # Extract features from backbone
        features = []
        for layer in self.backbone_layers:
            x = layer(x)
            features.append(x)
            
        # FPN forward pass
        fpn_features = self.fpn(features)
        
        # Global average pooling on each FPN level
        pooled_features = []
        for feature in fpn_features:
            pooled = self.avgpool(feature)
            pooled_features.append(pooled.flatten(1))
            
        # Concatenate all pooled features
        x = torch.cat(pooled_features, dim=1)
        x = self.fc(x)
        
        return x

Code breakdown :

import torch
import torch.nn as nn
import torchvision.models as models

class FPNNeck(nn.Module):
    def __init__(self, in_channels_list, out_channels):
        super(FPNNeck, self).__init__()
        
        # Lateral connections (1x1 convolutions)
        self.lateral_convs = nn.ModuleList([
            nn.Conv2d(in_channels, out_channels, 1)
            for in_channels in in_channels_list
        ])
        
        # Top-down pathway (upsampling + smoothing)
        self.fpn_convs = nn.ModuleList([
            nn.Conv2d(out_channels, out_channels, 3, padding=1)
            for _ in range(len(in_channels_list))
        ])

The FPNNeck class implements the core FPN architecture:

The lateral_convs create 1x1 convolutions that reduce the channel dimensions of features coming from different levels of the backbone. Think of these as "adapters" that make sure features from different levels can be combined properly.
The fpn_convs are 3x3 convolutions that smooth the features after we combine them. This helps blend the information from different levels more effectively.

def forward(self, features):
        # features should be ordered from highest resolution to lowest
        laterals = [conv(feature) for feature, conv in zip(features, self.lateral_convs)]
        
        # Top-down pathway
        for i in range(len(laterals)-1, 0, -1):
            laterals[i-1] += nn.functional.interpolate(
                laterals[i], size=laterals[i-1].shape[-2:], mode='nearest'
            )
        
        # Smoothing
        outputs = [conv(lateral) for lateral, conv in zip(laterals, self.fpn_convs)]
        return outputs

The forward pass shows how FPN processes features:

First, it applies the lateral convolutions to all feature levels
Then, it implements the top-down pathway: starting from the deepest layer, it upsamples features and adds them to the next level up
Finally, it applies smoothing convolutions to all levels

class ResNetFPN(nn.Module):
    def __init__(self, num_classes):
        super(ResNetFPN, self).__init__()
        
        # Load pretrained ResNet-18 as backbone
        resnet = models.resnet18(pretrained=True)
        self.backbone_layers = nn.ModuleList([
            nn.Sequential(resnet.conv1, resnet.bn1, resnet.relu, resnet.maxpool, resnet.layer1),
            resnet.layer2,
            resnet.layer3,
            resnet.layer4
        ])

The ResNetFPN class combines everything:

It starts with a pretrained ResNet-18 as the backbone
We split it into four stages that will give us features at different resolutions

# FPN neck
        in_channels_list = [64, 128, 256, 512]  # ResNet-18 output channels
        self.fpn = FPNNeck(in_channels_list, out_channels=256)
        
        # Classification head
        self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
        self.fc = nn.Linear(256 * 4, num_classes)  # 4 feature maps from FPN

For the classification task:

We create the FPN neck that will process features from all four stages of the ResNet

We add a simple classification head that:

Pools features from each FPN level
Concatenates them together
Makes the final classification

def forward(self, x):
        # Extract features from backbone
        features = []
        for layer in self.backbone_layers:
            x = layer(x)
            features.append(x)
            
        # FPN forward pass
        fpn_features = self.fpn(features)
        
        # Global average pooling on each FPN level
        pooled_features = []
        for feature in fpn_features:
            pooled = self.avgpool(feature)
            pooled_features.append(pooled.flatten(1))
            
        # Concatenate all pooled features
        x = torch.cat(pooled_features, dim=1)
        x = self.fc(x)
        
        return x

The forward pass combines everything together:

The input image goes through the ResNet backbone, collecting features at each stage
These features go through the FPN neck, which creates our pyramid
We pool the features from each level of the pyramid
Finally, we combine all these features to make our classification prediction

While this example uses FPN for classification, it’s worth noting that in real-world applications, FPN is more commonly used in detection and segmentation tasks where multi-scale feature representation is crucial.

The concepts shown here — the lateral connections, top-down pathway, and feature fusion — are the same ones that make FPN powerful in those more complex tasks.

Variations (Evolution of Feature Pyramid Network)

Feature Pyramid Netwrok has evolved significantly since its inception, and researchers have introduced various modifications to the vanilla FPN network. Some of these variations are now used in state-of-the-art object detection and segmentation architectures.

Here are a few of notable variations:

PANet (Path Aggregation Network)

Enhances information flow by adding an extra bottom-up path after FPN
Introduces adaptive feature pooling

Used in:

Mask Scoring R-CNN for instance segmentation
Thunder-Net for real-time object detection
VFNet for accurate object detection

BiFPN (Bidirectional FPN):

Introduces weighted bidirectional cross-scale connections
Removes redundant connections for efficiency

Featured in:

EfficientDet family of object detectors
BoTNet for autonomous driving perception
PE-FPN (Position Enhanced FPN) in retail object detection

Recent Models (2023–2024):

RT-DETR: Uses a deformable transformer-based FPN variant
DINO-V2: Implements a hybrid FPN-Transformer neck
YOLOv8: Features a modified CSP-PAN neck inspired by FPN principles

Limitations and Considerations

Although Feature Pyramid Networks (FPN) represent a major achievement in computer vision neural networks, they are not suitable for every use case. There are limitations and several considerations to keep in mind before applying FPN to your specific use case.

Here are a few of them:

i. Computational Overhead

Real-world impact:

In autonomous driving systems, FPN processing can add 20–30ms latency, which might be critical for real-time decision making.

Example: Tesla’s previous vision systems used simplified feature fusion to maintain real-time performance.

ii. Single-Scale Tasks:

Examples where FPN might be overkill:

OCR for standardized documents (fixed text size)
QR code detection (known scale range)
Industrial defect detection on production lines (fixed camera distance)

iii. Small Object Detection:

Practical limitations in:

Satellite imagery analysis (small buildings/vehicles)
Medical image analysis (small lesions)
Wildlife monitoring (distant animals)

Conclusion

Feature Pyramid Networks represent a major achievement in computer vision architecture design. Their ability to effectively handle multi-scale feature representation while maintaining computational efficiency has made them an essential component in modern computer vision systems. As the field continues to evolve, FPN’s influence can be seen in newer architectures, and its principles continue to inspire innovations in neural network design.

The success of FPN teaches us an important lesson in deep learning architecture design:

sometimes, the most elegant solutions come from carefully considering the fundamental challenges of the problem space and addressing them with simple, well-thought-out mechanisms rather than increasing complexity.

If you enjoyed this article, your applause would be greatly appreciated!