Deep Dive into Single Stage Object Detection

Data Science

Date : 06/28/2024

Data Science

Date : 06/28/2024

Deep Dive into Single Stage Object Detection

Explore the evolution of single stage object detection, from R-CNN innovations to modern YOLO and SSD models.

Amey Patil

AUTHOR - FOLLOW
Amey Patil
Data Scientist Consultant

Demystifying Single Stage Object Detection: A Deep Dive
Like the blog
Demystifying Single Stage Object Detection: A Deep Dive

Accurate object detection is key to countless real-world applications, from self-driving cars to medical image analysis. The journey towards today's sophisticated systems began with R-CNN.

R-CNN played a groundbreaking role in the field of object detection, introducing several innovations that paved the way for modern, highly accurate object detectors. It introduced a modular pipeline with separate stages for proposal generation, feature extraction, and classification. While later models moved towards unified training, this stage-wise approach provided a foundation for understanding the breakdown of the object detection process. 

Let’s walk through some of the key ideas and understand the inner workings of the R-CNN family of detectors.

Object Detection before R-CNN

Before R-CNN, object detection primarily relied on meticulously designed detectors to scan every possible location and scale in an image. Features such as HOG or SIFT, manually designed to capture shape and texture, were used to identify objects. These approaches were computationally expensive as they had to search the whole image at different locations.

Viola-Jones Detector

The Viola-Jones algorithm is a classic object detection method famed for its speed and ability to achieve real-time face detection in the early 2000s. Its core innovation lies in the use of Haar-like features. These simple rectangular features capture differences in brightness and can be computed with extreme efficiency using an integral image representation. The algorithm trains classifiers using the AdaBoost algorithm to select the most discriminative Haar-like features.

Figure 1: Haar-like Cascades

HOG Detector

The HOG detector was a popular method for object detection that aimed to capture the overall shape and appearance of objects based on the distribution of local intensity gradients (edges) within an image. The HOG detector, often combined with a sliding window approach, was a significant tool for object detection tasks, especially pedestrian detection, before the dominance of deep learning methods.

Deformable Part-Based Model

DPMs were a significant advancement in pre-deep learning object detection. They aimed to model objects as flexible combinations of parts to better handle deformations and occlusions compared to rigid object detectors.

The flexible arrangement of parts allowed DPMs to capture moderate pose variations and partial occlusions. The part-based approach provided some intuition on which parts of the model helped with the detection, making them more interpretable. However, since DPMs still use HOG features, their representational power is limited compared to learned features in deep learning. Also, the optimization for finding the best part configuration was computationally demanding.

R-CNN Architecture

The seminal R-CNN paper, "Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation," describes RCNN in the following manner:

Regions with CNN Features (RCNN)  model uses high-capacity CNNs to bottom-up region proposals in order to localize and segment objects. It uses selective search to identify a number of bounding-box object region candidates (“regions of interest”) and then extracts features from each region independently for classification.

To understand this better, let’s break down the problem it’s trying to solve and see how it addresses each piece individually. The main challenge of object detection is two-fold:

  1. Where: Locating objects and their bounding boxes.
  2. What: Identifying the class of the object (e.g., “cat,” “car,” “cycle,” etc.)

R-CNN tackles these problems using a sequence of different techniques.

Region Proposal: Selective Search

To address the “where” side of the problem, we would want to search for “interesting” regions within the image where we might find some of the objects we are trying to detect. To this end, R-CNN leverages Selective Search.

Selective Search aims to group similar regions, starting from a fine-grained over-segmentation and progressing hierarchically. The process can be described in the following steps:

Step I: Initial Over-segmentation
The image is divided into many small, homogeneous regions. This is usually based on pixel similarities involving color, texture, intensity, etc.

Step II: Hierarchical Merging

  1. Similarity Measures: Selective Search uses a combination of similarity metrics:
    1. Color similarity (color histograms)
    2. Texture similarity (gradients)
    3. Size similarity (favors smaller regions merging first)
    4. Shape Compatibility (how well regions fit together when combined)
  2. Greedy Combination: Starting with the most similar region pairs, Selective Search iteratively merges them. After each merge, similarities are recalculated with the newly formed, larger region.

Step III: Generating Proposals:

As the merging progresses, the remaining regions (usually around 1000-2000) are considered the region proposals.

Selective Search is much faster than exhaustive sliding window methods (used for traditional object detection tasks). It also provides proposals at varying scales, locations, and shapes. This helps the detector handle different object types and has a high recall, meaning it's good at proposing regions that contain objects, even if it suggests some extra regions, too.

Categorizing Regions

Now that we have some candidate regions to investigate, we can shift our focus to what is in these regions. Do these regions contain any of the categories we are searching for? For this, R-CNN proposes a two-step process.

Feature Extraction

Pre-trained CNN: R-CNN takes a pre-trained CNN (like AlexNet) that is already good at classifying images.

Extract Features: Each proposed region is cropped, resized, and fed into the CNN. The CNN is a powerful feature extractor, creating a rich representation of what's inside that region.

Classification

R-CNN trains a separate Support Vector Machine (SVM) classifier for each object class you want to detect. These SVMs take the CNN's extracted features and tell you the probability of the region containing that object (e.g., one SVM for "car," another for "person," etc.).

So, when we obtain features for each bounding box, we pass them through each SVM to get different scores across all categories.

Bounding Box Regression

The paper finally proposes one last step to refine the output: introducing a bounding-box regressor. This regressor takes the region's features and fine-tunes the location of the initially proposed bounding box to better fit the object.

Finally, R-CNN has produced a list of bounding boxes. Each box has:

  • An object category label (e.g., "person")
  • A confidence score from the SVM
  • Refined bounding box coordinates.

Limitations

While R-CNN was a breakthrough in object detection, it had certain limitations that paved the way for further improvements:

Slow Speed: 

R-CNN is considered notoriously slow due to several factors:

  • Multiple Stages: Processing each image involved running Selective Search, extracting CNN features for thousands of proposals, and then classifying them separately.
  • Redundant Computations: CNN feature extraction happened independently on each region proposal; even overlapping regions shared a lot of computation.
    • Disjointed Training Pipeline: Training R-CNN is a multi-stage process:
    • Pre-training a CNN on a large image classification task.
    • Fine-tuning the CNN on region proposals generated for object detection.
    • Training SVMs separately.
    • Finally, training the bounding box regressors. This fragmented process made it complex to optimize.

Selective Search Bottleneck

R-CNN relied on Selective Search for region proposals. While efficient, Selective Search became a bottleneck for overall speed, and its parameters could affect detection performance.

Fast R-CNN

Fast R-CNN was a significant speed improvement over R-CNN, addressing major computational inefficiencies. It introduced the idea of sharing CNN feature computation and replaced SVMs with a single neural network for classification and bounding box regression, streamlining the process and boosting speed.

Fast R-CNN aimed to fix the major efficiency issues of R-CNN. The procedure introduced was as follows:

  1. Single Feature Map: Instead of extracting features from each region proposal separately, Fast R-CNN runs the CNN on the entire input image once and generates a feature map.
  2. ROI Pooling: A new layer called ROI Pooling takes the region proposals and extracts fixed-size feature vectors from the shared feature map for each proposal. This avoids redundant feature computation. [Diagram of ROI Pooling with region proposals mapped onto feature map]
  3. Multi-task Training: Fast R-CNN combines classification and bounding box regression into a single neural network with multi-task loss, eliminating separate training stages.

Limitations:

  • Dependence on Separate Proposal Method: Fast R-CNN still used Selective Search for region proposals, which remained a computational bottleneck.
  • Potential for Further Optimization: While significantly improved, the architecture might not be fully optimized for real-time object detection.

Faster R-CNN

The most significant leap forward came with Faster R-CNN, where it introduced the Region Proposal Network (RPN). The RPN is a small network that directly learns to predict object proposals, finally removing the dependency on Selective Search and enabling near real-time object detection.

  • Region Proposal Network (RPN): The core innovation in Faster R-CNN. The RPN is a small neural network that slides over the convolutional feature map from the backbone network, predicting object proposals (or "anchors") at various scales and aspect ratios.
  • Speed: The RPN shares convolutional computations with the leading detection network, making proposal generation much faster than Selective Search.
  • Unified Model:  Faster R-CNN integrates the RPN, the feature extractor, and the classification/bounding box regression branches into a single, end-to-end trainable model. This reduces complexity and allows for joint optimization.

Limitations

Two-Stage Process: While faster than R-CNN and Fast R-CNN, Faster R-CNN is still a two-stage detector (proposals + classification), which can limit its real-time capability.

RPN Hyperparameters: The performance of the RPN can be sensitive to the choice of anchors and other hyperparameters.

Mask R-CNN

While Faster R-CNN excels at drawing boxes around objects, Mask R-CNN determines the exact outline of each object within those boxes, providing incredibly fine-grained information. A parallel branch (alongside those for classification and bounding box regression) specifically for predicting pixel-level segmentation masks for each detected object instance.