Object detection is pinpointing where objects exist within an image (known as object localization) and assigning them to specific categories (object classification). It's a preferred approach over image classification in numerous scenarios because it identifies objects and precisely locates them. Object detection architectures fall into two main categories: single-stage and two-stage detectors.
Two-stage detectors follow a sequence of steps: first, they extract features, then generate proposals, and finally perform classification. On the other hand, single-stage detectors like YOLO (You Only Look Once) execute detection in a single step, which makes them popular due to their accuracy, lightweight design, and suitability for edge deployment. YOLO architectures are increasingly favored among single-stage detectors for their compatibility with industrial requirements.
In object detection, enhancements have been achieved by employing compact filters to predict object categories and bounding box adjustments. These filters vary based on aspect ratios and are applied across multiple feature maps to detect objects at different scales. This methodology enables high accuracy even with low-resolution input, thus accelerating the detection process.
SSD: Single Shot Detection
SSD is an object detection technique that uses a convolutional neural network (CNN) to detect objects within images. It predicts both the bounding boxes (the boxes that outline the objects) and the class labels (what type of object it is) for each detected object.
SSD Architecture Flow
- Feature Extraction: It starts with a base CNN that extracts features from the input image.
- Multi-scale Feature Maps: These are layers added on top of the base CNN that progressively decrease in size. They allow SSD to detect objects at different scales in the image.
- Convolutional Predictors: Each feature map layer uses small convolutional filters to predict a fixed set of bounding boxes and their corresponding class labels. These filters are like small templates that scan the image and predict whether there's an object present in a specific location.
- Default Boxes: For each location on the feature map, SSD associates a set of default bounding boxes with different sizes and aspect ratios. These default boxes serve as references for predicting the final bounding boxes.
- Matching Ground Truth Boxes: During training, SSD matches these default boxes with the ground truth boxes (the actual objects in the image) based on their overlap. This helps the network learn which default boxes correspond to real objects.
- Training Objective: SSD's training objective involves two main components:
- Localization Loss: This measures how accurate the predicted bounding boxes are compared to the ground truth boxes. It measures how well the predicted bounding boxes (l) match the ground truth boxes (g). It uses a Smooth L1 loss function to calculate the difference between the predicted and ground truth box parameters (center coordinates, width, and height)
- Confidence Loss: This measures how confident the network is in its predictions of an object's presence and its class label. It uses a softmax loss function over the multiple class confidences.
Yolo: You Look Only Once
A fundamental concept was introduced in YOLO using grid cells overlaid onto the image. Each cell, typically of size s×s, takes responsibility for detecting objects within its boundaries. When the center of an object falls within a specific grid cell, that cell is tasked with identifying and locating the object. This approach enables other cells to disregard the object if it appears in multiple grid cells.
YOLO assigns each grid cell for object detection implementation to predict B bounding boxes. These boxes include information about the object's dimensions and a confidence score, indicating the likelihood of an object's presence within the box.
YOLO Architecture Flow
- Grid Cells: YOLO divides the image into a grid of cells, each responsible for detecting objects within its confines.
- Bounding Boxes: YOLO predicts multiple bounding boxes within each cell, estimating potential object locations and sizes. Each bounding box contains parameters (x, y, width, height) and a confidence score.
- Confidence Score: This score reflects YOLO's confidence in the presence of an object within a bounding box. It ranges from 0 to 1, with higher scores indicating greater confidence.
- Class Prediction: YOLO also predicts the type of object within each box, such as "car," "dog," or "person." This aids in object classification.
- Non-Maximum Suppression (NMS): NMS is employed to address situations where YOLO predicts multiple overlapping bounding boxes for the same object. It removes redundant boxes, retaining only the most confident predictions.
Case Study: On-Shelf Availability of a Product in Store
The problem is to detect Coke, Sprite, Pepsi, and Miranda on the shelf.