Object tracking has emerged as a critical component in various applications such as surveillance, autonomous driving, and augmented reality. Recent computer vision and deep learning advancements have led to the development of sophisticated object-tracking algorithms, notably Byte-Track and Fair-MOT. In this blog, we will explore the intricacies of these two state-of-the-art methods, exploring their underlying principles, performance, and applications. Multi-object tracking (MOT) is an essential problem in computer vision, and with years of research on various applications,
First, let us investigate Deep-SORT and the issues in its architecture that led to the development of Fair-MOT.
Deep-SORT, short for Deep Simple Online and Realtime Tracking, is an advanced tracking algorithm designed to track multiple objects in a video sequence with high accuracy and in real time. It is widely used in computer vision applications, especially in object tracking in surveillance, autonomous vehicles, and robotics.
Deep-SORT's Architecture Comprises Several Key Components
Figure 1 Deep-SORT Architecture
Feature Extractor: Deep-SORT first extracts deep features from each detected object in the video frames using a convolutional neural network (CNN) YOLOv4. These features capture the appearance characteristics of each object.
Kalman Filter: After feature extraction, Deep-SORT utilizes a Kalman filter to predict the state (position and velocity) of each object in subsequent frames. The Kalman filter helps to estimate the next state based on the current state and the dynamics of the object's motion.
Data Association: Deep-SORT employs a sophisticated data association technique, typically using the Hungarian algorithm, to associate detected objects with existing tracks. This step helps maintain tracking consistency by matching detected objects with existing tracks based on their appearance features and predicted states.
Tracker Management: Deep-SORT employs a tracker management module to manage creation, maintenance, and deletion of object tracks. It determines when to create a new track for a newly detected object, update existing tracks, and terminate tracks for objects no longer present in the scene.
Embedding Model: A deep neural network calculates similarity scores between object detections and existing tracks. This embedding model learns a metric space where the similarity between feature vectors corresponds to the similarity between objects in appearance and motion.
Drawbacks
Computationally Intensive: Deep-SORT's reliance on deep learning models and complex algorithms makes it computationally intensive, requiring powerful hardware to achieve real-time performance.
High False Positive Rate: In challenging scenarios such as crowded scenes or occlusions, Deep-SORT may struggle with false positives or incorrect associations, leading to tracking errors.
Limited Generalization: Deep-SORT's performance may degrade when applied to object classes or scenes that differ significantly from the data it was trained on, limiting its generalization capability.
Dependency on Quality of Detection: Deep-SORT heavily relies on the accuracy of the object detection system that provides bounding boxes as input. Inferior quality detections can adversely affect tracking performance.
Vulnerability to Noisy Detections: Deep-SORT's performance can be affected by noisy detections or false positives, especially in cluttered scenes or challenging lighting conditions. Since it relies on feature similarity for association, inaccurate or inconsistent detections can lead to tracking failures.
Single Object Tracking Limitation: While Deep-SORT excels in tracking multiple objects simultaneously, it may face challenges in scenarios where the goal is to track a single object with high precision due to the complexity of its data association mechanisms.
Motivation Behind Fair-MOT
One of the critical drawbacks of Deep-SORT that led to the development of FAIR-MOT (Fair Multiple Object Tracking) is its high computational complexity, mainly when dealing with real-time applications. Deep-SORT's reliance on deep learning models and complex algorithms can strain computational resources, making achieving real-time performance on standard hardware challenging.
Addressing Limitations: The limitations of existing MOT algorithms, including Deep-SORT, in handling complex scenarios with occlusions, crowded scenes, and diverse motion patterns motivated the development of Fair-MOT.
Fair-MOT aimed to overcome these limitations by integrating deep association networks with a multi-task learning framework, enabling more robust and accurate object detection and tracking.
Multi-Task Learning Framework: Fair-MOT was built to simultaneously address object detection, feature embedding, and association tasks within a unified framework. Fair-MOT leverages complementary information to enhance tracking performance in challenging scenarios by jointly optimizing these tasks.
Robustness and Accuracy: Fair-MOT integrates deep association networks to refine object associations across frames, leveraging both appearance and motion features. This approach improves the robustness and accuracy of object tracking, particularly in scenarios with occlusions, clutter, and complex motion patterns.
Real-time Performance: Despite its advanced capabilities, Fair-MOT maintains real-time performance, making it suitable for applications requiring low-latency object detection and tracking.
While Deep-SORT offers advanced tracking capabilities, it has limitations related to complexity, generalization, noise tolerance, and real-time constraints. Fair-MOT was developed to address these limitations by integrating deep association networks within a multi-task learning framework, thereby enhancing multi-object tracking systems' robustness, accuracy, and real-time performance.
Exploring Fair-MOT
Fair-MOT stands for Fair Multi-Object Tracking and was developed by Yifu Zhang et al (link). Let's look into the details of the architecture:
Figure 2 Fair-MOT architecture from the paper
Backbone Network: The backbone network of Fair-MOT is based on ResNet-34, which is a famous convolutional neural network architecture. ResNet-34 is chosen because of its balance between accuracy and efficiency. The backbone network is responsible for extracting high-level features from the input images, which are then used by the subsequent branches for object detection and re-identification (Re-ID).
Fair-MOT incorporates Deep Layer Aggregation (DLA) to enhance the backbone network's feature representation capability. DLA introduces more skip connections between low-level and high-level features, allowing the network to fuse multi-layer features effectively. This helps capture semantic and spatial information, crucial for accurate object detection and re-ID.
Detection Branch: The detection branch in Fair-MOT is responsible for localizing and classifying objects in the scene. It consists of several heads that predict different components of the object bounding boxes.
Heatmap Head: The heatmap head predicts a heatmap representation of object centers. It generates a heatmap where each pixel value indicates the likelihood of an object center being present at that location. This heatmap is typically of lower resolution than the input image, and it allows for accurate localization of object centers even in densely packed scenes.
Box Offset and Size Heads: These heads predict the offsets and sizes of the object bounding boxes relative to the object centers. The box offset head predicts the offsets in the x and y directions from the object center to the top-left corner of the bounding box. The size head predicts the width and height of the bounding box. These predictions are crucial for accurately localizing and sizing the object-bounding boxes.
Re-ID Branch: The Re-ID branch in Fair-MOT is designed to extract discriminative features for each detected object, enabling their differentiation and tracking across frames. This branch consists of several convolutional layers followed by a feature embedding head.
Re-ID Loss: The Re-ID branch is trained using a combination of identification and triplet losses. The identification loss is calculated as the cross-entropy loss between the objects' predicted and ground truth identities. The triplet loss encourages the network to learn an embedding space where objects with the same identity are closer together than objects with different identities.
Training Fair-MOT
Fair-MOT is trained end-to-end using a combination of detection and Re-ID losses. The detection loss includes the heatmap loss, box offset loss, and size loss. The heatmap loss is calculated as the focal loss between the predicted and ground truth heatmaps. The box offset and size losses are computed as L1 losses between the predicted and ground truth box offsets and sizes, respectively.
During training, the authors employ several data augmentation techniques, such as random cropping, flipping, and scaling, to improve the model's generalization ability. They also use a complex example mining strategy to focus on challenging samples during training.
Online Inference: Fair-MOT takes a video frame as input during online inference and passes it through the network to obtain the detection and Re-ID results. The detection branch predicts the object centers, offsets, and sizes, which are then used to generate the final bounding boxes. The Re-ID branch provides a feature embedding for each detected object.
The data association step is then performed to link the detected objects across frames and construct their trajectories. Fair-MOT employs a Kalman Filter for motion modeling and utilizes the Hungarian algorithm to associate the detected objects with existing tracks based on their appearance (Re-ID features) and motion cues.
Fair-MOT 's online inference is efficient and performs quickly, making it suitable for various applications, including surveillance, traffic monitoring, and sports analytics.
Exploring Byte-Track
Byte-Track is Multi-Object Tracking by Associating Every Detection Box is a paper (Link) presented by Yifu Zhang et al. at the European Conference on Computer Vision (ECCV) in 2022. It introduces a simple yet powerful multi-object tracking (MOT) algorithm that associates every detection box, including low-score detections, to improve tracking accuracy and robustness. Byte-Track addresses the limitations of previous MOT methods, which often discard low-score detection boxes, leading to identity switches and fragmentations. The algorithm proposes associating every detection box, regardless of its score, to track lets using a high-confidence matching strategy. This approach improves tracking performance, especially in crowded and occluded scenes.
Here is a Detailed Explanation of the Byte-Track Algorithm:
Byte-Track consists of three main components: a detection network, a track-let management module, and a data association module.
Detection Network: Byte-Track employs a state-of-the-art object detection model, YOLOX, as its detection network. YOLOX is an anchor-based detector that provides high-quality bounding box predictions and confidence scores. The detection network processes each video frame and generates a set of detection boxes with corresponding scores.
Track-let Management: Byte-Track introduces a track-let management module to maintain historical information about objects. A track-let is defined as a short trajectory of an object within a limited time frame. Byte-Track initiates a new track-let for each detection box that is not matched with any existing track-lets. Track-lets are continuously updated with new detection boxes based on their association scores.
Data Association: The core of Byte-Track lies in its data association module, which associates detection boxes with track-lets. The association is performed in two stages:
High-Confidence Matching: Using the Hungarian algorithm, byte-Track first associates high-score detection boxes (above a threshold) with track-lets. This step ensures that high-confidence detections match the correct track-lets, reducing identity switches.
Low-Score Association: Byte-Track then associates the remaining low-score detection boxes with track-lets based on their similarities. The similarities are computed using intersection over union (IoU) and cosine similarity of appearance features. This step helps to recover missed associations and improve tracking accuracy.
The association process is performed iteratively, considering the historical information of track-lets and detection boxes. Byte-Track also employs a gating mechanism to filter out false positives and suppress redundant detections.
Training: Byte-Track is trained using a combination of detection and Re-ID losses. The detection loss includes the focal loss for classification and the L1 loss for bounding box regression. The Re-ID loss consists of an identification loss and a triplet loss, encouraging the model to learn discriminative features for each object.
Inference: Byte-Track processes each video frame and associates detection boxes with track-lets during inference. The algorithm continuously updates the track-lets' states, including their positions, velocities, and appearance features. Byte-Track effectively handles occlusions and missing detections by utilizing the historical information stored in the track-lets.
Comparison of the 2 Tracking Algorithms
Byte-Track and Fair-MOT are state-of-the-art multi-object tracking (MOT) algorithms performing remarkably in various benchmarks. Here is a detailed comparison between the two algorithms:
Algorithm Overview: Byte-Track is a MOT algorithm that associates every detection box, including low-score detections, to improve tracking accuracy. It is built on top of the YOLOX object detection model and introduces a track-let management module and a two-stage data association process.
Fair-MOT is an MOT approach that treats detection and re-identification (re-ID) tasks equally. It is based on the anchor-free object detection architecture CenterNet and consists of two homogeneous branches for detection and re-ID.
Detection Network: Byte-Track employs YOLOX, a state-of-the-art anchor-based object detection model, as its detection network. YOLOX provides high-quality bounding box predictions and confidence scores.
Fair-MOT utilizes ResNet-34 as its backbone network, incorporating Deep Layer Aggregation (DLA) to fuse multi-layer features effectively. It follows an anchor-free detection approach, similar to CenterNet.
Data Association: Byte-Track performs data association in two stages. First, it associates high-score detection boxes with track-lets using the Hungarian algorithm. Then, it associates low-score detection boxes based on IoU and appearance feature similarities.
Fair-MOT employs a Kalman Filter for motion modeling and utilizes appearance features for data association. It associates detections with existing tracks based on appearance and motion cues using the Hungarian algorithm.
Track-let Management: Byte-Track introduces a track-let management module to maintain historical information. It initiates new track-lets for detections not matched with existing track-lets and continuously updates track-lets with new detections.
Fair-MOT does not explicitly maintain track-lets but relies on the Kalman Filter to estimate the current state of objects based on their previous locations and velocities.
Training: Byte-Track is trained using detection and Re-ID losses. The detection loss includes focal loss for classification and L1 loss for bounding box regression. The Re-ID loss consists of identification and triplet losses.
Fair-MOT is also trained using detection and Re-ID losses. The detection loss includes focal loss for heatmap prediction, L1 loss for box offset prediction, and size loss. The Re-ID loss is based on cross-entropy and triplet losses.
Inference Speed: Byte-Track is designed for real-time inference, achieving 30 FPS on a single GPU during online tracking. Fair-MOT also achieves real-time performance, with a reported speed of 25 FPS on a single GPU.
Theft Detection Using Byte-Track
A video from YouTube, Videos Show Consistent Theft Problem Faced by Gas Station Chain (youtube.com), was sampled, and two shoplifting videos were captured in a Retail Store. This video was then split into frames, which were labeled further.
def split_video_to_frames(video_path, output_folder):
# Open the video file
video_capture = cv2.VideoCapture(video_path)
# Create output folder if it doesn't exist
if not os.path.exists(output_folder):
os.makedirs(output_folder)
# Initialize frame count
frame_count = 0
# Read the first frame
success, frame = video_capture.read()
while success:
# Write the current frame to disk
frame_path = os.path.join(output_folder, f"frame_{frame_count:06d}.jpg")
cv2.imwrite(frame_path, frame)
# Read the next frame
success, frame = video_capture.read()
# Increment frame count
frame_count += 1
# Release the video capture object
video_capture.release()
print(f"Split {frame_count} frames from video {video_path} to folder {output_folder}")
# Example usage:
video_path = 'C:\One-Drive\OneDrive - Tredence\Documents\COC-CV\Media2.mp4'
output_folder = 'C:\One-Drive\OneDrive - Tredence\Documents\COC-CV\Video2'
split_video_to_frames(video_path, output_folder)
Using the Anylabelling library, which is probably the most accessible tool to use across different standard datasets, the videos have been labeled to the following classes:
- person
- banana
- carrot
- cabbage
- book
- fruits
- apple
- oven
- doll
- cell phone
- broccoli
- orange
- chips
- cup
- chocolates
- bottle
- bowl
- box
They are labeled, and the export is done per the COCO dataset in the Anylabelling library with a folder for images and labels. For a more robust model training, divide the dataset into train and test, label the images, and export them as per the COCO dataset.