Brief Introduction
•
Indicate the "essential difference" between anchor-based and anchor-free detectors
◦
How to define positive and negative training samples
•
Propose an "adaptive training sample selection" to automatically select positive and negative training samples
•
Demonstrate tiling multiple anchors per location is useless
•
Achieve SOTA performance on MS COCO without overhead
Recent vision models anchor-based & anchor-free
•
Mainly two methods for vision processing
◦
Anchor-based method: single-stable & double-stage
◦
Anchor-free method: keypoint-based & center-based
Anchor-based
•
Double-stage: Faster R-CNN
◦
Consists of Region Proposal Network(RPN) & region-wise prediction network (R-CNN)
◦
Good accuracy with anchor refinement
•
Single-stage: Single Shot Detection (SSD)
◦
High computational efficiency
Anchor-free
•
Keypoint-based method: CornerNet
◦
First locates several pre-defined or self-learned keypoints
◦
Then, generates bounding boxes
•
Center-based method: YOLO
◦
Regards the center of object as foreground to define positives
◦
Then, predicts the distances from positive to the four sides of the object bounding box
Difference Analysis between two
•
RetinaNet (anchor-based) vs. FCOS (anchor-free)
◦
one-stage anchor-based & center-based anchor-free
•
Attention points
◦
The positive/negative sample definition
◦
The number of anchors tiled per location
•
Dataset: MS COCO (80 object classes)
•
RetinaNet (#A=1) → one square anchor box per location
Inconsistency removal
Essential Difference (and not)
•
The definition of positive and negative samples is an essential difference
•
No obvious difference regarding the regression starting status (box or point)
IoU (spatial and scale dimension simultaneously) vs. spatial first, then scale constraint
Using the spatial and scale constraint strategy instead of IoU, RetinaNet improves
Regression starting status of RetinaNet is a box, while FCOS is a point
Table2 also indicates that the regression starting status is an irrelevant
Adaptive Training Sample Selection
New way to define positive and negatives based on the previous insights
→ Better than traditional IoU-based strategy
Sensitive hyper-parameters (IoU threshold, scale ranges) make very different results
Algorithm details
•
Select k anchor boxes whose center are closest to the center of g (L2 distance)
◦
Supposing L feature pyramid levels, L * k candidate positive samples
•
Then, compute IoU with threshold ()
◦
Mean average & standard deviation
•
If assigned to multiple ground-truth boxes, one with the highest IoU is selected
Keypoints
•
Selecting candidates based on the center distance between anchor box and object
◦
Both showed better results when the center of anchor box is closer
◦
larger IoU (anchor-based) & higher-quality (anchor-free)
•
Using the sum of mean and standard deviation as the IoU threshold
◦
Higher shows high-quality candidates
◦
With variance, we can obtain a low threshold to select appropriate positives
•
Limiting the positive samples' center to object
◦
The anchor with a center outside object is obviously poor candidate
•
Maintaining fairness between different objects
◦
Make same positive samples regardless of scale and aspect ratio
◦
RetinaNet and FCOS tend to have much more positive samples for larger objects
•
Keeping almost hyperparameter-free → Only k (and it's insensitive)
ATSS shows better performance without any additional overhead
Quite insensitive to the variations of k
Quite stable and insensitive to different scales and aspect ratios of anchor box
ATSS shows better than SOTA without overhead
•
Method proposed by this paper is compatible and complementary to most of current tech
Discussion
•
Previous experiments are based on RetinaNet with only one anchor per location
•
Under the traditional IoU-based strategy, tiling more anchor boxes per location is effective
•
On the other hand, under this paper's strategy it's irrelevant