Bridging the Gap Between Anchor-based and Anchor-free Detection via Adaptive Training Sample Selection

Hwang Injung

Brief Introduction

Indicate the "essential difference" between anchor-based and anchor-free detectors
How to define positive and negative training samples
Propose an "adaptive training sample selection" to automatically select positive and negative training samples
Demonstrate tiling multiple anchors per location is useless
Achieve SOTA performance on MS COCO without overhead

Recent vision models anchor-based & anchor-free

Mainly two methods for vision processing
Anchor-based method: single-stable & double-stage
Anchor-free method: keypoint-based & center-based


Double-stage: Faster R-CNN
Consists of Region Proposal Network(RPN) & region-wise prediction network (R-CNN)
Good accuracy with anchor refinement
Single-stage: Single Shot Detection (SSD)
High computational efficiency


Keypoint-based method: CornerNet
First locates several pre-defined or self-learned keypoints
Then, generates bounding boxes
Center-based method: YOLO
Regards the center of object as foreground to define positives
Then, predicts the distances from positive to the four sides of the object bounding box

Difference Analysis between two

RetinaNet (anchor-based) vs. FCOS (anchor-free)
one-stage anchor-based & center-based anchor-free
Attention points
The positive/negative sample definition
The number of anchors tiled per location
Dataset: MS COCO (80 object classes)
RetinaNet (#A=1) → one square anchor box per location

Inconsistency removal

Essential Difference (and not)

The definition of positive and negative samples is an essential difference
No obvious difference regarding the regression starting status (box or point)
IoU (spatial and scale dimension simultaneously) vs. spatial first, then scale constraint
Using the spatial and scale constraint strategy instead of IoU, RetinaNet improves
Regression starting status of RetinaNet is a box, while FCOS is a point
Table2 also indicates that the regression starting status is an irrelevant

Adaptive Training Sample Selection

New way to define positive and negatives based on the previous insights
→ Better than traditional IoU-based strategy
Sensitive hyper-parameters (IoU threshold, scale ranges) make very different results

Algorithm details

Select k anchor boxes whose center are closest to the center of g (L2 distance)
Supposing L feature pyramid levels, L * k candidate positive samples
Then, compute IoU with threshold (tg=mg+vvt_g = m_g + v_v)
Mean average & standard deviation
If assigned to multiple ground-truth boxes, one with the highest IoU is selected


Selecting candidates based on the center distance between anchor box and object
Both showed better results when the center of anchor box is closer
larger IoU (anchor-based) & higher-quality (anchor-free)
Using the sum of mean and standard deviation as the IoU threshold
Higher mgm_g shows high-quality candidates
With variance, we can obtain a low threshold to select appropriate positives
Limiting the positive samples' center to object
The anchor with a center outside object is obviously poor candidate
Maintaining fairness between different objects
Make same positive samples regardless of scale and aspect ratio
RetinaNet and FCOS tend to have much more positive samples for larger objects
Keeping almost hyperparameter-free → Only k (and it's insensitive)
ATSS shows better performance without any additional overhead
Quite insensitive to the variations of k
Quite stable and insensitive to different scales and aspect ratios of anchor box
ATSS shows better than SOTA without overhead
Method proposed by this paper is compatible and complementary to most of current tech


Previous experiments are based on RetinaNet with only one anchor per location
Under the traditional IoU-based strategy, tiling more anchor boxes per location is effective
On the other hand, under this paper's strategy it's irrelevant