CLC number:
On-line Access: 2025-04-17
Received: 2024-10-29
Revision Accepted: 2025-02-09
Crosschecked: 0000-00-00
Cited: 0
Clicked: 24
Zuyi WANG1, Zhimeng ZHENG1, Jun MENG‡1,2, Li XU‡1,2. End-to-end object detection using a query-selection encoder with hierarchical feature-aware attention[J]. Frontiers of Information Technology & Electronic Engineering, 1998, -1(-1): .
@article{title="End-to-end object detection using a query-selection encoder with hierarchical feature-aware attention",
author="Zuyi WANG1, Zhimeng ZHENG1, Jun MENG‡1,2, Li XU‡1,2",
journal="Frontiers of Information Technology & Electronic Engineering",
volume="-1",
number="-1",
pages="",
year="1998",
publisher="Zhejiang University Press & Springer",
doi="10.1631/FITEE.2400960"
}
%0 Journal Article
%T End-to-end object detection using a query-selection encoder with hierarchical feature-aware attention
%A Zuyi WANG1
%A Zhimeng ZHENG1
%A Jun MENG‡1
%A 2
%A Li XU‡1
%A 2
%J Journal of Zhejiang University SCIENCE C
%V -1
%N -1
%P
%@ 2095-9184
%D 1998
%I Zhejiang University Press & Springer
%DOI 10.1631/FITEE.2400960
TY - JOUR
T1 - End-to-end object detection using a query-selection encoder with hierarchical feature-aware attention
A1 - Zuyi WANG1
A1 - Zhimeng ZHENG1
A1 - Jun MENG‡1
A1 - 2
A1 - Li XU‡1
A1 - 2
J0 - Journal of Zhejiang University Science C
VL - -1
IS - -1
SP -
EP -
%@ 2095-9184
Y1 - 1998
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/FITEE.2400960
Abstract: end-to-end object detection methods have attracted extensive interest recently since they alleviate the need for complicated human-designed components and simplify the detection pipeline. However, these methods suffer from slower training convergence and inferior detection performance compared to conventional detectors, as their feature fusion and selection processes are constrained by insufficient positive supervision. To address this issue, we introduce a novel query-selection encoder (QSE) designed for end-to-end object detectors to improve training convergence speed and detection accuracy. The QSE is composed of multiple encoder layers stacked on top of the backbone. A lightweight head network is added after each encoder layer to continuously optimize features in a cascading manner, providing more positive supervision for efficient training. Additionally, a hierarchical featureaware attention (HFA) mechanism is incorporated in each encoder layer, including in-level feature attention and cross-level feature attention, to enhance the interaction between features from different levels. HFA can effectively suppress similar feature representations and highlight discriminative ones, thereby accelerating the feature selection process. Our method is highly versatile in accommodating both CNN-based and transformer-based detectors. Extensive experiments were conducted on the popular benchmark datasets MS COCO, CrowdHuman and PASCAL VOC to demonstrate the effectiveness of our method. The results showed that CNN-based and transformer-based detectors using QSE can achieve better end-to-end performance in a shorter training setting.
Open peer comments: Debate/Discuss/Question/Opinion
<1>