Zuyi WANG1, Zhimeng ZHENG1, Jun MENG‡1,2, Li XU‡1,2. End-to-end object detection using a query-selection encoder with hierarchical feature-aware attention[J]. Frontiers of Information Technology & Electronic Engineering,in press.https://doi.org/10.1631/FITEE.2400960
@article{title="End-to-end object detection using a query-selection encoder with hierarchical feature-aware attention", author="Zuyi WANG1, Zhimeng ZHENG1, Jun MENG‡1,2, Li XU‡1,2", journal="Frontiers of Information Technology & Electronic Engineering", year="in press", publisher="Zhejiang University Press & Springer", doi="https://doi.org/10.1631/FITEE.2400960" }
%0 Journal Article %T End-to-end object detection using a query-selection encoder with hierarchical feature-aware attention %A Zuyi WANG1 %A Zhimeng ZHENG1 %A Jun MENG‡1 %A 2 %A Li XU‡1 %A 2 %J Frontiers of Information Technology & Electronic Engineering %P %@ 2095-9184 %D in press %I Zhejiang University Press & Springer doi="https://doi.org/10.1631/FITEE.2400960"
TY - JOUR T1 - End-to-end object detection using a query-selection encoder with hierarchical feature-aware attention A1 - Zuyi WANG1 A1 - Zhimeng ZHENG1 A1 - Jun MENG‡1 A1 - 2 A1 - Li XU‡1 A1 - 2 J0 - Frontiers of Information Technology & Electronic Engineering SP - EP - %@ 2095-9184 Y1 - in press PB - Zhejiang University Press & Springer ER - doi="https://doi.org/10.1631/FITEE.2400960"
Abstract: End-to-end object detection methods have attracted extensive interest recently since they alleviate the need for complicated human-designed components and simplify the detection pipeline. However, these methods suffer from slower training convergence and inferior detection performance compared to conventional detectors, as their feature fusion and selection processes are constrained by insufficient positive supervision. To address this issue, we introduce a novel query-selection encoder (QSE) designed for end-to-end object detectors to improve training convergence speed and detection accuracy. The QSE is composed of multiple encoder layers stacked on top of the backbone. A lightweight head network is added after each encoder layer to continuously optimize features in a cascading manner, providing more positive supervision for efficient training. Additionally, a hierarchical featureaware attention (HFA) mechanism is incorporated in each encoder layer, including in-level feature attention and cross-level feature attention, to enhance the interaction between features from different levels. HFA can effectively suppress similar feature representations and highlight discriminative ones, thereby accelerating the feature selection process. Our method is highly versatile in accommodating both CNN-based and transformer-based detectors. Extensive experiments were conducted on the popular benchmark datasets MS COCO, CrowdHuman and PASCAL VOC to demonstrate the effectiveness of our method. The results showed that CNN-based and transformer-based detectors using QSE can achieve better end-to-end performance in a shorter training setting.
Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article
Reference
Open peer comments: Debate/Discuss/Question/Opinion
Open peer comments: Debate/Discuss/Question/Opinion
<1>