JZUS - Journal of Zhejiang University SCIENCE

Frontiers of Information Technology & Electronic Engineering 2025 Vol.26 No.8 P.1324-1340

End-to-end object detection using a query-selection encoder with hierarchical feature-aware attention

Author(s): Zuyi WANG, Zhimeng ZHENG, Jun MENG, Li XU
Affiliation(s): College of Electrical Engineering, Zhejiang University, Hangzhou 310027, China; more
Corresponding email(s): zuyiwang@zju.edu.cn, zhengzhimengzju@foxmail.com, junmeng@zju.edu.cn, xupower@zju.edu.cn
Key Words: End-to-end object detection, Query-selection encoder, Hierarchical feature-aware attention

Share this article to： More <<< Previous Article \|Next Article >>>

Zuyi WANG, Zhimeng ZHENG, Jun MENG, Li XU. End-to-end object detection using a query-selection encoder with hierarchical feature-aware attention[J]. Frontiers of Information Technology & Electronic Engineering, 2025, 26(8): 1324-1340.

@article{title="End-to-end object detection using a query-selection encoder with hierarchical feature-aware attention",
author="Zuyi WANG, Zhimeng ZHENG, Jun MENG, Li XU",
journal="Frontiers of Information Technology & Electronic Engineering",
volume="26",
number="8",
pages="1324-1340",
year="2025",
publisher="Zhejiang University Press & Springer",
doi="10.1631/FITEE.2400960"
}

%0 Journal Article
%T End-to-end object detection using a query-selection encoder with hierarchical feature-aware attention
%A Zuyi WANG
%A Zhimeng ZHENG
%A Jun MENG
%A Li XU
%J Frontiers of Information Technology & Electronic Engineering
%V 26
%N 8
%P 1324-1340
%@ 2095-9184
%D 2025
%I Zhejiang University Press & Springer
%DOI 10.1631/FITEE.2400960

TY - JOUR
T1 - End-to-end object detection using a query-selection encoder with hierarchical feature-aware attention
A1 - Zuyi WANG
A1 - Zhimeng ZHENG
A1 - Jun MENG
A1 - Li XU
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 26
IS - 8
SP - 1324
EP - 1340
%@ 2095-9184
Y1 - 2025
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/FITEE.2400960

Abstract
Chinese Summary
Academic Network
Reviewer Comment

Abstract: end-to-end object detection methods have attracted extensive interest recently since they alleviate the need for complicated human-designed components and simplify the detection pipeline. However, these methods suffer from slower training convergence and inferior detection performance compared to conventional detectors, as their feature fusion and selection processes are constrained by insufficient positive supervision. To address this issue, we introduce a novel query-selection encoder (QSE) designed for end-to-end object detectors to improve the training convergence speed and detection accuracy. QSE is composed of multiple encoder layers stacked on top of the backbone. A lightweight head network is added after each encoder layer to continuously optimize features in a cascading manner, providing more positive supervision for efficient training. Additionally, a hierarchical feature-aware attention (HFA) mechanism is incorporated in each encoder layer, including in- and cross-level feature attention, to enhance the interaction between features from different levels. HFA can effectively suppress similar feature representations and highlight discriminative ones, thereby accelerating the feature selection process. Our method is highly versatile in accommodating both CNN- and Transformer-based detectors. Extensive experiments were conducted on the popular benchmark datasets MS COCO, CrowdHuman, and PASCAL VOC to demonstrate the effectiveness of our method. The results showed that CNN- and Transformer-based detectors using QSE can achieve better end-to-end performance within fewer training epochs.

基于分层特征感知注意力与查询选择编码器的端到端目标检测

王足毅¹，郑智萌¹，孟濬^1,2，许力^1,2
¹浙江大学电气工程学院，中国杭州市，310027
²浙江大学机器人研究院，中国余姚市，315400
摘要：由于无需设计复杂人工组件且简化了检测流程，端到端目标检测方法近年来受到广泛关注。然而，与传统检测器相比，这类方法存在训练收敛速度较慢、检测性能不足的问题，究其原因是在特征融合与选择过程中算法受限于正样本监督信号不足。针对此问题，本文提出一种用于端到端目标检测器的查询选择编码器（QSE），可以提升训练收敛速度与检测精度。QSE由多个编码器层组成，且在每个编码器层后添加了轻量级网络，以级联方式持续优化特征，为高效训练提供更充分的正样本监督。此外，每个编码器层引入分层特征感知注意力（HFA）机制，包括层内以及跨层特征注意力，以增强不同层级特征间的交互融合。HFA能有效抑制相似特征表征并强化判别性特征，从而加速特征选择过程。该方法可灵活应用于基于卷积神经网络和基于Transformer的检测器；在目标检测主流基准数据集MSCOCO、CrowdHuman以及PASCALVOC上的大量实验表明，使用QSE的基于卷积神经网络或基于Transformer的检测器均能在更少训练周期内获得更优的端到端检测性能。

关键词：端到端目标检测；查询选择编码器；分层特征感知注意力

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference

[1]Carion N, Massa F, Synnaeve G, et al., 2020. End-to-end object detection with Transformers. Proc 16^th European Conf on Computer Vision, p.213-229.

[2]Chen K, Wang JQ, Pang JM, et al., 2019. MMDetection: open MMlab detection toolbox and benchmark. https://arxiv.org/abs/1906.07155

[3]Chen Q, Chen XK, Wang J, et al., 2023. Group DETR: fast DETR training with group-wise one-to-many assignment. Proc IEEE/CVF Int Conf on Computer Vision, p.6633-6642.

[4]Chen YQ, Chen Q, Hu QH, et al., 2022. DATE: dual assignment for end-to-end fully convolutional object detection. https://arxiv.org/abs/2211.13859v1

[5]Dai XY, Chen YP, Yang JW, et al., 2021. Dynamic DETR: end-to-end object detection with dynamic attention. Proc IEEE/CVF Int Conf on Computer Vision, p.2988-2997.

[6]Deng J, Dong W, Socher R, et al., 2009. ImageNet: a large-scale hierarchical image database. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.248-255.

[7]Everingham M, van Gool L, Williams CK, et al., 2010. The PASCAL Visual Object Classes (VOC) challenge. Int J Comput Vis, 88(2):303-338.

[8]Girshick R, 2015. Fast R-CNN. Proc IEEE Int Conf on Computer Vision, p.1440-1448.

[9]He KM, Zhang XY, Ren SQ, et al., 2016. Deep residual learning for image recognition. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.770-778.

[10]Hou XQ, Liu MQ, Zhang SL, et al., 2024. Relation DETR: exploring explicit position relation prior for object detection. Proc 18^th European Conf on Computer Vision, p.89-105.

[11]Jia D, Yuan YH, He HD, et al., 2023. DETRs with hybrid matching. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.19702-19712.

[12]Jocher G, Chaurasia A, Qiu J, 2023. Ultralytics YOLOv8. https://github.com/ultralytics/ultralytics

[13]Kim K, Lee HS, 2020. Probabilistic anchor assignment with IoU prediction for object detection. Proc 16^th European Conf on Computer Vision, p.355-371.

[14]Law H, Deng J, 2018. CornerNet: detecting objects as paired keypoints. Proc 15^th European Conf on Computer Vision, p.765-781.

[15]Li F, Zhang H, Liu SL, et al., 2022. DN-DETR: accelerate DETR training by introducing query denoising. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.13609-13617.

[16]Li F, Zeng AL, Liu SL, et al., 2023. Lite DETR: an interleaved multi-scale encoder for efficient DETR. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.18558-18567.

[17]Li S, Li MH, Li RH, et al., 2023. One-to-few label assignment for end-to-end dense detection. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.7350-7359.

[18]Lin TY, Maire M, Belongie S, et al., 2014. Microsoft COCO: common objects in context. Proc 13^th European Conf on Computer Vision, p.740-755.

[19]Lin TY, Goyal P, Girshick R, et al., 2017. Focal loss for dense object detection. Proc IEEE Int Conf on Computer Vision, p.2999-3007.

[20]Liu SL, Li F, Zhang H, et al., 2022. Dab-DETR: dynamic anchor boxes are better queries for DETR. Proc 10^th Int Conf on Learning Representations.

[21]Liu W, Anguelov D, Erhan D, et al., 2016. SSD: single shot multibox detector. Proc 14^th European Conf on Computer Vision, p.21-37.

[22]Pu SL, Zhao W, Chen WJ, et al., 2021. Unsupervised object detection with scene-adaptive concept learning. Front Inform Technol Electron Eng, 22(5):638-651.

[23]Qin XF, Hu WK, Xiao C, et al., 2023. Attention-based efficient robot grasp detection network. Front Inform Technol Electron Eng, 24(10):1430-1444.

[24]Redmon J, Divvala S, Girshick R, et al., 2016. You only look once: unified, real-time object detection. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.779-788.

[25]Ren SQ, He KM, Girshick R, et al., 2015. Faster R-CNN: towards real-time object detection with region proposal networks. Proc 29^th Int Conf on Neural Information Processing Systems, p.91-99.

[26]Rezatofighi H, Tsoi N, Gwak J, et al., 2019. Generalized intersection over union: a metric and a loss for bounding box regression. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.658-666.

[27]Shao S, Zhao ZJ, Li BX, et al., 2018. CrowdHuman: a benchmark for detecting human in a crowd. https://arxiv.org/abs/1805.00123

[28]Sun PZ, Zhang RF, Jiang Y, et al., 2021a. Sparse R-CNN: end-to-end object detection with learnable proposals. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.14449-14458.

[29]Sun PZ, Jiang Y, Xie EZ, et al., 2021b. What makes for end-to-end object detection? Proc 38^th Int Conf on Machine Learning, p.9934-9944.

[30]Tian Z, Shen CH, Chen H, et al., 2019. FCOS: fully convolutional one-stage object detection. Proc IEEE/CVF Int Conf on Computer Vision, p.9626-9635.

[31]Wang A, Chen H, Liu LH, et al., 2024. YOLOv10: real-time end-to-end object detection. Proc 38^th Annual Conf on Neural Information Processing Systems, p.107984-108011.

[32]Wang CY, Bochkovskiy A, Liao HYM, 2023. YOLOv7: trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.7464-7475.

[33]Wang JF, Song L, Li ZM, et al., 2021. End-to-end object detection with fully convolutional network. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.15844-15853.

[34]Wang YN, Zhang XY, Yang T, et al., 2022. Anchor DETR: query design for Transformer-based detector. Proc 36^th AAAI Conf on Artificial Intelligence, 36(3):2567-2575.

[35]Yao ZY, Ai JB, Li BX, et al., 2021. Efficient DETR: improving end-to-end object detector with dense prior. https://arxiv.org/abs/2104.01318

[36]Ye MQ, Ke L, Li SY, et al., 2023. Cascade-DETR: delving into high-quality universal object detection. Proc IEEE/CVF Int Conf on Computer Vision, p.6681-6691.

[37]Zhang H, Li F, Liu SL, et al., 2023. DINO: DETR with improved denoising anchor boxes for end-to-end object detection. Proc 11^th Int Conf on Learning Representations.

[38]Zhang SF, Chi C, Yao YQ, et al., 2020. Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.9756-9765.

[39]Zhang SL, Wang XJ, Wang JQ, et al., 2023. Dense distinct query for end-to-end object detection. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.7329-7338.

[40]Zhou XY, Wang DQ, Krähenbühl P, 2019. Objects as points. https://arxiv.org/abs/1904.07850

[41]Zhu XZ, Su WJ, Lu LW, et al., 2021. Deformable DETR: deformable Transformers for end-to-end object detection. Proc 9^th Int Conf on Learning Representations.

[42]Zong ZF, Song GL, Liu Y, 2023. DETRs with collaborative hybrid assignments training. Proc IEEE/CVF Int Conf on Computer Vision, p.6725-6735.

Open peer comments: Debate/Discuss/Question/Opinion

<1>

Similar articles

- Go to

基于分层特征感知注意力与查询选择编码器的端到端目标检测

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference