CLC number: TP391.41
On-line Access: 2025-06-04
Received: 2024-10-29
Revision Accepted: 2025-02-09
Crosschecked: 2025-09-04
Cited: 0
Clicked: 792
Zuyi WANG, Zhimeng ZHENG, Jun MENG, Li XU. End-to-end object detection using a query-selection encoder with hierarchical feature-aware attention[J]. Frontiers of Information Technology & Electronic Engineering, 2025, 26(8): 1324-1340.
@article{title="End-to-end object detection using a query-selection encoder with hierarchical feature-aware attention",
author="Zuyi WANG, Zhimeng ZHENG, Jun MENG, Li XU",
journal="Frontiers of Information Technology & Electronic Engineering",
volume="26",
number="8",
pages="1324-1340",
year="2025",
publisher="Zhejiang University Press & Springer",
doi="10.1631/FITEE.2400960"
}
%0 Journal Article
%T End-to-end object detection using a query-selection encoder with hierarchical feature-aware attention
%A Zuyi WANG
%A Zhimeng ZHENG
%A Jun MENG
%A Li XU
%J Frontiers of Information Technology & Electronic Engineering
%V 26
%N 8
%P 1324-1340
%@ 2095-9184
%D 2025
%I Zhejiang University Press & Springer
%DOI 10.1631/FITEE.2400960
TY - JOUR
T1 - End-to-end object detection using a query-selection encoder with hierarchical feature-aware attention
A1 - Zuyi WANG
A1 - Zhimeng ZHENG
A1 - Jun MENG
A1 - Li XU
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 26
IS - 8
SP - 1324
EP - 1340
%@ 2095-9184
Y1 - 2025
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/FITEE.2400960
Abstract: end-to-end object detection methods have attracted extensive interest recently since they alleviate the need for complicated human-designed components and simplify the detection pipeline. However, these methods suffer from slower training convergence and inferior detection performance compared to conventional detectors, as their feature fusion and selection processes are constrained by insufficient positive supervision. To address this issue, we introduce a novel query-selection encoder (QSE) designed for end-to-end object detectors to improve the training convergence speed and detection accuracy. QSE is composed of multiple encoder layers stacked on top of the backbone. A lightweight head network is added after each encoder layer to continuously optimize features in a cascading manner, providing more positive supervision for efficient training. Additionally, a hierarchical feature-aware attention (HFA) mechanism is incorporated in each encoder layer, including in- and cross-level feature attention, to enhance the interaction between features from different levels. HFA can effectively suppress similar feature representations and highlight discriminative ones, thereby accelerating the feature selection process. Our method is highly versatile in accommodating both CNN- and Transformer-based detectors. Extensive experiments were conducted on the popular benchmark datasets MS COCO, CrowdHuman, and PASCAL VOC to demonstrate the effectiveness of our method. The results showed that CNN- and Transformer-based detectors using QSE can achieve better end-to-end performance within fewer training epochs.
[1]Carion N, Massa F, Synnaeve G, et al., 2020. End-to-end object detection with Transformers. Proc 16th European Conf on Computer Vision, p.213-229.
[2]Chen K, Wang JQ, Pang JM, et al., 2019. MMDetection: open MMlab detection toolbox and benchmark. https://arxiv.org/abs/1906.07155
[3]Chen Q, Chen XK, Wang J, et al., 2023. Group DETR: fast DETR training with group-wise one-to-many assignment. Proc IEEE/CVF Int Conf on Computer Vision, p.6633-6642.
[4]Chen YQ, Chen Q, Hu QH, et al., 2022. DATE: dual assignment for end-to-end fully convolutional object detection. https://arxiv.org/abs/2211.13859v1
[5]Dai XY, Chen YP, Yang JW, et al., 2021. Dynamic DETR: end-to-end object detection with dynamic attention. Proc IEEE/CVF Int Conf on Computer Vision, p.2988-2997.
[6]Deng J, Dong W, Socher R, et al., 2009. ImageNet: a large-scale hierarchical image database. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.248-255.
[7]Everingham M, van Gool L, Williams CK, et al., 2010. The PASCAL Visual Object Classes (VOC) challenge. Int J Comput Vis, 88(2):303-338.
[8]Girshick R, 2015. Fast R-CNN. Proc IEEE Int Conf on Computer Vision, p.1440-1448.
[9]He KM, Zhang XY, Ren SQ, et al., 2016. Deep residual learning for image recognition. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.770-778.
[10]Hou XQ, Liu MQ, Zhang SL, et al., 2024. Relation DETR: exploring explicit position relation prior for object detection. Proc 18th European Conf on Computer Vision, p.89-105.
[11]Jia D, Yuan YH, He HD, et al., 2023. DETRs with hybrid matching. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.19702-19712.
[12]Jocher G, Chaurasia A, Qiu J, 2023. Ultralytics YOLOv8. https://github.com/ultralytics/ultralytics
[13]Kim K, Lee HS, 2020. Probabilistic anchor assignment with IoU prediction for object detection. Proc 16th European Conf on Computer Vision, p.355-371.
[14]Law H, Deng J, 2018. CornerNet: detecting objects as paired keypoints. Proc 15th European Conf on Computer Vision, p.765-781.
[15]Li F, Zhang H, Liu SL, et al., 2022. DN-DETR: accelerate DETR training by introducing query denoising. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.13609-13617.
[16]Li F, Zeng AL, Liu SL, et al., 2023. Lite DETR: an interleaved multi-scale encoder for efficient DETR. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.18558-18567.
[17]Li S, Li MH, Li RH, et al., 2023. One-to-few label assignment for end-to-end dense detection. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.7350-7359.
[18]Lin TY, Maire M, Belongie S, et al., 2014. Microsoft COCO: common objects in context. Proc 13th European Conf on Computer Vision, p.740-755.
[19]Lin TY, Goyal P, Girshick R, et al., 2017. Focal loss for dense object detection. Proc IEEE Int Conf on Computer Vision, p.2999-3007.
[20]Liu SL, Li F, Zhang H, et al., 2022. Dab-DETR: dynamic anchor boxes are better queries for DETR. Proc 10th Int Conf on Learning Representations.
[21]Liu W, Anguelov D, Erhan D, et al., 2016. SSD: single shot multibox detector. Proc 14th European Conf on Computer Vision, p.21-37.
[22]Pu SL, Zhao W, Chen WJ, et al., 2021. Unsupervised object detection with scene-adaptive concept learning. Front Inform Technol Electron Eng, 22(5):638-651.
[23]Qin XF, Hu WK, Xiao C, et al., 2023. Attention-based efficient robot grasp detection network. Front Inform Technol Electron Eng, 24(10):1430-1444.
[24]Redmon J, Divvala S, Girshick R, et al., 2016. You only look once: unified, real-time object detection. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.779-788.
[25]Ren SQ, He KM, Girshick R, et al., 2015. Faster R-CNN: towards real-time object detection with region proposal networks. Proc 29th Int Conf on Neural Information Processing Systems, p.91-99.
[26]Rezatofighi H, Tsoi N, Gwak J, et al., 2019. Generalized intersection over union: a metric and a loss for bounding box regression. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.658-666.
[27]Shao S, Zhao ZJ, Li BX, et al., 2018. CrowdHuman: a benchmark for detecting human in a crowd. https://arxiv.org/abs/1805.00123
[28]Sun PZ, Zhang RF, Jiang Y, et al., 2021a. Sparse R-CNN: end-to-end object detection with learnable proposals. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.14449-14458.
[29]Sun PZ, Jiang Y, Xie EZ, et al., 2021b. What makes for end-to-end object detection? Proc 38th Int Conf on Machine Learning, p.9934-9944.
[30]Tian Z, Shen CH, Chen H, et al., 2019. FCOS: fully convolutional one-stage object detection. Proc IEEE/CVF Int Conf on Computer Vision, p.9626-9635.
[31]Wang A, Chen H, Liu LH, et al., 2024. YOLOv10: real-time end-to-end object detection. Proc 38th Annual Conf on Neural Information Processing Systems, p.107984-108011.
[32]Wang CY, Bochkovskiy A, Liao HYM, 2023. YOLOv7: trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.7464-7475.
[33]Wang JF, Song L, Li ZM, et al., 2021. End-to-end object detection with fully convolutional network. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.15844-15853.
[34]Wang YN, Zhang XY, Yang T, et al., 2022. Anchor DETR: query design for Transformer-based detector. Proc 36th AAAI Conf on Artificial Intelligence, 36(3):2567-2575.
[35]Yao ZY, Ai JB, Li BX, et al., 2021. Efficient DETR: improving end-to-end object detector with dense prior. https://arxiv.org/abs/2104.01318
[36]Ye MQ, Ke L, Li SY, et al., 2023. Cascade-DETR: delving into high-quality universal object detection. Proc IEEE/CVF Int Conf on Computer Vision, p.6681-6691.
[37]Zhang H, Li F, Liu SL, et al., 2023. DINO: DETR with improved denoising anchor boxes for end-to-end object detection. Proc 11th Int Conf on Learning Representations.
[38]Zhang SF, Chi C, Yao YQ, et al., 2020. Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.9756-9765.
[39]Zhang SL, Wang XJ, Wang JQ, et al., 2023. Dense distinct query for end-to-end object detection. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.7329-7338.
[40]Zhou XY, Wang DQ, Krähenbühl P, 2019. Objects as points. https://arxiv.org/abs/1904.07850
[41]Zhu XZ, Su WJ, Lu LW, et al., 2021. Deformable DETR: deformable Transformers for end-to-end object detection. Proc 9th Int Conf on Learning Representations.
[42]Zong ZF, Song GL, Liu Y, 2023. DETRs with collaborative hybrid assignments training. Proc IEEE/CVF Int Conf on Computer Vision, p.6725-6735.
Open peer comments: Debate/Discuss/Question/Opinion
<1>