JZUS - Journal of Zhejiang University SCIENCE

Frontiers of Information Technology & Electronic Engineering

ISSN 2095-9184 (print), ISSN 2095-9230 (online)

2025 Vol.26 No.8 P.1324-1340

End-to-end object detection using a query-selection encoder with hierarchical feature-aware attention

Zuyi WANG, Zhimeng ZHENG, Jun MENG, Li XU

College of Electrical Engineering, Zhejiang University, Hangzhou 310027, China; Zhejiang University Robotics Institute, Yuyao 315400, China

zuyiwang@zju.edu.cn, zhengzhimengzju@foxmail.com, junmeng@zju.edu.cn, xupower@zju.edu.cn

Abstract: End-to-end object detection methods have attracted extensive interest recently since they alleviate the need for complicated human-designed components and simplify the detection pipeline. However, these methods suffer from slower training convergence and inferior detection performance compared to conventional detectors, as their feature fusion and selection processes are constrained by insufficient positive supervision. To address this issue, we introduce a novel query-selection encoder (QSE) designed for end-to-end object detectors to improve the training convergence speed and detection accuracy. QSE is composed of multiple encoder layers stacked on top of the backbone. A lightweight head network is added after each encoder layer to continuously optimize features in a cascading manner, providing more positive supervision for efficient training. Additionally, a hierarchical feature-aware attention (HFA) mechanism is incorporated in each encoder layer, including in- and cross-level feature attention, to enhance the interaction between features from different levels. HFA can effectively suppress similar feature representations and highlight discriminative ones, thereby accelerating the feature selection process. Our method is highly versatile in accommodating both CNN- and Transformer-based detectors. Extensive experiments were conducted on the popular benchmark datasets MS COCO, CrowdHuman, and PASCAL VOC to demonstrate the effectiveness of our method. The results showed that CNN- and Transformer-based detectors using QSE can achieve better end-to-end performance within fewer training epochs.

Key words: End-to-end object detection; Query-selection encoder; Hierarchical feature-aware attention

Chinese Summary <10> 基于分层特征感知注意力与查询选择编码器的端到端目标检测

王足毅¹，郑智萌¹，孟濬^1,2，许力^1,2
¹浙江大学电气工程学院，中国杭州市，310027
²浙江大学机器人研究院，中国余姚市，315400
摘要：由于无需设计复杂人工组件且简化了检测流程，端到端目标检测方法近年来受到广泛关注。然而，与传统检测器相比，这类方法存在训练收敛速度较慢、检测性能不足的问题，究其原因是在特征融合与选择过程中算法受限于正样本监督信号不足。针对此问题，本文提出一种用于端到端目标检测器的查询选择编码器（QSE），可以提升训练收敛速度与检测精度。QSE由多个编码器层组成，且在每个编码器层后添加了轻量级网络，以级联方式持续优化特征，为高效训练提供更充分的正样本监督。此外，每个编码器层引入分层特征感知注意力（HFA）机制，包括层内以及跨层特征注意力，以增强不同层级特征间的交互融合。HFA能有效抑制相似特征表征并强化判别性特征，从而加速特征选择过程。该方法可灵活应用于基于卷积神经网络和基于Transformer的检测器；在目标检测主流基准数据集MSCOCO、CrowdHuman以及PASCALVOC上的大量实验表明，使用QSE的基于卷积神经网络或基于Transformer的检测器均能在更少训练周期内获得更优的端到端检测性能。

关键词组：端到端目标检测；查询选择编码器；分层特征感知注意力

Share this article to： More

Go to Contents

References:

Open peer comments: Debate/Discuss/Question/Opinion

<1>

DOI:

10.1631/FITEE.2400960

CLC number:

TP391.41

Download Full Text:

Click Here

Downloaded:

2215

Clicked:

1443

Cited:

On-line Access:

2025-06-04

Received:

2024-10-29

Revision Accepted:

2025-02-09

Crosschecked:

2025-09-04

Journal of Zhejiang University-SCIENCE, 38 Zheda Road, Hangzhou 310027, China
Tel: +86-571-87952276; Fax: +86-571-87952331; E-mail: jzus@zju.edu.cn
Copyright © 2000~ Journal of Zhejiang University-SCIENCE

CONTENTS

INSTR. FOR AUTHOR

FOR REVIEWER

ABOUT JZUS

Publishing Service