JZUS - Journal of Zhejiang University SCIENCE

Frontiers of Information Technology & Electronic Engineering

ISSN 2095-9184 (print), ISSN 2095-9230 (online)

2021 Vol.22 No.6 P.902-913

Video summarization with a graph convolutional attention network

Ping Li, Chao Tang, Xianghua Xu

School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou 310018, China; The State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China

patriclouis.lee@gmail.com

Abstract: Video summarization has established itself as a fundamental technique for generating compact and concise video, which alleviates managing and browsing large-scale video data. Existing methods fail to fully consider the local and global relations among frames of video, leading to a deteriorated summarization performance. To address the above problem, we propose a graph convolutional attention network (GCAN) for video summarization. GCAN consists of two parts, embedding learning and context fusion, where embedding learning includes the temporal branch and graph branch. In particular, GCAN uses dilated temporal convolution to model local cues and temporal self-attention to exploit global cues for video frames. It learns graph embedding via a multi-layer graph convolutional network to reveal the intrinsic structure of frame samples. The context fusion part combines the output streams from the temporal branch and graph branch to create the context-aware representation of frames, on which the importance scores are evaluated for selecting representative frames to generate video summary. Experiments are carried out on two benchmark databases, SumMe and TVSum, showing that the proposed GCAN approach enjoys superior performance compared to several state-of-the-art alternatives in three evaluation settings.

Key words: Temporal learning, Self-attention mechanism, Graph convolutional network, Context fusion, Video summarization

Chinese Summary <41> 基于图卷积注意力网络的视频摘要方法

李平^1,2，唐超¹，徐向华¹
¹杭州电子科技大学计算机学院，中国杭州市，310018
²南京大学计算机软件新技术国家重点实验室，中国南京市，210023
摘要：视频摘要已成为生成浓缩简洁视频的一种基础技术，该技术有利于管理和浏览大规模视频数据。已有方法并未充分考虑各视频帧之间的局部和全局关系，导致摘要性能下降。提出一种基于图卷积注意力网络（graph convolutional attention network, GCAN）的视频摘要方法。GCAN由嵌入学习和上下文融合两部分组成，其中嵌入学习包括时序分支和图分支。具体而言，GCAN使用空洞时序卷积对局部线索和时序自注意力建模，能有效利用各视频帧的全局线索；同时利用多层图卷积网络学习图嵌入，反映视频帧样本的本征结构。上下文融合部分将时序分支和图分支的输出信息流合并形成视频帧的上下文表示，然后计算其重要性得分，据此选择具有代表性的帧，生成视频摘要。在两个基准数据集SumMe和TVSum上的实验结果表明，相比其他多种先进方法，GCAN方法在3种不同评测环境下取得更优越的性能。

关键词组：时序学习；自注意力机制；图卷积网络；上下文融合；视频摘要

Share this article to： More

Go to Contents

References:

Open peer comments: Debate/Discuss/Question/Opinion

<1>

DOI:

10.1631/FITEE.2000429

CLC number:

TP391

Download Full Text:

Click Here

Downloaded:

12769

Download summary:

Downloaded:

1952

Clicked:

5967

Cited:

On-line Access:

2024-08-27

Received:

2023-10-17

Revision Accepted:

2024-05-08

Crosschecked:

2021-04-01

Journal of Zhejiang University-SCIENCE, 38 Zheda Road, Hangzhou 310027, China
Tel: +86-571-87952276; Fax: +86-571-87952331; E-mail: jzus@zju.edu.cn
Copyright © 2000~ Journal of Zhejiang University-SCIENCE

CONTENTS

INSTR. FOR AUTHOR

FOR REVIEWER

ABOUT JZUS

Publishing Service