Publishing Service

Polishing & Checking

Frontiers of Information Technology & Electronic Engineering

ISSN 2095-9184 (print), ISSN 2095-9230 (online)

A new focused crawler using an improved tabu search algorithm incorporating ontology and host information

Abstract: To solve the problems of incomplete topic description and repetitive crawling of visited hyperlinks in traditional focused crawling methods, in this paper, we propose a novel focused crawler using an improved tabu search algorithm with domain ontology and host information (FCITS_OH), where a domain ontology is constructed by formal concept analysis to describe topics at the semantic and knowledge levels. To avoid crawling visited hyperlinks and expand the search range, we present an improved tabu search (ITS) algorithm and the strategy of host information memory. In addition, a comprehensive priority evaluation method based on Web text and link structure is designed to improve the assessment of topic relevance for unvisited hyperlinks. Experimental results on both tourism and rainstorm disaster domains show that the proposed focused crawlers overmatch the traditional focused crawlers for different performance metrics.

Key words: Focused crawler; Tabu search algorithm; Ontology; Host information; Priority evaluation

Chinese Summary  <6> 一种新的融合本体和主机信息的改进禁忌搜索算法的主题爬虫方法

刘景发1,王震1,2,钟国1,杨志和1
1广东外语外贸大学信息科学与技术学院,中国广州市,510006
2中国联通中南研究院,中国长沙市,410000

摘要:为解决传统主题爬虫方法存在的主题描述不完整和重复爬取已访问链接的问题,本文提出一种新的融合本体和主机信息的改进禁忌搜索算法的主题爬虫方法(FCITS_OH)。该方法基于形式概念分析(FCA)构建领域本体,在语义和知识层面描述主题。为避免重复爬取已访问的链接和扩大搜索范围,提出一种改进的禁忌搜索(ITS)算法和记忆主机信息的策略。此外,为改进未访问链接的主题相关性的评估方法,提出一种基于Web文本和链接结构的综合优先度评估方法。以旅游和暴雨灾害为主题的实验结果表明,对于不同的性能指标,所提出的爬虫方法优于文献中其它主题爬虫策略。

关键词组:主题爬虫;禁忌搜索算法;本体;主机信息;优先度评估


Share this article to: More

Go to Contents

References:

<Show All>

Open peer comments: Debate/Discuss/Question/Opinion

<1>

Please provide your name, email address and a comment





DOI:

10.1631/FITEE.2200315

CLC number:

TP39

Download Full Text:

Click Here

Downloaded:

2342

Download summary:

<Click Here> 

Downloaded:

228

Clicked:

1192

Cited:

0

On-line Access:

2023-07-03

Received:

2022-07-22

Revision Accepted:

2023-01-06

Crosschecked:

2023-07-03

Journal of Zhejiang University-SCIENCE, 38 Zheda Road, Hangzhou 310027, China
Tel: +86-571-87952276; Fax: +86-571-87952331; E-mail: jzus@zju.edu.cn
Copyright © 2000~ Journal of Zhejiang University-SCIENCE