|
Frontiers of Information Technology & Electronic Engineering
ISSN 2095-9184 (print), ISSN 2095-9230 (online)
2023 Vol.24 No.6 P.859-875
A new focused crawler using an improved tabu search algorithm incorporating ontology and host information
Abstract: To solve the problems of incomplete topic description and repetitive crawling of visited hyperlinks in traditional focused crawling methods, in this paper, we propose a novel focused crawler using an improved tabu search algorithm with domain ontology and host information (FCITS_OH), where a domain ontology is constructed by formal concept analysis to describe topics at the semantic and knowledge levels. To avoid crawling visited hyperlinks and expand the search range, we present an improved tabu search (ITS) algorithm and the strategy of host information memory. In addition, a comprehensive priority evaluation method based on Web text and link structure is designed to improve the assessment of topic relevance for unvisited hyperlinks. Experimental results on both tourism and rainstorm disaster domains show that the proposed focused crawlers overmatch the traditional focused crawlers for different performance metrics.
Key words: Focused crawler; Tabu search algorithm; Ontology; Host information; Priority evaluation
1广东外语外贸大学信息科学与技术学院,中国广州市,510006
2中国联通中南研究院,中国长沙市,410000
摘要:为解决传统主题爬虫方法存在的主题描述不完整和重复爬取已访问链接的问题,本文提出一种新的融合本体和主机信息的改进禁忌搜索算法的主题爬虫方法(FCITS_OH)。该方法基于形式概念分析(FCA)构建领域本体,在语义和知识层面描述主题。为避免重复爬取已访问的链接和扩大搜索范围,提出一种改进的禁忌搜索(ITS)算法和记忆主机信息的策略。此外,为改进未访问链接的主题相关性的评估方法,提出一种基于Web文本和链接结构的综合优先度评估方法。以旅游和暴雨灾害为主题的实验结果表明,对于不同的性能指标,所提出的爬虫方法优于文献中其它主题爬虫策略。
关键词组:
References:
Open peer comments: Debate/Discuss/Question/Opinion
<1>
DOI:
10.1631/FITEE.2200315
CLC number:
TP39
Download Full Text:
Downloaded:
3893
Download summary:
<Click Here>Downloaded:
367Clicked:
2033
Cited:
0
On-line Access:
2024-08-27
Received:
2023-10-17
Revision Accepted:
2024-05-08
Crosschecked:
2023-07-03