Full Text:   <875>

Summary:  <170>

Suppl. Mater.: 

CLC number: TP39

On-line Access: 2023-07-03

Received: 2022-07-22

Revision Accepted: 2023-01-06

Crosschecked: 2023-07-03

Cited: 0

Clicked: 913

Citations:  Bibtex RefMan EndNote GB/T7714

 ORCID:

Jingfa LIU

https://orcid.org/0000-0002-0407-1522

Zhen WANG

https://orcid.org/0000-0003-4940-2812

Guo ZHONG

https://orcid.org/0000-0002-6428-5645

Zhihe YANG

https://orcid.org/0000-0002-0998-5227

-   Go to

Article info.
Open peer comments

Frontiers of Information Technology & Electronic Engineering  2023 Vol.24 No.6 P.859-875

http://doi.org/10.1631/FITEE.2200315


A new focused crawler using an improved tabu search algorithm incorporating ontology and host information


Author(s):  Jingfa LIU, Zhen WANG, Guo ZHONG, Zhihe YANG

Affiliation(s):  School of Information Science and Technology, Guangdong University of Foreign Studies, Guangzhou 510006, China; more

Corresponding email(s):   1007427607@qq.com

Key Words:  Focused crawler, Tabu search algorithm, Ontology, Host information, Priority evaluation


Jingfa LIU, Zhen WANG, Guo ZHONG, Zhihe YANG. A new focused crawler using an improved tabu search algorithm incorporating ontology and host information[J]. Frontiers of Information Technology & Electronic Engineering, 2023, 24(6): 859-875.

@article{title="A new focused crawler using an improved tabu search algorithm incorporating ontology and host information",
author="Jingfa LIU, Zhen WANG, Guo ZHONG, Zhihe YANG",
journal="Frontiers of Information Technology & Electronic Engineering",
volume="24",
number="6",
pages="859-875",
year="2023",
publisher="Zhejiang University Press & Springer",
doi="10.1631/FITEE.2200315"
}

%0 Journal Article
%T A new focused crawler using an improved tabu search algorithm incorporating ontology and host information
%A Jingfa LIU
%A Zhen WANG
%A Guo ZHONG
%A Zhihe YANG
%J Frontiers of Information Technology & Electronic Engineering
%V 24
%N 6
%P 859-875
%@ 2095-9184
%D 2023
%I Zhejiang University Press & Springer
%DOI 10.1631/FITEE.2200315

TY - JOUR
T1 - A new focused crawler using an improved tabu search algorithm incorporating ontology and host information
A1 - Jingfa LIU
A1 - Zhen WANG
A1 - Guo ZHONG
A1 - Zhihe YANG
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 24
IS - 6
SP - 859
EP - 875
%@ 2095-9184
Y1 - 2023
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/FITEE.2200315


Abstract: 
To solve the problems of incomplete topic description and repetitive crawling of visited hyperlinks in traditional focused crawling methods, in this paper, we propose a novel focused crawler using an improved tabu search algorithm with domain ontology and host information (FCITS_OH), where a domain ontology is constructed by formal concept analysis to describe topics at the semantic and knowledge levels. To avoid crawling visited hyperlinks and expand the search range, we present an improved tabu search (ITS) algorithm and the strategy of host information memory. In addition, a comprehensive priority evaluation method based on Web text and link structure is designed to improve the assessment of topic relevance for unvisited hyperlinks. Experimental results on both tourism and rainstorm disaster domains show that the proposed focused crawlers overmatch the traditional focused crawlers for different performance metrics.

一种新的融合本体和主机信息的改进禁忌搜索算法的主题爬虫方法

刘景发1,王震1,2,钟国1,杨志和1
1广东外语外贸大学信息科学与技术学院,中国广州市,510006
2中国联通中南研究院,中国长沙市,410000

摘要:为解决传统主题爬虫方法存在的主题描述不完整和重复爬取已访问链接的问题,本文提出一种新的融合本体和主机信息的改进禁忌搜索算法的主题爬虫方法(FCITS_OH)。该方法基于形式概念分析(FCA)构建领域本体,在语义和知识层面描述主题。为避免重复爬取已访问的链接和扩大搜索范围,提出一种改进的禁忌搜索(ITS)算法和记忆主机信息的策略。此外,为改进未访问链接的主题相关性的评估方法,提出一种基于Web文本和链接结构的综合优先度评估方法。以旅游和暴雨灾害为主题的实验结果表明,对于不同的性能指标,所提出的爬虫方法优于文献中其它主题爬虫策略。

关键词:主题爬虫;禁忌搜索算法;本体;主机信息;优先度评估

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference

[1]Asano Y, Tezuka Y, Nishizeki T, 2007. Improvements of HITS algorithms for spam links. Proc 9th Asia-Pacific Web Conf and 8th Int Conf on Web-Age Information Management, p.479-490.

[2]Chakrabarti S, van den Berg M, Dom B, 1999. Focused crawling: a new approach to topic-specific Web resource discovery. Comput Netw, 31(11-16):1623-1640.

[3]de Bra P, Houben GJ, Kornatzky Y, et al., 1994. Information retrieval in distributed hypertexts. Proc RIAO: Intelligent Multimedia Information Retrieval Systems and Management, p.481-491.

[4]Deng SQ, 2020. Research on the focused crawler of mineral intelligence service based on semantic similarity. J Phys Conf Ser, 1575:012142.

[5]Derrac J, García S, Molina D, et al., 2011. A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms. Swarm Evol Comput, 1(1):3-18.

[6]Du YJ, Hai YF, Xie CZ, et al., 2014. An approach for selecting seed URLs of focused crawler based on user-interest ontology. Appl Soft Comput, 14:663-676.

[7]Farag MMG, Lee S, Fox EA, 2018. Focused crawler for events. Int J Dig Libr, 19(1):3-19.

[8]Fei CJ, Liu BS, 2018. Focused crawler based on LDA extended topic terms. Comput Appl Softw, 35(4):‍49-54(in Chinese).

[9]Guan WG, Luo YC, 2016. Design and implementation of focused crawler based on concept context graph. Comput Eng Des, 37(10):2679-2684(in Chinese).

[10]He S, Cheng JX, Cai XB, 2009. Focused crawler based on simulated anneal algorithm. Comput Technol Dev, 19(12):55-58, 62(in Chinese).

[11]Hosseinkhani J, Taherdoost H, Keikhaee S, 2021. ANTON framework based on semantic focused crawler to support Web crime mining using SVM. Ann Data Sci, 8(2):‍227-240.

[12]Jiang QC, Zhang Y, 2007. SiteRank-based crawling ordering strategy for search engines. Proc 7th IEEE Int Conf on Computer and Information Technology, p.259-263.

[13]Khan MA, Sharma DK, 2016. Self-adaptive ontology-based focused crawling: a literature survey. Proc 5th Int Conf on Reliability, Infocom Technologies and Optimization (Trends and Future Directions), p.595-601.

[14]Lakzaei B, Shmasfard M, 2021. Ontology learning from relational databases. Inform Sci, 577:280-297.

[15]Li L, Zhang GY, Li ZW, 2015. Research on focused crawling technology based on SVM. Comput Sci, 42(2):‍118-122(in Chinese).

[16]Liu JF, Li F, Jiang SY, 2019. Focused annealing crawler algorithm for rainstorm disasters based on comprehensive priority and host information. Comput Sci, 46(2):‍215-222(in Chinese).

[17]Liu JF, Wang DW, Yan XM, 2021. Tabu search algorithm for dynamic facility layout problem. J Huazhong Univ Sci Technol (Nat Sci Ed), 49(2):44-50(in Chinese).

[18]Liu JF, Dong Y, Liu ZX, et al., 2022a. Applying ontology learning and multi-objective ant colony optimization method for focused crawling to meteorological disasters domain knowledge. Expert Syst Appl, 198:116741.

[19]Liu JF, Li X, Zhang QS, et al., 2022b. A novel focused crawler combining Web space evolution and domain ontology. Knowl-Based Syst, 243:108495.

[20]Liu WJ, Du YJ, 2014. A novel focused crawler based on cell-like membrane computing optimization algorithm. Neurocomputing, 123:266-280.

[21]Ma LL, Li HW, Lian SW, et al., 2016. A strategy of disaster focused crawler based on ontology semantics. Comput Eng, 42(11):50-56(in Chinese).

[22]Prakash J, Kumar R, 2015. Web crawling through shark-search using PageRank. Proc Comput Sci, 48:‍210-216.

[23]Rani M, Dhar AK, Vyas OP, 2017. Semi-automatic terminology ontology learning based on topic modeling. Eng Appl Artif Intell, 63:108-125.

[24]Rawat S, Patil DR, 2013. Efficient focused crawling based on best first search. Proc 3rd IEEE Int Advance Computing Conf, p.908-911.

[25]Tong YL, 2008. Application of focused crawler using adaptive dynamical evolutional particle swarm optimization. Geomat Inform Sci Wuhan Univ, 33(12):‍1296-1299(in Chinese).

[26]Wang ZG, Meng BJ, 2014. A comparison of approaches to Chinese word segmentation in Hadoop. Proc IEEE Int Conf on Data Mining Workshop, p.844-850.

[27]Wu TY, 2018. Research on information retrieval technology based on Word2vec+BM25. Electron World, 2018(22):135-136.

[28]Wu YL, Zhao SL, Li CJ, et al., 2017. Text classification method based on TF-IDF and cosine similarity. J Chin Inform Process, 31(5):138-145(in Chinese).

[29]Xiao JJ, Chen ZY, 2018. Focused crawling based on grey wolf algorithms. Comput Sci, 45(11A):‍146-148, 166(in Chinese).

[30]Yan W, Pan L, 2018. Designing focused crawler based on improved genetic algorithm. Proc 10th Int Conf on Advanced Computational Intelligence, p.319-323.

[31]Yu J, Liu G, 2015. Survey on topic-focused crawlers. Comput Eng Sci, 37(2):231-237(in Chinese).

[32]Yuan ZQ, Zhang WH, Fu HJ, et al., 2017. A PageRank-improved ranking algorithm based on cheating similarity and cheating relevance. Proc IEEE/ACIS 16th Int Conf on Computer and Information Science, p.257-263.

[33]Zhu G, Yang JY, Wu XH, et al., 2017. Research on construction of hierarchy relationship and ontology of meteorological disaster based on FCA. Mod Inform, 37(5):‍79-88(in Chinese).

Open peer comments: Debate/Discuss/Question/Opinion

<1>

Please provide your name, email address and a comment





Journal of Zhejiang University-SCIENCE, 38 Zheda Road, Hangzhou 310027, China
Tel: +86-571-87952783; E-mail: cjzhang@zju.edu.cn
Copyright © 2000 - 2024 Journal of Zhejiang University-SCIENCE