CLC number: TP39
On-line Access: 2024-08-27
Received: 2023-10-17
Revision Accepted: 2024-05-08
Crosschecked: 2022-08-29
Cited: 0
Clicked: 2414
Citations: Bibtex RefMan EndNote GB/T7714
https://orcid.org/0000-0002-0407-1522
https://orcid.org/0000-0001-7836-0522
Jingfa LIU, Fan LI, Ruoyao DING, Ziang LIU. Focused crawling strategies based on ontologies and simulated annealing methods for rainstorm disaster domain knowledge[J]. Frontiers of Information Technology & Electronic Engineering, 2022, 23(8): 1189-1204.
@article{title="Focused crawling strategies based on ontologies and simulated annealing methods for rainstorm disaster domain knowledge",
author="Jingfa LIU, Fan LI, Ruoyao DING, Ziang LIU",
journal="Frontiers of Information Technology & Electronic Engineering",
volume="23",
number="8",
pages="1189-1204",
year="2022",
publisher="Zhejiang University Press & Springer",
doi="10.1631/FITEE.2100360"
}
%0 Journal Article
%T Focused crawling strategies based on ontologies and simulated annealing methods for rainstorm disaster domain knowledge
%A Jingfa LIU
%A Fan LI
%A Ruoyao DING
%A Ziang LIU
%J Frontiers of Information Technology & Electronic Engineering
%V 23
%N 8
%P 1189-1204
%@ 2095-9184
%D 2022
%I Zhejiang University Press & Springer
%DOI 10.1631/FITEE.2100360
TY - JOUR
T1 - Focused crawling strategies based on ontologies and simulated annealing methods for rainstorm disaster domain knowledge
A1 - Jingfa LIU
A1 - Fan LI
A1 - Ruoyao DING
A1 - Ziang LIU
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 23
IS - 8
SP - 1189
EP - 1204
%@ 2095-9184
Y1 - 2022
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/FITEE.2100360
Abstract: At present, focused crawler is a crucial method for obtaining effective domain knowledge from massive heterogeneous networks. For most current focused crawling technologies, there are some difficulties in obtaining high-quality crawling results. The main difficulties are the establishment of topic benchmark models, the assessment of topic relevance of hyperlinks, and the design of crawling strategies. In this paper, we use domain ontology to build a topic benchmark model for a specific topic, and propose a novel multiple-filtering strategy based on local ontology and global ontology (MFSLG). A comprehensive priority evaluation method (CPEM) based on the web text and link structure is introduced to improve the computation precision of topic relevance for unvisited hyperlinks, and a simulated annealing (SA) method is used to avoid the focused crawler falling into local optima of the search. By incorporating SA into the focused crawler with MFSLG and CPEM for the first time, two novel focused crawler strategies based on ontology and SA (FCOSA), including FCOSA with only global ontology (FCOSA_G) and FCOSA with both local ontology and global ontology (FCOSA_LG), are proposed to obtain topic-relevant webpages about rainstorm disasters from the network. Experimental results show that the proposed crawlers outperform the other focused crawling strategies on different performance metric indices.
[1]Bajpai N, Arora D, 2018. Domain-based search engine evaluation. In: Saeed K, Chaki N, Pati B, et al. (Eds.), Progress in Advanced Computing and Intelligent Engineering. Advances in Intelligent Systems and Computing, volume 564. Springer, Singapore, p.711-720.
[2]Boukadi K, Rekik M, Rekik M, et al., 2018. FC4CD: a new SOA-based focused crawler for cloud service discovery. Computing, 100(10):1081-1107.
[3]Capuano A, Rinaldi AM, Russo C, 2020. An ontology-driven multimedia focused crawler based on linked open data and deep learning techniques. Multim Tools Appl, 79(11):7577-7598.
[4]Chen YB, Zhang Z, Zhang T, 2011. A searching strategy in topic crawler using ant colony algorithm. Microcomput Appl, 30(1):53-56 (in Chinese).
[5]Cheng YK, Liao WJ, Cheng G, 2018. Strategy of focused crawler with word embedding clustering weighted in shark-search algorithm. Comput Dig Eng, 46(1):144-148 (in Chinese).
[6]Colazzo D, Ghelli G, Pardini L, et al., 2013. Almost-linear inclusion for XML regular expression types. ACM Trans Database Syst, 38(3):15.
[7]Derrac J, García S, Molina D, et al., 2011. A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms. Swarm Evol Comput, 1(1):3-18.
[8]Dong Y, Liu JF, Liu WJ, 2020. Focused crawler strategy based on multi-objective ant colony algorithm. Comput Eng, 46(9):274-282 (in Chinese).
[9]Du YJ, Pen QQ, Gao ZQ, 2013. A topic-specific crawling strategy based on semantics similarity. Data Knowl Eng, 88:75-93.
[10]Du YJ, Hai YF, Xie CZ, et al., 2014. An approach for selecting seed URLs of focused crawler based on user-interest ontology. Appl Soft Comput, 14:663-676.
[11]Du YJ, Li CX, Hu Q, et al., 2017. Ranking webpages using a path trust knowledge graph. Neurocomputing, 269:58-72.
[12]Farag MMG, Lee S, Fox EA, 2018. Focused crawler for events. Int J Dig Libr, 19(1):3-19.
[13]Gruber TR, 1995. Toward principles for the design of ontologies used for knowledge sharing? Int J Human-Comput Stud, 43(5-6):907-928.
[14]Guan WG, Luo YC, 2016. Design and implementation of focused crawler based on concept context graph. Comput Eng Des, 37(10):2679-2684(in Chinese).
[15]He S, Cheng JX, Cai XB, 2009. Focused crawler based on simulated anneal algorithm. Comput Technol Dev, 19(12):55-58, 62 (in Chinese).
[16]Jia JF, Tumanian V, Li GQ, 2021. Discovering semantically related technical terms and web resources in Q&A discussions. Front Inform Technol Electron Eng, 22(7):969-985.
[17]Jing WP, Wang YJ, Dong WW, 2016. Research on adaptive genetic algorithm in application of focused crawler search strategy. Comput Sci, 43(8):254-257 (in Chinese).
[18]Khadir AC, Aliane H, Guessoum A, 2021. Ontology learning: grand tour and challenges. Comput Sci Rev, 39:100339.
[19]Lakzaei B, Shamsfard M, 2021. Ontology learning from relational databases. Inform Sci, 577:280-297.
[20]Liu B, Jiang SY, Zou Q, 2020. HITS-PR-HHblits: protein remote homology detection by combining PageRank and hyperlink-induced topic search. Brief Bioinform, 21(1):298-308.
[21]Liu JF, Li G, Chen DB, et al, 2010. Two-dimensional equilibrium constraint layout using simulated annealing. Comput Ind Eng, 59(4):530-536.
[22]Liu JF, Li F, Jiang SY, 2019a. Focused annealing crawler algorithm for rainstorm disasters based on comprehensive priority and host information. Comput Sci, 46(2):215-222 (in Chinese).
[23]Liu JF, Li X, Jiang SY, 2019b. Focused crawler for rainstorm disaster strategy based on web space evolutionary algorithm. Comput Eng, 45(2):184-190 (in Chinese).
[24]Liu JF, Gu YP, Liu WJ, 2020. Focused crawler method combining ontology and improved Tabu search for meteorological disaster. J Comput Appl, 40(8):2255-2261 (in Chinese).
[25]Liu WJ, Du YJ, 2014. A novel focused crawler based on cell-like membrane computing optimization algorithm. Neurocomputing, 123:266-280.
[26]Patel A, Schmidt N, 2011. Application of structured document parsing to focused web crawling. Comput Stand Inter, 33(3):325-331.
[27]Prakash J, Kumar R, 2015. Web crawling through shark-search using PageRank. Proc Comput Sci, 48:210-216.
[28]Rawat S, Patil DR, 2013. Efficient focused crawling based on best first search. Proc 3rd IEEE Int Advance Computing Conf, p.908-911.
[29]Rios-Alvarado AB, Lopez-Arevalo I, Sosa-Sosa VJ, 2013. Learning concept hierarchies from textual resources for ontologies construction. Expert Syst Appl, 40(15):5907-5915.
[30]Tong YL, 2008. Application of focused crawler using adaptive dynamical evolutional particle swarm optimization. Geom Inform Sci Wuhan Univ, 33(12):1296-1299 (in Chinese).
[31]Tsikrika T, Moumtzidou A, Vrochidis S, et al., 2016. Focussed crawling of environmental web resources based on the combination of multimedia evidence. Multim Tools Appl, 75(3):1563-1587.
[32]Vidal MLA, da Silva AS, de Moura ES, et al., 2006. Structure-driven crawler generation by example. Proc 29th Annual Int ACM SIGIR Conf on Research and Development in Information Retrieval, p.292-299.
[33]Wang ZG, Meng BJ, 2014. A comparison of approaches to Chinese word segmentation in Hadoop. Proc IEEE Int Conf on Data Mining Workshop, p.844-850.
[34]Yang YK, Du YJ, Sun JY, et al., 2008. A topic-specific web crawler with concept similarity context graph based on FCA. Proc 4th Int Conf on Intelligent Computing, p.840-847.
[35]Zhu G, Yang JY, Wu XH, et al., 2017. Research on construction of hierarchy relationship and ontology of meteorological disaster based on FCA. Mod Inform, 37(5):79-88 (in Chinese).
Open peer comments: Debate/Discuss/Question/Opinion
<1>