|
Journal of Zhejiang University SCIENCE A
ISSN 1673-565X(Print), 1862-1775(Online), Monthly
2009 Vol.10 No.8 P.1114-1124
On-line topical importance estimation: an effective focused crawling algorithm combining link and content analysis
Abstract: Focused crawling is an important technique for topical resource discovery on the Web. The key issue in focused crawling is to prioritize uncrawled uniform resource locators (URLs) in the frontier to focus the crawling on relevant pages. Traditional focused crawlers mainly rely on content analysis. Link-based techniques are not effectively exploited despite their usefulness. In this paper, we propose a new frontier prioritizing algorithm, namely the on-line topical importance estimation (OTIE) algorithm. OTIE combines link- and content-based analysis to evaluate the priority of an uncrawled URL in the frontier. We performed real crawling experiments over 30 topics selected from the Open Directory Project (ODP) and compared harvest rate and target recall of the four crawling algorithms: breadth-first, link-context-prediction, on-line page importance computation (OPIC) and our OTIE. Experimental results showed that OTIE significantly outperforms the other three algorithms on the average target recall while maintaining an acceptable harvest rate. Moreover, OTIE is much faster than the traditional focused crawling algorithm.
Key words: Focused crawlers, Topical crawlers, PageRank, Classifiers, On-line topical importance estimation (OTIE) algorithm
References:
Open peer comments: Debate/Discuss/Question/Opinion
<1>
DOI:
10.1631/jzus.A0820481
CLC number:
TP391.3
Download Full Text:
Downloaded:
4186
Clicked:
5872
Cited:
4
On-line Access:
2024-08-27
Received:
2023-10-17
Revision Accepted:
2024-05-08
Crosschecked:
2009-06-10