CLC number: TP391.3
On-line Access: 2024-08-27
Received: 2023-10-17
Revision Accepted: 2024-05-08
Crosschecked: 2009-06-10
Cited: 4
Clicked: 5756
Can WANG, Zi-yu GUAN, Chun CHEN, Jia-jun BU, Jun-feng WANG, Huai-zhong LIN. On-line topical importance estimation: an effective focused crawling algorithm combining link and content analysis[J]. Journal of Zhejiang University Science A, 2009, 10(8): 1114-1124.
@article{title="On-line topical importance estimation: an effective focused crawling algorithm combining link and content analysis",
author="Can WANG, Zi-yu GUAN, Chun CHEN, Jia-jun BU, Jun-feng WANG, Huai-zhong LIN",
journal="Journal of Zhejiang University Science A",
volume="10",
number="8",
pages="1114-1124",
year="2009",
publisher="Zhejiang University Press & Springer",
doi="10.1631/jzus.A0820481"
}
%0 Journal Article
%T On-line topical importance estimation: an effective focused crawling algorithm combining link and content analysis
%A Can WANG
%A Zi-yu GUAN
%A Chun CHEN
%A Jia-jun BU
%A Jun-feng WANG
%A Huai-zhong LIN
%J Journal of Zhejiang University SCIENCE A
%V 10
%N 8
%P 1114-1124
%@ 1673-565X
%D 2009
%I Zhejiang University Press & Springer
%DOI 10.1631/jzus.A0820481
TY - JOUR
T1 - On-line topical importance estimation: an effective focused crawling algorithm combining link and content analysis
A1 - Can WANG
A1 - Zi-yu GUAN
A1 - Chun CHEN
A1 - Jia-jun BU
A1 - Jun-feng WANG
A1 - Huai-zhong LIN
J0 - Journal of Zhejiang University Science A
VL - 10
IS - 8
SP - 1114
EP - 1124
%@ 1673-565X
Y1 - 2009
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/jzus.A0820481
Abstract: Focused crawling is an important technique for topical resource discovery on the Web. The key issue in focused crawling is to prioritize uncrawled uniform resource locators (URLs) in the frontier to focus the crawling on relevant pages. Traditional focused crawlers mainly rely on content analysis. Link-based techniques are not effectively exploited despite their usefulness. In this paper, we propose a new frontier prioritizing algorithm, namely the on-line topical importance estimation (OTIE) algorithm. OTIE combines link- and content-based analysis to evaluate the priority of an uncrawled URL in the frontier. We performed real crawling experiments over 30 topics selected from the Open Directory Project (ODP) and compared harvest rate and target recall of the four crawling algorithms: breadth-first, link-context-prediction, on-line page importance computation (OPIC) and our OTIE. Experimental results showed that OTIE significantly outperforms the other three algorithms on the average target recall while maintaining an acceptable harvest rate. Moreover, OTIE is much faster than the traditional focused crawling algorithm.
[1] Abiteboul, S., Preda, M., Cobena, G., 2003. Adaptive On-line Page Importance Computation. Proc. 12th Int. Conf. on World Wide Web, p.280-290.
[2] Aggarwal, C.C., 2002. Collaborative Crawling: Mining User Experiences for Topical Resource Discovery. Proc. 8th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, p.423-428.
[3] Aggarwal, C.C., Al-Garawi, F., Yu, P.S., 2001. Intelligent Crawling on the World Wide Web with Arbitrary Predicates. Proc. 10th Int. Conf. on World Wide Web, p.96-105.
[4] Almpanidis, G., Kotropoulos, C., Pitas, I., 2005. Focused crawling using latent semantic indexing—an application for vertical search engines. LNCS, 3652:402-413.
[5] Bharat, K., Henzinger, M.R., 1998. Improved Algorithms for Topic Distillation in a Hyperlinked Environment. Proc. 21st Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, p.104-111.
[6] Burges, C.J.C., 1998. A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Discov., 2(2):121-167.
[7] Calado, P., Cristo, M., Moura, E., Ziviani, N., Ribeiro-Neto, B., Gonçalves, M.A., 2003. Combining Link-based and Content-based Methods for Web Document Classification. Proc. 12th Int. Conf. on Information and Knowledge Management, p.394-401.
[8] Chakrabarti, S., van den Berg, M., Dom, B., 1999. Focused crawling: a new approach to topic-specific Web resource discovery. Computer Networks, 31(11-16):1623-1640.
[9] Chau, M., Chen, H., 2003. Comparison of three vertical search spiders. Computer, 36(5):56-62.
[10] Cho, J., Garcia-Molina, H., Page, L., 1998. Efficient Crawling through URL Ordering. Proc. 7th Int. Conf. on World Wide Web, p.161-172.
[11] Davison, B.D., 2002. Topical Locality in the Web. Proc. 23rd Annual Int. ACM SIGIR Conf., p.272-279.
[12] Diligenti, M., Coetzee, F.M., Lawrence, S., Giles, C.L., Gori, M., 2000. Focused Crawling Using Context Graphs. Proc. 26th Int. Conf. on Very Large Databases (VLDB), p.527-534.
[13] Elkan, C., 1997. Boosting and Naive Bayesian Learning. Technical Report No. CS97-557, Department of Computer Science and Engineering, University of California, San Diego.
[14] Guan, Z., Wang, C., Chen, C., Bu, J., Wang, J., 2008. Guide Focused Crawler Efficiently and Effectively Using On-line Topical Importance Estimation. Proc. 31st Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, p.757-758.
[15] Haveliwala, T.H., 2002. Topic-sensitive PageRank. Proc. 11th Int. Conf. on World Wide Web, p.517-526.
[16] Jain, A.K., Mao, J., Mohiuddin, K.M., 1996. Artificial neural networks: a tutorial. Computer, 29(3):31-44.
[17] Jamali, M., Sayyadi, H., Hariri, B.B., Abolhassani, H., 2006. A Method for Focused Crawling Using Combination of Link Structure and Content Similarity. IEEE/WIC/ACM Int. Conf. on Web Intelligence, p.753-756.
[18] Kleinberg, J., 1998. Authoritative Sources in a Hyperlinked Environment. Proc. 9th Annual ACM-SIAM Symp. on Discrete Algorithms, p.668-677.
[19] Menczer, F., 1997. ARACHNID: Adaptive Retrieval Agents Choosing Heuristic Neighborhoods for Information Discovery. Proc. 14th Int. Conf. on Machine Learning, p.227-235.
[20] Menczer, F., Pant, G., Srinivasan, P., Ruiz, M.E., 2001. Evaluating Topic-driven Web Crawlers. Proc. 24th Annual Int. ACM SIGIR Conf., p.241-249.
[21] Page, L., Brin, S., Motwani, R., Winograd, T., 1998. The Pagerank Citation Algorithm: Bringing Order to the Web. Technical Report, Stanford Digital Library Technologies, Stanford InfoLab.
[22] Pant, G., Srinivasan, P., 2005. Learning to crawl: comparing classification schemes. ACM Trans. Inf. Syst., 23(4):430-462.
[23] Pant, G., Srinivasan, P., 2006. Link contexts in classifier-guided topical crawlers. IEEE Trans. Knowl. Data Eng., 18(1):107-122.
[24] Pant, G., Srinivasan, P., Menczer, F., 2002. Exploration versus Exploitation in Topic Driven Crawlers. Proc. 11th World Wide Web Workshop on Web Dynamics, p.1-10.
[25] Rennie, J., McCallum, A., 1999. Using Reinforcement Learning to Spider the Web Efficiently. Proc. 16th Int. Conf. on Machine Learning, p.335-343.
[26] Silva, I., Ribeiro-Neto, B., Calado, P., Moura, E., Ziviani, N., 2000. Link-based and Content-based Evidential Information in a Belief Network Model. Proc. 23rd Annual Int. ACM SIGIR Conf., p.96-103.
[27] Srinivasan, P., Menczer, F., Pant, G., 2005. A general evaluation framework for topical crawlers. Inf. Retr., 8(3):417-447.
[28] Tang, T.T., Hawking, D., Craswell, N., Griffiths, K., 2005. Focused Crawling for Both Topical Relevance and Quality of Medical Information. Proc. 14th ACM Int. Conf. on Information and Knowledge Management, p.147-154.
[29] Yang, K., 2001. Combining Text- and Link-based Retrieval Methods for Web IR. Proc. 10th Text Retrieval Conf., p.609-618.
Open peer comments: Debate/Discuss/Question/Opinion
<1>