|
Frontiers of Information Technology & Electronic Engineering
ISSN 2095-9184 (print), ISSN 2095-9230 (online)
2020 Vol.21 No.7 P.995-1004
Web page classification based on heterogeneous features and a combination of multiple classifiers
Abstract: Precise web page classification can be achieved by evaluating features of web pages, and the structural features of web pages are effective complements to their textual features. Various classifiers have different characteristics, and multiple classifiers can be combined to allow classifiers to complement one another. In this study, a web page classification method based on heterogeneous features and a combination of multiple classifiers is proposed. Different from computing the frequency of HTML tags, we exploit the tree-like structure of HTML tags to characterize the structural features of a web page. Heterogeneous textual features and the proposed tree-like structural features are converted into vectors and fused. Confidence is proposed here as a criterion to compare the classification results of different classifiers by calculating the classification accuracy of a set of samples. Multiple classifiers are combined based on confidence with different decision strategies, such as voting, confidence comparison, and direct output, to give the final classification results. Experimental results demonstrate that on the Amazon dataset, 7-web-genres dataset, and DMOZ dataset, the accuracies are increased to 94.2%, 95.4%, and 95.7%, respectively. The fusion of the textual features with the proposed structural features is a comprehensive approach, and the accuracy is higher than that when using only textual features. At the same time, the accuracy of the web page classification is improved by combining multiple classifiers, and is higher than those of the related web page classification algorithms.
Key words: Web page classification, Web page features, Combined classifiers
浙江大学信息与电子工程学院,中国杭州市,310027
摘要:网页特征是网页分类的关键,通过有区分度的特征能有效对网页分类。网页结构特征是对文本特征的有效补充。不同分类器有不同特点,多分类器组合可实现分类器性能互补。提出一种基于异构特征和组合分类器的网页分类算法。与计算HTML标记的频率不同,本文采用树状分布的HTML标签表示网页结构特征,以向量形式将异构文本和结构特征融合。通过计算一组样本的分类准确率,提出将分类结果置信度作为比较不同分类器分类结果的标准。基于置信度采用投票、比较大小和直接输出的决策策略,得到组合分类器的分类结果。实验结果表明,在Amazon数据集、7-web-genres数据集和DMOZ数据集中,准确率分别提高到94.2%、95.4%、95.7%。融合文本和结构特征的分类方法比仅使用文本特征的方法更全面有效。同时多分类器组合能够提高网页分类准确率,高于同类网页组合分类算法。
关键词组:
References:
Open peer comments: Debate/Discuss/Question/Opinion
<1>
DOI:
10.1631/FITEE.1900240
CLC number:
TP391
Download Full Text:
Downloaded:
7122
Download summary:
<Click Here>Downloaded:
1696Clicked:
6983
Cited:
0
On-line Access:
2024-08-27
Received:
2023-10-17
Revision Accepted:
2024-05-08
Crosschecked:
2020-06-06