|
Frontiers of Information Technology & Electronic Engineering
ISSN 2095-9184 (print), ISSN 2095-9230 (online)
2015 Vol.16 No.6 P.449-456
A sampling method based on URL clustering for fast web accessibility evaluation
Abstract: When evaluating the accessibility of a large website, we rely on sampling methods to reduce the cost of evaluation. This may lead to a biased evaluation when the distribution of checkpoint violations in a website is skewed and the selected samples do not provide a good representation of the entire website. To improve sampling quality, stratified sampling methods first cluster web pages in a site and then draw samples from each cluster. In existing stratified sampling methods, however, all the pages in a website need to be analyzed for clustering, causing huge I/O and computation costs. To address this issue, we propose a novel page sampling method based on URL clustering for web accessibility evaluation, namely URLSamp. Using only the URL information for stratified page sampling, URLSamp can efficiently scale to large websites. Meanwhile, by exploiting similarities in URL patterns, URLSamp cluster pages by their generating scripts and can thus effectively detect accessibility problems from web page templates. We use a data set of 45 web sites to validate our method. Experimental results show that our URLSamp method is both effective and efficient for web accessibility evaluation.
Key words: Page sampling, URL clustering, Web accessibility evaluation
创新点:大部分网站的网页内容和URL信息都是由有限数量的模板生成的。因此这些网站的无障碍问题都可以追溯到模板。鉴于同一模板生成的网页具有相似结构和URL模式,可基于URL相似性对网页进行聚类,将同一模板的URL聚到一类中。本文所提抽样算法仅利用网页URL模式信息,无需存储全部网页内容,从而减少I/O开销和计算代价,实现快速的无障碍检测和评估。
方法:利用模板生成的网页具有相似URL模式,将URL进行聚类以实现同一模板生成的网页聚在一类中。具体过程:首先,解析爬取到的URL以获取候选URL分词和模板URL分词;然后利用最小长度描述原则进行URL聚类(算法1);最后在每类中按照抽样比例进行抽样。
结论:不同于现有的分层抽样算法,本文提出的抽样算法仅利用URL模式信息将网页进行聚类,可减少大量I/O开销和计算代价。
关键词组:
References:
Open peer comments: Debate/Discuss/Question/Opinion
<1>
DOI:
10.1631/FITEE.1400377
CLC number:
TP391.3
Download Full Text:
Downloaded:
3075
Download summary:
<Click Here>Downloaded:
2083Clicked:
7679
Cited:
3
On-line Access:
2024-08-27
Received:
2023-10-17
Revision Accepted:
2024-05-08
Crosschecked:
2015-05-18