Publishing Service

Polishing & Checking

Frontiers of Information Technology & Electronic Engineering

ISSN 2095-9184 (print), ISSN 2095-9230 (online)

A sampling method based on URL clustering for fast web accessibility evaluation

Abstract: When evaluating the accessibility of a large website, we rely on sampling methods to reduce the cost of evaluation. This may lead to a biased evaluation when the distribution of checkpoint violations in a website is skewed and the selected samples do not provide a good representation of the entire website. To improve sampling quality, stratified sampling methods first cluster web pages in a site and then draw samples from each cluster. In existing stratified sampling methods, however, all the pages in a website need to be analyzed for clustering, causing huge I/O and computation costs. To address this issue, we propose a novel page sampling method based on URL clustering for web accessibility evaluation, namely URLSamp. Using only the URL information for stratified page sampling, URLSamp can efficiently scale to large websites. Meanwhile, by exploiting similarities in URL patterns, URLSamp cluster pages by their generating scripts and can thus effectively detect accessibility problems from web page templates. We use a data set of 45 web sites to validate our method. Experimental results show that our URLSamp method is both effective and efficient for web accessibility evaluation.

Key words: Page sampling, URL clustering, Web accessibility evaluation

Chinese Summary  <26> 基于URL聚类的快速无障碍检测抽样方法

目的:大多数残疾人士上网都会遇到各种障碍。为减少上网障碍,对网站进行无障碍检测评估是十分必要的。鉴于大部分网站具有海量网页且某些网页需涉及人工检测,通常利用抽样算法对网站进行无障碍检测评估。已有的分层抽样算法I/O开销和计算代价大。为解决这一问题,本文提出一种基于URL聚类的抽样算法。仅利用URL信息进行聚类,然后抽样,最终实现快速的无障碍检测和评估。
创新点:大部分网站的网页内容和URL信息都是由有限数量的模板生成的。因此这些网站的无障碍问题都可以追溯到模板。鉴于同一模板生成的网页具有相似结构和URL模式,可基于URL相似性对网页进行聚类,将同一模板的URL聚到一类中。本文所提抽样算法仅利用网页URL模式信息,无需存储全部网页内容,从而减少I/O开销和计算代价,实现快速的无障碍检测和评估。
方法:利用模板生成的网页具有相似URL模式,将URL进行聚类以实现同一模板生成的网页聚在一类中。具体过程:首先,解析爬取到的URL以获取候选URL分词和模板URL分词;然后利用最小长度描述原则进行URL聚类(算法1);最后在每类中按照抽样比例进行抽样。
结论:不同于现有的分层抽样算法,本文提出的抽样算法仅利用URL模式信息将网页进行聚类,可减少大量I/O开销和计算代价。

关键词组:>网页抽样;URL聚类;无障碍检测


Share this article to: More

Go to Contents

References:

<Show All>

Open peer comments: Debate/Discuss/Question/Opinion

<1>

Please provide your name, email address and a comment





DOI:

10.1631/FITEE.1400377

CLC number:

TP391.3

Download Full Text:

Click Here

Downloaded:

2772

Download summary:

<Click Here> 

Downloaded:

1848

Clicked:

6793

Cited:

3

On-line Access:

2015-06-04

Received:

2014-11-02

Revision Accepted:

2015-04-21

Crosschecked:

2015-05-18

Journal of Zhejiang University-SCIENCE, 38 Zheda Road, Hangzhou 310027, China
Tel: +86-571-87952276; Fax: +86-571-87952331; E-mail: jzus@zju.edu.cn
Copyright © 2000~ Journal of Zhejiang University-SCIENCE