CLC number: TP391
On-line Access: 2011-08-03
Received: 2010-09-25
Revision Accepted: 2011-03-09
Crosschecked: 2011-07-04
Cited: 4
Clicked: 8268
Wen-hua Xu, Zheng Qin, Yang Chang. Clustering feature decision trees for semi-supervised classification from high-speed data streams[J]. Journal of Zhejiang University Science C, 2011, 12(8): 615-628.
@article{title="Clustering feature decision trees for semi-supervised classification from high-speed data streams",
author="Wen-hua Xu, Zheng Qin, Yang Chang",
journal="Journal of Zhejiang University Science C",
volume="12",
number="8",
pages="615-628",
year="2011",
publisher="Zhejiang University Press & Springer",
doi="10.1631/jzus.C1000330"
}
%0 Journal Article
%T Clustering feature decision trees for semi-supervised classification from high-speed data streams
%A Wen-hua Xu
%A Zheng Qin
%A Yang Chang
%J Journal of Zhejiang University SCIENCE C
%V 12
%N 8
%P 615-628
%@ 1869-1951
%D 2011
%I Zhejiang University Press & Springer
%DOI 10.1631/jzus.C1000330
TY - JOUR
T1 - Clustering feature decision trees for semi-supervised classification from high-speed data streams
A1 - Wen-hua Xu
A1 - Zheng Qin
A1 - Yang Chang
J0 - Journal of Zhejiang University Science C
VL - 12
IS - 8
SP - 615
EP - 628
%@ 1869-1951
Y1 - 2011
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/jzus.C1000330
Abstract: Most stream data classification algorithms apply the supervised learning strategy which requires massive labeled data. Such approaches are impractical since labeled data are usually hard to obtain in reality. In this paper, we build a clustering feature decision tree model, CFDT, from data streams having both unlabeled and a small number of labeled examples. CFDT applies a micro-clustering algorithm that scans the data only once to provide the statistical summaries of the data for incremental decision tree induction. Micro-clusters also serve as classifiers in tree leaves to improve classification accuracy and reinforce the any-time property. Our experiments on synthetic and real-world datasets show that CFDT is highly scalable for data streams while generating high classification accuracy with high speed.
[1]Bifet, A., Kirkby, R., Holmes, G., Pfahringer, B., 2007. MOA: Massive Online Analysis. Available from http://moa.cs.waikato.ac.nz/ [Accessed on Jan. 31, 2010].
[2]Bifet, A., Holmes, G., Pfahringer, B., Kirkby, R., Gavalda, R., 2009. New Ensemble Methods for Evolving Data Streams. Proc. 15th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, p.139-148.
[3]Bifet, A., Holmes, G., Pfahringer, B., Frank, E., 2010. Fast perceptron decision tree learning from evolving data streams. LNCS, 6119:299-310.
[4]Chapelle, O., Scholkopf, B., Zien, A., 2006. Semi-supervised Learning. MIT Press, Cambridge, USA, p.5.
[5]Domingos, P., Hulten, G., 2000. Mining High-Speed Data Streams. Proc. 6th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, p.71-80.
[6]Gama, J., Rocha, R., Medas, P., 2003. Accurate Decision Trees for Mining High-Speed Data Streams. Proc. 9th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, p.523-528.
[7]Gehrke, J., Ganti, V., Ramakrishnan, R., Loh, W., 1999. BOAT—Optimistic Decision Tree Construction. Proc. ACM SIGMOD Int. Conf. on Management of Data, p.169-180.
[8]Gehrke, J., Ramakrishnan, R., Ganti, V., 2000. RainForest—a framework for fast decision tree construction of large datasets. Data Min. Knowl. Disc., 4(2/3):127-162.
[9]Greenwald, M., Khanna, S., 2001. Space-Efficient Online Computation of Quantile Summaries. Proc. ACM SIGMOD Int. Conf. on Management of Data, p.58-66.
[10]Hulten, G., Domingos, P., 2003. VFML—a Toolkit for Mining High-Speed Time-Changing Data Streams. Available from http://www.cs.washington.edu/dm/vfml [Accessed on Apr. 25, 2010].
[11]Hulten, G., Spencer, L., Domingos, P., 2001. Mining Time-Changing Data Streams. Proc. 7th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, p.97-106.
[12]Jin, W., Tung, A.K.H., Han, J., 2001. Mining Top-n Local Outliers in Large Databases. Proc. 7th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, p.293-298.
[13]Li, P., Wu, X., Hu, X., 2010. Learning from Concept Drifting Data Streams with Unlabeled Data. Proc. 24th AAAI Conf. on Artificial Intelligence, p.1495-1496.
[14]Masud, M.M., Gao, J., Khan, L., Han, J., Thuraisingham, B., 2008. A Practical Approach to Classify Evolving Data Streams: Training with Limited Amount of Labeled Data. Proc. 8th IEEE Int. Conf. on Data Mining, p.929-934.
[15]Mehta, M., Agrawal, R., Rissanen, J., 1996. SLIQ: a fast scalable classifier for data mining. LNCS, 1057:18-32.
[16]Pfahringer, B., Holmes, G., Kirkby, R., 2007. New options for Hoeffding trees. LNCS, 4830:90-99.
[17]Pfahringer, B., Holmes, G., Krikby, R., 2008. Handling Numeric Attributes in Hoeffding Trees. Proc. 12th Pacific-Asia Conf. on Knowledge Discovery and Data Mining, p.296-307.
[18]Shafer, J.C., Agrawal, R., Mehta, M., 1996. SPRINT: a Scalable Parallel Classifier for Data Mining. Proc. 22nd Int. Conf. on Very Large Data Bases, p.544-555.
[19]Street, W.N., Kim, Y., 2001. A Streaming Ensemble Algorithm (SEA) for Large-Scale Classification. Proc. 7th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, p.377-382.
[20]Wang, H., Fan, W., Yu, P.S., Han, J., 2003. Mining Concept-Drifting Data Streams Using Ensemble Classifiers. Proc. 9th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, p.226-235.
[21]Wu, S., Yang, C., Zhou, J., 2006. Clustering-Training for Data Stream Mining. Proc. 6th IEEE Int. Conf. on Data Mining, p.653-656.
[22]Yu, H., Yang, J., Han, J., 2003. Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Proc. 9th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, p.306-315.
[23]Zhang, T., Ramakrishnan, R., Livny, M., 1996. BIRCH: an Efficient Data Clustering Method for Very Large Databases. Proc. ACM SIGMOD Int. Conf. on Management of Data, p.103-114.
Open peer comments: Debate/Discuss/Question/Opinion
<1>