Journal of Zhejiang University

Frontiers of Information Technology & Electronic Engineering 2016 Vol.17 No.11 P.1186-1198

Performance analysis of new word weighting procedures for opinion mining

Author(s): G. R. Brindha, P. Swaminathan, B. Santhi
Affiliation(s): 1. School of Computing, SASTRA University, Thanjavur 613401, India
Corresponding email(s): brindha.gr@ict.sastra.edu
Key Words: Inferred word weight, Opinion mining, Supervised classification, Support vector machine (SVM), Machine learning

Share this article to： More <<< Previous Article \|Next Article >>>

G. R. Brindha, P. Swaminathan, B. Santhi. Performance analysis of new word weighting procedures for opinion mining[J]. Frontiers of Information Technology & Electronic Engineering, 2016, 17(11): 1186-1198.

@article{title="Performance analysis of new word weighting procedures for opinion mining",
author="G. R. Brindha, P. Swaminathan, B. Santhi",
journal="Frontiers of Information Technology & Electronic Engineering",
volume="17",
number="11",
pages="1186-1198",
year="2016",
publisher="Zhejiang University Press & Springer",
doi="10.1631/FITEE.1500283"
}

%0 Journal Article
%T Performance analysis of new word weighting procedures for opinion mining
%A G. R. Brindha
%A P. Swaminathan
%A B. Santhi
%J Frontiers of Information Technology & Electronic Engineering
%V 17
%N 11
%P 1186-1198
%@ 2095-9184
%D 2016
%I Zhejiang University Press & Springer
%DOI 10.1631/FITEE.1500283

TY - JOUR
T1 - Performance analysis of new word weighting procedures for opinion mining
A1 - G. R. Brindha
A1 - P. Swaminathan
A1 - B. Santhi
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 17
IS - 11
SP - 1186
EP - 1198
%@ 2095-9184
Y1 - 2016
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/FITEE.1500283

Abstract
Chinese Summary
Academic Network
Reviewer Comment

Abstract: The proliferation of forums and blogs leads to challenges and opportunities for processing large amounts of information. The information shared on various topics often contains opinionated words which are qualitative in nature. These qualitative words need statistical computations to convert them into useful quantitative data. This data should be processed properly since it expresses opinions. Each of these opinion bearing words differs based on the significant meaning it conveys. To process the linguistic meaning of words into data and to enhance opinion mining analysis, we propose a novel weighting scheme, referred to as inferred word weighting (IWW). IWW is computed based on the significance of the word in the document (SWD) and the significance of the word in the expression (SWE) to enhance their performance. The proposed weighting methods give an analytic view and provide appropriate weights to the words compared to existing methods. In addition to the new weighting methods, another type of checking is done on the performance of text classification by including stop-words. Generally, stop-words are removed in text processing. When this new concept of including stop-words is applied to the proposed and existing weighting methods, two facts are observed: (1) Classification performance is enhanced; (2) The outcome difference between inclusion and exclusion of stop-words is smaller in the proposed methods, and larger in existing methods. The inferences provided by these observations are discussed. Experimental results of the benchmark data sets show the potential enhancement in terms of classification accuracy.

The paper is a good work where the authors propose a new method to weight the relevance of terms for polarity classification systems.

一种观点挖掘新词语权重过程性能分析

概要：论坛和博客的普及为大量信息的处理带来了挑战和机遇。基于不同主题的信息通常包含了主观的定性词语，需要经过统计分析转换为可用的定量数据。这些数据如不恰当处理则会影响观点的正确表达。每个观点相关词的主要表义各有不同。为将词的语义转换为数据并加强对观点挖掘的分析，我们提出了一种新颖的加权方案，称为词权重推测法（inferred word weighting, IWW）。IWW通过对语境下和表义中词语重要性的计算对算法进行增强。相对已有的方法，本文提出的加权方法从分析的视角上为词语提供了合适的权重。此外，通过对包含停用词的文本分类的性能研究，提供了另一种校验方法，作为对所提出的新加权方法的补充。而通常这些停用词都会在文本处理时移除。将包含停用词这一新概念应用于本文提出的加权方法和已有加权方法，可观察到2个现象：（1）文本分类性能增强；（2）包含停用词与否，所造成的文本处理结果的差异在所提出的方法中较小，而在已有方法中较大。进而，从这2种现象得出推论。基于基准数据集的实验结果表明所提出的方法在分类精度上具有优化潜力。

关键词：词权重推测法；观点挖掘；监督分类法；支持向量机；机器学习

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference

[1]Andreevskaia, A., Bergler, S., 2008. When specialists and generalists work together: overcoming domain dependence in sentiment tagging. Proc. ACL-08, p.290-298.

[2]Armstrong, T.G., Moffat, A., Webber, W., et al., 2009. Improvements that don’t add up: ad-hoc retrieval results since 1998. Proc. 18th ACM Conf. on Information and Knowledge Management, p.601-610.

[3]Barnes, S.J., Bohringer, M., 2011. Modeling use continuance behavior in micro blogging services: the case of Twitter. J. Comput. Inform. Syst., 51(4):1-10.

[4]Blitzer, J., Dredze, M., Pereira, F., 2007. Biographies, bollywood, boom-boxes and blenders: domain adaptation for sentiment classification. Proc. 45th Annual Meeting of the Association of Computational Linguistics, p.440-447.

[5]Boiy, E., Moens, M.F., 2009. A machine learning approach to sentiment analysis in multilingual web texts. Inform. Retr., 12(5):526-558.

[6]Boiy, E., Hens, P., Deschacht, K., et al., 2007. Automatic sentiment analysis in on-line text. Proc. 11th Int. Conf. on Electronic Publishing, p.349-360.

[7]Church, K.W., Hanks, P., 1989. Word association norms, mutual information and lexicography. Proc. 27th Annual Meeting on Association for Computational Linguistics, p.76-83.

[8]Das, S., Chen, M., 2001. Yahoo! for Amazon: extracting market sentiment from stock message boards. Manag. Sci., 53(9):1375-1388.

[9]Debole, F., Sebastiani, F., 2003. Supervised term weighting for automated text categorization. Proc. ACM Symp. on Applied Computing, p.784-788.

[10]Esparza, S.G., O’Mahony, M.P., Smyth, B., 2012. Mining the real-time web: a novel approach to product recommendation. Knowl.-Based Syst., 29(3):3-11.

[11]Gabrilovich, E., Markovitch, S., 2004. Text categorization with many redundant features: using aggressive feature selection to make SVMs competitive with C4.5. Proc. 21st Int. Conf. on Machine Learning, p.41-50.

[12]Geng, L., Hamilton, H.J., 2006. Interestingness measures for data mining: a survey. ACM Comput. Surv., 38(3), Article 9.

[13]He, B., Huang, J.X.J., Zhou, X., 2011. Modeling term proximity for probabilistic information retrieval models. Inform. Sci., 181(14):3017-3031.

[14]Lee, S., Song, J., Kim, Y., 2010. An empirical comparison of four text mining methods. J. Comput. Inform. Syst., 51(1):1-10.

[15]Li, S., Xia, R., Zong, C., et al., 2009. A framework of feature selection methods for text categorization. Proc. Joint Conf. 47th Annual Meeting of the ACL and Proc. 4th Int. Joint Conf. on Natural Language of the AFNLP, p.692-700.

[16]Maas, A.L., Daly, R.E., Pham, P.T., et al., 2011. Learning word vectors for sentiment analysis. Proc. 49th Annual Meeting of the Association for Computational Linguistics, p.142-150.

[17]Manning, C.D., Raghavan, P., Schütze, H., 2008. Introduction to Information Retrieval. Cambridge University Press, Cambridge, UK.

[18]Mladenić, D., Grobelnik, M., 1998. Feature selection for classification based on text hierarchy. Proc. Int. Conf. on Automated Learning and Discovery.

[19]Ng, V., Dasgupta, S., Arifin, S.M.N., 2006. Examining the role of linguistic knowledge sources in the automatic identification and classification of reviews. Proc. Int. Conf. on COLING/ACL, p.611-618.

[20]Nigam, K., McCallum, A.K., Thrun, S., et al., 2000. Text classification from labeled and unlabeled documents using EM. Mach. Learn., 39(2-3):103-134.

[21]Paltoglou, G., Thelwall, M., 2010. A study of information retrieval weighting schemes for sentiment analysis. Proc. 48th Annual Meeting of the Association for Computational Linguistics, p.1386-1395.

[22]Pang, B., Lee, L., 2004. A sentimental education: sentiment analysis using subjectivity summarization based on minimum cuts. Proc. 42nd Annual Meeting of the Association for Computational Linguistics, p.271-278.

[23]Pang, B., Lee, L., Vaithyanathan, S., 2002. Thumbs up: sentiment classification using machine learning techniques. Proc. Conf. on Empirical Methods in Natural Language Processing, p.79-86.

[24]Saif, H., He, Y., Alani, H., 2012. Alleviating data sparsity for Twitter sentiment analysis. CEUR Workshop Proc., p.2-9.

[25]Salton, G., Buckley, C., 1998. Term-weighting approaches in automatic text retrieval. Inform. Process. Manag., 24(5):513-523.

[26]Salvetti, F., Lewis, S., Reichenbach, C., 2004. Impact of lexical filtering on overall opinion polarity identification. Proc. AAAI Spring Symp. on Exploring Attitude and Affect in Text: Theories and Applications.

[27]Sebastiani, F., 2002. Machine learning in automated text categorization. ACM Comput. Surv., 34(1):1-47.

[28]Sheikh, M., Conlon, S., 2012. A rule-based system to extract financial information. J. Comput. Inform. Syst., 52(4):10-19.

[29]Simmons, L., Conlon, S., Mukhopadhyay, S., et al., 2011. A computer aided content analysis of online reviews. J. Comput. Inform. Syst., 52(1):43-55.

[30]Tong, R.M., 2001. An operational system for detecting and tracking opinions in on-line discussion. Working Notes of the ACM SIGIR Workshop on Operational Text Classification, p.1-6.

[31]Tsutsumi, K., Shimada, K.K., Endo, T., 2007. Movie review classification based on a multiple classifier. Proc. Annual Meetings of the 21st Pacific Asia Conf. on Language, Information and Computation, p.481-488.

[32]Xu, Y., Jones, G.J., Li, J., et al., 2007. A study on mutual information-based feature selection for text categorization. J. Comput. Inform. Syst., 3(3):1007-1012.

[33]Zaidan, O., Eisner, J., Piatko, C.D., 2007. Using “annotator rationales” to improve machine learning for text categorization. Proc. HLT-NAACL, p.260-267.

Open peer comments: Debate/Discuss/Question/Opinion

<1>