Full Text:   <2983>

CLC number: TP391

On-line Access: 

Received: 2005-08-05

Revision Accepted: 2005-09-10

Crosschecked: 0000-00-00

Cited: 12

Clicked: 6319

Citations:  Bibtex RefMan EndNote GB/T7714

-   Go to

Article info.
1. Reference List
Open peer comments

Journal of Zhejiang University SCIENCE A 2005 Vol.6 No.11 P.1297-1305


Optical Character Recognition for printed Tamil text using Unicode

Author(s):  SEETHALAKSHMI R., SREERANJANI T.R., BALACHANDAR T., Abnikant Singh, Markandey Singh, Ritwaj Ratan, Sarvesh Kumar

Affiliation(s):  Shanmugha Arts Science Technology and Research Academy, Thirumalaisamudram, Thanjavur, Tamil Nadu, India

Corresponding email(s):   rseetha123@cse.sastra.edu, trsree@yahoo.com

Key Words:  OCR, Unicode, Features, Support Vector Machine (SVM), Artificial Neural Networks

SEETHALAKSHMI R., SREERANJANI T.R., BALACHANDAR T., Abnikant Singh, Markandey Singh, Ritwaj Ratan, Sarvesh Kumar. Optical Character Recognition for printed Tamil text using Unicode[J]. Journal of Zhejiang University Science A, 2005, 6(11): 1297-1305.

@article{title="Optical Character Recognition for printed Tamil text using Unicode",
author="SEETHALAKSHMI R., SREERANJANI T.R., BALACHANDAR T., Abnikant Singh, Markandey Singh, Ritwaj Ratan, Sarvesh Kumar",
journal="Journal of Zhejiang University Science A",
publisher="Zhejiang University Press & Springer",

%0 Journal Article
%T Optical Character Recognition for printed Tamil text using Unicode
%A Abnikant Singh
%A Markandey Singh
%A Ritwaj Ratan
%A Sarvesh Kumar
%J Journal of Zhejiang University SCIENCE A
%V 6
%N 11
%P 1297-1305
%@ 1673-565X
%D 2005
%I Zhejiang University Press & Springer
%DOI 10.1631/jzus.2005.A1297

T1 - Optical Character Recognition for printed Tamil text using Unicode
A1 - Abnikant Singh
A1 - Markandey Singh
A1 - Ritwaj Ratan
A1 - Sarvesh Kumar
J0 - Journal of Zhejiang University Science A
VL - 6
IS - 11
SP - 1297
EP - 1305
%@ 1673-565X
Y1 - 2005
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/jzus.2005.A1297

Optical Character Recognition (OCR) refers to the process of converting printed Tamil text documents into software translated unicode Tamil Text. The printed documents available in the form of books, papers, magazines, etc. are scanned using standard scanners which produce an image of the scanned document. As part of the preprocessing phase the image file is checked for skewing. If the image is skewed, it is corrected by a simple rotation technique in the appropriate direction. Then the image is passed through a noise elimination phase and is binarized. The preprocessed image is segmented using an algorithm which decomposes the scanned text into paragraphs using special space detection technique and then the paragraphs into lines using vertical histograms, and lines into words using horizontal histograms, and words into character image glyphs using horizontal histograms. Each image glyph is comprised of 32×32 pixels. Thus a database of character image glyphs is created out of the segmentation phase. Then all the image glyphs are considered for recognition using unicode mapping. Each image glyph is passed through various routines which extract the features of the glyph. The various features that are considered for classification are the character height, character width, the number of horizontal lines (long and short), the number of vertical lines (long and short), the horizontally oriented curves, the vertically oriented curves, the number of circles, number of slope lines, image centroid and special dots. The glyphs are now set ready for classification based on these features. The extracted features are passed to a support Vector Machine (SVM) where the characters are classified by Supervised Learning Algorithm. These classes are mapped onto unicode for recognition. Then the text is reconstructed using unicode fonts.

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article


[1] LTG (Language Technologies Group), 2003. Optical Character Recognition for Printed Kannada Text Documents. SERC, IISc Bangalore.

[2] VijayaKumar, B., 2001. Machine Recognition of Printed Kannada Text. IISc Bangalore. The Unicode Standard Version 3.0, Addison Wesley.

[3] Gonzalez, R.C., Woods, R.E., Eddins, S.L., 2004. Digital Image Processing Using MATLAB. PHI Pearson.

[4] Unicode, 2000. The Unicode Standard Version 3.0. Addison Wesley.

Open peer comments: Debate/Discuss/Question/Opinion



2016-12-05 18:30:45

please give me download. I want to research more how to do it

Senthil Kumar@No address<katpadi.senthil@gmail.com>

2014-02-17 00:44:04

Interested in finding a good tamil OCR.

Please provide your name, email address and a comment

Journal of Zhejiang University-SCIENCE, 38 Zheda Road, Hangzhou 310027, China
Tel: +86-571-87952783; E-mail: cjzhang@zju.edu.cn
Copyright © 2000 - 2024 Journal of Zhejiang University-SCIENCE