Medical Intelligence and Language Engineering Lab    TTS Demo   |   Downloads   |   Videos   |   Contact Us   |   Site Map      
     Home   |    About Mile   |    Projects   |    Research Area   |    Publications   |    Alumni   |    FAQ's    |    News & Events   |    Gallery


Online Handwritten Tamil Word Database: 15000 Handwritten,

Isolated Tamil Words on Tablet PC


         The online handwritten Tamil word database is an outcome of an ongoing funded project from the Technology Development for Indian Languages (TDIL) program of the Ministry of Information Technology, Government of India. The main agenda of the project is to create a writer independent, open vocabulary recognition framework suitable for form filling applications such as census data collection.

For this application, isolated Tamil words have been collected using a custom application running on a tablet PC. We have ensured that all the writers, who participated in the data collection activity, are native Tamil people, who currently write in that language, at least irregularly. Thus, the data contains different popular writing styles for Tamil symbols.

Moreover, the participants were provided with a graphics interface with large rectangular boxes and were prompted to write Tamil words with minimal constraints, one in each box in the form. No restrictions were placed on the number of strokes, shape of the symbols and direction of constituent strokes.

High school and college students from many educational institutions in the Indian state of Tamil Nadu contributed in building the word database of size 15,000 comprising 2,000 distinct words. Trained Tamil natives carefully inspected the collected data and removed very badly written (cursive writing or one stroke containing multiple characters) and wrong (written word not matching with the given word or the ground truth) word samples. These 15,000 words contain a total of 80, 098 Tamil symbols.

Description of Database

The set of 2000 distinct words have been divided into 8 sets, each comprising 250 words. Each set in turn is divided into a number of sub-folders. The numbers of sub folders vary between 5 (for set03, set06 and set07) to 11 (for set01 and set05).

Owing to the fact that the data set has been designed for writer independent recognition, the naming of the sub-folder within each set has been done for the sake of convenience and holds no specific significance.

There are 250 word samples in each sub folder. The samples in each sub folder are labeled in the format #number t #number .txt. The #number to left of t is the word number in the ground truth (any one of the 2000) and the number to the right is the sample number of that word. Each set comprises as many instances of a given word as the number of sub-folders.

In each .txt file, ink data is represented in UNIPEN v1.0 format. The channels reported for each ink point are X, Y and T. Files corresponding to some users have valid T (time) values for the first and last points of each stroke, with intermediate values set to 0. For other users, the time channel is set to 0 for all the points.

The online data in each file is divided into a set of strokes. The information between the .PEN_DOWN and .PEN_UP signal is called a stroke.

For the utilization of this database in the design of the handwriting recognition system, only the stroke information of the form (X, Y) is to be considered.

The data collected has been annotated at symbol level. Each word is represented as sequence of symbol ids. The ground truth of each of the 2000 words a sequence of symbol ids in the file 'ground_truth.xls'.

The map of the Tamil symbols and their ids is available in 'SYMBOL_LEVEL_IDS.xls'.

This database is being made available only for research purposes. By downloading this database, you agree to acknowledge the source and also cite the following paper in any publication arising out of using the database:

Suresh Sundaram and A. G. Ramakrishnan, “Attention-feedback based robust segmentation of online handwritten isolated Tamil words,” ACM Transactions on Asian Language Information Processing (TALIP), Vol. 12 (1), March 2013, Article No. 4.

Please click here to download the data set.
© 2010 Medical Intelligence and Language Engineering Lab - IISc Campus, Bangalore.