Medical Intelligence and Language Engineering Lab    TTS Demo   |   Downloads   |   Videos   |   Contact Us   |   Site Map      
     Home   |    About Mile   |    Projects   |    Research Area   |    Publications   |    Alumni   |    FAQ's    |    News & Events   |    Gallery


Word Image Data Set for 11 Indian Scripts

(for script recognition studies at the word level)

        This dataset contains 2,20,000 word images (of printed text) from 11 different Indian scripts (languages) (20,000 images for each language), namely, Bangla, Devanagari, English, Gujarati, Kannada, Malayalam, Odiya, Punjabi, Tamil, Telugu and Urdu. These images were collected by Dr. Peeta Basa Pati, when he was a doctoral student here. 100 printed pages were scanned for each script from different books at 300 dpi resolution, from which the words were segmented, binarized and saved in binary TIFF uncompressed format. The word images of each language contain most of the graphemes in that particular language, to our knowledge.


        Sample Images from the data set: :
description here
Tamil
description here
Telugu
description here
Malayalam
description here
English
description here
Bengali
description here
Devanagari
description here
Urdu
description here
Kannada
description here
Odiya
description here
Gujarati
description here
Punjabi
        This data has been used to carry out word level script recognition. For further details on our multi-script recognition research work, you may refer to our paper mentioned below, published in Pattern Recognition Letters.

    Download Multi-script Data set

        By downloading and using the data set below (or part of it), you agree to acknowledge its source and cite the paper given below in all your publications using this data.

        Peeta Basa Pati and A. G. Ramakrishnan, “Word Level Multi-script Identification”, Pattern Recognition Letters, 2008, Vol. 29, pp. 1218-1229 (Download) (Tex Reference)


        Please email ppati@lycos.com or ramkiag@ee.iisc.ernet.in to let us know about the usage of the data set.

        We are sure researchers worldwide will find this data set useful. The zip file containing the multi-script word image dataset can be downloaded here. his zip file contains a folder which has 11 sub-folders corresponding to the 11 individual scripts (languages).


© 2013 Medical Intelligence and Language Engineering Lab - IISc Campus, Bangalore.