Code and Data (from Pramod Sankar K ).

IIIT-H Word Recognition Dataset (Telugu)

Description: IIIT-H Word Recognition dataset is the most challenging word recognition dataset available; and the largest in any Indian language. Researchers are encouraged to use this dataset to train and evaluate features and classifiers, for the task of optical word recognition. The dataset contains word-images from printed document images, corresponding to 1000 distinct Telugu words. The word-images are sampled from 33 Telugu books, obtained from the Digital Library of India. The dataset contains large variety in fonts, font style, print style, size and degradations (both cuts and merges). The accuracy of the labeling is quite high (> 99%). The labeling was hand-corrected by Pramod Sankar K.

Dataset Size: 32,773 word-images ranging from 5 to 530 images per word (52MB in TIF format).

Examples: See Samples

Download: TARBall

Code: TIFF I/O (README)

Citing: If you use this dataset in your work, please cite our DAS 2010 paper:

Pramod Sankar Kompalli, C.V.Jawahar and R. Manmatha,
Nearest Neighbor based Collection OCR
Proceedings of Ninth IAPR International Workshop on Document Analysis Systems (DAS'10), pp. 207-214, 9-11 June, 2010, Boston, MA, USA.
Queries:


IIIT-H Word Recognition Synthetic Dataset

IIIT-H Word Recognition Synthetic dataset, is a dataset of word-images that are created synthetically, by font-rendering with 28 custom fonts. The Synthetic dataset has variety in font type and font style, but is free from size variations or degradations. This would lend a more controlled setting to test features and classifiers for the task of optical word recognition.

Dataset Size: 28,000 word-images for 1000 unique Telugu words, 28 images per word (268MB in TIF format).

Examples: See Samples

Download: TARBall

Code: TIFF I/O (README)

Citing: If you use this dataset in your work, please cite our DAS 2010 paper:

Pramod Sankar Kompalli, C.V.Jawahar and R. Manmatha,
Nearest Neighbor based Collection OCR
Proceedings of Ninth IAPR International Workshop on Document Analysis Systems (DAS'10), pp. 207-214, 9-11 June, 2010, Boston, MA, USA.
Queries:


Pramod Shankar Pramod Shankar Pramod Shankar