IIIT-H Word Recognition Dataset (Telugu)
Description:
IIIT-H Word Recognition dataset is the most challenging word recognition dataset available; and the largest in any Indian language.
Researchers are encouraged to use this dataset to train and evaluate features and classifiers, for the task of optical word recognition.
The dataset contains word-images from printed document images, corresponding to 1000 distinct Telugu words.
The word-images are sampled from 33 Telugu books, obtained from the Digital Library of India.
The dataset contains large variety in fonts, font style, print style, size and degradations (both cuts and merges).
The accuracy of the labeling is quite high (> 99%).
The labeling was hand-corrected by Pramod Sankar K.
IIIT-H Word Recognition Synthetic dataset, is a dataset of word-images that are created synthetically, by font-rendering with 28 custom fonts.
The Synthetic dataset has variety in font type and font style, but is free from size variations or degradations.
This would lend a more controlled setting to test features and classifiers for the task of optical word recognition.
|