Datasets for unstructured OCR tasks

shewlykhatun708 · Post by **shewlykhatun708** » Wed Dec 04, 2024 6:25 am

There are lots of datasets available in English but it's harder to find datasets for other languages. Different datasets present different tasks to be solved. Here are a few examples of datasets commonly used for machine learning OCR problems.

SVHN dataset
The Street View House Numbers dataset contains 73257 digits for training, 26032 digits for testing, and 531131 additional as extra training data. The dataset includes 10 labels which are the digits 0-9. The dataset differs from MNIST since SVHN has images of house numbers with the house numbers against varying backgrounds. The dataset has bounding boxes around each digit instead of having several images of digits like in MNIST.

Scene Text dataset
This dataset consists of 3000 images in different settings (indoor and outdoor) and lighting conditions (shadow, light, and night), with text in Korean and English. Some images also contain digits.

Devanagri Character dataset
This dataset provides us with 1800 samples usa b2b leads from 36 character classes obtained by 25 different native writers in the devanagri script.
And there are many others like this one for Chinese characters, this one for CAPTCHA, or this one for handwritten words.

Any Typical machine learning OCR pipeline follows the following steps:

Preprocessing
Remove the noise from the image

Remove the complex background from the image

Handle the different lighting conditions in the image

Text Detection

Text detection techniques are required to detect the text in the image and create and bounding box around the portion of the image having text.

Sliding window technique
The bounding box can be created around the text through the sliding window technique. However, this is a computationally expensive task. In this technique, a sliding window passes through the image to detect the text in that window, like a convolutional neural network. We try with different window sizes to not miss the text portion with different sizes. There is a convolutional implementation of the sliding window which can reduce the computational time.

Text Recognition
Once we have detected the bounding boxes having the text, the next step is to recognize the text. There are several techniques for recognizing the text.

CRNN
Convolutional Recurrent Neural Network (CRNN) is a combination of CNN, RNN, and CTC (Connectionist Temporal Classification) loss for image-based sequence recognition tasks, such as scene text recognition and OCR. The network architecture has been taken from this paper published in 2015.