Skip to main content
OCR & Document AI7 min read

Handwritten Cyrillic OCR: Technical Challenges and Solutions

Commercial OCR engines handle printed Cyrillic text reasonably well. Handwritten Cyrillic is a different problem entirely. When we started building a handwritten Cyrillic OCR system for a government digitization project, off-the-shelf solutions topped out around 60-65% character accuracy on our dataset. We needed to get to 90%+ to make the system operationally useful. Here's how we got to 92%.

Why handwritten Cyrillic is hard

Cyrillic presents specific challenges that Latin-script OCR systems don't face:

Character ambiguity. Several Cyrillic characters are visually similar in handwriting: ш and щ differ by a single descender stroke that writers often omit. н and и are near-mirrors of each other. т in handwriting frequently looks like m in Latin. These ambiguities compound — a single word might have three characters where the model is guessing, and getting any one wrong can make the whole word unintelligible.

Connected writing patterns. Cyrillic cursive has different connection rules than Latin. Characters like д, з, and у have descenders that interleave with the next character's ascender. Segmenting individual characters from connected writing is unreliable, which is why we moved to a sequence-to-sequence approach early on.

Mongolian Cyrillic additions. Mongolian Cyrillic includes two additional characters (ө and ү) that don't exist in Russian Cyrillic. Training data for these characters is extremely limited, and they're visually similar to о and у respectively. We needed to handle both Russian and Mongolian Cyrillic, which meant the model had to distinguish between very similar characters across two slightly different alphabets.

Data augmentation: the biggest lever

Our initial training dataset was around 15,000 handwritten samples — far too small for the variability we needed to handle. We used three augmentation strategies that each contributed measurably to accuracy:

Elastic deformation. Small random distortions that simulate natural handwriting variation. This was the single most effective augmentation, adding roughly 8 percentage points of accuracy. We applied it with carefully tuned parameters — too aggressive and the characters become unrecognizable; too subtle and it doesn't help.

Ink density variation. Real documents have inconsistent ink darkness — some writers press harder, pens run dry mid-word, photocopies introduce noise. We simulated this with variable erosion and dilation kernels plus Gaussian noise. This helped with the long tail of low-quality inputs.

Synthetic data generation. We generated synthetic handwritten text using a modified version of handwriting synthesis with style transfer. The trick was making the synthetic data imperfect enough to be useful — perfectly regular synthetic handwriting actually hurt performance because it shifted the training distribution away from real messiness.

After augmentation, our effective training set was around 200,000 samples, and accuracy jumped from 72% to 84%.

Model architecture decisions

We evaluated three architectures:

CNN + CTC (Connectionist Temporal Classification). Our baseline. Fast inference but struggled with the character ambiguity problem — CTC tends to collapse similar characters when it's uncertain, producing systematic errors on ш/щ pairs.

CNN + Attention-based decoder. Better accuracy on ambiguous characters because the attention mechanism can use surrounding context. A character that's ambiguous in isolation often becomes clear in the context of a word. This got us to 88%.

Transformer-based encoder-decoder. Best accuracy (92%) but significantly more compute at inference time. We went with this architecture and optimized inference using ONNX Runtime with quantization, getting per-page processing time down to under 2 seconds on CPU — acceptable for the batch processing workflow.

The key architectural insight was adding a language model component. Pure character recognition topped out around 88%. Adding a word-level language model that could score candidate decodings using Mongolian language statistics pushed us to 92%. When the character-level model is uncertain between н and и, the language model resolves it based on which resulting word actually exists in the language.

Achieving 92%: the last few percent

Going from 88% to 92% took as long as going from 65% to 88%. The last few percentage points came from:

  • Error analysis and targeted training. We categorized every error the model made on a validation set and found that 40% of errors involved just six character pairs. We created targeted training batches that oversampled these confusing pairs.
  • Document-level preprocessing. Binarization, deskewing, and line segmentation quality directly impact recognition accuracy. We spent two weeks tuning our preprocessing pipeline and gained 1.5 percentage points just from better line segmentation.
  • Ensemble voting. Running three model variants and taking the majority vote on uncertain characters. This costs 3x inference time but reduced character error rate by about 15% relative. We use this for high-value documents where accuracy matters more than throughput.

92% character accuracy means roughly one error every 12 characters — about one or two errors per typical word. For the government digitization use case, this was sufficient when combined with a human review interface that highlighted low-confidence regions. The system reduced manual transcription time by roughly 70%, which was the operational metric that actually mattered.

Related service

OCR & Document AI

Need help with ocr & document ai?

Tell us what you're building. We'll tell you how we can help.

Start a conversation