A Neural OCR Engine for North Saami

1. Abstract

The DH-LAB at the National Library of Norway can announce that we have an open-source optical character recognition (OCR) engine for North Saami in construction. North Saami is an under-resourced indigenous minority language recognized by the Norwegian State. The OCR engine is induced with the system Tesseract by the means of cross-lingual model transfer. When evaluating the model on a held-out portion of the ground truth, it reaches a bag-of-words F1 measure of 0.98 %. The OCR engine in question will be the first freely available OCR engine for North Saami.

Andre Kåsen (lars.johnsen@nb.no), National Library of Norway, Håvard Østli , National Library of Norway, Andrea M. Huus , National Library of Norway and Lars Johnsen , National Library of Norway

