Manara - Qatar Research Repository
Browse

Arabic OCR Corpus v.2 (2,894 items from QNL Collection)

Version 2 2024-11-12, 20:13
Version 1 2024-09-12, 11:35
dataset
revised on 2024-11-12, 20:12 and posted on 2024-11-12, 20:13 authored by Qatar National Library

Dataset contents

This dataset is an OCR text corpus of 2,984 printed works (monographs and serials) from the collection of the Qatar National Library. All works are mostly in Arabic language, but fragments of texts in other languages can also be found. Besides the OCR text, the basic descriptive metadata for each item is also provided.

Release note for version 2 of the dataset

The dataset of OCRed Arabic books has been fully updated to ensure consistency and quality. All items in the dataset have now been processed using the latest retrained data. Furthermore, every item has undergone a thorough visual quality assurance check conducted using a representative sample of pages. This update has resulted in a significant enhancement of word-level accuracy across the entire dataset, ensuring higher reliability and usability.

The exact list of files changed between version 1 and version 2 of the dataset can be determined by comparing the SHA256 checksums provided with each dataset version (see below for details).

Dataset structure

The dataset consists of three files:

  • QNL-ArabicContentDataset-Metadata.csv and QNL-ArabicContentDataset-Metadata.xlsx contain the same basic metadata of 2,894 items from the Qatar National Library collection. Both files have the same content and are structured into the following columns:
    • CALL #(ITEM) - Item call number in the QNL catalog
    • RECORD #(ITEM) - Item record number in the QNL catalog (unique for each item)
    • Repository URL - URL to digitized item content in the QNL repository
    • Catalog URL - URL to the complete item metadata record in the QNL catalog
    • AUTHOR - Main author information for the item
    • ADD AUTHOR - Additional author information for the item
    • PUB INFO - Item publication info
    • TITLE - Item title
    • DESCRIPTION - Item description
    • VOLUME - Item volume information (in case of some serial publications)
  • QNL_ArabicOCR_Corpus-v2.zip contains:
    • 2,894 text files with the following naming pattern: [unique item record number]-[unique item QNL repository id].txt. The unique item record number should be used to match each file with a related metadata record. Each file contains text extracted from a particular item using OCR software.
    • checksums.sha256 - contains SHA256 checksums for all 2,894 text files


History

Language

  • Arabic

Publisher

Qatar National Library

Publication Year

  • 2024

License statement

This dataset consists of the text of out-of-copyright works extracted using OCR software. QNL does not assert copyright claims to scans or other direct reproductions of works from the collection of the Qatar National Library. The associated metadata is released on CC0 1.0 Universal License.

Institution affiliated with

  • Qatar National Library

Methodology

The content OCRed at Qatar National Library is processed through a continuously evolving system designed to deliver exceptional accuracy and efficiency in text recognition, achieving up to 95% accuracy at the word level. The workflow begins with advanced binarization techniques, followed by sophisticated text enhancement methods to optimize clarity for flawless recognition. By leveraging both in-house developed machine learning system and proprietary OCR engines, QNL Digitization Team ensures high-quality results, even for the most complex Arabic scripts. The quality of QNL OCR processing tools has significantly improved over time, therefore some of the text in the dataset which were processed using older versions of QNL technology may not reflect the improvements made with the latest software.

Temporal coverage

Early 20th century and before

Usage metrics

    Qatar National Library

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC