Arabic OCR Corpus v.2 (2,894 items from QNL Collection)
Dataset contents
This dataset is an OCR text corpus of 2,984 printed works (monographs and serials) from the collection of the Qatar National Library. All works are mostly in Arabic language, but fragments of texts in other languages can also be found. Besides the OCR text, the basic descriptive metadata for each item is also provided.
Release note for version 2 of the dataset
The dataset of OCRed Arabic books has been fully updated to ensure consistency and quality. All items in the dataset have now been processed using the latest retrained data. Furthermore, every item has undergone a thorough visual quality assurance check conducted using a representative sample of pages. This update has resulted in a significant enhancement of word-level accuracy across the entire dataset, ensuring higher reliability and usability.
The exact list of files changed between version 1 and version 2 of the dataset can be determined by comparing the SHA256 checksums provided with each dataset version (see below for details).
Dataset structure
The dataset consists of three files:
- QNL-ArabicContentDataset-Metadata.csv and QNL-ArabicContentDataset-Metadata.xlsx contain the same basic metadata of 2,894 items from the Qatar National Library collection. Both files have the same content and are structured into the following columns:
- CALL #(ITEM) - Item call number in the QNL catalog
- RECORD #(ITEM) - Item record number in the QNL catalog (unique for each item)
- Repository URL - URL to digitized item content in the QNL repository
- Catalog URL - URL to the complete item metadata record in the QNL catalog
- AUTHOR - Main author information for the item
- ADD AUTHOR - Additional author information for the item
- PUB INFO - Item publication info
- TITLE - Item title
- DESCRIPTION - Item description
- VOLUME - Item volume information (in case of some serial publications)
- QNL_ArabicOCR_Corpus-v2.zip contains:
- 2,894 text files with the following naming pattern: [unique item record number]-[unique item QNL repository id].txt. The unique item record number should be used to match each file with a related metadata record. Each file contains text extracted from a particular item using OCR software.
- checksums.sha256 - contains SHA256 checksums for all 2,894 text files
History
Language
- Arabic
Publisher
Qatar National LibraryPublication Year
- 2024
License statement
This dataset consists of the text of out-of-copyright works extracted using OCR software. QNL does not assert copyright claims to scans or other direct reproductions of works from the collection of the Qatar National Library. The associated metadata is released on CC0 1.0 Universal License.Institution affiliated with
- Qatar National Library