Portfolio

Galool Somali Corpus โ€” Progress Report

Night of Dec 07, 2025


Project Goal

Build a high-quality Somali text corpus to support dictionary development, language research, and NLP applications.


Tonightโ€™s Achievements

  • OCR Tools Setup
    • Installed Tesseract-OCR 5.5.0 (Windows) โœ…
    • Installed OCRmyPDF 16.12.0 via Python โœ…
    • Installed Poppler 25.12.0 for PDF โ†’ text conversion (pdftotext) โœ…
  • Test OCR Workflow
    • Extracted pages 6โ€“10 from the first Somali literature book using Chrome Print โ†’ Save as PDF
    • Ran OCRmyPDF with -l eng for Somali Latin script
    • Result: Searchable PDF preserves original formatting, layout, and background โœ…
  • Mixed Script Handling
    • Tested Arabic headings with -l eng+ara
    • Verified headings recognized correctly alongside Somali text โœ…
  • Extract Editable Text
    • Converted OCR PDF to UTF-8 text using pdftotext
    • Text is fully editable โ†’ forming the base corpus for proofreading and cleaning โœ…

Next Steps

  1. Apply the workflow to the full 70+ page book and remaining literature books
  2. Begin manual proofreading to correct OCR errors
  3. Organize files into a structured corpus folder
  4. Plan for future OCR improvements, including Somali-specific Tesseract training

Outcome

A fully functional digitization pipeline is now established:

Original PDFs โ†’ Searchable OCR PDFs โ†’ Editable UTF-8 text

This is the foundation of the Galool Somali Corpus, paving the way for future dictionary projects, NLP research, and community contributions.

Date: 2025-12-07

Work Summary:

Today, we continued developing the Somali literature corpus project. Key accomplishments:

  1. Text Cleaning and Organization
    • We successfully processed the first book (Book1) containing Somali poems (Gabay, Geeraar), short stories (Sheeko), and wisdom sayings (Curis).
    • Cleaned the OCR text by removing noise such as page numbers, leftover symbols, and extraneous numbers, while keeping all meaningful text intact.
  2. Data Structuring
    • Segmented the book into meaningful sections: poems, stories, and wisdom sayings.
    • Added metadata such as title and author for each segment.
  3. Automated Corpus Generation
    • Converted the cleaned and segmented text into a structured JSONL format suitable for corpus building and future NLP tasks.
    • Prepared the file for easy search, filtering, and analysis of Somali literary works.
  4. Next Steps
    • Fix minor author field inconsistencies.
    • Optionally trim each segment to short previews for quick referencing.
    • Continue processing remaining books in the same automated workflow.

Outcome:
The first book is now fully cleaned, segmented, and structured. This forms the foundation of a high-quality Somali literature corpus for research, NLP, and cultural preservation.