Galool

Project Portfolio & Progress Log

This section documents every practical step in building the Somali language corpus and dictionary.
From early experiments to long-term milestones, each entry represents a building block of the Galool project.

Current Active Projects

📌 1. Somali Text Corpus (Phase 1)

Collecting public Somali text
Manual cleaning tests
Designing file structure
Preparing for tokenization and frequency analysis
Status: In progress

📌 2. Somali Dictionary Framework

Choosing initial word list
Drafting definition format
Designing data model (JSON + human-readable format)
Creating example entries
Status: Planning / Testing

📌 3. Digitizing Old Somali Materials

Reviewing scanned Somali government textbooks
Preparing OCR test samples
Deciding which books to rewrite manually
Testing correction workflows
Status: Initial experiments

📌 4. Web Infrastructure & Documentation

Setting up Galool.net
Creating subdomains for internal tools
Building documentation style
Structuring long-term archives
Status: Ongoing

Mini-Projects

These are small experiments used to learn skills and test ideas.

🔹 Mini-Project: OCR Accuracy Test

Testing Tesseract and other OCR engines on Somali diacritics.
Goal: Evaluate accuracy and decide when manual rewriting is better.

🔹 Mini-Project: Text Cleaning Pipeline

Building a simple Python script to remove noise: HTML, punctuation issues, duplicates.

🔹 Mini-Project: Small Demo Corpus

Creating a 50k–100k word mini-corpus to practice tokenization, sorting, and frequency lists.

🔹 Mini-Project: Community Feedback Form

Creating a page where Somali speakers can suggest words, corrections, or sample sentences.

Long-Term Vision

✔ Full Somali Corpus

Millions of words of cleaned and verified Somali text.

✔ Full Somali–English Dictionary

Modern, comprehensive, open to everyone.

✔ Tools & APIs

Word lookup API, POS tagger (future), search engine, teaching resources.

✔ Education & Preservation

Digitized archives, reading materials, and content that future generations can depend on.

Progress Timeline

I will update this weekly.

Month / Year

What was completed
What was learned
What challenges appeared
What’s next

Galool Somali Corpus — Progress Report

Night of Dec 07, 2025

Project Goal

Build a high-quality Somali text corpus to support dictionary development, language research, and NLP applications.

Tonight’s Achievements

OCR Tools Setup
- Installed Tesseract-OCR 5.5.0 (Windows) ✅
- Installed OCRmyPDF 16.12.0 via Python ✅
- Installed Poppler 25.12.0 for PDF → text conversion (pdftotext) ✅
Test OCR Workflow
- Extracted pages 6–10 from the first Somali literature book using Chrome Print → Save as PDF
- Ran OCRmyPDF with -l eng for Somali Latin script
- Result: Searchable PDF preserves original formatting, layout, and background ✅
Mixed Script Handling
- Tested Arabic headings with -l eng+ara
- Verified headings recognized correctly alongside Somali text ✅
Extract Editable Text
- Converted OCR PDF to UTF-8 text using pdftotext
- Text is fully editable → forming the base corpus for proofreading and cleaning ✅

Next Steps

Apply the workflow to the full 70+ page book and remaining literature books
Begin manual proofreading to correct OCR errors
Organize files into a structured corpus folder
Plan for future OCR improvements, including Somali-specific Tesseract training

Outcome

A fully functional digitization pipeline is now established:

Original PDFs → Searchable OCR PDFs → Editable UTF-8 text

This is the foundation of the Galool Somali Corpus, paving the way for future dictionary projects, NLP research, and community contributions.

Date: 2025-12-07

Work Summary:

Today, we continued developing the Somali literature corpus project. Key accomplishments:

Text Cleaning and Organization
- We successfully processed the first book (Book1) containing Somali poems (Gabay, Geeraar), short stories (Sheeko), and wisdom sayings (Curis).
- Cleaned the OCR text by removing noise such as page numbers, leftover symbols, and extraneous numbers, while keeping all meaningful text intact.
Data Structuring
- Segmented the book into meaningful sections: poems, stories, and wisdom sayings.
- Added metadata such as title and author for each segment.
Automated Corpus Generation
- Converted the cleaned and segmented text into a structured JSONL format suitable for corpus building and future NLP tasks.
- Prepared the file for easy search, filtering, and analysis of Somali literary works.
Next Steps
- Fix minor author field inconsistencies.
- Optionally trim each segment to short previews for quick referencing.
- Continue processing remaining books in the same automated workflow.

Outcome:
The first book is now fully cleaned, segmented, and structured. This forms the foundation of a high-quality Somali literature corpus for research, NLP, and cultural preservation.