Outreachy Internship

Blog

Outreachy Update:Understanding and Improving def-extractor.py
Introduction

Over the past couple of weeks, I have been working on understanding and improving def-extractor.py, a Python script that processes dictionary data from Wiktionary to generate word lists and definitions in structured formats. My main task has been to refactor the script to use configuration files instead of hardcoded values, making it more flexible and maintainable.

In this blog post, I’ll explain:
1. What the script does
2. How it works under the hood
3. The changes I made to improve it
4. Why these changes matter
What Does the Script Do?

At a high level, this script processes huge JSONL (JSON Lines) dictionary dumps, like the ones from Kaikki.org , and filters them down into clean, usable formats.

The def-extractor.py script takes raw dictionary data (from Wiktionary) and processes it into structured formats like:
- Filtered word lists (JSONL)
- GVariant binary files (for efficient storage)
- Enum tables (for parts of speech & word tags)
It was originally designed to work with specific word lists (Wordnik, Broda, and a test list), but my goal is to make it configurable so it could support any word list with a simple config file.

How It Works (Step by Step)

1. Loading the Word List

The script starts by loading a word list (e.g., Wordnik’s list of common English words). It filters out invalid words (too short, contain numbers, etc.) and stores them in a hash table for quick lookup.

2. Filtering Raw Wiktionary Data

Next, it processes a massive raw-wiktextract-data.jsonl file (theWiktionary dump) and keeps only entries that:
- Match words from the loaded word list
- Are in the correct language (e.g., English)
3. Generating Structured Outputs

After filtering, the script creates:
- Enum tables (JSON files listing parts of speech & word tags)
- GVariant files (binary files for efficient storage and fast lookup)
What Changes have I Made?

1. Added Configuration Support

Originally, the script uses hardcoded paths and settings. I modified it to read from .config files, allowing users to define:
- Source word list file
- Output directory
- Word validation rules (min/max length, allowed characters)
Before (Hardcoded):
```
WORDNIK_LIST = "wordlist-20210729.txt"
ALPHABET = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
```
After (Configurable):

ini
```
[Word List]
Source = my-wordlist.txt
MinLength = 2
MaxLength = 20
```
2. Improved File Path Handling

Instead of hardcoding paths, the script now constructs them dynamically:
```
output_path = os.path.join(config.word_lists_dir, f"{config.id}-filtered.jsonl")
```
Why Do These Changes Matter?

Flexibility -Now supports any word list via config files.
Maintainability– No more editing code to change paths or rules.
Scalability -Easier to add new word lists or languages.
Consistency -All settings are in configs.

Next Steps?

1. Better Error Handling

I am working on adding checks for:
- Missing config fields
- Invalid word list files
- Incorrectly formatted data
2. Unified Word Loading Logic

There are separate functions (load_wordnik(), load_broda()).

I want to merged them into one load_words(config) that would works for any word list.

3. Refactor legacy code for better structure

Try It Yourself
1. Download the script: [wordlist-Gitlab]
2. Create a .conf config file
3. Run: python3 def-extractor.py --config my-wordlist.conf filtered-list
Happy coding!
July 25, 2025
Outreachy Update: Two Weeks of Configs, Word Lists, and GResource Scripting

It has been a busy two weeks of learning as I continued to develop the GNOME Crosswords project. I have been mainly engaged in improving how word lists are managed and included using configuration files.

I started by writing documentation for how to add a new word list to the project by using .conf files. The configuration files define properties like display name, language, and origin of the word list so that contributors can simply add new vocabulary datasets. Each word list can optionally pull in definitions from Wiktionary and parse them, converting them into resource files for use by the game.

As an addition to this I also scripted a program that takes config file contents and turns them into GResource XML files. This isn’t the project bulk, but a useful tool that automates part of the setup and ensures consistency between different word list entries. It takes in a .conf file and outputs a corresponding .gresource.xml.in file, mapping the necessary resources to suitable aliases. This was a good chance for me to learn more about Python’s argparse and configparser modules.

Beyond scripting, I’ve been in regular communication with my mentor, seeking feedback and guidance to improve both my technical and collaborative skills. One key takeaway has been the importance of sharing smaller, incremental commits rather than submitting a large block of work all at once, a practice that not only helps with clarity but also encourages consistent progress tracking. I was also advised to avoid relying on AI-generated code and instead focus on writing clear, simple, and understandable solutions, which I’ve consciously applied to both my code and documentation.

Next, I’ll be looking into how definitions are extracted and how importer modules work. Lots more to discover, especially about the innards of the Wiktionary extractor tool.

Looking forward to sharing more updates as I get deeper into the project

July 7, 2025
Outreachy Internship:My First Two Weeks with GNOME:
Diving into Word Scoring for Crosswords

In my first two weeks as an Outreachy intern with GNOME, I’ve been getting familiar with the project I’ll be contributing to and settling into a rhythm with my mentor, Jonathan Blandford. We’ve agreed to meet every Monday to review the past week and plan goals for the next — something I’ve already found incredibly grounding and helpful.

What I’m Working On: The Word Score Project

My project revolves around improving how GNOME’s crossword tools (like GNOME Crosswords) assess and rank words. This is part of a larger effort to support puzzle constructors by helping them pick better words for their grids — ones that are fun, fresh, and fair.

But what makes a “good” crossword word?

This is what the Word Score project aims to answer. It proposes a scoring system that assigns numerical values to words based on multiple measurable traits, such as:
- Lexical interest (e.g. does it contain unusual bigrams/trigrams like “KN” or “OXC”?),
- Frequency in natural language (based on datasets like Google Ngrams),
- Familiarity to solvers (which may differ from frequency),
- Definition count (some words like SET or RUN are goldmines for cryptic clues),
- Sentiment and appropriateness (nobody wants a vulgar word in a breakfast puzzle).
The goal is to build a system that supports both the autofill functionality and the word list interface in GNOME Crosswords, giving human setters better tools while respecting editorial judgment. In other words, this project isn’t about replacing setters — it’s about enhancing their toolkit.

You can read more about the project’s goals and philosophy in our draft document: Thoughts on Scoring Words (final link coming soon).

Week 1: Building and Breaking Puzzles

During my first week, I spent time getting familiar with the project environment and experimenting with crossword puzzle generation. I created test puzzles to better understand how word placement, scoring, and validation work under the hood.

This hands-on experimentation helped me form a clearer mental model of how GNOME Crosswords structures and fills puzzles — and why scoring matters. The way words interact in a grid can make some fills elegant and others feel forced or unplayable.

Week 2: Wrestling with libipuz and Introspection

In the second week, my focus shifted to working on libipuz, a C library that parses and exports puzzles using the IPUZ format. but getting libipuz working with GNOME’s introspection system proved more challenging than expected.

Initially, I tried to use it inside the crosswords container, but it wasn’t cooperating. After some digging (and rebuilding), we decided to create a separate container specifically for libipuz to enable introspection and allow scripting in languages like Python and JavaScript to interact with it.

This also gave me a deeper understanding of how GNOME handles language bindings via GObject Introspection — something I hadn’t worked with before, but I’m quickly getting the hang of.

Bonus: Scrabble-Inspired Scoring Script

As a side exploration, I also wrote a quick Python script that calculates Scrabble-style scores for words. While Scrabble scoring isn’t the same as what we want in crosswords (it values rare letters like Z and Q), it gave me a fun way to experiment with scoring mechanics and visualize how simple rules change the ranking of word lists. This mini-project helped me warm up to the idea of building more complex scoring systems later on.

What’s Next?

In the coming weeks, I’ll continue refining the scoring dimensions, writing more scripts to calculate traits (especially frequency and lexical interest), and exploring how this scoring system can be surfaced in GNOME Crosswords. I’m excited to see how this evolves — and even more excited to share updates as I go.

Thanks for reading!
June 1, 2025
Hello World!

Welcome to WordPress! This is your first post. Edit or delete it to take the first step in your blogging journey.

May 31, 2025

Design a site like this with WordPress.com