Creating a Database Using C Programming in Linux

1. Overview

In this tutorial, we’ll build a simple yet handy command-line database tool called simpledb from scratch, written in C on a Linux environment. Instead of building a complex relational system, we’ll use JSON files as a lightweight data store, allowing us to inspect and manipulate records directly with standard UNIX utilities.

We’ll walk through the high-level rationale, key design decisions, and a thorough testing approach.

The full source code, including the C implementation and a Bash script for in-depth testing, is available in our repository. We’ll see only relevant snippets here to keep the explanation clear and concise.

2. Why a Minimal JSON-Based Database?

MySQL and other relational database engines are powerful solutions for large or highly structured applications, but their hardware requirements and complexity are typically prohibitive for very small or embedded projects.

In contrast, simpledb is extremely lightweight. On our Linux Mint 22 testing environment, its compiled binary occupies only 22KiB on disk, and its database operations consume about 1792KiB of RAM, as measured by /usr/bin/time -v. This negligible footprint makes simpledb suitable for embedded platforms, lightweight development environments, Bash scripting, cron jobs, and prototyping.

Each database table is a standard JSON file that we can parse, inspect, or modify using common command-line tools such as grep, join, sort, and jq. In this way, we follow the UNIX philosophy of “doing one thing well” by delegating more complex functionality – such as sophisticated filtering or joins – to external utilities, thereby keeping the simpledb codebase lean and easy to maintain.

In addition, it’s quite easy to convert simpledb‘s data from JSON to CSV, which makes it possible to migrate data from simpledb to other databases. We’ll see some examples of CSV export in the final Bash script.

3. Core Requirements and Architecture

Let’s establish the core requirements and architectural principles that underlie the design of simpledb.

3.1. Data Model

Our JSON-based design strikes a balance between simplicity and human readability. We also avoid the complexity of maintaining a proprietary or binary format, preferring full compatibility with version control systems and existing Linux utilities:

A database is represented by a directory on disk
Each table is a single JSON file containing an array of objects
Each object is a record with one or more fields, e.g.: id, name, email, age, etc.
A unique positive integer id is treated as the primary key
If the id is omitted, simpledb generates one incrementally
Data sent to stdout must adhere to the JSON lines format for easy parsing in scripts

In keeping with a minimalist approach, simpledb doesn’t perform extensive data validation beyond what is strictly necessary. This keeps the codebase lean and promotes faster development.

3.2. Command-Line Interface

Whenever we invoke simpledb without valid parameters, or explicitly ask for the parameter syntax with –help, it displays the following instructions:

$ ./simpledb --help
Usage:
  ./simpledb --db-path <PATH> COMMAND [ARGS...]

Commands:
  list <table>
  get <table> field=value
  save <table> field1=value1 [field2=value2 ...]
  delete <table> field=value

Options:
  --db-path <PATH>   Required. Path to the database directory.

Let’s take a closer look at the commands:

If a record with the given id already exists, the save command updates its fields
If no such record exists, save appends a new record to the table
list prints all records
get prints the records where field=value
delete removes the records where field=value

Many simple and advanced examples of these commands are included in the final Bash script.

The dual behavior of save is known as upsert and is common in many databases and persistence frameworks. It simplifies record management by combining insert and update logic into a single command.

On successful operations, simpledb prints the appropriate JSON records or a success message. On failure due to malformed JSON, missing –db-path, invalid parameters, no matching records for delete or get, or other unexpected conditions, it prints an error message to stderr and exits with a non-zero status.

3.3. Atomic Writes and JSON Parsing

Each command begins by reading the contents of the relevant JSON file into memory and then parsing it using cJSON. This library provides a straightforward API for parsing and building JSON objects in C, keeping the code simple and free of excessive boilerplate.

We can use this command to install cJSON on a Debian or Ubuntu-based system:

$ sudo apt install libcjson-dev

As changes are made to the in-memory representation, simpledb stores the updated JSON. However, instead of writing directly to the final file, it first writes all changes to a temporary file and then performs an atomic rename operation. This renaming step replaces the old file with the newly written version in one go, preventing partial or corrupt writes in case of unexpected failures.

However, according to POSIX guarantees, the rename() system call is atomic only if both the source and target files reside on the same filesystem. This ensures that no intermediate state of the file is visible to other processes. Atomicity isn’t guaranteed on different mounted filesystems, networked filesystems such as NFS, or non-POSIX compliant filesystems. Simply put, simpledb is intended to run only on a UNIX-compatible filesystem such as ext4, XFS, ZFS, or Btrfs.

4. Implementation Outline

We’ll explore some core components of simpledb, focusing on data flow, command dispatch, JSON handling, and atomic writing.

4.1. Entry Point: main()

The main() function is responsible for:

Parsing command-line parameters
Choosing which command to execute: list, get, save, or delete
Handling missing or invalid arguments and printing usage instructions

The program requires at least four command line arguments, as the program name counts as the first argument, and this is the expected format for calling it:

$ ./simpledb --db-path <PATH> command ...

This simplified diagram shows the main() execution flow in an intuitive way:

The code starts by checking for the minimum number of arguments:

[...]
if (argc < 4) {
    print_usage(argv[0]);
    return 1;
}
[...]

We then proceed with a loop to identify the database path and the command:

[...]
int i = 1;
for (; i < argc; i++) {
    if (strcmp(argv[i], "--db-path") == 0 || strcmp(argv[i], "-d") == 0) {
        if (i + 1 < argc) {
            db_path = argv[++i];
            continue;
        } else {
            fprintf(stderr, "Error: --db-path requires an argument\n");
            return 1;
        }
    } else {
        // This is likely the command
        command = argv[i];
        i++;
        break;
    }
}
[...]

The database path must exist and be a directory, which adds an extra layer of filesystem validation:

[...]
struct stat st;
if (stat(db_path, &st) != 0) {
    fprintf(stderr, "Error: Database path '%s' does not exist.\n", db_path);
    return 1;
}
if (!S_ISDIR(st.st_mode)) {
    fprintf(stderr, "Error: '%s' is not a directory.\n", db_path);
    return 1;
}
[...]

Each command has its own requirements for the number and format of arguments. For example, the get and delete commands expect exactly one additional argument of the form field=value, while the save command allows multiple field=value pairs. The command_args array stores these extra arguments so that the underlying command functions can handle them correctly:

[...]
char* command_args[MAX_COMMAND_ARGS];
int command_args_count = 0;
[...]
for (; i < argc && command_args_count < MAX_COMMAND_ARGS; i++) {
    command_args[command_args_count++] = argv[i];
}
[...]

We take care not to exceed MAX_COMMAND_ARGS (number of command arguments) to avoid memory or buffer overflow problems.

In the get and delete commands, the implementation explicitly searches for the = character to split the field from its corresponding value:

[...]
} else if (strcmp(command, "get") == 0) {
    // Expects: get <table> field=value
    if (command_args_count != 1) {
        print_usage(argv[0]);
        return 1;
    }
    char* eq = strchr(command_args[0], '=');
    if (!eq) {
        fprintf(stderr, "Error: Invalid get argument '%s'. Use field=value.\n", command_args[0]);
        return 1;
    }
    *eq = '\0';
    const char* field = command_args[0];
    const char* value = eq + 1;
    return command_get(db_path, table_name, field, value);
[...]

This is just a taste of the main() function, which acts as a dispatcher, calling the appropriate command function or printing usage instructions if the command isn’t recognized.

4.2. Table Loading and Saving

Our JSON data is stored in individual files. We implement two helper routines:

load_table(): Reads the JSON file, parses it using cJSON, and returns a cJSON* array
save_table(): Writes the updated JSON array back to disk using an atomic rename operation

In this case, the code is quite compact, so there is no need for diagrams:

/* --------------------------------------------------------------------------
 * Load the JSON array from <table>.json, or create an empty JSON array if file
 * doesn't exist. Return a cJSON pointer, or NULL on error.
 * -------------------------------------------------------------------------- */
static cJSON* load_table(const char* db_path, const char* table_name) {
    // Construct the file path: e.g., db_path/users.json
    char filepath[1024];
    snprintf(filepath, sizeof(filepath), "%s/%s.json", db_path, table_name);

    char* content = read_file(filepath);
    cJSON* root = NULL;

    if (content) {
        // Parse the JSON
        root = cJSON_Parse(content);
        free(content);

        // If parse failed or root is not an array, create a new array
        if (!root || !cJSON_IsArray(root)) {
            if (root) {
                cJSON_Delete(root);
            }
            root = cJSON_CreateArray();
        }
    } else {
        // File not found or not readable; let's assume it's just empty
        root = cJSON_CreateArray();
    }

    return root;
}

/* --------------------------------------------------------------------------
 * Write the JSON array back to <table>.json (atomically).
 * -------------------------------------------------------------------------- */
static int save_table(const char* db_path, const char* table_name, cJSON* root) {
    if (!root) return -1;

    char* print_buffer = cJSON_PrintUnformatted(root);
    if (!print_buffer) {
        return -1;
    }

    char filepath[1024];
    snprintf(filepath, sizeof(filepath), "%s/%s.json", db_path, table_name);

    int ret = write_file_atomic(filepath, print_buffer);
    free(print_buffer);
    return ret;
}

Here, write_file_atomic() ensures that data is never corrupted by partial writes:

/* --------------------------------------------------------------------------
 * Utility: Write a temporary file, then rename it to ensure atomic updates.
 * Returns 0 on success, non-zero on error.
 * -------------------------------------------------------------------------- */
static int write_file_atomic(const char* filename, const char* data) {
    // Create a temp file name
    char temp_filename[1024];
    snprintf(temp_filename, sizeof(temp_filename), "%s.tmp", filename);

    // Write data to temp file
    FILE* fp = fopen(temp_filename, "wb");
    if (!fp) {
        return -1;
    }
    size_t len = strlen(data);
    if (fwrite(data, 1, len, fp) < len) {
        fclose(fp);
        return -1;
    }
    fclose(fp);

    // Atomically rename the temp file to the actual file
    if (rename(temp_filename, filename) != 0) {
        return -1;
    }
    return 0;
}

This straightforward approach combines cJSON parsing with atomic writes to ensure data integrity. Unexpected failures will never corrupt the database.

4.3. Overall Operation

So far we have seen in detail the implementation of some difficult parts of simpledb. We won’t go any further into the code, but rather look at it as a whole:

Now let’s move on to testing.

5. Testing With a Bash Script

In the testing folder of our repository, we have the script test_simpledb.sh for a large comprehensive test set that includes these categories:

Basic usage and error handling
Basic create and list
Save and update and then list
Get by field
Delete
Multiple tables in the same DB
Second database directory
Using grep and jq for advanced filtering
Simulating a JOIN using standard UNIX tools
Testing automatic ID generation and invalid ID values

There are 45 self-explanatory tests. Here is an example:

[...]
echo "- Adding a new record with id=100 to 'users' in $DB1:"
$SIMPLEDB --db-path "$DB1" save users id=100 name="John Doe" age=30 [email protected]

echo "- Listing 'users' again (should show John):"
$SIMPLEDB --db-path "$DB1" list users

echo "- Updating record with id=100 (changing email):"
$SIMPLEDB --db-path "$DB1" save users id=100 [email protected]

echo "- Listing 'users' again (should show updated email):"
$SIMPLEDB --db-path "$DB1" list users
[...]

The test_simpledb.log file contains the expected output. This is a small interesting snippet:

[...]
### 9) Simulating a 'join' between tables...
- Creating an 'orders' table in testdb1...
{"id":"1","order_id":"9001","user_id":"100","product":"Red Book","price":"15.00"}
{"id":"2","order_id":"9002","user_id":"101","product":"Blue Pen","price":"2.50"}
{"id":"3","order_id":"9003","user_id":"999","product":"Green Pencil","price":"1.00"}
- Listing 'orders':
{"id":"1","order_id":"9001","user_id":"100","product":"Red Book","price":"15.00"}
{"id":"2","order_id":"9002","user_id":"101","product":"Blue Pen","price":"2.50"}
{"id":"3","order_id":"9003","user_id":"999","product":"Green Pencil","price":"1.00"}

#### 9a) Converting 'users' to CSV (id,name,email) => users.csv
"100","John Doe","[email protected]"
"101","Alpha Tester","[email protected]"
"102","Beta Tester","[email protected]"

#### 9b) Converting 'orders' to CSV (order_id,user_id,product,price) => orders.csv
"9001","100","Red Book","15.00"
"9002","101","Blue Pen","2.50"
"9003","999","Green Pencil","1.00"

#### 9c) Sort both CSV files by their key for join.
- Sorted 'users':
"100","John Doe","[email protected]"
"101","Alpha Tester","[email protected]"
"102","Beta Tester","[email protected]"
- Sorted 'orders':
"9001","100","Red Book","15.00"
"9002","101","Blue Pen","2.50"
"9003","999","Green Pencil","1.00"

#### 9d) Join by user_id (users.id == orders.user_id)
(We have to specify that for 'users' the join field is column 1, for 'orders' it's column 2)
- Result of join (joined.csv):
"100","John Doe","[email protected]","9001","Red Book","15.00"
"101","Alpha Tester","[email protected]","9002","Blue Pen","2.50"
[...]

Suppose we prefer an automated approach instead of manually inspecting the output. In that case, we can integrate our tests with a framework like Bats (Bash Automated Testing System) or a similar tool that allows us to specify the expected result for each command.

6. Conclusion

In this article, we explored the rationale, design, and implementation of a minimal command-line database tool, simpledb, which stores data in the JSON format in a Linux environment.

By taking a JSON-based approach and using standard UNIX utilities for advanced manipulation, we demonstrated how to maintain a lightweight codebase that remains both highly portable and easy to inspect. With features such as upsert-like save command behavior, atomic file writes, and minimal data validation, simpledb shines as an ideal solution for small, embedded, or scripting scenarios where traditional relational databases may be overkill.

By combining simpledb with common tools like grep, jq, and join, we unleash the power of the broader UNIX ecosystem for filtering, formatting, and data analysis.

We also saw the importance of rigorous testing with a Bash script that demonstrates how simpledb handles everything from basic record creation to more advanced tasks like simulating joins.

Learn Java Collections

Learn Spring

Learn Maven

View All Courses

Administration

Scripting

Networking

Files

Processes

Full Archive

About Baeldung