Skip to content

logging: implement journald-based local log retention#1511

Merged
brianmcgillion merged 2 commits intotiiuae:mainfrom
juliuskoskela:log-retention-pr-2
Oct 30, 2025
Merged

logging: implement journald-based local log retention#1511
brianmcgillion merged 2 commits intotiiuae:mainfrom
juliuskoskela:log-retention-pr-2

Conversation

@juliuskoskela
Copy link
Copy Markdown
Contributor

@juliuskoskela juliuskoskela commented Oct 29, 2025

Description of Changes

This PR implements local log retention using systemd-journald instead of running a resource-intensive Loki server in admin-vm.

Changes:

  • Added journalRetention configuration options to ghaf.logging module:
    • enable (default: true) - Enable/disable journald retention
    • maxRetentionDays (default: 30) - Days to retain logs locally
    • maxDiskUsage (default: "500M") - Maximum disk space for logs
  • Applied journald retention configuration to both logging clients and server
  • Configured journald with MaxRetentionSec, SystemMaxUse, SystemMaxFileSize=100M, and Storage=persistent

Rationale:
The initial approach of running Loki server locally in admin-vm consumed too many CPU and memory resources. This implementation uses journald's built-in retention capabilities while Alloy continues to forward logs to the remote server. Alloy's WAL (Write-Ahead Log) ensures logs are synced even during network outages.

Type of Change

  • New Feature
  • Bug Fix
  • Improvement / Refactor

Related Issues / Tickets

Checklist

  • Clear summary in PR description
  • Detailed and meaningful commit message(s)
  • Commits are logically organized and squashed if appropriate
  • Contribution guidelines followed
  • Ghaf documentation updated with the commit - https://tiiuae.github.io/ghaf/
  • Author has run make-checks and it passes
  • All automatic GitHub Action checks pass - see actions
  • Author has added reviewers and removed PR draft status

Testing Instructions

Applicable Targets

  • Orin AGX aarch64
  • Orin NX aarch64
  • Lenovo X1 x86_64
  • Dell Latitude x86_64
  • System 76 x86_64

Installation Method

  • Requires full re-installation
  • Can be updated with nixos-rebuild ... switch
  • Other:

Test Steps To Verify:

On ghaf-host:

  1. Check journald retention configuration is applied:

    cat /etc/systemd/journald.conf

    Expected: Should show MaxRetentionSec=2592000, SystemMaxUse=500M, SystemMaxFileSize=100M, Storage=persistent

  2. Verify disk usage is within limits:

    journalctl --disk-usage

    Expected: Total usage should be under 500M

  3. Verify logs are retained across boots:

    journalctl --list-boots

    Expected: Should show multiple boot entries

  4. Verify Alloy is still reading and forwarding logs:

    systemctl status alloy
    sudo ls -lh /var/lib/alloy/data-alloy/loki.source.journal.journal/

    Expected: Alloy service running, positions.yml file recently modified

  5. Test log retention period:

    journalctl --vacuum-time=30d

    Expected: Should only delete logs older than 30 days

On admin-vm:

  1. Check journald retention configuration:

    cat /etc/systemd/journald.conf
    systemctl status alloy

    Expected: Same retention settings applied, Alloy receiving logs from clients

  2. Verify admin-vm is forwarding to remote Loki server:

    sudo ls -lh /var/lib/alloy/data-alloy/

    Expected: WAL directory present with recent activity

Add configurable journal retention settings to reduce resource usage.
Logs are retained locally in systemd journal instead of running Loki.

Signed-off-by: juliuskoskela <julius.koskela@unikie.com>
Admin-vm also generates its own logs and needs local retention.

Signed-off-by: juliuskoskela <julius.koskela@unikie.com>
@brianmcgillion brianmcgillion added the Needs Testing CI Team to pre-verify label Oct 29, 2025
@milva-unikie
Copy link
Copy Markdown

Tested on Darter Pro (nixos-rebuild switch)

All good!

  • Was able to complete all test steps
  • Logs are being sent to Grafana

@milva-unikie milva-unikie added Tested on System76 and removed Needs Testing CI Team to pre-verify labels Oct 30, 2025
Copy link
Copy Markdown
Contributor

@everton-dematos everton-dematos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I modified the MaxRetentionSec to 300 (5 min) both in client and server MaxRetentionSec=${toString (300)}
https://github.com/juliuskoskela/ghaf/blob/log-retention-pr-2/modules/common/logging/client.nix#L76
https://github.com/juliuskoskela/ghaf/blob/log-retention-pr-2/modules/common/logging/server.nix#L123

However, it seems the logs are still accessible even after the 5 min window:

[ghaf@net-vm:~]$ cat /etc/systemd/journald.conf
[Journal]
Storage=persistent
RateLimitInterval=30s
RateLimitBurst=10000


Audit=
MaxRetentionSec=300
SystemMaxUse=500M
SystemMaxFileSize=100M
Storage=persistent


[ghaf@net-vm:~]$ journalctl --list-boots
IDX BOOT ID                          FIRST ENTRY                 LAST ENTRY                 
  0 3022cf1e481c47858d4069dfdb5c36c0 Thu 2025-10-30 09:10:47 UTC Thu 2025-10-30 09:24:34 UTC

[ghaf@net-vm:~]$ journalctl --directory=/var/log/journal/38ef38829fe94ed688be4d5b4c137aaf   --no-pager -o short-iso | head -n1
2025-10-30T09:10:47+00:00 net-vm kernel: rtc_cmos 00:03: registered as rtc0

[ghaf@admin-vm:~]$ cat /etc/systemd/journald.conf
[Journal]
Storage=persistent
RateLimitInterval=30s
RateLimitBurst=10000


Audit=
MaxRetentionSec=300
SystemMaxUse=500M
SystemMaxFileSize=100M
Storage=persistent


[ghaf@admin-vm:~]$ journalctl --list-boots
IDX BOOT ID                          FIRST ENTRY                 LAST ENTRY                 
  0 22b4559d156443a09fab3bf0de6fc269 Thu 2025-10-30 09:10:43 UTC Thu 2025-10-30 09:25:16 UTC

[ghaf@admin-vm:~]$ journalctl --directory=/var/log/journal/84a916f36de347dba1326cbf9bae595e/ --no-pager -o short-iso | head -n1
2025-10-30T09:10:43+00:00 admin-vm kernel: Linux version 6.17.3 (nixbld@localhost) (gcc (GCC) 14.3.0, GNU ld (GNU Binutils) 2.44) #1-NixOS SMP PREEMPT_DYNAMIC Wed Oct 15 10:04:23 UTC 2025

Is this an expected behavior?

@juliuskoskela
Copy link
Copy Markdown
Contributor Author

I modified the MaxRetentionSec to 300 (5 min) both in client and server MaxRetentionSec=${toString (300)} https://github.com/juliuskoskela/ghaf/blob/log-retention-pr-2/modules/common/logging/client.nix#L76 https://github.com/juliuskoskela/ghaf/blob/log-retention-pr-2/modules/common/logging/server.nix#L123

However, it seems the logs are still accessible even after the 5 min window:

[ghaf@net-vm:~]$ cat /etc/systemd/journald.conf
[Journal]
Storage=persistent
RateLimitInterval=30s
RateLimitBurst=10000


Audit=
MaxRetentionSec=300
SystemMaxUse=500M
SystemMaxFileSize=100M
Storage=persistent


[ghaf@net-vm:~]$ journalctl --list-boots
IDX BOOT ID                          FIRST ENTRY                 LAST ENTRY                 
  0 3022cf1e481c47858d4069dfdb5c36c0 Thu 2025-10-30 09:10:47 UTC Thu 2025-10-30 09:24:34 UTC

[ghaf@net-vm:~]$ journalctl --directory=/var/log/journal/38ef38829fe94ed688be4d5b4c137aaf   --no-pager -o short-iso | head -n1
2025-10-30T09:10:47+00:00 net-vm kernel: rtc_cmos 00:03: registered as rtc0

[ghaf@admin-vm:~]$ cat /etc/systemd/journald.conf
[Journal]
Storage=persistent
RateLimitInterval=30s
RateLimitBurst=10000


Audit=
MaxRetentionSec=300
SystemMaxUse=500M
SystemMaxFileSize=100M
Storage=persistent


[ghaf@admin-vm:~]$ journalctl --list-boots
IDX BOOT ID                          FIRST ENTRY                 LAST ENTRY                 
  0 22b4559d156443a09fab3bf0de6fc269 Thu 2025-10-30 09:10:43 UTC Thu 2025-10-30 09:25:16 UTC

[ghaf@admin-vm:~]$ journalctl --directory=/var/log/journal/84a916f36de347dba1326cbf9bae595e/ --no-pager -o short-iso | head -n1
2025-10-30T09:10:43+00:00 admin-vm kernel: Linux version 6.17.3 (nixbld@localhost) (gcc (GCC) 14.3.0, GNU ld (GNU Binutils) 2.44) #1-NixOS SMP PREEMPT_DYNAMIC Wed Oct 15 10:04:23 UTC 2025

Is this an expected behavior?

Great testing, thank you. It seems like this is indeed how the journald retention works. There's some subtleties here:

  • The deletion happens on journal files which are archived/rotated files. Active, currently open journal files (which journald is still writing to) are typically not considered for deletion until they’re closed/rotated.
  • You may trigger manual rotation (e.g., by using journalctl --rotate) followed by vacuuming (journalctl --vacuum-time=5m) to observe retention more immediately.

So basically in our case what we could do is set up a cron job that runs when the configured retention time is reached and which then rotates and vacuums the logs.

Copy link
Copy Markdown
Collaborator

@mbssrc mbssrc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the log clearing, log rotation happens on (re)boot, so the edge case of needing to vacuum could be added with a systemd timer later?

@juliuskoskela
Copy link
Copy Markdown
Contributor Author

For the log clearing, log rotation happens on (re)boot, so the edge case of needing to vacuum could be added with a systemd timer later?

Yeah sure!

@brianmcgillion brianmcgillion merged commit d9a55f8 into tiiuae:main Oct 30, 2025
28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants