Skip to content

Use iterator for lazy evaluation in journal storage’s read_logs#6144

Merged
c-bata merged 10 commits intooptuna:masterfrom
kAIto47802:journalstorage-use-iterator
Oct 24, 2025
Merged

Use iterator for lazy evaluation in journal storage’s read_logs#6144
c-bata merged 10 commits intooptuna:masterfrom
kAIto47802:journalstorage-use-iterator

Conversation

@kAIto47802
Copy link
Copy Markdown
Collaborator

@kAIto47802 kAIto47802 commented Jun 11, 2025

Motivation

Currently, the read_logs function in journal storage returns a list, meaning all log entries are loaded into memory at once. This leads to high memory usage, especially when handling a large number of entries.
To address this issue, this PR introduces lazy evaluation by replacing the list with a generator, allowing entries to be processed one by one without loading everything into memory.

Description of the changes

  • Replace the list containing all log entries with a generator.

Benchmarking

I conduct a benchmark on memory usage to confirm the effectiveness of this PR.

Benchmarking Setup

I create a JournalStorage instance with large log files and profile the memory usage right after it invokes apply_logs with the result of read_logs in the _sync_with_backend method.

The code I use to create the large log files is as follows:

The code to create the large log files
from pathlib import Path

import optuna


def objective(trial: optuna.Trial) -> float:
    x = trial.suggest_float("x", -10, 10)
    return (x - 2) ** 2


storage = optuna.storages.JournalStorage(
    optuna.storages.journal.JournalFileBackend("./journal_storage.log")
)
sampler = optuna.samplers.RandomSampler()
study = optuna.create_study(storage=storage, sampler=sampler)
study.optimize(objective, n_trials=100000)

for num in [1 << i for i in range(1, 20)]:
    Path(f"journal_storage{num}.log").write_text(
    "\n".join(Path("journal_storage.log").read_text().splitlines()[:num])
    )

The code I use to profile the memory usage after apply_logs for the master branch and the branch of this PR is as follows, respectively:

The code I use to run the benchmark and visualize the results is as follows:

The benchmarking code

read_logs.py

import argparse

import optuna


def objective(trial: optuna.Trial) -> float:
    x = trial.suggest_float("x", -10, 10)
    return (x - 2) ** 2

def main(args: argparse.Namespace) -> None:
    optuna.storages.JournalStorage(
        optuna.storages.journal.JournalFileBackend(f"./journal_storage{args.log_length or ''}.log")
    )

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--log_length", type=int)
    args = parser.parse_args()
    main(args)

run_benchmark.sh

#!/bin/bash

branch=$(git rev-parse --abbrev-ref HEAD | tr '/' '_')
mkdir -p results
outfile="results/apply_logs_${branch}.csv"
echo "length,run,before,after" > $outfile
for run in {1..5}; do
    for k in {1..19}; do
        length=$((1<<k))
        result=$(python read_logs.py --log_length $length)
        echo "$length,$run,$result" >> $outfile
    done
done
The visualization code
from argparse import ArgumentParser, Namespace

import numpy as np
from matplotlib import font_manager
from matplotlib.figure import Figure
import matplotlib.pyplot as plt
import polars as pl


fp = font_manager.FontProperties(fname="/usr/share/fonts/TTF/Times.TTF")


def _prepare_data(branch: str) -> pl.DataFrame:
    df = pl.read_csv(f"results/apply_logs_{branch}.csv")
    return df.group_by("length").agg(
        [
            pl.col("before").mean().alias("before_mean"),
            (pl.col("before").std() / pl.count("before").cast(pl.Float64).sqrt()).alias("before_se"),
            pl.col("after").mean().alias("after_mean"),
            (pl.col("after").std() / pl.count("after").cast(pl.Float64).sqrt()).alias("after_se"),
        ]
    ).sort("length")


def plot_results(
    data: dict[str, np.ndarray],
    colors: dict[str, str],
    markers: dict[str, str],
    marker_sizes: dict[str, float],
    xlabel: str,
    ylabel: str,
    xlim: tuple[float, float] | None = None,
    ylim: tuple[float, float] | None = None,
    figsize: tuple[float, float] | None = None,
) -> Figure:
    fig, ax = plt.subplots(figsize=figsize)
    for name, d in data.items():
        x, mean, se = d.T
        ax.plot(
            x,
            mean,
            colors[name],
            label=name,
            marker=markers[name],
            markersize=marker_sizes[name] * 1.2,
        )
        ax.fill_between(
            x,
            (mean - se),
            (mean + se),
            alpha=0.2,
            color=colors[name],
        )
    ax.legend(
        loc="upper left",
        fontsize=12,
        prop=(
            font_manager.FontProperties(fname="/usr/share/fonts/TTF/Times.TTF", size=12)
        ),
    )
    ax.set_xlabel(xlabel, fontsize=13, fontproperties=fp)
    ax.set_ylabel(ylabel, fontsize=13, fontproperties=fp)

    ax.set_xscale("log")

    ax.grid(which="major", color="gray", linestyle="--", linewidth=0.5)
    for lbl in ax.get_xticklabels() + ax.get_yticklabels():
        lbl.set_fontproperties(fp)
    ax.tick_params(labelsize=12)

    if xlim is not None:
        ax.set_xlim(*xlim)
    if ylim is not None:
        ax.set_ylim(*ylim)

    return fig


def main(args: Namespace) -> None:
    dfs = [_prepare_data(branch) for branch in args.branches]
    names = ["Original", "This PR"]
    colors = {
        "Original": "#CC79A7",
        "This PR": "#0072B2",
    }
    maekers = {
        "Original": "o",
        "This PR": "*",
    }
    marker_sizes = {
        "Original": 6.0,
        "This PR": 8.0,
    }
    for phase in ["before", "after"]:
        data = {
            name:
            df.select(
                [
                    pl.col("length").cast(pl.Float64),
                    pl.col(f"{phase}_mean").cast(pl.Float64),
                    pl.col(f"{phase}_se").cast(pl.Float64),
                ]
            ).to_numpy()
            for name, df in zip(names, dfs)
        }
        fig = plot_results(
            data=data,
            colors=colors,
            markers=maekers,
            marker_sizes=marker_sizes,
            xlabel="Log file length",
            ylabel="Memory usage / MB",
            figsize=(6, 4),
        )
        fig.savefig(f"results/memory_apply_logs_{phase}.png", dpi=300, bbox_inches="tight")


if __name__ == "__main__":
    parser = ArgumentParser()
    parser.add_argument("--branches", type=str, nargs="+")
    args = parser.parse_args()

    main(args)

Result

The result is shown in Figure 1, confirming that this PR effectively reduces the memory usage.

memory_apply_logs_after

Figure 1. The memory usage after the invocation of apply_logs with large log files. The solid lines denote the mean, and the shaded regions denote the standard error, both computed over five independent runs with different random seeds. Compared to the master branch (shown in pink), this PR (shown in blue) effectively reduces the memory usage.

@kAIto47802 kAIto47802 marked this pull request as draft June 11, 2025 07:36
@github-actions
Copy link
Copy Markdown
Contributor

This pull request has not seen any recent activity.

@github-actions github-actions bot added the stale Exempt from stale bot labeling. label Jun 18, 2025
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Jul 2, 2025

This pull request was closed automatically because it had not seen any recent activity. If you want to discuss it, you can reopen it freely.

@github-actions github-actions bot closed this Jul 2, 2025
@kAIto47802 kAIto47802 reopened this Oct 8, 2025
@kAIto47802 kAIto47802 marked this pull request as ready for review October 8, 2025 08:32
@github-actions github-actions bot removed the stale Exempt from stale bot labeling. label Oct 8, 2025
@c-bata
Copy link
Copy Markdown
Member

c-bata commented Oct 10, 2025

@sawa3030 Could you review this PR?

@sawa3030
Copy link
Copy Markdown
Collaborator

sawa3030 commented Oct 17, 2025

Though I was initially concerned about runtime, the difference seems to be small.

image

I generated the plot using the following scripts, adapted from the code suggested here:

read_logs.py
import argparse

import optuna
import time


def objective(trial: optuna.Trial) -> float:
    x = trial.suggest_float("x", -10, 10)
    return (x - 2) ** 2

def main(args: argparse.Namespace) -> None:
    start_time = time.time()
    optuna.storages.JournalStorage(
        optuna.storages.journal.JournalFileBackend(f"./journal_storage{args.log_length or ''}.log")
    )
    print(time.time() - start_time)

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--log_length", type=int)
    args = parser.parse_args()
    main(args)
run_benchmark.sh
#!/bin/bash

mkdir -p results
outfile="results/apply_logs_master.csv"
echo "length,run,time" > $outfile

outfile="results/apply_logs_pr.csv"
echo "length,run,time" > $outfile

for run in {1..5}; do
    for branch in master pr; do
        git checkout $branch
        for k in {1..17}; do
            length=$((1<<k))
            result=$(python read_logs.py --log_length $length)
            outfile="results/apply_logs_${branch}.csv"
            echo "$length,$run,$result" >> $outfile
        done
    done
done
visualize.py
from argparse import ArgumentParser, Namespace

import numpy as np
from matplotlib import font_manager
from matplotlib.figure import Figure
import matplotlib.pyplot as plt
import polars as pl


fp = font_manager.FontProperties(family="serif")


def _prepare_data(branch: str) -> pl.DataFrame:
    df = pl.read_csv(f"results/apply_logs_{branch}.csv")
    return df.group_by("length").agg(
        [
            pl.col("time").mean().alias("time_mean"),
            (pl.col("time").std() / pl.count("time").cast(pl.Float64).sqrt()).alias("time_se"),
        ]
    ).sort("length")


def plot_results(
    data: dict[str, np.ndarray],
    colors: dict[str, str],
    markers: dict[str, str],
    marker_sizes: dict[str, float],
    xlabel: str,
    ylabel: str,
    xlim: tuple[float, float] | None = None,
    ylim: tuple[float, float] | None = None,
    figsize: tuple[float, float] | None = None,
) -> Figure:
    fig, ax = plt.subplots(figsize=figsize)
    for name, d in data.items():
        x, mean, se = d.T
        ax.plot(
            x,
            mean,
            colors[name],
            label=name,
            marker=markers[name],
            markersize=marker_sizes[name] * 1.2,
        )
        ax.fill_between(
            x,
            (mean - se),
            (mean + se),
            alpha=0.2,
            color=colors[name],
        )
    ax.legend(
        loc="upper left",
        fontsize=12,
        prop=(
            font_manager.FontProperties(family="serif", size=12)
        ),
    )
    ax.set_xlabel(xlabel, fontsize=13, fontproperties=fp)
    ax.set_ylabel(ylabel, fontsize=13, fontproperties=fp)

    ax.set_xscale("log")

    ax.grid(which="major", color="gray", linestyle="--", linewidth=0.5)
    for lbl in ax.get_xticklabels() + ax.get_yticklabels():
        lbl.set_fontproperties(fp)
    ax.tick_params(labelsize=12)

    if xlim is not None:
        ax.set_xlim(*xlim)
    if ylim is not None:
        ax.set_ylim(*ylim)

    return fig


def main(args: Namespace) -> None:
    dfs = [_prepare_data(branch) for branch in args.branches]
    names = ["Original", "This PR"]
    colors = {
        "Original": "#CC79A7",
        "This PR": "#0072B2",
    }
    maekers = {
        "Original": "o",
        "This PR": "*",
    }
    marker_sizes = {
        "Original": 6.0,
        "This PR": 8.0,
    }
    for phase in ["time"]:
        data = {
            name:
            df.select(
                [
                    pl.col("length").cast(pl.Float64),
                    pl.col(f"{phase}_mean").cast(pl.Float64),
                    pl.col(f"{phase}_se").cast(pl.Float64),
                ]
            ).to_numpy()
            for name, df in zip(names, dfs)
        }
        fig = plot_results(
            data=data,
            colors=colors,
            markers=maekers,
            marker_sizes=marker_sizes,
            xlabel="Log file length",
            ylabel="Runtime / sec",
            figsize=(6, 4),
        )
        fig.savefig(f"results/memory_apply_logs_{phase}.png", dpi=300, bbox_inches="tight")


if __name__ == "__main__":
    parser = ArgumentParser()
    parser.add_argument("--branches", type=str, nargs="+")
    args = parser.parse_args()

    main(args)

Copy link
Copy Markdown
Collaborator

@sawa3030 sawa3030 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@c-bata
Copy link
Copy Markdown
Member

c-bata commented Oct 22, 2025

@kAIto47802 The initial impression of this PR looks very good to me. Could you merge the latest master branch to resolve CI issues?

@c-bata c-bata added the enhancement Change that does not break compatibility and not affect public interfaces, but improves performance. label Oct 22, 2025
@c-bata c-bata added this to the v4.6.0 milestone Oct 22, 2025
@codecov
Copy link
Copy Markdown

codecov bot commented Oct 22, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 89.13%. Comparing base (7f6c6c3) to head (5310eb0).
⚠️ Report is 427 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6144      +/-   ##
==========================================
- Coverage   89.21%   89.13%   -0.08%     
==========================================
  Files         209      209              
  Lines       13935    13935              
==========================================
- Hits        12432    12421      -11     
- Misses       1503     1514      +11     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@kAIto47802
Copy link
Copy Markdown
Collaborator Author

kAIto47802 commented Oct 22, 2025

Thank you for the comment. I've merged the latest master branch, resolving the conflict.

@sawa3030 sawa3030 removed their assignment Oct 23, 2025
Copy link
Copy Markdown
Member

@c-bata c-bata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I confirmed that the memory usage decreased from 632MB to 229MB while loading 100k trials (the journal file contained 500,001 lines). LGTM!

master

$ scalene --memory journal_storage_mem.py
  Memory usage: ▁▁▁▁▁▂▂▃▃▃▃▃▃▄▄▅▅▅▅▅▆▇▇████ (max: 632.315 MB, growth rate:  98%)
% of time = 100.00% (20.507s) out of 20.507s.
:

This PR

$ scalene --memory journal_storage_mem.py
  Memory usage: ▁▁▂▂▂▃▃▃▄▄▅▅▅▆▆▇▆▇▇▇███ (max: 229.319 MB, growth rate:  96%)
% of time = 100.00% (19.675s) out of 19.675s.
:
`journal_storage_mem.py`
  1. Create a Python snippet.
import optuna
import time
from optuna.storages import JournalStorage
from optuna.storages.journal import JournalFileBackend


study_name = "journal_storage_bench"
journal_file_path = "./journal_storage_bench.log"


def objective(trial: optuna.Trial) -> float:
    x = trial.suggest_float("x", -10, 10)
    y = trial.suggest_float("y", -10, 10)
    trial.set_user_attr("dummy_attr", "dummy_value")
    return (x - 2) ** 2 + (y - 3) ** 2


def create_study() -> None:
    storage = JournalStorage(JournalFileBackend(journal_file_path))
    sampler = optuna.samplers.RandomSampler(1)
    study = optuna.create_study(storage=storage, sampler=sampler, direction="minimize", study_name=study_name, load_if_exists=True)
    start = time.time()
    study.optimize(objective, n_trials=100000, n_jobs=10)
    elapsed = time.time() - start
    print(f"Elapsed time: {elapsed:.4f} seconds")



if __name__ == "__main__":
    # create_study()
    JournalStorage(JournalFileBackend(journal_file_path))

This snippet

@c-bata c-bata merged commit bfa6f8a into optuna:master Oct 24, 2025
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement Change that does not break compatibility and not affect public interfaces, but improves performance.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants