Mojmir Mutny

How much budget do I need with active learning?

2024-09-27T00:00:00-07:00

TLDR: In this blog I discuss the expected budget of a experiment design (active learning campaign).

What is Active Learning?

Active learning (or design of experiment, if you got your education before the 2000s) is a machine learning (statistics) paradigm that helps to gather the most informative subset of a dataset to label (or measure) in order to improve the performance (whether raw or in terms of generalization, etc.) of your machine learning model.

Data is expensive, and accurate data is even more expensive, so usually, if you have some high-impact problem, the data is scarce and you need to plan your budget accordingly. For example, suppose you are working on protein design. You could create completely random mutants of your protein to understand the impact of mutations on its function, but maybe there’s a more efficient way to collect data. Maybe you want to cover a diverse set of sequences instead of sampling randomly from the pool. Maybe you want to minimize duplicates, or better yet, close duplicates in terms of function. Indeed, this is possible, and this is where active learning comes into play. A typical pipeline looks like the following:

Suppose your function $ f(x) $ has a certain complexity. This means the relationship between $ x$ and $ y$ is established. We do not know $ f $, but we know something about it. Examples include, but are not limited to (and possibly in combination):
- $ f $ is linear
- $f$ is a shallow 2-layer network
- $ f $ is a Gaussian process (RKHS) with a certain kernel
- $ f $ is positive, monotone, or convex, or has any other shape constraint
- $ f $ is linear on the NN-embeddings $ \phi(x) $
- $ f $ is a fine-tuned NN with a couple of gradient steps
The user measures $ y = f(x) + \epsilon$. Before experimenting, one has to establish the structure of the random noise $ \epsilon $—in other words, the error of measurement. Often, in practice, this is close to Gaussian for real-valued signals. For discrete signals, Poisson error might be more appropriate, but this is beyond the scope of this blog post.
Calculate a sampling strategy $ \pi $.
Sample or use a sampling strategy on the search space $ \mathcal{X} $. The search space is defined by the user.
Use part of the experimental budget and go back to step 1. Usually, you can update the estimate of your relationship, but the complexity remains. In other words, if you believe the neural network explains the relationship, you just update your belief about the parameters, not the actual relationship form.

The promise of this scheme is that you would improve your performance much faster as a function of data compared to, say, randomized sampling. Such an example would look like the one below.

Even with random sampling, we reduce error, and we decrease it at the same asymptotic rate! With truly non-parametric models that allow us to change the norm of the function, the rate might differ from the typical $ 1/N $, but for any model with limited capacity, the only thing that changes is the constant. However, this constant can be very large and might even be unbounded for certain$ N_0 < N $! Hence, non-asymptotically, this can lead to significant improvements over other methods, such as random sampling, which might have a very large constant. The capacity is either the fact that the function has limited dimensions or, in functional spaces, that we assume the norm of the true non-parametric $ f $ is bounded in certain functional spaces. There is a very nice book by Grace Wahba [1] to learn more about non-parametric statistics.

Figure 1: MSE (mean squared error) of active learning vs. random sampling. Notice that eventually we approach the $ 1/N $ rate once the capacity of the model is reached.

How much data I need?

The above plot is post-hoc. We have collected the data and then evaluated the performance. If we want to achieve a fixed performance level, say at level $\nu$, how many datapoint do we need? We can read out this number from the graph above, see the dashed vertical line. However the graph is post-hoc execution, can’t always get this number before we start any of it, right? Well, it turns out this is the core thing what many theoretical works in active learning fields analyze.

Practically, there are couple of answers, but lets summarize them to effectively to two ways, one general, and one elegant. In both cases, we need to however understand how to construct instances of our problem; either by sampling or assume a regularity on our function.

Simulations. We can always try to simulate the above with our guesses of possible $f$, and see what happens. See couple of simulations below, where each simulation is ran with a different $f$. The dashed lines are different instances of $f$. We end up with a budget given $\nu = 0.1$ at about 400-500 datapoints. We can perform similar analysis for random sampling as well.

Kernelized regression (RKHS or very related Gaussian processes) with Gaussian likelihood (or alike). For certain likelihoods and model classes, we can in fact calculate the worst-case expected error if we are in advance and in closed form (or even worst case maximum error in the dataset, though not covered here)!

In fact, it is so simple, let us derive it. I will derive it using the Hilbert space notation, where a kernel $k(x,y) = \braket{\phi(x), \phi(y)}$, and the function as $f(x) = \braket{f, \phi(x)}$. Since I am lazy I am just going to use a transposition as an indication for an inner product as $\braket{f, \phi(x)} = f^\top \phi(x)$.

The idea is simple, we observe corrupted values of $f(x) $ as $y = f(x) + \epsilon$, where $\epsilon$ has a known, in this case, Gaussian likelihood, and $f$ is the Hilbert space element.

In this specific derivation, we are going to calculate the average accuracy over the search space $\mathcal{X}$ as the performance metric we care about. The estimate we use to provide us the predictions and achieves this performance on this metric is the estimate $\hat{f}$ of $f$. The error is its deviation from the true $f$ on the whole search space:

\[E = \frac{1}{\|\mathcal{X}\|} \mathbb{E}\left[\sum_{x\in \mathcal{X}} (\hat{f}(x) - f(x))^2\right].\]

The expectation is over the random noise realizations $\epsilon$. Let us use a shorthand for all the evaluated measurements as $X = [\phi(x_1), \dots \phi(x_n)]$, which gives a $\mathcal{H} \rightarrow \mathbb{R}^{n}$ operator, and adjoit $X^\top$. In the case of least squares regression, the estimator $\hat{f}$, can be represented as $\hat{f} = X^\top (K+\lambda I)^{-1}y = X^\top (K+\sigma^2 \lambda I)^{-1}(Xf + \epsilon)$. Plugging this estimating in, using the reproducing property and putting $f$ under the same bracket leads to:

\[E(X) =\frac{1}{\|\mathcal{X}\|} \sum_{x\in \mathcal{X}} \mathbb{E}\left [(( ( X^\top (K+\lambda\sigma^2 I)^{-1}X - I)f + X^\top (K+\sigma^2\lambda I)^{-1}\epsilon)^\top\phi(x))^2 \right]\]

Let us take the expectation with simple fixed parameters as $\mathbb{E}[\epsilon] = 0$, and $\mathbb{E}[\epsilon^2] = \sigma^2$, and $|f|\leq 1/\lambda$, bounded variation in Hilbert space. For proper generalization to arbitrary noise level consult the reference [2].

Let us define shortand for a covariance average span of the whole search space:

\[V_s := \frac{1}{|\mathcal{X}|} \sum_{x\in \mathcal{X}} \phi(x)\phi(x)^\top.\]

Effective Dimension - A metric that can hint.

After using all of this, we will arrive at:

$E(X) \leq \sigma^2 \operatorname{Trace}\left(\left(\lambda I + \frac{X^\top X}{\sigma^2}\right)^{-1} V_s \right) ||f||^2$ Notice that this simple and elegant expression is able to bound the error. First or all notice that the operator $I$ and $X^\top X$ are $\mathcal{H} \rightarrow \mathcal{H}$ operators, hence we need to use matrix inversion lemma to evaluate them. Namely,

$E(X) \leq \sum_{x}\frac{\sigma^2}{|\mathcal{X}|} \phi(x)^\top \left(\lambda I + \frac{X^\top X}{\sigma^2}\right)^{-1} \phi(x) = \frac{\sigma^2}{|\mathcal{X}|} \phi(\mathcal{X})^\top \phi(\mathcal{X}) - \phi(\mathcal{X}) X \left(\lambda I + \frac{X X^\top}{\sigma^2}\right)^{-1}X\phi(\mathcal{X}),$ where $\Phi(\mathcal{X})$ is the stacked embeddings of the whole search space. This can be conveniently evaluated in the computer by using the fact that $\Phi(x)\Phi(\mathcal{X}) = k(x,x_s)$ where the vector is of size $\mathcal{X}$ on the index $s$.

Now comming to the center of the derivation. We can in fact ask what is the lowest possible value of $E$, given the fixed budget $n$, how much can I optimize this? In paricular,

\[\min_{X, |X|\leq n} E(X)?\]

This is a challenging discrete optimization problem. Its in fact known to be NP-hard in its simplest variant [3]. However, upon performing a probability relaxation, where inclusion of $x \in \mathcal{X}$ changes to probability whether $x$ is $\eta(x)$, we can reformulate the problem as:

\[E(\eta) = \sum_{x \in \mathcal{X}}\frac{\sigma^2}{|\mathcal{X}|} \phi(x)^\top \left(\lambda I + \frac{n\sum_{x \in \mathcal{X}}\eta(x)\phi(x)\phi(x)^\top }{\sigma^2}\right)^{-1} \phi(x).\]

This is a convex optimization problem that can be easily solved using interior point methods, mirror descent or frank-wolfe. This quantity is often referred to as effective dimension of the problem.

Notice clearly that this number increases with the $\sigma^2$ but also decreases with $\lambda$. The more regularized the problem the less complexity and the more noice the less possible recovery.

As a sidenote, in the statistical literature the above quantity appears in generalization bounds most commonly. In fact it arises when evaluating the value of $E$ when sampling iid data evaluated on the same iid distribution in expectation (generalization). This quantity hence mostly features in works that characterize complexity of generalization of non-parametric models in machine learning.

Pratical Note and Example

In terms of overall numbers, this number is not representing anything much practically relevant since we need either the regularization value (and hence bound on the $f$; or bound on the $||f||$ which will imply regularization). Despite not knowing $\lambda$ precisely it allows us to do is to understand relative budgetary constraints. Namely, if I am at accuracy $\nu$, I need $n(\nu)$ data in order to make sure $\min_n \min_\eta E_n(\eta)-\nu < 0 $, the extra experimental effort to increase the accuracy to $2\nu$ is $n(2\nu)$, and these numbers are however informative. If I want to increase the accuracy by twice, we can see how many more datapoint we need.

Let us take a practical example, we have a trained neural network embeddings from a foundational model in this case support ESM2 [4]. A versatile model for protein embeddings based on sequence information. In goes sequence $x$, and out comes embedding $\phi(x)$ a fixed length vector.

Now we will use this to define a kernel as $k(x,y) = \exp(-\sum_i\gamma_i(\phi_i(x)-\phi_i(y))^2)$.

We will use a dataset from our work [5] designing a novel Metallo-enzyme. This dataset contains 3k+ sequences. We want to understand how much the average error improves as we increase the sample size by follwoing optimal active learning sampling strategy. See the worst-case calculation $E$ as theoretical error:

Figure: Expected worst-case (for the worst function) mean squared error on a search space of streptavidin variants.

If we look at these plots of the above quantity along with the simulations, we see the worst case performance is not far away from the simulated one.

Code

To calculate theoretical capacity of your model and/or embeddings can be reconstructured using my library stpy and doexpy with little effort as you can see bellow. These packages can be found on my github.

from stpy.continuous_processes.nystrom_fea import NystromFeatures
from stpy.kernels import KernelFunction
from stpy.helpers.helper import interval_torch
from mdpexplore.env.bandits import Bandits
from mdpexplore.functionals.doe_static_functionals import DesignA
from mdpexplore.convex_solvers.frank_wolfe import FrankWolfe
from mdpexplore.feedback.feedback_base import EmptyFeedback
from mdpexplore.solvers.dp import DP
from mdpexplore.policies.summary_policies.density_policy import DensityPolicy
from mdpexplore.mdpexplore import MdpExplore
import torch 
import matplotlib.pyplot as plt

# list of kernels
kernels = [KernelFunction(kernel_name="squared_exponential", gamma=0.1), 
           KernelFunction(kernel_name="squared_exponential", gamma=0.5)]
# names of the kernels
names =["RBF $\gamma = 0.01$", "RBF $\gamma = 0.1$", "RBF $\gamma = 0.5$"]

# define the discretized inveral [-1,1]
x = interval_torch(128,d = 1)*2

# We calculate finite dimensional embeddings for the intervals for easier calculation
# We use Nystrom features with thresholdon explained variance
embeddings = []
for k in kernels:
    Nystrom = NystromFeatures(k, m = None, approx = 'svd-explained')
    Nystrom.fit_gp(x, None, explained_variance = 0.999)
    embeddings.append(Nystrom.embed)

# define the budget of the experiment 
Ts = torch.logspace(1,10,10,base=2)
sigma = 0.05
for name, emb in zip(names,embeddings):
    values = []
    
    for T in Ts: 
        phi = emb(x)
        
        # define the environment
        env = Bandits(
            action_space=phi
        )
        
        # define the problem
        design = DesignA(
            env=env,
            lambd=1.,
            dim = 1,
            V = phi.T@phi/(sigma**2))
        
        # define the convex solver
        convex_solver = FrankWolfe(env,
                                   objective=design,
                                   num_components=10*phi.size()[1],
                                   solver=DP,
                                   step="line-search",
                                   SummarizedPolicyType=DensityPolicy,
                                   accuracy=1e-5)
        
        # define the feedback class
        feedback = EmptyFeedback(env, design)
               
        # Bandit environment 
        env = Bandits(action_space=phi)
        
        me = MdpExplore(
            env=env,
            objective=design,
            convex_solver=convex_solver,
            verbosity=0,
            feedback=feedback,
            general_policy='markovian')
        
        val, opt_val = me.run(
            episodes=int(T))
        
        values.append(-sigma**2*opt_val/T)
    plt.loglog(Ts, values,'o-', label = name)
#plt.loglog(Ts,1/Ts,'k--')
plt.xlabel("Number of Experiments")
plt.ylabel("Achievable Accuracy")
plt.legend()

References

Grace Wahba, Spline Models for Observational Data, 1990, CBMS-NSF Regional Conference Series in Applied Mathematics.
Experimental Design for Linear Functionals in Reproducing Kernel Hilbert Spaces, NeurIPS 2022
Cerny, Hladik Two complexity results on c-optimality in experimental design, 2012, Computational Optimization and Applications 51(3):1397-1408
Lin et. al., Language models of protein sequences at the scale of evolution enable accurate structure prediction, bioRxiv https://www.biorxiv.org/content/10.1101/2022.07.20.500902v1
Vornholt, T., Mutný, M., Schmidt, G. W., Schellhaas, C., Tachibana, R., Panke, S., Ward, T. R., Krause, A., & Jeschek, M. (2024). Enhanced sequence-activity mapping and evolution of artificial metalloenzymes by active learning. ACS Central Science. https://doi.org/10.1021/acscentsci.4c00456
Mojmir Mutny, Modern Adaptive Experiment Design: Machine Learning Perspective, 2024. PhD Thesis ETH Zurich.*

Greedy algorithm and Frank-Wolfe on a convex relaxation: relationships

2024-04-22T00:00:00-07:00

In this blog I discuss relation between greedy algorithm and Frank-wolfe on a convex relaxation in the context of Optimal Experiment Design objectives.

Background

This is my first blog post. I am still experimenting with this format, so I apologize if the format or pacing is off. Until recently, I didn’t feel that this medium was suitable for communicating results. However, I have noticed that there are many small or speculative results I would like to share with others, beyond the classical publication scheme. My goal is not to attract a more general audience, but rather to address the same scientific/engineering audience by providing quick, small results that might not be suitable for a traditional scientific paper.

Experiment Design Problem

For the first blog I chose to write about a greedy algorithm and its role in experiment design. Classical setup of experiment design in term discrete sets goes like this. It also relates to a question after a talk I gave recenly. I will elaborate on this later.

Suppose there set of experiments $V$, where $v \in V$ is the experiment. Additionally, there is a budget $T$, and we would like to select a subset $S\subset V$, $|S| \[S^* = \max_{S \subset V, |S| \leq T} E(S)\]

We will focus on the class of problems that arise when fitting linear regression of the set of experiments in $S$, represented by covariates $\{\Phi_i\}$. These objectives have usually a form:

\[E(S) = s\left(\sum_{i \in S} \Phi_i \Phi_i^\top + \mathbf{I} \lambda\right),\]

where the $\lambda$ is due to regularization and $s$ is the so called scalarization function. The famed examples are with $s(A) = \log\det(A)$, or $s(A) = 1/\operatorname{Tr}(A^{-1})$. The role of scalarization is to ensure total order in the space of utilities. For more details look below in the reference, and discussion of different scalarizations.

Greedy Algorithm – Discrete Gradient

A greedy algorithm solution to this problem involves selecting the next element according to the rule:

\[\arg\max_{i} E(S \cup \\{ i \\} ).\]

The final solution is grown incrementally. This algorithm is immensely practical and forms the backbone of many applications. The greedy algorithm provably approximates the maximum if the objective happens to be submodular. This is a form of regularity of set function capturing diminishing returns. Naturally, as the experiment design objectives capture information they are either submodular or close to submodular. In some cases, the objective is submodular and we can prove constant-factor approximation guarantees.

Convex Relaxation – Continuous Gradient

Classical experiment design does not solve this problem by applying greedy algorithm. At least not in the motivation. Instead, it introduces a relaxation to the objective, where the optimization problem is continuous. The textbook relaxation introduces:

\[\bar{E}(\eta) = s\left(T\sum_{i \in V}\eta_i \Phi_i \Phi_i^\top + \mathbf{I} \lambda\right),\]

where, we sum on the ground set, and $\eta_i$ represents a normalized indication whether the element is selected. Normalized,because $\sum \eta_i = 1$. The relaxation proceeds by saying that $\eta_i \in (0,1)$. Then, the objective $\bar{E}$ is in fact convex on the simplex, $\Delta_V$. In fact, a way to view this is that the objective is convex in the space of psd matrices, where it is restricted to the space of atoms formed by positive definite matrices $\{ \Phi_i\Phi_i^\top \}$. In fact, we can view, as as being supported on $\{ \Phi_i\Phi_i^\top \}\cup \{\lambda\mathbf{I} \}$, and accept the regularization as part of the atom on which the solution is supported. With the overloading of the notation, the utility can be seen as a function on positive definite matrices:

\[\max_{\mathbf{\Sigma} \in \operatorname{conv}( \\{ \Phi_i\Phi_i^\top \\} )} \bar{E}(\mathbf{\mathbf{\Sigma}})\]

Clearly, we are not selecting the regularization through the optimization, but we can force it to be present at all times through the form of the update and initial point in the algorithm. Namely, we will define as a starting point of this optimization algorithm to be equal to $\mathbf{\Sigma}_0 = \mathbf{I}\lambda$.

We apply a optimization procedure that works on constrained space of atoms that respects the constraints $\mathbf{\Sigma} \in \operatorname{conv}(\{\Phi_i\Phi_i^\top \})$, convex hull of the matrices – Frank-Wolfe. Frank-Wolfe proceeds by constructing a series of linearizations of the objective $\bar{E}$ at the iterates, and moves by performing a convex combination of the maximizer of the linearization of $\bar{E}$, $\nabla \bar{E}$ and the current iterate. I think there are wonderful resources on Frank-Wolfe algorithm e.g. here. Assuming you are familiar with Frank-Wolfe, lets go to the algorithm directly. Starting with $\mathbf{\Sigma}_0=\lambda \mathbf{I}$, we iterate:

\[v_t = \arg\min_{v \in V} \operatorname{Tr}(\nabla\bar{E}(\mathbf{\Sigma}_t) \Phi_v\Phi_v^\top )\]
\[\mathbf{\Sigma}_{t+1} = \alpha_t \Phi_{v_t} \Phi_{v_t}^\top + (1-\alpha_t)\mathbf{\Sigma}_t\]

Notice that $\alpha_t \in (0,1)$ forms the step-size of the convex combination. Optimization literature suggests to pick this number using a line search to converge as fast as possible. In this blog we will use: $\alpha_t = \frac{1}{t+1}$ for two reasons.

The first reason is that this rule ensures that at any iteration $t$, the design is integral, in other words, $t\mathbf{\Sigma}_t$ is integral combination of elements from the base set. This way the relaxation is exact and we always work on the lattice of the relaxation only utilizing the continuous properties to define a gradient.
Secondly, we use this step size, as it always means that the contribution of the regularization stays constant and proportionally decreases with more data point.

This way, the update looks very similar to the greedy algorithm only with the difference we use the gradient of the objective for the greedy step instead of the actual change in the discrete gradient.

Connection: When the two are the same?

The linearized and discrete gradient algorithms construct a surrogate objective that is being maximized. The two surrogates are different in general, but when they are the same? The answer is when the discrete and continuous relaxation gradient have the same extreme point. This is implied when for each vertex in $\Delta_V$, $\delta_v$

\[\begin{equation}\label{eq:condition-discrete} \underbrace{\left(F\left(\frac{t}{t+1}\eta_{t}+ \frac{1}{1+t} \delta_x\right) - F(\eta_t)\right)}_{\text{Discrete gradient}} = \rho_t(\nabla F(\eta_t)^\top \delta_x) +C_t \end{equation}\]

where $C_t \in \mathbb{R}$ is a constant independent of $x$ and $\rho_t$ is monotone non-decreasing. The constant and the function $\rho_t$ can change with time $t$. Remarkably, there are many important problems when the two coincide. Let us prove this property for the prominent $\log\det$. In order to state the result, we work on the augmented simplex where $\eta(0)$ corresponds to the $\lambda \mathbf{I}$ just to keep track of it.

Proposition Let us consider the objective $F(\eta) = \log\det(\sum_{i=1}^n \eta(i) \Phi(x_i)\Phi(x_i)^\top + \mathbf{I} \eta(0))$ on augmented simplex, starting with the value $\eta$ s.t. $\eta_0(0) = 1$.

Proof

Let us use a shorthand $\mathbf{V}(\eta) = \left( \sum_{i=1}^T \eta(i) \Phi(x_i)\Phi(x_i)^\top + \mathbf{I} \eta(0) \right)$. The gradient is equal to:

\[\nabla F(\eta)_k = \Phi(x_k)^\top\mathbf{V}(\eta)^{-1}\Phi(x_k) ~ \text{for} ~ i \in \{1,\dots n\},\]

While the discrete gradient, using shorthand $C_t^\prime = \log\det(\mathbf{V}(\eta_t))$,

$dF_k = \log\det\left(\frac{t}{t+1}\mathbf{V}\left(\eta_t\right) + \frac{1}{1+t}\Phi(x_k)\Phi(x_k)^\top \right) - C_t^\prime$ $= \log\left(\det\left(\frac{t}{t+1}\mathbf{V}\left(\eta_t\right)\right)\left(1 + \frac{1}{t}\Phi(x_k)^\top\mathbf{V}(\eta_t)^{-1}\Phi(x_k) \right)\right) - C_t^\prime$ $= \log\det\left(\frac{t}{t+1}\mathbf{V}\left(\eta_t\right)\right) + \log\left(1 + \frac{1}{t} \nabla F(\eta_t)_k\right) - C_t^\prime + \rho_t(\nabla F(\eta_t)_k)$

where we have identified the constant and monotone function $\rho_t$.

The relationship between greedy and Frank-Wolfe in this form gives us the ability to prove a suboptimality result of the greedy algorithm as:

\[\bar{E}(S^*) - E(S_T) \leq \frac{\bar{E}(\mathbf{\Sigma}^*) - E(\mathbf{\Sigma}_0)}{T} + L \frac{\log T}{T}\]

where $L$ is the Lipschitz constant of $\bar{E}$. The result follows by application of master theorem from Mutny (2024). This suggest by following greedy algorithm we eventually converge to the optimal proportion of experiment allocation. Note that this is different that proving classical submodular guarantees such as

\[E(S_T) \geq (1-e^{T/\tau}) E(S^*_\tau).\]

where we compare to $S^*_\tau$ which is the best solution with budget $\tau$. The first guarantee has consistency flavor whereas the other is approximation guarantee.

While the discrete algorithm seems appealing, it has some limited applicability in more complicated domains beyond simple ground sets of $V$. This difficulty is at the core of my paper about Experiment Design in Markov chains (Mutny e.t al. (2023)).

Relationships

Indeed the relationship between greedy algorithm and this convex relaxation has been known at least since 1970s in the experimental literature. See for example, Whittles (1973) paper on this topic among others. In fact, the belief, and the sentiment presented in that paper then was that by following greedy you are a lot more suboptimal than just optimizing quickly and then rounding. If experiments are no longer deterministic but instead follow a stochastic policy, the perspective with rounding is hard to execute. We will look into this in the next blog post.

Its remarkable that exactly for the case of submodular $E$, we can show that the greedy is optimal. This perhaps hints at the deeper connection between submodularity, greedy and Frank-Wolfe with this choice of step-size.

Let me return to the question I got after a talk. Notice that the relaxation depends on $T$,

\[\bar{E}(\eta) = s\left(T\sum_{i \in V}\eta_i \Phi_i \Phi_i^\top + \mathbf{I} \lambda\right),\]

which is not surprising since, but perhaps unclear how to construct anytime result. Greedy algorithm seem not to care about total budget $T$. In order to solve the optimal experiment design we need to know $T$ in advance. This opens a question how to convert the procedure to have anytime guarantee or at least be executed at the greedy algorithm. In fact, the answer is to do the same thing as greedy and view the initial point being the regularization instead of part of the objective. This way we obtain an anytime result. The only thing we loose is the optimality for the fixed $T$, but this is to be expected given the motivation.

Citing and References

If you would like to know more about results like this, please consult my theis. Likewise, if you would like to cite this result that is more apopriate than this blog. The essence of this discussion appears in my PhD Thesis in section 3.4.6, alteit with some eratta that are corrected here, namely one inequality should be equality. I do not claim to be the first to notice this, I know this from Whittle (1973).

Mojmir Mutny, Modern Adaptive Experiment Design: Machine Learning Perspective, 2024. PhD Thesis ETH Zurich.

References

Nemhauser, George L.; Wolsey, Laurence A.; Fisher, Marshall L., “An analysis of approximations for maximizing submodular set functions—I”, 1978, Mathematical Programming.
Krause, Andreas; Guestrin, Carlos, “Nonmyopic active learning of gaussian processes: an exploration-exploitation approach”, 2007, Proceedings of the 24th International Conference on Machine Learning.
Krause, Andreas; Guestrin, Carlos, “Optimal nonmyopic value of information in graphical models: efficient algorithms and theoretical limits”, 2005. This work does not specify a venue as it appears to be a file citation possibly intended for personal use.
Mutný, Mojmír; Janik, Tadeusz; Krause, Andreas, “Active Exploration via Experiment Design in Markov Chains”, 2023, Proceedings of the 26th International Conference on Artificial Intelligence and Statistics (AISTATS).
Whittle, Perer, “Some general points in the theory of optimal experimental design”, 1973, Journal of the Royal Statistical Society: Series B (Methodological)
Pokutta Sebastian, “The Frank-Wolfe algorithm: a short introduction”, 2023, arxiv 2311.05313