<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://alexali04.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://alexali04.github.io/" rel="alternate" type="text/html" /><updated>2026-01-31T06:23:09+00:00</updated><id>https://alexali04.github.io/feed.xml</id><title type="html">Alex Ali</title><subtitle>Your Name&apos;s academic portfolio</subtitle><author><name>Alex Ali</name><email>alexali000@gmail.com</email></author><entry><title type="html">Expectation Maximization and Gaussian Mixture Models</title><link href="https://alexali04.github.io/em-gmm/" rel="alternate" type="text/html" title="Expectation Maximization and Gaussian Mixture Models" /><published>2025-02-27T00:00:00+00:00</published><updated>2025-02-27T00:00:00+00:00</updated><id>https://alexali04.github.io/em-gmm</id><content type="html" xml:base="https://alexali04.github.io/em-gmm/"><![CDATA[<h1 id="1-introduction">1. <strong>Introduction</strong></h1>

<hr />

<p>This is my attempt to go through the math completely of Gaussian Mixture
Models. I also discuss why the alternating of the \(E\) and \(M\) steps
converges to a good solution, pathological solutions, and some matrix
algebra.</p>

<h1 id="2-problem-setup">2. <strong>Problem Setup</strong></h1>

<hr />

<p>Let’s begin by specifying our data and task. Given
\(\{x_n\}_{n = 1}^N = X \in \mathbb{R}^{n \times d}\), each data point
\(x_n\) has an associated latent indicator variable
\(z_n \in \mathbb{R}^K\) where \(z_n\) is a one-hot vector,
i.e. \(z_{nk} = 1 \implies x \in C_k\) where \(C_k\) is the \(k\)’th
component.</p>

<p>We’ll additionally assume each point \(x_n\) was generated by a single
Gaussian so \(C_k\) could be thought of as all the points generated by the
\(k\)’th multivariate Gaussian.</p>

<p>Our task is to figure out \(p(z_{nk} \mid x_n)\) - this is the probability
that a point \(x_n\) belongs to the \(k\)’th MVG.</p>

<p>For a concrete motivating example, say we’re tracking several herds of
animals and we’ve tracked various sightings of said animals and recorded
their locations. The herds are distinct but occasionally animals from
different herds will be sighted in the same place. Our job is to figure
out, given a sighting at a specific location, which herd that animal
belongs to: in other words, we want \(p(z_{nk} \mid x)\) or the posterior
distribution. We have the constraint that
\(\sum_{k = 1}^K p(z_{nk}) = 1\).</p>

<p>Let’s define \(\pi_k = p(z_{k})\) - this is the prior distribution of the
different \(k\) components independent on the location. So if we know one
of the herds is much larger than the other, we’d give it a larger
<strong>mixture coefficient</strong> (modeling our data as a mixture of Gaussians).</p>

<p>So we can write:</p>

\[\begin{aligned}
    p(x_n) &amp;= \sum_{k = 1}^K p(x_n, z_{nk}) \\
           &amp;= \sum_{K = 1}^K p(x_n \mid z_{nk}) p(z_{nk})
\end{aligned}\]

<p>The probability of a single point \(x_n\) given \(z_{nk} = 1\) means that we
are conditioning on the possibility that \(x_n\) is generated from the
\(k\)’th multivariate Gaussian. So,</p>

\[\begin{aligned}
    p(x_n) = \sum_{k = 1}^K \mathcal{N}(x_n \mid \mu_k, \Sigma_k) \pi_k
\end{aligned}\]

<p>Next, let’s also establish some other basic representations.</p>

\[p(z_n) = \prod_{k = 1}^K \pi_k^{z_{nk}}\]

<p>We’re taking advantage that \(z_{nk}\) is only \(1\) for one \(k\) - so this
reduces to \(p(z_n) = \pi_k\) for some \(k\). We’ll do the same for
expressing the conditional:</p>

\[p(x_n \mid z_n) = \prod_{k = 1}^K \mathcal{N}(x_n \mid \mu_k, \Sigma_k)^{z_{nk}}\]

<h1 id="3-mle">3. <strong>MLE</strong></h1>

<hr />

<p>Now, let’s try MLE on the dataset. We define our parameters:</p>

\[\Theta = \{\mu, \Sigma, \pi\}\]

<p>as the means, covariance matrices, and mixing coefficients
of the different components.</p>

\[\begin{aligned}
    L(\Theta \mid X) &amp;= p(X \mid \Theta) \\
    &amp;= \prod_{n = 1}^N p(x_n \mid \Theta) \\
    &amp;= \prod_{n = 1}^N \sum_{k = 1}^K \mathcal{N}(x_n \mid \mu_k, \Sigma_k) \pi_k \\
    l(\Theta \mid X) &amp;= \sum_{n = 1}^N \ln \left \{ 
    \sum_{k = 1}^K \pi_k \mathcal{N}(x_n \mid \mu_k, \Sigma_k)
\right \} \label{eq:hard}
\end{aligned}\]

<p>The term is hard to optimize due to the log of sums. So MLE is
computationally intractable. Fortunately, there’s still a way to solve
this problem.</p>

<h1 id="4-expectation-maximization">4. <strong>Expectation Maximization</strong></h1>

<h2 id="41-e-step">4.1 E-step</h2>

<hr />

<p>The algorithm used to solve this problem is called <strong>Expectation
Maximization</strong>. Pretend for a second we knew the ground truth and had
access to the <strong>complete dataset</strong>, \({X, Z}\). At this point we compute
an expected value. So we can write:</p>

\[\begin{aligned}
    L(\Theta \mid X, Z) &amp;= p(X, Z \mid \Theta) \\
                &amp;= p(X \mid Z, \Theta) p(Z \mid \Theta) \\
                &amp;= \prod_{n = 1}^N p(x_n \mid z_n, \Theta) \prod_{n = 1}^N p(z_n \mid \Theta) \\
                &amp;= \prod_{n = 1}^N p(x_n \mid z_n, \Theta) p(z_n \mid \Theta) \\
                &amp;= \prod_{n = 1}^N \prod_{k = 1}^K \left \{
                \pi_k \mathcal{N}(\mu_k, \Sigma_k)
            \right \}^{z_{nk}} \\
    l(\Theta \mid X, Z) &amp;= \sum_{n = 1}^N \sum_{k = 1}^K z_{nk} \left [
    \ln \pi_k + \ln \mathcal{N}(x_n \mid \mu_k, \Sigma_k)
    \right ]
\end{aligned}\]

<p>The log-likelihood of the complete dataset has a much nicer form than
the log-likelihood of the incomplete dataset. However, latent variables
are well... latent. If we had access to component ground truths, we
wouldn’t need to do density estimation. So instead, we can optimize the
expected value of the log-likelihood of the complete dataset under the
posterior distribution \(p(z \mid x)\).</p>

<p>Just by using Bayes Rule we can write:</p>

\[\gamma(z_{nk}) = p(z_{nk} \mid x_n) = \frac{p(z_{nk}, x_n)}{p(x_n)} = \frac{\pi_k \mathcal{N}(x_n \mid \mu_k, \Sigma_k)}
    {\sum_{k = 1}^K \pi_k \mathcal{N}(x_n \mid \mu_k, \Sigma_k)}\]

<p>If \(z_{nk}\) is an indicator variable, then \(\gamma(z_{nk})\) is the
probability that \(z_{nk} = 1\). The probability that an indicator
variable is $1$ is its expected value, i.e.
\(\mathbb{E}[z_{nk}] = \gamma(z_{nk})\).</p>

<p>So we have:</p>

\[\mathbb{E}_{z \mid x}[ \ln p(X, Z \mid \theta)] = \mathbb{E} \left [ \sum_{n = 1}^N \sum_{k = 1}^K z_{nk} \left [
 \ln \pi_k + \ln \mathcal{N}(x_n \mid \mu_k, \Sigma_k)  \right ]  \right ]\]

<p>By linearity and since \(\pi_k\) and the normal don’t depend on \(z\), we
have:</p>

\[\mathbb{E}_{z \mid x}[ \ln p(X, Z \mid \theta)] = \sum_{n = 1}^N \sum_{k = 1}^K \mathbb{E}[z_{nk}] \left [  \ln \pi_k + \ln \mathcal{N}(x_n \mid \mu_k, \Sigma_k)  \right ]\]

\[\mathbb{E}_{z \mid x}[ \ln p(X, Z \mid \theta)] = \sum_{n = 1}^N \sum_{k = 1}^K \gamma(z_{nk}) \left [  \ln \pi_k + \ln \mathcal{N}(x_n \mid \mu_k, \Sigma_k)  \right ]\]

<h2 id="42-m-step">4.2 M-Step</h2>

<hr />

<p>At this point, we maximize that value w.r.t \(\Theta\). Specifically, we
have:</p>

\[\Theta^\text{new} = \mathop{\mathrm{arg\,max}}_{\Theta} \ \mathbb{E}_{z \mid x}[ \ln p(X, Z \mid \theta)]\]

<p>Recall our parameters are the mixture coefficients, covariance matrices,
and means. So we’ll do the following derivations.</p>

<h3 id="421-means">4.2.1 <em>Means</em></h3>

<hr />

<p>First, let’s compute the derivative w.r.t. the <strong>means</strong>. We’ll just
ignore the sums for now since it practically doesn’t matter (derivative
is linear) and the term \(\gamma\).</p>

\[\frac{\partial E}{\partial \mu_k} = \frac{\partial}{\partial \mu_k}  \left \{
\ln (\pi_k) + \ln \mathcal{N}(x_n \mid \mu_k, \Sigma_k)
\right \}\]

<p>Let’s expand the logarithms:</p>

\[= \frac{\partial}{\partial \mu_k} \left \{
\ln (\pi_k) - \frac{1}{2} (x_n - \mu_k)^T \Sigma_k^{-1} (x_n - \mu_k) - \frac{d}{2} \ln(2 \pi) - \frac{1}{2} \ln \det(\Sigma_k) \right \}\]

<p>The innermost sum goes away since we derive w.r.t. one specific mean. So
we have:</p>

\[\frac{\partial E}{\partial \mu_k}  = - \frac{1}{2} \sum_{n = 1}^N \gamma(z_{nk}) \frac{\partial}{\partial \mu_k} (x_n - \mu_k)^T \Sigma_k^{-1} (x_n - \mu_k)\]

<p>The derivative of \(x^T A x\) w.r.t. \(x\) is \((A + A^T) x\) so we have
(since \(\Sigma^{-1}\) is symmetric as \(\Sigma\) is a symmetric matrix).</p>

\[0 = \sum_{n = 1}^N \gamma(z_{nk}) \Sigma_k^{-1} (x_n - \mu_k)\]

<p>Left-multiply both sides by \(\Sigma\) and we get:</p>

\[0 = \sum_{n = 1}^N \gamma(z_{nk}) x_n - \sum_{n = 1}^N \gamma(z_{nk}) \mu_k\]

<p>Define \(N_k = \sum_{n = 1}^N \gamma(z_{nk})\). So we have:</p>

\[\mu_k^* = \frac{1}{N_k} \sum_{n = 1}^N \gamma(z_{nk}) x_n\]

<p>So the MLE of the \(k\)’th mean is the posterior probability of belonging
to the \(k\)th component multiplied by the value data point normalized by
the effective number of points assigned to cluster \(k\).</p>

<h3 id="422-covariances">4.2.2 <em>Covariances</em></h3>

<hr />

<p>Next, we’ll do the same for the covariance matrix. First,
\(\frac{\partial \det(X)}{\partial X} = \det(X) (X^{-1})^T\) so
\(\frac{\partial \log \det(X)}{\partial X} = (X^{-1})^T\) (49 in Matrix
Cookbook).</p>

\[= \frac{\partial}{\partial \Sigma_k} \left \{
\ln (\pi_k) - \frac{1}{2} (x_n - \mu_k)^T \Sigma_k^{-1} (x_n - \mu_k) - \frac{d}{2} \ln(2 \pi) - \frac{1}{2} \ln \det(\Sigma_k) \right \}\]

<p>The \(\pi\) terms go away. So we have:</p>

\[= - \frac{1}{2} \Sigma_k^{-1} - \frac{1}{2} \frac{\partial}{\partial \Sigma_k} (x_n - \mu_k)^T \Sigma_k^{-1} (x_n - \mu_k)\]

<p>By (61) in the Matrix Cookbook, we have:</p>

\[\frac{\partial}{\partial X} a^T X^{-1} a = - X^{-T} aa^T X^{-T}\]

\[\frac{\partial}{\partial \Sigma_k} (x_n - \mu_k)^T \Sigma_k^{-1} (x_n - \mu_k) = - \Sigma_k ^{-1} (x_n - \mu_k) (x_n - \mu_k)^T \Sigma_k^{-1}\]

<p>So we have (since again, the inner sum disappears):</p>

\[\frac{\partial E}{\partial \Sigma_k} = - \frac{1}{2} \sum_{n = 1}^N \gamma(z_{nk}) \left \{ \Sigma_k^{-1} - \Sigma_k^{-1} (x_n - \mu_k) (x_n - \mu_k)^T \Sigma_k^{-1} \right \}\]

<p>We can write this:</p>

\[\begin{aligned}
0 &amp;= - \frac{1}{2} \Sigma_k^{-1} \left [ N_k I - \sum_{n = 1}^N \gamma(z_{nk}) (x_n - \mu_k) (x_n - \mu_k)^T \Sigma_k^{-1}
\right ] \\
&amp;= N_k I - \sum_{n = 1}^N \gamma(z_{nk})  (x_n - \mu_k) (x_n - \mu_k)^T \Sigma_k^{-1} \\ 
N_k I &amp;= 
\sum_{n = 1}^N \gamma(z_{nk})  (x_n - \mu_k) (x_n - \mu_k)^T \Sigma_k^{-1} \\
N_k \Sigma_k &amp;= \sum_{n = 1}^N  \gamma(z_{nk}) (x_n - \mu_k) (x_n - \mu_k)^T \\
\Sigma_k^* &amp;= \frac{1}{N_K} \sum_{n = 1}^N  \gamma(z_{nk}) (x_n - \mu_k) (x_n - \mu_k)^T
\end{aligned}\]

<p>So the MLE for the covariance matrix of the \(k\)’th compoonent is like a
weighted covariance matrix (weighted by the posterior mixing components)
normalized by the number of data points belonging to the \(k\)’th cluster.</p>

<h3 id="423-mixing-coefficients">4.2.3. <em>Mixing Coefficients</em></h3>

<hr />

<p>Finding the MLEs of the mixing coefficients involves a Lagrange
multiplier since the coefficients sum to \(1\) (the others had this to but
it went to \(0\) during optimization).</p>

<p>So we optimize \(E + \lambda (\sum_{k = 1}^K \pi_k - 1)\). We derive w.r.t
the mixture coefficients.</p>

\[0 = \frac{\partial E'}{\partial \pi_k} = \sum_{n = 1}^N \frac{\gamma(z_{nk})}{\pi_k} - \lambda\]

\[\lambda \pi_k = \sum_{n = 1}^N \gamma(z_{nk})\]

<p>Now, let’s sum over \(k\) (mixing coefficients sum to \(1\)). So we have:</p>

\[\lambda = \sum_{k = 1}^K \sum_{n = 1}^N \gamma(z_{nk}) = N\]

<p>So we have:</p>

\[\pi_k^* = \frac{N_k}{N}\]

<p>The mixing coefficient for the \(k\)’th multivariate Gaussian is the
average responsibility for the whole dataset - number of points
explained over total number of points.</p>

<p>So in short, our MLEs are:</p>

\[\pi_k^* = \frac{N_k}{N}\]

\[\Sigma_k^* = \frac{1}{N_K} \sum_{n = 1}^N  \gamma(z_{nk}) (x_n - \mu_k) (x_n - \mu_k)^T\]

\[\mu_k^* = \frac{1}{N_k} \sum_{n = 1}^N \gamma(z_{nk}) x_n\]

<h4 id="424-some-code">4.2.4 <em>Some Code</em></h4>

<p>Next, we have some code written up.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>def E_step(X, pi_ks, means, covs):
    k = means.shape[0]
    Gamma = np.zeros((X.shape[0], k))
    X_probs = np.zeros((X.shape[0]))

    for i in range(X.shape[0]):
        X_probs[i] = np.sum(pi_ks[j] * compute_mvg_density(mu=means[j], Sigma=covs[j], x=X[i], batch=False) for j in range(k))
    
    for i in range(X.shape[0]):
        for j in range(k):
            Gamma[i, j] = pi_ks[j] * compute_mvg_density(mu=means[j], Sigma=covs[j], x=X[i], batch=False) / X_probs[i]
    
    return Gamma


def M_step(X, pi_ks, means, covs, Gamma):
    N = X.shape[0]
    k = means.shape[0]

    for j in range(k):
        N_k = np.sum(Gamma[:, j]) # N_k = sum_n gamma_nk
        pi_ks[j] = N_k / N        # pi_k = N_k / N

        # mu_k = (sum_n gamma_nk * x_n) / N_k
        means[j] = np.sum(Gamma[n, j] * X[n] for n in range(N)) / N_k 

        # sigma_k = (sum_n gamma_nk * (x_n - mu_k)(x_n - mu_k).T) / N_k
        covs[j] = np.sum(Gamma[n, j] * np.outer(X[n] - means[j], X[n] - means[j]) for n in range(N)) / N_k


def fit_gmm(X, k, steps):
"""
Prints neg log likelihood at each step
"""
    covs = np.zeros((k, 2, 2))
    for i in range(k):
        covs[i] = make_cov_matrix(np.array([1, 1]))

    mean_point = np.average(X, axis=0)
    rands = np.random.randn(k, 2)
    means = mean_point + rands * 0.1

    pi_ks = np.random.rand(k)

    for i in range(steps):
        Gamma = E_step(X, pi_ks, means, covs)
        M_step(X, pi_ks, means, covs, Gamma)

    
    return Gamma, pi_ks, means, covs
</code></pre></div></div>

<p><img src="/images/GMM.gif" alt="GMM Fitting Gif" /></p>

<p>Currently, my code for fitting GMMs is maybe 4x slower than sklearn? But they both yield the same results which is pretty cool. In practice, the E-step means computing posterior probabilities while the M-step is a parameter update. There’s some more plotting code which was tough to figure out - it can all be found <a href="https://github.com/alexali04/result_replication/tree/main/EM">here</a>. You can also use the negative log-likelihood as a proxy for how well your clustering algorithm fits the data. However, be cautious as likelihood is not necessarily a good proxy for dimensionality reduction tasks or mixture modeling for clustering tasks.</p>

<h1 id="5-generalized-em">5. <strong>Generalized EM</strong></h1>

<hr />

<p>We present the EM algorithm below:</p>

<ol>
  <li>
    <p>Choose initial \(\Theta^\text{old}\).</p>
  </li>
  <li>
    <p>E step. Evaluate \(p(Z \mid X, \Theta^\text{old})\).</p>
  </li>
  <li>
    <p>M step. Evaluate
\(\Theta^\text{new} = \mathop{\mathrm{arg\,max}}_{\Theta} Q(\Theta, \Theta^\text{old})\).</p>
  </li>
  <li>
    <p>Check for log likelihood or parameter value convergence. Otherwise,
\(\Theta^\text{old} \leftarrow \Theta^\text{new}\) and return to step
\(2\).</p>
  </li>
</ol>

<p>EM is a much more general algorithm than just fitting mixture models -
it finds ML or MAP parameter estimates when the model depends on latent
variables and optimizing the complete log-likelihood is easier than
optimizing the log likelihood.</p>

<h2 id="51-intuition">5.1. Intuition</h2>

<hr />

<p>We’ve done a lot of math but it’s not exactly clear at this point why
alternating between the E and M step produces a better estimate of the
parameters and posterior probabilities. The E-step is about computing
probabilistic cluster assignments. In $k$-means, this is saying which
points belong to which clusters - the assignment step. The M-step is
about updating \(\Theta\), cluster properties. The idea behing EM is to
find a lower bound on likelihood \(L(\Theta \mid X)\). Maximizing this
lower bound leads to higher likelihood values. Starting with
\(\Theta^\text{old}\) we construct the surrogate lower bound
\(Q(\Theta, \Theta^\text{old})\). Then, we find
\(\Theta^\text{new} = \mathop{\mathrm{arg\,max}}_{\theta} Q(\Theta, \Theta^\text{old})\).
We repeat, constructing another surrogate.</p>

<p>Recall our log-likelihood quantity:</p>

\[l(\Theta \mid X) = \sum_{n = 1}^N \log \sum_{k = 1}^K p(x_n, z_{nk} \mid \Theta)\]

<p>We can re-write this:</p>

\[= \sum_{n = 1}^N \log \sum_{k = 1}^K p(z_{nk} \mid x_n, \Theta) \frac{p(x_n, z_{nk})}{p(z_{nk} \mid x_n, \Theta}\]

<p>The weighted average \(\sum_{k = 1}^K p(z_{nk} \mid x_n)\) is an
expectation over \(z_{nk} \mid x_n\). So we can write:</p>

\[l(\Theta \mid X) = \sum_{n = 1}^N \log \mathbb{E}_{z \mid x} \left [ \frac{p(x_n, z_{nk} \mid \Theta)}{p(z_{nk} \mid x_n, \Theta)}    \right ]\]

<p>We can take advantage of <strong>Jensen’s Inequality</strong>. For linear functions,
we have \(\mathbb{E}[f(x)] = f(\mathbb{E}[x])\). For convex functions
which curve upwards and thus have more large values, we have that the
average output is larger than the output of the largest input or
\(f(\mathbb{E}[x]) \leq \mathbb{E}[f(x)]\).</p>

<p>The logarithm is a concave function (in which case, by the same
reasoning the inequality is reversed) - so set \(f(x) = - \log x\). Then,
\(\log (\mathbb{E}X) \geq \mathbb{E} \log X\). So we have:</p>

\[l(\Theta \mid X) \geq \sum_{n = 1}^N \mathbb{E}_{z \mid x} \ln \frac{p(x_n, z_{nk} \mid \Theta)}{p(z_{nk} \mid x_n, \Theta)}\]

\[= \sum_{n = 1}^N \mathbb{E}_{z \mid x}  \ln p(x_n, z_{nk}) - \ln p(z_{nk} \mid x_n, \Theta)\]

<p>The term within the sum becomes:</p>

\[= \mathbb{E}_{z \mid x} [\ln p(x_n, z_{nk} \mid \Theta)] - \mathbb{E}_{z \mid x} [\log p(z_{nk} \mid x_n, \Theta)]\]

<p>This is the complete log-likelihood plus the entropy of the conditional
distribution \(z \mid x\). It can actually be shown that if we multiply by
one using \(p(z \mid x, \Theta)\), the lower-bound is tight.</p>

<p>Let’s call the expected value of the complete log-likelihood
\(Q(\Theta \mid \Theta^\text{old})\) and the entropy
\(H(\Theta \mid \Theta^{\text{old}})\). Recall that entropy is
non-negative (you cannot have negative surprisal). In some sense, we are
almost done. We compute a lower-bound on the likelihood of the data
given our existing parameters (E-step) and then maximize that
lower-bound (M-step). However, how do we know that maximizing our
lower-bound will improve our likelihood? Gibb’s Inequality states:</p>

\[H(\Theta \mid \Theta^\text{old}) \geq H(\Theta^\text{old} \mid \Theta^\text{old})\]

<p>Cross-entropy is always greater than entropy. So we can write:</p>

\[\begin{aligned}
    \ln p(x \mid \Theta) - \ln p(x \mid \Theta^\text{old}) &amp;= Q(\Theta \mid \Theta^\text{old}) - Q(\Theta^\text{old} \mid \Theta^\text{old}) + H(\Theta \mid \Theta^\text{old}) - H(\Theta^\text{old} \mid \Theta^\text{old}) \\
                                   &amp;\geq Q(\Theta \mid \Theta^\text{old}) - Q(\Theta^\text{old} \mid \Theta^\text{old})
\end{aligned}\]

<p>In other words, if we maximize
\(Q(\Theta^\text{old} \mid \Theta^\text{old})\), then we maximize the
log-likelihood of the data point by the same amount.</p>

<h2 id="52-pathological-solution">5.2 Pathological Solution</h2>

<hr />

<p>We don’t actually want to maximize the likelihood of the data with a
Gaussian Mixture Model. If we shape one MVG as a point-mass on a single
data point and let another MVG give non-zero mass to the other data
points, then the likelihood will essentially go to infinity. To see,
this observe:</p>

\[L(X \mid \Theta) = \prod_{n = 1}^N \frac{\pi_1}{\sqrt{2 \pi \sigma_1^2}} \exp \left \{ - \frac{1}{2 \sigma_1^2} (x_n - \mu_1)^2  \right \} + \frac{\pi_2}{\sqrt{2 \pi \sigma_2^2}} \exp \left \{ - \frac{1}{2 \sigma_2^2 }  (x_n - \mu_2)^2 \right \}\]

<p>By setting \(\mu_1 = x_n\) and letting \(\sigma_1^2 \to 0\), we can get
infinite likelihood for a single data point $x_n$. The exponential term
has a small numerator and a small denominator but is positive.
Meanwhile, the mixing coefficient becomes really large because the
denominator goes to \(0\). The other MVG gives non-zero mass to other
points so we have the product of a very large number with a bunch of
other non-zero numbers.</p>

<p>This is an example of MLE overfitting - it suggests that MLE is not a
good objective function at all for this sort of problem and we need to
introduce some sort of prior or constraint on our parameters.</p>

<h1 id="6-conclusion">6. <strong>Conclusion</strong></h1>

<hr />

<p>This view of EM leads to the generalized paradigm of variational
inference (minimizing KL-divergence between some parameterized proposal
dist. and the true distribution through maximizing lower bound on
marginal likelihood). The opposite view leads to expectation
propagation. All very interesting stuff. Recently, I’ve been reading up
on information theory. I also aim for my next post to be on diffusion
models and various other generative models. Also at some point I want to
code up a diffusion model.</p>

<h1 id="7-references">7. <strong>References</strong></h1>

<hr />

<p>Gregory Gundersen, Expectation–Maximization, 10 Nov. 2019.
https://gregorygundersen.com/ blog/2019/11/10/em</p>

<p>Bishop, Christopher M. Pattern Recognition and Machine Learning. New
York :Springer, 2006. Chapter 9</p>

<p>Rudin, Cynthia. Intuition for the Algorithms of Machine Learning,
Self-pub, eBook, 2020 Chapter 11.
https://users.cs.duke.edu/ cdr42/teaching.html</p>

<p>Petersen, K. B., &amp; Pedersen, M. S. (2006). The Matrix Cookbook.
Technical University of Denmark. http://www2.imm.dtu.dk/pubdb/p.php?3274</p>]]></content><author><name>Alex Ali</name><email>alexali000@gmail.com</email></author><summary type="html"><![CDATA[I keep forgetting GMM math so I'm writing it down.]]></summary></entry><entry><title type="html">An Anthology for Linear Regresion</title><link href="https://alexali04.github.io/posts/2024/11/bayesian-linear-regression/" rel="alternate" type="text/html" title="An Anthology for Linear Regresion" /><published>2024-11-29T00:00:00+00:00</published><updated>2024-11-29T00:00:00+00:00</updated><id>https://alexali04.github.io/posts/2024/11/bayesian-linear-regression</id><content type="html" xml:base="https://alexali04.github.io/posts/2024/11/bayesian-linear-regression/"><![CDATA[<h1 id="1-introduction">1. <strong>Introduction</strong></h1>

<hr />

<p>Linear regression is interpretable and well-studied. I present to you a
rough anthology of linear regression.<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup> By this, I mean several
different approaches to linear regression. We’ll start off with
minimizing MSE and end up with a Bayesian approach with hopefully a
(semi-)rigorous approach throughout. Ideally, everything should <em>feel</em>
motivated.</p>

<p>There are several computational tricks / alternative derivations which
are shorter and easier. I’ll go through some of them at the end of the
article but it’s probably valuable to do it the painful way at least
once.</p>

<h1 id="2-problem-set-up">2. <strong>Problem Set up</strong></h1>

<hr />

<p>We begin with a dataset \(\mathcal{D} = {(x_i, y_i)}_{i= 1}^n\). Let
\(y \in \mathbb{R}, x \in \mathbb{R}^d\). We want to find a linear
relationship between features \(x_i\) and noisy target values \(y_i\), i.e.:</p>

\[y_i = f(x_i) + \epsilon_x\]

<p>If we believe the underlying relationship between \(y\) and \(x\) is in fact
linear, we set</p>

\[f(x_i) = w^T x_i\]

<p>where \(w\) represents the weights or scaling coefficients.</p>

<p>We can collect the target values \(y_1, \dots, y_n\) into a vector \(y\) and
the feature vectors \(x_1, \dots, x_N\) into a <strong>design matrix</strong> \(X\) where
the row \(X_i = x_i\). This ensures each \(y_i = x_i^T w\).</p>

\[y = X w + \epsilon_x\]

<p>Note that this represents \(y\) as a column vector.</p>

<h1 id="3-minimize-mse">3. <strong>Minimize MSE</strong></h1>

<hr />

<p>The first perspective on linear regression is just minimizing the error
between the targets \(y\) and the predictions \(Xw\). We’ll derive this with
matrix calculus<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup>.The residual (error between \(Xw\) and \(y\)) is defined
as:</p>

\[e = y - Xw\]

<p>We want to find the \(\hat{w}\) that minimizes the <strong>mean-squared error</strong>:</p>

\[\hat{w} = \operatorname{argmin}_{w} \ e^T e = \operatorname{argmin}_{w} \ (y - Xw)^T (y - Xw)\]

<p>For optimization problems like this, we can just take the first
derivative w.r.t \(w\) and set to \(0\).</p>

\[0 = \frac{\partial e^T e}{\partial w} = \frac{\partial}{\partial w} (y - Xw)^T (y - Xw)\]

\[= \frac{\partial}{\partial w} (y^T y + w^T X^T X w - 2 w^T X^T y)\]

<p>There are two ways to treat matrix derivatives. Numerator layout
(treating the gradient as row vector) and denominator layout (treating
the gradient as a column vector). This actually caused me great grief
when I was learning this on my own but I’ve kinda gotten over it. So
both views are presented here.</p>

<p>Here are the rules we will use (treating the gradient as a row vector).
To treat the gradient as a column vector, take the transpose.</p>

\[\frac{\partial}{\partial x} x^T a = \frac{\partial}{\partial x} a^T x = a^T\]

\[\frac{\partial}{\partial x} x^T A x = x^T (A + A^T)\]

<p>To verify these rules for yourself, you can just convert these
matrix-vector products into sums.</p>

<h2 id="31-gradient-is-a-row-vector">3.1 <strong>Gradient is a Row Vector</strong></h2>

<hr />

<p>The first term (\(y^T y\)) is not dependent on \(w\) so it goes to zero. The
second term becomes \(2 w^T X^T X\). The third term becomes \(2 y^T X\). So
we have:</p>

\[0 = 2 w^T X^T X - 2 y^T X\]

\[y^T X = w^T X^T X\]

\[w^T = y^T X (X^T X)^{-1}\]

\[w = (X^T X)^{-1} X^T y\]

<p>If a symmetric matrix is non-singular, it’s inverse is also symmetric.</p>

<h2 id="32-gradient-is-a-column-vector">3.2 <strong>Gradient is a Column Vector</strong></h2>

<hr />

<p>The second term becomes \(2 X^T X w\) and the third becomes \(2 X^T y\).</p>

\[0 = 2 X^T X w - 2 X^T y\]

<p>Skipping ahead…</p>

\[w = (X^T X)^{-1} X^T y\]

<p>Okay, this was all fairly simple. We can also read this as
\(w = \frac{\text{Cov}(X, y)}{\text{Var}(X)}\) to view this from another,
probabilistic angle.</p>

<p>I hope everything seems relatively well-motivated thus far. However,
there are certain assumptions that have already been made as well as
other issues that have been glossed over.</p>

<p>Prominently, we defined the loss function to be the mean-squared error
\(\text{MSE}(y, Xw)\). This is equivalent to minimizing the \(L_2\) norm of
the residuals. But <em>why</em> use mean-squared error? Why not mean absolute
error (MAE) or mean-cubic error?</p>

<p>There are many reasons why MSE is a empirically nice loss function to
use. It’s differentiable unlike MAE and doesn’t place as much weight on
outliers as mean errors of higher order terms does. \(L_2\) is also a
Hilbert space while \(L_p\) for \(p \geq 1, p \neq 2\) is not (whatever that
means). But we still <em>chose</em> MSE. These are justifications, not
derivations from first principles.</p>

<p>I find that everything fits better in my head when we don’t choose an
arbitrary loss function but are <em>forced</em> into that choice from the
assumptions we make. My professor once said (paraphrasing) being
Bayesian is all about being up-front and honest with our beliefs. We
want to honestly represent our assumptions here. A good first step
towards that is to do linear regression probabilistically (we’ll get to
the Bayesian bits eventually).</p>

<h1 id="4-maximum-likelihood-estimation">4. <strong>Maximum Likelihood Estimation</strong></h1>

<hr />

<p>Let’s begin by placing a distribution over the noise. Say, for example,
we choose \(\epsilon_x \sim \mathcal{N}(0, \sigma^2)\). So we have:</p>

\[\epsilon_x \sim \mathcal{N}(0, \sigma^2)\]

\[y_i \sim \mathcal{N}(w^T x_i, \sigma^2)\]

<p>The <strong>likelihood function</strong> is:</p>

\[\mathcal{L}(w \mid \mathcal{D}) = p(\mathcal{D} \mid w)\]

<p>We want to maximize the likelihood of observing the data (what does this
mean?). Our data points \(y_i\) are i.i.d. since they’re all drawn from
\(\mathcal{N}(w^T x_i, \sigma^2)\) (identical) and are conditionally
independent on the mean and variance. So we have:</p>

\[p(\mathcal{D} \mid w) = p(y_1, \dots, y_N \mid w, x_1, \dots, x_n, \sigma^2)\]

\[= \prod_{i = 1}^n \mathcal{N}(y_i ; w^T x_i, \sigma^2)\]

\[= \prod_{i = 1}^n \frac{1}{\sqrt{2 \pi \sigma^2}} \exp \left \{ - \frac{1}{2 \sigma^2} (y_i - w^T x_i)^2    \right \}\]

<p>The product of exponentials becomes a sum within the exponential. So we
have:</p>

\[= \frac{1}{(2 \pi \sigma^2)^\frac{n}{2}} \exp \left \{ - \frac{1}{2 \sigma^2} \sum_{i = 1}^n (y_i - w^T x_i)^2    \right \}\]

<p>We can re-write these as inner products and save ourselves a lot of
mental load. My friend Clark actually pointed
this out to me and I never looked back (thank you!)</p>

\[= \frac{1}{(2 \pi \sigma^2)^\frac{n}{2}} \exp \left \{ - \frac{1}{2 \sigma^2} (y - Xw)^T (y - Xw)    \right \}\]

<p>Now, we want to maximize this likelihood with respect to \(w\). Since the
logarithm is a monotonic transformation, the log-likelihood and the
likelihood have the same extrema. So, we have:</p>

\[w_{\text{MLE}} = \operatorname{argmax}_w \ p(\mathcal{D} \mid w) = \operatorname{argmax}_w \ \log p(\mathcal{D} \mid w)\]

<p>So we have:</p>

\[\log p(\mathcal{D} \mid w) = - \frac{1}{2 \sigma^2} (y^Ty + w^T X^T X w - 2 y^T X w) - \frac{n}{2} \log(2 \pi \sigma^2)\]

\[0 = \frac{\partial}{\partial w} \log p(D \mid w) = - \frac{1}{2 \sigma^2} \frac{\partial}{\partial w} (y^T y + w^T X^T X w - 2 y^T X w)\]

<p>For convenience’s sake, I’ll treat the gradient as a column vector
although I generally prefer the other way.</p>

<p>So we have:</p>

\[0 = 2 X^T X w - 2 X^T y\]

\[X^T y = X^T X w\]

\[w_{\text{MLE}} = (X^T X)^{-1} X^T y\]

<p>This is no different from what we’ve already seen. But we can also
derive a maximum likelihood estimate for the noise variance.</p>

\[0 = \frac{\partial}{\partial \sigma^2} \log p(\mathcal{D} \mid w) = \frac{\partial}{\partial \sigma^2} \left [ - \frac{1}{2 \sigma^2} (y^Ty + w^T X^T X w - 2 y^T X w) - \frac{n}{2} \log(2 \pi \sigma^2) \right ]\]

\[0 = - \frac{2 n \pi}{4 \pi \sigma^2} + \frac{1}{2 \sigma^4} (y - Xw)^T (y - Xw)\]

\[\frac{n}{\sigma^2} = \frac{1}{\sigma^4} (y - Xw)^T (y - Xw)\]

\[\sigma^2_{\text{MLE}} = \frac{1}{n} (y - Xw)^T (y - Xw)\]

<p>This is a little better - its nice to be able to estimate the variance
of the noise as well.</p>

<p>Math is nice but we need to check whether or not this formula actually
works. We implement these basic formulas to generate the following
image.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>import torch
import matplotlib.pyplot as plt

torch.manual_seed(42)

x = 30.0 * torch.rand(100)
y = 3.0 * x + 40.0 + torch.randn(100) * torch.sqrt(torch.tensor(5.0))

X = torch.stack([torch.ones(100), x], dim=1)

w_MLE = torch.linalg.inv(X.T @ X) @ X.T @ y
y_pred = X @ w_MLE

sigma_2_mle = ((y - y_pred).T @ (y - y_pred)) / (100 - 2)
slope_str = f"Estimated Slope, Intercept, Noise Variance: {w_MLE[1]:.4f}, {w_MLE[0]:.4f}, {sigma_2_mle:.4f}, True Noise Variance: {5.0:.4f}"
# Estimated Slope: 2.9656, Intercept: 40.5160, Noise Variance: 3.3836, True Noise Variance: 5.0000

plt.title(slope_str, fontsize=10, color="black")
plt.plot(x, y, "o", label="Noisy Targets", color="blue")
plt.plot(x, y_pred, "-*",label="Predictions", color="red")
plt.legend()
plt.show()
</code></pre></div></div>

<p><img src="/images/MLE_linear_regression.png" alt="MLE Linear Regression" /></p>

<p>However, the MLE of the variance is a biased estimator. <strong>Bessel’s
correction</strong> divides \((y - Xw)^T (y - Xw)\) by \((n - p)\) where \(p\) is the
number of <strong>degrees of freedom</strong> to get a unbiased estimator (in the
above case, \(p = 2\) - slope, intercept). The following GIF shows how
different estimators of the variance change w.r.t. different variances.
To be honest, I initially found that the MLE / Bessel estimators were
initially overestimating variance but figured out this was because of
the random seed. Over different random seeds, the MLE estimator tended
to generally underestimate the true noise variance.</p>

<p><img src="/images/lin_reg_mle.gif" alt="Variance Estimator GIF" /></p>

<p>The code for that can be found
<a href="https://github.com/alexali04/function_fitting/blob/main/func_learning/experiments/bayesian_lr/lin_reg.py">here</a>.</p>

<h2 id="41-issues-with-mle">4.1 <strong>Issues with MLE</strong></h2>

<hr />

<p>There are several issues with maximum likelihood estimation. But the
most glaring one, to me at least, is that it fundamentally answers the
wrong question. Do we <em>really</em> care about the probability of the data
given some parameter setting? I think the more natural question is the
probability of some parameter setting given the data.</p>

<h1 id="5-map">5. <strong>MAP</strong></h1>

<hr />

<p>One way to do this is with <strong>MAP</strong> or <strong>Maximum a Posteriori</strong>
estimation.</p>

<p>Bayes Rule states:</p>

\[p(w \mid \mathcal{D}) = \frac{p(\mathcal{D} \mid w) p(w)}{p(\mathcal{D})}\]

<p>The left-hand side is called the <strong>posterior distribution</strong>. It’s
proportional to the likelihood multiplied with the prior distribution.
The prior represents our beliefs about the data before we have observed
the data - the posterior represents our updated beliefs after having
observed the data (likelihood).</p>

<p>So we need to place a prior distribution on \(w\). So we can specify:</p>

\[w \sim \mathcal{N}(0, \alpha^2 I)\]

\[\epsilon_x \sim \mathcal{N}(0, \sigma^2)\]

<p>So the joint distribution can be written as a multivariate Gaussian.</p>

\[y \mid w, X, \sigma^2 \sim \mathcal{N}(Xw, \sigma^2 I)\]

<p>The MAP estimate \(w_{\text{MAP}}\) is the parameter setting which
maximizes the probability of the parameter given the data. So we want:</p>

\[w_{\text{MAP}} = \operatorname{argmax}_{w} \ p(w \mid \mathcal{D})\]

<p>Again, we can take the logarithm, so we have:</p>

\[w_{\text{MAP}} = \operatorname{argmax}_{w} \ \log p(w \mid \mathcal{D})\]

\[= \operatorname{argmax}_{w} \ \log p(\mathcal{D} \mid w) + \log p(w) - \log(\mathcal{D})\]

<p>The denominator of Bayes rule is called the marginal likelihood or the
partition function or the evidence. Since it doesn’t depend on \(w\), we
can just ignore it in finding \(w_{\text{MAP}}\). So we have:</p>

\[0 = \frac{\partial}{\partial w} \log p(\mathcal{D} \mid w) + \log p(w)\]

<p>Taking the logarithm of normal distributions,</p>

\[= \frac{\partial}{\partial w} \left [ 
- \frac{1}{2 \sigma^2} (y - Xw)^T (y - Xw) - \frac{n}{2} \log (2 \pi \sigma^2) - \frac{1}{2 \alpha^2} w^T w - \frac{d}{2} \log (2 \pi \alpha^2)
\right ]\]

<p>We’ve performed bits and pieces of this derivative above. So we get</p>

\[0 = \frac{X^T y}{\sigma^2} - \frac{w}{\alpha^2} - \frac{X^T X w}{\sigma^2}\]

\[0 = X^T y - \frac{\sigma^2 w}{\alpha^2} - X^T X w\]

<p>Define \(\lambda = \frac{\sigma^2}{\alpha^2}\). Then, we have:</p>

\[0 = X^T y - \lambda w - X^T X w\]

\[(X^T X + \lambda I)w = X^T y\]

\[w_{\text{MAP}} = (X^T X + \lambda I)^{-1} X^T y\]

<p>We can reach this formula by finding:</p>

\[w_{\text{MAP}} = \operatorname{argmin}_{w} (y - Xw)^T (y - Xw) + \lambda w^T w\]

<p>In other words, we can recover \(L_2\) regularization by assuming Gaussian
noise and a Gaussian prior. I like this probabilistic approach much
better because it all feels very motivated from our assumptions.</p>

<p>We can also derive MAP solutions for the noise and weight variances. The
MAP solution for the noise variance is the same as the MLE solution
because it’s not affected by the prior.</p>

\[0 = \frac{\partial}{\partial \alpha^2} \left [ 
 - \frac{1}{2 \alpha^2} w^T w - \frac{d}{2} \log (2 \pi \alpha^2)
\right ]\]

\[= \frac{1}{2 \alpha^4} w^T w - \frac{d}{2 \alpha^2}\]

\[\alpha^2_{\text{MAP}} = \frac{1}{d} w_{\text{MAP}}^T w_{\text{MAP}}\]

<p>Let’s gut-check. If we send the prior variance of the weight to \(0\),
i.e. \(\alpha^2 \to 0\), then the regularization coefficient \(\lambda\)
grows very large so the weights move towards \(0\). Similarly, if we send
the noise variance \(\sigma^2 \to \infty\),
we will see a similar result.</p>

<h1 id="6-bayesian-linear-regression">6. <strong>Bayesian Linear Regression</strong></h1>

<h2 id="61-marginal-likelihood">6.1 <strong>Marginal Likelihood</strong></h2>

<p>I want to talk briefly about the marginal likelihood (evidence,
partition function, normalizing constant) which is the denominator in
Bayes rule. By the sum and product rules of probability, we have:</p>

\[p(\mathcal{D}) = \int p(\mathcal{D} \mid w) \ p(w) \ dw\]

<p>By marginalizing out \(w\) (hence the name <strong>marginal</strong> likelihood), we
get a term which tells us the likelihood of the data, conditioned on
hyperparameters. This also lets us directly perform hyperparameter
optimization. Contrast optimizing this quantity with other
hyperparameter optimization techniques like grid search which is
exponential in the number of hyperparameter-combinations or random
search which is … random.</p>

<p>Of course, the downside of this method (<strong>marginal likelihood
optimization</strong> or <strong>Type 2 MLE</strong>) is that it involves an integral which
is often intractable.</p>

<p>We’ll use the same prior and noise distribution as before and compute
the marginal likelihood. Buckle up.<sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">3</a></sup></p>

\[p(\mathcal{D}) = \int \mathcal{N}(y ; Xw, \sigma^2 I) \ \mathcal{N}(w ; 0, \alpha^2 I) \ dw\]

\[= c \int \exp \left \{
- \frac{1}{2} \left [
    \frac{(y - Xw)^T (y - Xw)}{\sigma^2} + \frac{w^T w}{\alpha^2}
    \right ]
\right \} \ dw\]

\[= c \int \exp \left \{
- \frac{1}{2} \left [
    \frac{y^T y + w^T X^T X w - 2 w^T X^T y}{\sigma^2} + \frac{w^T w}{\alpha^2}
    \right ]
\right \} dw\]

<p>The strategy here is to transform the exponential inside the integrand
into an unnormalized probability distribution, or the posterior,
\(p(w \mid \mathcal{D}) \propto p(\mathcal{D} \mid w) p(w)\). Moving the
\(y^T y\) term out of the integral (since it doesn’t depend on \(w\)s), we
get:</p>

\[= c' \int \exp \left \{
- \frac{1}{2} \left [
    \frac{w^T X^T X w - 2 w^T X^T y}{\sigma^2} + \frac{w^T w}{\alpha^2}
    \right ]
\right \} dw\]

\[= c' \int \exp \left \{
- \frac{1}{2} \left [
    w^T \left ( \frac{X^T X}{\sigma^2} + \frac{I}{\alpha^2} \right ) w - \frac{2 w^T X^T y}{\sigma^2}
    \right ]
\right \} dw\]

<p>We can infer the precision matrix
\(\Lambda = \frac{X^T X}{\sigma^2} + \frac{I}{\alpha^2}\). We want the
integrand to be (ignoring the exponential), of the form
\((w - \mu)^T \Lambda (w - \mu)\). We have:</p>

\[(w - \mu)^T \Lambda (w - \mu) = w^T \Lambda w + \mu^T \Lambda \mu - 2 w^T \Lambda \mu\]

<p>By inspection, we know that
\(2 w^T \Lambda \mu = \frac{2 w^T X^T y}{\sigma^2}\). So we have
\(\Lambda \mu = \frac{X^T y}{\sigma^2}, \mu = \Lambda^{-1} \frac{X^T y}{\sigma^2}\).</p>

<p>So we have:</p>

\[w^T \Lambda w - 2 w^T \Lambda \mu = (w - \mu)^T \Lambda (w - \mu) - \mu^T \Lambda \mu\]

<p>So we have:</p>

\[p(\mathcal{D}) = c' \int \exp \left \{ - \frac{1}{2} \left [
(w - \mu)^T \Lambda (w - \mu) - \mu^T \Lambda \mu
\right ]
\right \} \ dw\]

<p>Finally, we can take out the term \(\mu^T \Lambda \mu\) to get:</p>

\[p(\mathcal{D}) = c'' \int \exp \left \{ - \frac{1}{2} 
(w - \mu)^T \Lambda (w - \mu)
\right \} \ dw\]

<p>There’s a nice integral trick to know here. If we have a probability
distribution \(f(x) = \frac{\hat{f}(x)}{Z}\) where \(Z\) is the normalizing
constant, it follows that the integral of the unnormalized probability
distribution is \(\int \hat{f}
(x) \ dx = \frac{1}{Z}\).</p>

<p>The integrand corresponds to an unnormalized normal distribution
\(\mathcal{UN}(w; \mu, \Lambda^{-1})\) (my own notation :P). So we have:</p>

\[\int \exp \left \{ - \frac{1}{2} 
(w - \mu)^T \Lambda (w - \mu)
\right \} \ dw = (2 \pi )^{\frac{d}{2}} \det(\Lambda^{-1})^{\frac{1}{2}}\]

\[= \frac{(2 \pi )^{\frac{d}{2}}}{\det(\Lambda)^{\frac{1}{2}}}\]

<p>Finally, we’ve reached a closed form solution.</p>

\[p(\mathcal{D}) = c'' \frac{(2 \pi )^{\frac{d}{2}}}{\det(\Lambda)^{\frac{1}{2}}}\]

\[= \frac{\exp \left \{ - \frac{1}{2} \left [ \frac{y^T y}{\sigma^2} + \mu^T \Lambda \mu  \right ]   \right \}}{(2 \pi \sigma^2)^{\frac{n}{2}} (2 \pi \alpha^2)^{\frac{d}{2}}}
\frac{(2 \pi )^{\frac{d}{2}}}{\det(\Lambda)^{\frac{1}{2}}}\]

<p>However, we can again, re-write this slightly. First of all, observe
that:</p>

\[\mu^T \Lambda \mu = \frac{y^T X}{\sigma^2} \Lambda^{-1} \Lambda \Lambda^{-1} \frac{X^T y}{\sigma^2}\]

\[= \frac{y^T X \Lambda^{-1} X^T y}{\sigma^4}\]

<p>So let’s re-write the term in this exponential as:</p>

\[p(\mathcal{D}) = \frac{\exp \left \{ - \frac{1}{2} y^T \left [ \frac{I}{\sigma^2} + \frac{X \Lambda^{-1} X^T}{\sigma^4}  \right ] y   \right \}}{(2 \pi \sigma^2)^{\frac{n}{2}} (2 \pi \alpha^2)^{\frac{d}{2}}}
\frac{(2 \pi )^{\frac{d}{2}}}{\det(\Lambda)^{\frac{1}{2}}}\]

<p>Define
\(\Lambda_0 = \frac{I}{\sigma^2} + \frac{X \Lambda^{-1} X^T}{\sigma^4}\).
So we have</p>

\[p(\mathcal{D}) = \frac{\exp \left \{ - \frac{1}{2} y^T \Lambda_0 y   \right \}}{(2 \pi \sigma^2)^{\frac{n}{2}} (2 \pi \alpha^2)^{\frac{d}{2}}}
\frac{(2 \pi )^{\frac{d}{2}}}{\det(\Lambda)^{\frac{1}{2}}}\]

<p>We can honestly leave it here. But we won’t because we know there is a
Gaussian distribution hiding <em>somewhere</em> in there. I’m gonna go ahead
and take a shot in the dark that
\(p(\mathcal{D}) = \mathcal{N}(0, \Lambda_0^{-1})\). Let’s test that
hypothesis.</p>

<p>I’ll introduce two lemmas which are generally quite useful in Bayesian
statistics. The first one is the <strong>Woodbury Matrix Identity</strong>. It
states:</p>

\[(A + UCV)^{-1} = A^{-1} - A^{-1} U (C^{-1} + V A^{-1} U)^{-1} V A^{-1}\]

<p>Define
\(C = \frac{\Lambda^{-1}}{\sigma^4}, U = -X, V = X^T, A = \frac{I}{\sigma^2}\).
Then, we have:</p>

\[\Lambda_0^{-1} = \sigma^2 I + \sigma^2 I X (\sigma^4 \Lambda - \sigma^2 X^T X)^{-1} X^T \sigma^2 I\]

<p>Substitute in \(\Lambda = \frac{X^T X}{\sigma^2} + \frac{I}{\alpha^2}\).</p>

\[\Lambda_0^{-1} = \sigma^2 I + \sigma^4 X \left (\sigma^4 \left ( \frac{X^T X}{\sigma^2} + \frac{I}{\alpha^2} \right ) - \sigma^2 X^T X \right )^{-1} X^T\]

\[\Lambda_0^{-1} = \sigma^2 I + \sigma^4 X \left ( 
    \sigma^2 X^T X + \frac{\sigma^4}{\alpha^2} I - \sigma^2 X^T X
    \right )^{-1} X^T\]

\[= \sigma^2 I + \sigma^4 X \left ( 
    \frac{\sigma^4}{\alpha^2} I 
    \right )^{-1} X^T\]

\[= \sigma^2 I + \alpha^2 X X^T\]

<p>When I derived this for the first time, I was actually surprised by how
nice this matrix inverse is.<sup id="fnref:4" role="doc-noteref"><a href="#fn:4" class="footnote" rel="footnote">4</a></sup></p>

<p>Next, let’s compute the determinant of this. What I’m hoping for is that
the determinant of the covariance matrix will absorb some of the
straggling terms in the normalizing coefficient(s). Recall, we have:</p>

\[p(\mathcal{D}) = \frac{\exp \left \{ - \frac{1}{2} y^T \Lambda_0^{-1} y   \right \}}{(2 \pi \sigma^2)^{\frac{n}{2}} (2 \pi \alpha^2)^{\frac{d}{2}}}
\frac{(2 \pi )^{\frac{d}{2}}}{\det(\Lambda)^{\frac{1}{2}}}\]

<p>First, we can immediately cancel out \((2 \pi)^{\frac{d}{2}}\).</p>

\[p(\mathcal{D}) = \frac{\exp \left \{ - \frac{1}{2} y^T \Lambda_0^{-1} y   \right \}}{(2 \pi \sigma^2)^{\frac{n}{2}} (\alpha^2)^{\frac{d}{2}} \det(\Lambda)^{\frac{1}{2}}}\]

<p>So now, we’ll use the <strong>determinant lemma</strong> to compute
\(\det(\Lambda_0^{-1})\). The determinant lemma states:</p>

\[\det(A + UWV^T) = \det(W^{-1} + V^T A^{-1} U) \det(W) \det(A)\]

\[\det(\sigma^2 I + X \alpha^2 I X^T) = \det(\frac{I}{\alpha^2} + X^T \frac{I}{\sigma^2} X) \det(\alpha^2 I) \det (\sigma^2 I)\]

<p>Recall that \(\Lambda = \frac{I}{\alpha^2} + \frac{X^T X}{\sigma^2}\). So
we have:</p>

\[\det(\Lambda_0^{-1}) = \det(\Lambda) (\alpha^2)^{d} (\sigma^2)^{n}\]

\[\det(\Lambda_0^{-1})^{\frac{1}{2}} = \det(\Lambda)^{\frac{1}{2}} (\alpha^2)^{\frac{d}{2}} (\sigma^2)^{\frac{n}{2}}\]

<p>So in conclusion, we have:</p>

\[p(\mathcal{D}) = \frac{\exp \left \{ - \frac{1}{2} y^T \Lambda_0^{-1} y \right \}}{(2 \pi)^{\frac{n}{2}} \det(\Lambda_0^{-1})^{\frac{1}{2}}}\]

\[\boxed{\mathcal{D} \sim \mathcal{N}(y; 0, \sigma^2 I + \alpha^2 X X^T)}\]

<p>Phew. This evidence quantity is very hard to compute so we try and avoid
it best we can. But with nice models like linear-Gaussian, we can do it.</p>

<p>Fun fact: the marginal likelihood is often written with a \(Z\) because it
stands for <em>Zustandssumme</em> or “sum over states” in German.</p>

<h2 id="62-posterior-distribution">6.2 <strong>Posterior Distribution</strong></h2>

<p>Finally, let’s compute the normalized posterior distribution.
Specifically, we want \(p(w \mid \mathcal{D})\). Good news - we’ve already
done this! The integrand of the marginal likelihood is our answer: we
have,</p>

\[p(w \mid \mathcal{D}) = \mathcal{N}(\mu, \Lambda^{-1})\]

\[\Lambda = \frac{X^T X}{\sigma^2} + \frac{I}{\alpha^2}\]

\[\mu = \Lambda^{-1} \frac{X^T y}{\sigma^2}\]

<p>There are very simple ways to derive this from Gaussian identities.<sup id="fnref:5" role="doc-noteref"><a href="#fn:5" class="footnote" rel="footnote">5</a></sup>
But since most of the work is already kind of done for us, let’s derive
it the “hard” way.</p>

\[p(w \mid \mathcal{D}) = \frac{p(\mathcal{D} \mid w) p(w)}{p(\mathcal{D})}\]

\[= \frac{\mathcal{N}(y; Xw, \sigma^2 I) \mathcal{N}(w; 0, \alpha^2 I)}{\mathcal{N}(y; 0, \alpha^2 X X^T + \sigma^2 I)}\]

<p>From the previous section, we can write the denominator:</p>

\[D_e = \frac{(2 \pi \sigma^2)^{\frac{n}{2}} \det(\Lambda)^{\frac{1}{2}} (\alpha^2)^{\frac{D}{2}}}{\exp(- \frac{1}{2} [ \frac{y^T y}{\sigma^2} - \mu^T \Lambda \mu])}\]

<p>and the numerator:</p>

\[N_u = \frac{\exp \left \{  - \frac{1}{2 \sigma^2} (y - Xw)^T (y - Xw) - \frac{1}{2 \alpha^2} w^T w  \right \}}{(2 \pi \sigma^2)^{\frac{n}{2}} (2 \pi \alpha^2)^{\frac{D}{2}}}\]

<p>Now, most of the constants cancel out except for \((2 \pi)^{\frac{D}{2}}\)
and \(\det(\Lambda)^{\frac{1}{2}}\). Let \(E(w)\) be the argument inside the
exponential.</p>

\[\frac{N_u}{D_e} = \frac{E(w)}{\sqrt{(2 \pi)^D \det(\Lambda^{-1})}}\]

<p>Finally, let’s put an end to this long fight and compute \(E(w)\). Just be
careful about signs.</p>

\[E(w) = - \frac{1}{2} \left [   
\frac{(y - Xw)^T(y - Xw)}{\sigma^2} + \frac{w^T w}{\alpha^2} + \mu^T \Lambda \mu - \frac{y^T y}{\sigma^2}
\right ]\]

\[E(w) = - \frac{1}{2} \left [   
\frac{y^T y + w^T X^T X w - 2 w^T X^T y}{\sigma^2} + \frac{w^T w}{\alpha^2} + \mu^T \Lambda \mu - \frac{y^T y}{\sigma^2}
\right ]\]

<p>Canceling out the \(\frac{y^T y}{\sigma^2}\) terms,</p>

\[= - \frac{1}{2} \left [   
    w^T \left (\frac{X^T X}{\sigma^2} + \frac{I}{\alpha^2} \right ) w - 2
\frac{w^T X^T y}{\sigma^2} + \mu^T \Lambda \mu 
\right ]\]

\[= - \frac{1}{2} \left [   
    w^T \Lambda w - 2 w^T \Lambda \mu + \mu^T \Lambda \mu 
\right ]\]

<p>In completing the square in the previous section, we found this to be
equal to:</p>

\[= - \frac{1}{2} \left [   
    (w - \mu)^T \Lambda (w - \mu) - \mu^T \Lambda \mu + \mu^T \Lambda \mu 
\right ]\]

\[= - \frac{1}{2} \left [   
    (w - \mu)^T \Lambda (w - \mu)
\right ]\]

<p>So we have:</p>

\[p(w \mid \mathcal{D}) = \frac{\exp(- \frac{1}{2} (w - \mu)^T \Lambda (w - \mu))}{\sqrt((2 \pi)^{D} \det(\Lambda^{-1}))}\]

\[\boxed{w \mid \mathcal{D} \sim \mathcal{N}(\mu, \Lambda^{-1})}\]

<p>So we have an estimate for \(w\). But how do we actually make point
predictions?</p>

<h2 id="63-predictive-distribution">6.3 <strong>Predictive Distribution</strong></h2>

<p>The predictive distribution is pretty simple.</p>

\[p(y_* \mid X_*, \alpha^2, \sigma^2, \mathcal{D}) = \int p(y_* \mid X_*, w, \sigma^2) p(w \mid \mathcal{D}, \alpha^2) dw\]

<p>This is called the <strong>Bayesian Model Average</strong>. It is robust to
overfitting and I like it very much. It weights our predictions for
different output points \(y_*\) by the posterior probability of the
coefficient \(w\).</p>

<h1 id="7-conclusion">7. <strong>Conclusion</strong></h1>

<p>This post is long with a lot of tedious math so if you got through it, I
think you should feel proud. We went over linear regression with MSE,
MAP, and the BMA which is quite a lot. I hope you learned something too!
I would estimate that the knowledge in this post probably took around 9
months to accumulate? Obviously, this is not the best metric for how
<em>long</em> it takes to learn these different perspectives because I wasn’t
actively seeking out said perspectivesduring those 9 months. I learned
the standard MSE perspective in my first machine learning class and then
MAP and the BMA I learned in a Bayesian machine learning class. Very
interesting stuff.</p>

<p>Just a note on marginal likelihood optimization (also called empirical
bayes or Type 2 MLE). The marginal likelihood answers a subtly different
question than what one might want. I <em>believe</em> it relates to the
likelihood of the training data under a prior model rather than
questions of which hyperparameters generalize best. Still, it’s often
treated as a proxy to such.</p>

<h1 id="8-footnotes">8. <strong>Footnotes</strong></h1>

<hr />

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1" role="doc-endnote">
      <p>All of these examples work with feature transformations on \(X\),
i.e. replacing \(X\) with \(\Phi\). Linear regression is super
powerful… when it can be made non-linear! <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:2" role="doc-endnote">
      <p>Linear algebra is your friend. Probably your best friend,
actually. It allows a very easy and simple derivation of linear
regression.</p>

      <p>We know that \(y\) is a vector and the solution \(X w\) is the vector
which is closest to $y$. In other words, $y$ is an orthogonal
projection onto \(\text{span}(X)\). This means the error vector
\(e = y - Xw\) is orthogonal to \(X\). So we have:</p>

\[X^T e = 0\]

\[X^T (y - Xw) = 0\]

\[X^T y - X^T X w = 0\]

\[w = (X^T X)^{-1} X^T y\]
      <p><a href="#fnref:2" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:3" role="doc-endnote">
      <p>Bayesian statistics math is notoriously tedious. There’s a much
easier way to do this by treating it as a Gaussian Process. I’ll
derive it quickly here. Suppose
\(w \sim \mathcal{N}(0, \alpha^2 I), \epsilon_x \sim \mathcal{N}(0, \sigma^2 I)\).</p>

\[y = Xw + \epsilon_x\]

\[E[y]= E[Xw] + E[\epsilon_x]\]

      <p>\(E[\epsilon_x] = 0\) since it has mean \(0\). For the same reason,</p>

\[E[Xw] = X E[w] = 0\]

      <p>So \(E[y] = 0\). Next, we’ll compute the covariance matrix.</p>

\[E[y y^T] = E[(Xw + \epsilon_x)(Xw + \epsilon_x)^T]\]

      <p>Order matters here. These are all matrices so we cannot interchange
the order of multiplication between \(Xw\) and \(w^T X^T\).</p>

\[= E[Xw w^T X^T + \epsilon_x \epsilon_x^T + \epsilon_x w^T X^T + X w \epsilon_x^T ]\]

      <p>The other terms go to $0$. Since \(\epsilon_x\) and \(w\) are
independent, \(E[\epsilon_x w] = E[\epsilon] E[w] = 0\).</p>

\[= X E[ww^T] X^T + E[\epsilon_x \epsilon_x^T]\]

\[= \sigma^2 I + \alpha^2 X X^T\]

      <p>Much easier… : ) <a href="#fnref:3" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:4" role="doc-endnote">
      <p>Sometimes it’s a lot nicer to use the precision matrix (inverse of
covariance matrix). I kind of want to make a post later about what
the precision matrix <em>actually</em> means - it has something to do with
partial correlations. Here, the precision matrix is quite ugly but
the covariance matrix is really nice. <a href="#fnref:4" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:5" role="doc-endnote">
      <p>Once we know the mean \(\mu\) and precision matrix \(\Lambda\), we can
kind of just infer that
\(w \mid \mathcal{D} \sim \mathcal{N}(\mu, \Lambda^{-1})\). I made a
mistake which confused me for a few days - I thought
\(p(w \mid \mathcal{D}) = \frac{\mathcal{N}(\mu, \Lambda)}{\mathcal{N}(0, \alpha^2 XX^T + \sigma^2 I)}\).
But this isn’t true - we need to integrate the unnormalized
\(p(\mathcal{D} \mid w) p(w)\) to get the marginal likelihood. I guess
what I’m trying to get is… Bayes formula never lies?</p>

      <p>With normal-normal models, we can also just take advantage of some
nice conditional Gaussian identities.</p>

      <p>Suppose</p>

\[p(x) = \mathcal{N}(x ; \mu, \Lambda^{-1})\]

\[p(y \mid x) = \mathcal{N}(y; Ax + b, L^{-1})\]

      <p>Then,</p>

\[p(x \mid y) = \mathcal{N}(x; \Sigma(A^T L(y - b) + \Lambda \mu), \Sigma)\]

\[\Sigma = (\Lambda + A^T L A)^{-1}\]

      <p>This will naturally give us our result. But, it’s good for our
confidence to do it the hard (redundant, inefficient, not-clever)
way at least once. <a href="#fnref:5" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name>Alex Ali</name><email>alexali000@gmail.com</email></author><summary type="html"><![CDATA[Comprehensive overview of linear regression.]]></summary></entry></feed>