Elena Orlova

Deep Stochastic Mechanics

2024-06-01T00:00:00-07:00

This post is based on “Deep Stochastic Mechanics” paper. Here, we’d like to explain the main ideas and show some results from this paper.

In quantum physics, accurately simulating and predicting the behavior of particles is a computationally challenging task due to the curse of dimensionality. The computational complexity grows exponentially as the number of particles in the system increases, making it difficult to study large-scale quantum systems using traditional methods.

Enter Deep Stochastic Mechanics (DSM), a novel approach that leverages deep learning to simulate quantum dynamics efficiently. It is a neural network(NN)–based method that directly samples from the probability density of the wave function, bypassing the need to estimate the wave function itself explicitly.

Solving Schrödinger equation

At the heart of quantum mechanics lies the Schrödinger equation (SE) for $0 < t \le T$ and $\forall x\in \mathbb{R}^d$:

\[i \hbar \partial_{t} \psi (x, t) = \Big[-\frac{\hbar^2}{2m} \frac{\partial^2}{\partial x^2} + V(x, t)\Big] \psi(x, t),\]

given an initial condition

\[\psi(x, 0) = \psi_{0}(x),\]

where $m$ is a particle’s mass, $V(x, t)$ is a potential funtion that describes physics, $\psi(x, t): \mathbb{R}^d \times [0, T]\rightarrow \mathbb{C}$ is a wave function.

The probability density of finding a particle at position $x$ at time $t$is

\[\rho(x,t) = |\psi (x, t)|^2.\]

Goal: given an initial wave function $\psi_0(x)$, draw samples from $|\psi (x, t)|^2$ for $t \in (0,T]$.

One of the possible solutions is to directly solve the SE for $\psi (x, t)$ using, for example, finite difference methods. Another approach is Monte-Carlo methods which rely on random sampling. They use a variational ansatz (a parametrized wave function) to approximate the true wave function. Existing methods for solving the time-dependent SE face significant challenges:

Classical numerical solvers require discretizing the problem on a grid, leading to an exponential growth in computational complexity as the dimensionality increases.
Physics-informed neural networks (PINNs) [Raissi, 2017] are an NN-based version of numerical solver that also suffer from an exponential growth of collocation points.
Variational methods like time-dependent Variational Monte Carlo (t-VMC) [Carleo, 2017] can bypass the curse of dimensionality. However, their accuracy heavily depends on choosing a suitable ansatz (good priors on $\psi$ to be effective). Additionally, the optimization process used to find the optimal ansatz parameters may suffer from numerical instabilities, depending on the method and initial conditions.

What if we can directly sample from the density $\vert \psi (x, t)\vert^2$ without estimating the wave function $\psi(x, t)$?

DSM method

DSM takes a different approach by leveraging Nelson’s stochastic mechanics [Nelson, 1966], which establishes an equivalence between the time-dependent Schrödinger equation and a diffusion process. Assuming $\psi (x, t) = \sqrt{\rho(x, t)}e^{iS(x, t)}$, we define

\[\begin{align*} \text{ current velocity: } v(x, t) &= \frac{\hbar}{m} \nabla S(x, t), \\ \text{ osmotic velocity: } u(x, t) &= \frac{\hbar}{2m} \nabla \log \rho(x, t). \end{align*}\]

Our method relies on the following stochastic process:

\[\mathrm{d}{\color{2D9090}X(t)} = \Big( {\color{982715}v} \big( {\color{2D9090}X(t)}, t \big)+ {\color{982715}u} \big({\color{2D9090}X(t)}, t \big) \Big)\mathrm{d}t + \sqrt{\frac{ \hbar}{m} }\mathrm{d} W, \qquad {\color{2D9090}X(0)} \sim \big|\psi_{0}\big|^2,\]

which corresponds to sampling from $\rho = \vert \psi (x, t)\vert^2$; where $u$ is an osmotic velocity, $v$ is a current velocity and $\overset{\rightarrow}{W}$ is a standard (forward) Wiener process. Process $X(t)$ is called the Nelsonian process.

We parametrize velocities $u, v$ via NNs, yielding a new process ${\color{2D9090}X^\theta(t)} \in \mathbb{R}^d$ that approximates the true process $X(t)$:

\[\mathrm{d}{\color{2D9090}X^\theta(t)} = \Big({\color{982715}v_{\theta}} \big({\color{2D9090}X^\theta(t)}, t \big)+ {\color{982715}u_{\theta} }\big({\color{2D9090}X^\theta(t)}, t \big) \Big)\mathrm{d}t + \sqrt{\frac{ \hbar}{m} }\mathrm{d} {W}.\]

After integration over time, we get

\[{\color{2D9090}X^\theta_{i+1}} = {\color{2D9090}X^\theta_{i}} + \big({\color{982715}v_{\theta}}({\color{2D9090}X^\theta_{i}}, t_{i})+ {\color{982715}u_{\theta}}({\color{2D9090}X^\theta_{i}}, t_{i}) \big)\epsilon + z,\]

where $\epsilon > 0$ is a time step size, $0 \le i < \frac{T}{\epsilon}$, and $z \sim \mathcal{N}\big(0, \frac{\hbar}{m} \epsilon I_{d}\big)$.

Given trained velocities $u_\theta, v_\theta$, and the initial condition $X_0 \sim \vert \psi_{0}\vert^2$, we can produce samples from $\rho$.

How to train velocities $u_\theta, v_\theta$?

The Schrödinger equation tells us the velocities should satisfy

\[\begin{align} \partial_{t} v_\theta &= -\frac{1}{m} \nabla V + \langle u_\theta, \nabla u_\theta \rangle - \langle v_\theta, \nabla v_\theta \rangle + \frac{\hbar}{2m} \nabla \big(\text{div } u_\theta \big) &&&& \label{eq1}\ \\ \partial_{t} u_\theta &= - \nabla \langle v_\theta, u_\theta\rangle - \frac{\hbar}{2m} \nabla \big(\text{div } v_\theta \big)&&&& \label{eq2} \end{align}\]

where $\nabla = \Big(\frac{\partial}{\partial x_{1}} , \ldots,\frac{\partial}{\partial x_{d}} \Big)$ is a gradient, $\langle \cdot , \cdot \rangle$ is a scalar product, $\text{div } f(x) = \sum_{i=1}^d \frac{\partial}{\partial x_i}f(x)$ is a divergence operator.

Additionally, the initial velocities should follow the initial conditions

\[v_\theta(x, 0) = \frac{\hbar}{m}\nabla S_0(x) \quad \text{and} \quad u_\theta(x, 0) = \frac{\hbar}{2m} \nabla \log \rho_0(x) \label{eq:ic}\]

These equations (\ref{eq1}), (\ref{eq2}) and (\ref{eq:ic}) define

\[\begin{align} \mathcal{L}_1 (v_{\theta}, u_{\theta}) &= \Big\| \partial_{t} v_\theta +\frac{1}{m} \nabla V - \langle u_\theta, \nabla u_\theta\rangle + \langle v_\theta, \nabla v_\theta\rangle - \frac{\hbar}{2m} \nabla \big(\text{div } u_\theta \big) \Big\|_2, \\ \mathcal{L}_2 (v_{\theta}, u_{\theta}) &= \Big \| \partial_{t} u_\theta + \nabla \langle v_\theta, u_\theta\rangle + \frac{\hbar}{2m} \nabla \big(\text{div } v_\theta \big) \Big \|_2,\\ \mathcal{L}_3 (v_{\theta}, u_{\theta}) &= \| u_\theta (x, 0) - u_0(x) \|_2 + \| v_\theta (x, 0) - v_0(x) \|_2 \end{align}\]

Then, our loss function to minimize is

\[\mathcal{L} (v_{\theta}, u_{\theta}) = \sum_{i=1}^3 \mathcal{L}_i (v_{\theta}, u_{\theta}).\]

The trajectories are generated iteratively in time: $X^\theta_{i+1} = X^\theta_{i} + \big(v_{\theta}(X^\theta_{i}, t_{i})+ u_{\theta}(X^\theta_{i}, t_{i}) \big)\epsilon + z$.
At every epoch, generate a batch of trajectories ${ X^\theta_{i, j} }$, where $i$ corresponds to the time step and $j$ is the number of samples.
These trajectories are used to evaluate the loss function and update the models’ weights.

(a) DSM training scheme: at every epoch $\tau$, we generate $B$ full trajectories $\{ X_{ij}\}_{ij}$, $i=0, ..., N$, $j=1, ..., B$. Then, we update the weights of our NNs. (b) An illustration of sampled trajectories at the early epoch. (c) An illustration of sampled trajectories at the final epoch. (d) Collocation points for a grid-based solver where it should predict values of $\psi(x, t)$. Blue regions in the plots correspond to higher-density regions.

Instead of explicitly estimating the wave function $\psi(x, t)$, DSM directly samples from the corresponding probability density $\vert \psi(x, t)\vert^2$ by parametrizing the velocities of the diffusion process using neural networks.

Theoretical guarantee

Theorem (Strong convergence bound) We have the following bound between the processes $X$ (the Nelsonian process) and $X^\theta$ (its approximation with $u_\theta, v_\theta$):

\[\begin{align*} \sup_{t\le T} \mathbb{E}\|X(t) - X^\theta(t)\|^2 \le C_{T} \mathcal{L}(v_{\theta}, u_{\theta}), \end{align*}\]

where the constant $C_T$ depends on a time horizon $T$ and Lipschitz constants of $u, v, u_\theta, v_\theta$.

This theorem means that optimizing the loss leads to a convergence of the neural process $X^\theta$ to the Nelsonian process $X$, and that the loss value directly translates into an improvement of error between the processes.

Experimental results

Interacting bosons in a harmonic potential:

\[\begin{align*} V(x, t) = \sum_i \frac{1}{2} m \omega^2 x_i^2 + \frac{1}{2} g \sum_{i, j} \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-(x_i - x_j)^2 / 2 \sigma^2}, \end{align*}\]

with an initial condition

\[\begin{align*} \psi(x, 0) = e^{-\omega^2x^2/(2\hbar)}, \end{align*}\]

where $g$ controls the interaction strength.

A numerical solution (Crank-Nicolson method) as a baseline. Comparison with PINNs and t-VMC.
Comparing density and some statistics (mean and variance of coordinate as a function of time).
An NN architecture for DSM/PINN: a feed-forward linear model with skip connections and tanh activations.
A t-VMC ansatz representation: Hermite polynomials with two-body interaction terms that inherently incorporate knowledge about the ground truth solution. NN ansatz parameterization did not yield satisfactory results.

Simulation results for two interacting bosons in the harmonic oscillator.

Let’s try to run the simulation for more particles:

The proposed DSM approach demonstrates robust performance, accurately following the ground truth and providing reasonable predictions for $d = 3, 4, 5$ interacting bosons
Our findings indicate that the t-VMC method can perform reasonably for low-dimensional systems, but its performance degrades as the number of interacting particles increases. This highlights the need for a scalable and carefully designed ansatz representation capable of capturing the complex behavior of particles in high-dimensional quantum systems.

Probability density plots for different numbers of interacting particles $d$. For five particles, our computer system does not allow running the Crank-Nicolson solver.

There are more experiments, including scaling studies, in our full paper.

Conclusions

Developed the new efficient computational method for simulating quantum dynamics based on Nelson’s stochastic mechanics
Relies on Markovian diffusion and does not require training data
Adaptive to latent low-dimensional support of density
Theoretical guarantees for our DSM method
The experiments show better performance of our method compared to the numerical solvers/PINNs/t-VMC both in terms of prediction quality and computation time

Since our DSM algorithm is a new approach for simulating quantum dynamics (solving time-dependent Schrodinger equation), which could be an alternative to t-VMC methods, there are still some challenges to resolve. For example:

We studied relatively simple bosonic systems (though existing methods still struggle). How to extend our approach to fermions?
We considered a linear spinless SE on a flat manifold with a smooth potential
More detailed study of our algorithms itself, including more precise error bounds

References

Nelson, Edward. “Derivation of the Schrödinger equation from Newtonian mechanics.” Physical review 150.4 (1966): 1079.
Raissi, Maziar, Paris Perdikaris, and George E. Karniadakis. “Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations.” Journal of Computational physics 378 (2019): 686-707.
Carleo, Giuseppe, et al. “Unitary dynamics of strongly interacting bose gases with the time-dependent variational monte carlo method in continuous space.” Physical Review X 7.3 (2017): 031026.

Tensor Train and Tensor Ring Decompositions for Neural Networks Compression

2023-01-27T00:00:00-08:00

This post is based on “Tenzorized Embedding Layers” paper. Here, I’d like to explain the main ideas from this paper and show some results.

One of the key components of natural language processing (NLP) models is embedding layers, which transform input words into real vectors. This can be represented as a lookup table (or a matrix). The large vocabulary leads to enormous weight matrices. State-of-the-art NLP networks are large, with millions to billions of parameters. However, computational resources are oftern limited, which is an essential problem in NLP research. What can we do about that?

The purpose of tensor decompositions is to represent a given tensor as a product of smaller tensors called cores with fewer parameters while preserving important information.

Tensor decompositions, such as Tucker decomposition, canonical decomposition, and Tensor Train (TT) decomposition (1), can be applied for dimensionality reduction in a varity of tasks. For instance, signal and data compression, or compression of neural networks layers. In the last case, model parameters are factorized into smaller cores of the corresponding tensor decomposition. For example, TT decomposition was utilized for a compression of a linear layer (2), what was extended to a compression of convolutional layer with canonical decomposition (3). The same holds for Tensor Ring (TR) decomposition (4).

Here, I’d like to show how TT and TR decompositions can be used to compress the embedding layer. $\def\uuX{\underline{\bf X}}$ $\def\uuG{\underline{\bf G}}$ $\newcommand\R{\mathbb{R}}$ $\newcommand\bG{\bf G}$ $\newcommand\bX{\bf X}$ $\newcommand\bU{\bf U}$ $\newcommand\bV{\bf V}$

Tensor Train decomposition

Suppose we have a $N$th-order tensor $\uuX \in \R^{I_1 \times I_2 \times \dots \times I_N}$. The TT representation of $\uuX$ is given as

\[x_{i_1, i_2, \dots, i_N} = \sum_{r_1=1}^{R_1} \sum_{r_1=2}^{R_2} \dots \sum_{r_{N-1}=1}^{R_{N-1}} g^{(1)}_{1, i_1, r_1} \cdot g^{(2)}_{r_1, i_1, r_2} \cdot \dots \cdot g^{(N)}_{r_{N-1}, i_N, 1},\]

or, equivalently,

\[x_{i_1, i_2, \dots, i_N} = {\bG}^{(1)}_{i_1} \cdot {\bG}^{(2)}_{i_1} \cdot ... \cdot {\bG}^{(N)}_{i_N},\]

where slice matrices are defined as ${\bG}_{i_n}^{(n)} =$ $\uuG^{(n)}(:, i_n, :) \in \mathbb{R}^{R_{n-1} \times R_n}, i_n = 1, 2, \dots, I_N$ with $\uuG^{(n)}$ being the $i_n$th lateral slice of A core tensor $\uuG^{(n)} \in \mathbb{R}^{R_{n-1}\times I_n \times R_n},$ $n=1, 2, \dots,N$ and $R_0 = R_N = 1$ by definition.

The key idea of TT decomposition is demonstrated in the next figure. The minimal values of ${R_k}_{k=1}^{N-1}$ are called TT–ranks for which the TT–decomposition exists.

TT decompostion illustration

The total number of parameters in TT decomposition can be evaluated as $\sum_{k=1}^N R_{k-1} I_k R_{k}$. Hence, if there are core tensors with small ranks, the total number of elements required to represent a given tensor in TT–format is significantly smaller than the number of elements in a full tensor $\sum_{k=1}^N I_k$. This remark makes the application of TT decomposition appealing in a lot of problems related to extremely large data.

Tensor Ring decomposition

The tensor ring format of a tensor $\uuX \in \mathbb{R}^{I_1 \times \cdots \times I_N}$ is defined as

\[x_{i_1, i_2, \dots, i_N} = \text{Trace}\left( \bG^{(1)}_{i_1} \cdot \ldots \cdot \bG^{(N)}_{i_N} \right),\]

or in index-form

\[x_{i_1, i_2, \dots, i_N} = \sum_{r_0 = 1 }^{R_{0}} \cdots \sum_{r_{N-1} = 1 }^{R_{N-1}} g^{(1)}_{r_0, i_1, r_1} \cdot \ldots \cdot g^{(N)}_{r_{N-1}, i_N, r_0},\]

where ${\bG}^{(n)}_{i_n}$ is an $i_n$th slice matrix of a tensor $\uuG^{(n)}$ $\in \R^{R_{n-1}\times I_n \times R_n}$. The last latent tensor $\uuG^{(N)}$ is of size $R_{N-1} \times I_N \times R_0$, i.e., $R_{N} = R_0$.

The TR-format can be seen as a natural extension of the TT decomposition where $R_0=R_N=1$. The illustration of TR-format is given in next figure.

TR decompostion illustration

However, the TR-format is known to have theoretical drawbacks compared to TT decomposition (5). For example, it was found that in case of TR decomposition, minimal TR-ranks for a tensor need not be unique (6) (not even up to permutation of the indices $i_1, \dots , i_N$), resulting in problems in their estimation. On the other hand, numerical experiments show that the TR-format leads to lower ranks of the core tensors compared to the TT-format (7), which means higher compression ratios and lower storage costs.

TT and TR embeddings

We aim to replace a regular embedding matrix with a more compact, yet powerful and trainable, format which would allow us to efficiently transform input words into vector representations.

Let $\bX \in \mathbb{R}^{I \times J}$ be a matrix of size $I \times J$. The goal is to get natural factors of its dimensions $I = \prod_{n=1}^N I_n$ and $J = \prod_{n=1}^N J_n$ and then reshape this matrix to $N$th-order tensor $\uuX \in \mathbb{R}^{I_1 J_1 \times I_2 J_2 \times \dots \times I_N J_N}$ whose $n$-th dimension is of length $I_n J_n$ and is indexed by the tuple $(i_n , j_n)$. We also can treat this procedure as the bijection that map rows and columns of the original matrix to the $N$-dimensional vector-indices. Than TT decomposition according to Eq. (1) is applied to this tensor to get a compact representation:

\[\uuX((i_1, j_1), (i_2, j_2), \dots, (i_N, j_N)) = \uuG^{(1)}((i_1, j_1), :) \ldots \uuG^{(N)}(:, (i_N, j_N)).\]

The described representation of a matrix in the TT–format is called a TT–matrix. The obtained factorizations $(I_1, I_2, \dots I_N ) \times (J_1,J_2, \dots J_N)$ will be treated as shapes of a TT– matrix, or TT–shapes. The idea of constructing the TT– matrix from a given matrix is showed in next figure for a 3-dimensional tensor.

TT compression of an embedding layer: reshaping a matrix into a tensor, then using TT decomposition

Similarly, we can define a TR-matrix by reshaping a given matrix $\bX$ into a tensor $\uuX \in \mathbb{R}^{I_1 J_1 \times I_2 J_2 \times \dots \times I_N J_N}$:

\[\uuX((i_1, j_1), (i_2, j_2), \dots, (i_N, j_N)) = \text{Trace}(\uuG^{(1)}((:,i_1, j_1), :) \ldots \uuG^{(N)}(:, (i_N, j_N), :)).\]

A concept of building the TR– matrix from the given matrix is showed in next figure for a 3-dimensional tensor.

TR compression of an embedding layer: reshaping a matrix into a tensor, then using TR decomposition

Now we can introduce a concept of a tensorized embedding layer:

A TT/TR-embedding layer is a layer where TT/TR–cores are trainable parameters, and they are represented as a TT/TR–matrix which can be transformed into an embedding layer $\bX \in \mathbb{R}^{I \times J}$. The algorithm requires to set the ranks in advance to define the cores size, and they are considered to be hyperparameters of the layer. The ranks values are crucially important since they determine and control the compression ratio.

To obtain an embedding for a specific word indexed $i$ in a vocabulary, we transform a row index $i$ into an N-dimensional vector index $(i_1; : : : ; i_N)$, and compute components of TT or TR embedding. Note, that the evaluation of all its components is equal to choosing the specific slices and running a sequence of matrix multiplications, which is implemented efficiently in modern linear algebra modules.

Results

Let me show results on a simple task – sentiment analysis. Sentiment analysis refers to predicting a polarity of a sentence.

The proposed approach is compared with the following baselines:

Standard embedding layer with the baseline compression ratio 1.
Embedding layer is parametrized by two matrices $\bX = \bU \bV^T$ where $\bU \in \R^{I\times R}$ and $\bU \in \R^{J\times R}$. Then the compression ratio is $\frac{IJ} {(I+J)R} \sim \frac{J}{R}$.

We test our approach on popular datasets such as the IMDB dataset with two classes, and the Stanford Sentiment Treebank (SST) with five classes. Our model consists of a standard bidirectional two-layer LSTM with a hidden size of 128 and a dropout rate of 0.5. For the embedding layer, we used the most frequent 25,000 words for IMDB and 17,200 for SST, and transformed them into a J-dimensional space with a regular embedding layer or a TT/TR embedding layer.

Sentiment analysis, LSTM with either TT-embedding or TR-embedding on IMDB dataset. The model is trained for 10 epochs. Embedding compression is calculated as the fraction between the number of parameters in the full embedding layer and TT/TR–embedding layer.

Sentiment analysis, LSTM with either TT-embedding or TR-embedding on SST dataset. The model is trained for 10 epochs. Ranks were set to 8 or 16.

The results of our experiments reveal that the models with the compressed embedding layer performed similarly or even better than the models with standard embedding layers. For example, on the IMDB dataset, the TT embedding layer with a rank of 16 and a test accuracy of 89.7% outperformed our baseline model with a test accuracy of 88.6%. Furthermore, the compressed model had significantly fewer parameters than the full model (7.19 million vs less than a million). Similarly, on the SST dataset, the model with the TR-embedding layer outperformed both the model with the regular embedding layer and the TT layer. In the case of matrix low-rank factorization, we would obtain compression ratios $\frac{J}{R} = \frac{256}{8} =32$ or $\frac{256}{16}= 16$ which are definitely worse compared to tensor factorization techniques.

The obtained slightly better test accuracy of the models with tenzorized embedding layers suggests that imposing specific tensorial low–rank structure on the matrix of embedding layer can be considered as a particular case of regularization, thus, potentially the model generalize better.

Conclusion

To conclude, TT and TR decompositions can be used to compress neural networks. We use them to compress embedding layers in NLP models. This method can be easily integrated into any deep learning framework and trained via backpropagation, while capitalizing on reduced memory requirements and increased training batch size. More details can be found in the paper and code is available here.

References

Oseledets, Ivan V. “Tensor-train decomposition.” SIAM Journal on Scientific Computing 33.5 (2011): 2295-2317.
Novikov, Alexander, et al. “Tensorizing neural networks.” Advances in neural information processing systems 28 (2015).
Lebedev, Vadim, et al. “Speeding-up convolutional neural networks using fine-tuned cp-decomposition.” arXiv preprint arXiv:1412.6553 (2014).
Wang, Wenqi, et al. “Wide compression: Tensor ring nets.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.
Grasedyck, Lars, Daniel Kressner, and Christine Tobler. “A literature survey of low‐rank tensor approximation techniques.” GAMM‐Mitteilungen 36.1 (2013): 53-78.
Ye, Ke, and Lek-Heng Lim. “Tensor network ranks.” arXiv preprint arXiv:1801.02662 (2018).
Zhao, Qibin, et al. “Learning efficient tensor representations with ring-structured networks.” ICASSP 2019-2019 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2019.