Baorun (Lauren) Mu

Mount a remote file system on Mac via sshfs

2022-07-08T04:21:00+00:00

Install MacPorts
- e.g. for Montery v12, simply download the .pkg and install
Install sshfs
- sudo port install sshfs
Install macFuse
- need to allow system software from Benjamin Fleischer in System Preferences to use this extension
Create a directory and mount the remote file system
- mkidr local_fs
- sshfs $USERNAME@HOST:DIR local_fs
- if the server requires key authentication, use sshfs $USERNAME@HOST:DIR local_fs -o IdentityFile=PATH_TO_KEY
(note) If the connection is broken, the local directory that the remote system is mounted might show as busy but cannot be read or written to. One can unmount the fs by umount -f local_fs and mount at local_fs again (as in step 3)

Reference

Get Started with SciNet Mist

2022-03-02T04:21:00+00:00

This is a tutorial for those who are new to Mist, a GPU cluster in the SciNet supercomputer center.

Register a Compute Canada Database (CCDB) account
- there are multiple roles for an account, for student researcher to be added to a group, need a sponsor code
- it takes a few hour for the account to be activated by consortium
Request access to Niagara and Mist on Compute Canada
- takes hours, will be notified by email
Set up ssh public/private key:
- generate key via ssh-keygen -t $TYPE -f $KEY_NAME, where $TYPE could be any of these preferred public key algorithm: rsa, dsa, ecdsa, ed25519
- copy public key (i.e the file with extension .pub) to Compute Canada’s webpage, select MyAccount -> Manage SSH Keys
- give the key a name, then click add key
ssh to Mist
- ssh -i $USERNAME -Y $USERNAME@mist.scinet.utoronto.ca, where $USERNAME is that of the CCDB account registered in step 1
Load/Install software modules
- load anaconda: module load anaconda3
- create a virtual environment: conda create -n $ENV_NAME python=$PYTHON_VERSION
- install any requirements in IBM Open-CE Conda Channel:
  - e.g. PyTorch, CUDAToolkit: conda install -c /scinet/mist/ibm/open-ce pytorch=1.10.2 cudatoolkit=11.2
(Heads up) Large dataset like ImageNet can exhaust disk quota in personal directory under /home($HOME), it should be downloaded and stored in personal directory under /scratch($SCRATCH)
(Optional, but recommended) Request a debugjob via debugjob --clean -g $NUM_GPUS and test the code on a small scale experiment first. This command gives an interactive session equivalent to 1 full hour compute

Reference

Multi-Node Distributed Neural Network Training on AWS

2022-02-13T04:21:00+00:00

note: the following tutorial will use an example with two p3.16 instance (16 V100 GPUs on 2 nodes), but it is easy to generalize to more instances (if AWS’s use limit and capacity availability agrees)

Launch multiple instances at once in AWS management console
- click Launch Instances button
- change the number of instances to 2 in 3. Config Instances
Allow all traffic (by changing inbound/outbound rules) in the security group with all instances
ssh to instances (in different terminals)
- run ssh -i {PRIVATE KEY} {INSTANCE NAME}@{PUBLIC IPv4 ADDRESS}
Config network interface for every node
- run ifconfig to find the name (e.g. ens3)
- run export NCCL_SOCKET_IFNAME=ens3 to set the environment variable
Add multi-node support in python script
- in init_process_group, we should provide:
  - url: one can use the private IPv4 address of node 0 and the port name, e.g. tcp://172.31.22.234:23456
    - this is one of the initialization method (other option e.g. shared file system, but need more configurations)
  - node index (e.g. 0 and 1 with two nodes)
  - number of processors per node: to calculate the global rank of a GPU
  - world size: total number of GPUs to be used in all nodes
  - global rank: calculated from the above parameters and local rank of a GPU
- note: local rank of a GPU will be automatically set by pytorch.distributed.launch (see how to use it in the next step)
Use pytorch.distributed.launch to run training script in distributed setting
- run python -m torch.distributed.launch --nproc_per_node=8 --nnodes=2 {PYTHON_FILE_NAME} {ARGS OF PYTHON SCRIPT}
Go fast girl!

Reference

pt-distributed-tutorial by Nathan Inkawhich

Download ImageNet-1k Dataset on AWS

2022-02-12T04:21:00+00:00

Launch a t2.large instance
Create a 1000G GP2 volume on AWS (has to be in the same zone with the instance)
Attach the volume created in step 1 to the instance
Format the volume and mount in the system
- run lsblk to check the device name of the volume created in step 1
  - e.g. /dev/xvdf
- format the volume: sudo mkfs -t ext4 $DEVICE_NAME
- mount the volume:
  - sudo mkdir imagenet
  - sudo mount $DEVICE_NAME imagenet where DEVICE_NAME should be replaced by that found above
Sign up on image-net.org (if one does not have an account)
Download ImageNet Large Scale Visual Recognition Challenge 2012 (ILSVRC2012)
- Training data
  - right click Training images (Task 1 & 2) on ImageNet’s website and copy link address
  - run sudo nohup wget $LINK where $Link should be replaced by the copied link address from website (takes hours!)
  - extract: tar -xf ILSVRC2012_img_train.tar
  - create directories find . -name "*.tar" | while read NAME ; do mkdir -p "${NAME%.tar}"; tar -xvf "${NAME}" -C "${NAME%.tar}"; rm -f "${NAME}"; done
- Validation data
  - right click Validation images (all tasks) on ImageNet’s website and copy link address
  - run sudo nohup wget $LINK where $Link should be replaced by the copied link address from website
  - download build_validation_tree.sh
  - create directories by running sh build_validation_tree.sh
- note:
  - both training and validation dataset should have 1000 directories in total, from n01440764 to n15075141
  - make sure only directories with training images are in the “train” and “validation” directories, otherwise, the target can be wrong at training
Create a snapshot from the volume that contains the dataset in AWS EC2 management console
Terminate the instance used to download the dataset. Launch a new instance for training. Attach the volume to the new instance for training and mount the volume as in step 3.
Load the dataset to the training program with torchvision.datasets.ImageFolder
Start training!

Reference

Truncated Backpropagation for Bilevel Optimization

2021-12-24T04:21:00+00:00

This post is a paper-reading note. Paper: Truncated Backpropagation for Bilevel Optimization, Amirreza Shaban & Ching-An Cheng et al. AISTATS 2018.

Bilevel Optimization (BO)

one application of BO in Machine Learning is Hyperparamter Optimization (HO) \[ \text{min}_{\lambda} f(\hat{w}, \lambda) \text{ s.t. } \hat{w} \approx \text{argmin}_w g(w, \lambda) \]
- denote parameter and hyperparameter by $w$ and $\lambda$, the outer and inner objective function by $f$ (validation loss function) and $g$ (training loss function)
- for this setup, the outer problem (minimizing the validation loss) does not depend on the hyperparameter $\lambda$, so the direct gradient $\frac{\partial \mathcal{f}}{\partial \lambda} = 0$
- note: we follow Shaban et al. 2018 that includes the algorithm to solver the inner objective as part of the problem’s formulation. Thus, we use $\hat{w}$ to denote the solution given by the solver instead of the exact minimizer $w^{*}$
challenges:
- dependency of the optimization of $\lambda$ on the inner problem is complicated, so that evaluating exact gradients is not scalable for high-dimensional problems
prior works:
- other approaches for HO: Grid Search (black box, run the training procedure many times), Random Search, Bayesian Optimization, hypernetwork (MacKay & Vicol et al. 2019, Bae et al. 2020)
- Implicit Differentiation: rely on estimate of Jacobian $J_{\lambda} \hat{w} = \frac{\partial \hat{w}}{\partial \lambda}$
  - note: this method relies on the assumption that $\hat{w} = w^{*}$, i.e. optimality of the approximate solution
- Dynamical System perspective: treat the iterative algorithm to solve the inner problem as a dynamical system and apply backpropagation through the system

Truncated Backpropagation for Bilevel Optimization

idea: approximate gradients using Truncated Backpropagation (TBP) through “time” (as in a previous post, TBP reduces time and space complexities by removing long term dependencies)
- note: here “time” refers to the optimization steps performed to solve the inner problem
hypergradient
- following the Dynamic System perspective, the iterative optimization algorithm that solves inner problem is viewed as a dynamical system: denote the transition function by $\Xi_t$ and the number of iterations by $T$ \[ w_{t + 1} = \Xi_{t + 1} (w_t, \lambda), ~~~ \hat{w_0} = \Xi_{\lambda} \text{ at } t = 0, ~~~ \hat{w} = w_T\]
- unrolling the computation graph: \[ \frac{df}{d\lambda} = \frac{\partial f}{\partial \lambda} + \sum_{t = 0}^T B_t A_{t + 1} … A_T \frac{\partial f}{\partial \hat{w}} ~~~ \text{# by the Chain Rule}\] \[ \text{where } A_{t + 1} = \frac{\partial \Xi_{t + 1}(w_t, \lambda)}{\partial w_t} = \frac{\partial w_{t + 1}}{\partial w_t}, B_{t + 1} = \frac{\partial \Xi_{t + 1}(w_t, \lambda)}{\partial \lambda} = \frac{\partial w_{t + 1}}{\partial \lambda} \text{ for } t \geq 0, B_0 = \frac{d \Xi_{0} (\lambda)}{d \lambda}\]
- dimensions:
  - denote $w_t \in \mathbb{R}^M$ and $\lambda \in \mathbb{R}^N$, then $A_t \in \mathbb{R}^{M \times M}$, $B_t \in \mathbb{R}^{N \times M}$
- reverse mode differentiation (RMD): \[ \frac{d f}{d \lambda} = h_{-1}, ~~~ \alpha_T = \frac{\partial f}{\partial \hat{w}}, ~~~ h_T = \frac{\partial f}{\partial \lambda}\] \[ h_{t - 1} = h_t + B_t \alpha_t, ~~~ \alpha_{t - 1} = A_t\alpha_t, ~~~ t = 0, …, T \]
  - need to store intermediate values $\{w_t \in \mathbb{R}^M\}_{t = 1}^T$, then the space requirement is $O(MT)$
- forward mode differentiation (FMD): \[ \frac{d f}{d \lambda} = Z_T \frac{\partial f}{\partial \hat{w}} + \frac{\partial f}{\partial \lambda}, ~~~ Z_0 = B_0\] \[ Z_{t + 1} = Z_t A_{t + 1} + B_{t + 1}, ~~~ t = 0, …, T - 1\]
  - need to propagate the matrices $Z_t \in \mathbb{R}^{M \times N}$, so the time complexity is N times of that of reverse mode RMD (matrix-matrix vs. matrix-vector multiplications), but no need to store intermediate values $w_t$ from the forward pass
- K-step truncated backpropagation (K-RMD): \[ h_{T - K} = \frac{\partial f}{\partial \lambda} + \sum_{t = T - K + 1}^T B_t A_{t + 1} … A_T \frac{\partial f}{\partial \hat{w}} \]
  - only need to store $\{w_t\}_{t = T - k + 1}^T$, the space requirement is $O(MK)$
main theoretical results (informal):
1. (Accuracy) if the inner problem is locally strongly convex around $\hat{w}$, then the bias of $h_{T - K}$ decays exponentially in K.
2. (Sufficient Descent) if the inner problem $g$ is second-order continuously differentiable, then $-h_{T - K}$ is a sufficent descent direction.
3. (Convergence) following the above results, we have that on-average convergence to $\epsilon$-approximate stationary point is guaranteed by $O(\text{log } 1 / \epsilon)$-step truncated backpropagation
relation with Implicit Differentiation:
- in the limit where $\hat{w}$ converges to $w^{*}$, $h_{T - K}$ can be viewed as an order-K (i.e. first K terms) Taylor series approximating the matrix inverse in the total derivative, the residual term has an upper bound \[ \frac{df}{d\lambda} = \frac{\partial f}{\partial \lambda} - \frac{\partial^2 g}{\partial \lambda \partial w} \bigg( \frac{\partial^2 g}{\partial w^2} \bigg)^{-1} \frac{\partial f}{\partial \hat{w}}\]
  - note: the above equation relies on the assumptions that (1) g is second-order continuously differentiable (2) there exists a unique optimal solution $w^{*}$ and all the derivatives are evaluated at $w^{*}$
- experiment: compare K-step truncated RMD to K-step Conjuagate Gradient Descent (CG)
  - both require local strong-convexity to ensure a good approximation
  - if $w^{*}$ is available, then CG gives a smaller bias
  - in practice, $w^{*}$ is usually unknown, K-step truncated RMD has a weaker assumption

Variational Inference: ELBO and reparameterization trick

2021-12-23T04:21:00+00:00

This post is a short review of Evidience Lower Bound (ELBO), which is the standard objective function to be optimized in Variational Inference.

Variational Inference

latent variables
- latent/hidden variable: a random variable that cannot be conditioned on for inference because its value is unknown
- Let the latent r.v. $\mathbf{Z}$ have distribution $p_{\theta^*}$ and the variable $\mathbf{X}$ have conditional distribution $p_{\theta^*} (x \vert z)$
objective
- get a maximum likelihood estimate for $\theta$, denoted $\theta_{MLE}$, so that we can estimate the distribution of variable given the latent variable $p_{\theta_{MLE}}(x \vert z)$ and marginal likelihood of an variable $p_{\theta_{MLE}}(x)$
- maximum likelihood estimate $\theta_{MLE}$: by maximizing the marginal likelihood \[p_{\theta}(x) = \int p_{\theta} (x \vert z) p_{\theta}(z) dz\]
  - suppose $p_{\theta} (x)$ and $p_{\theta} (z \vert x)$ are intractable
  - Variational Inference: approximate the posterior $p_{\theta}(z \vert x)$ then use it to estimate a lower bound on $\text{log } p_{\theta}(x)$ to update $\theta$
    - Let $q_{\phi} (z \vert x)$ be an approximating distribution for $p_{\theta} (z \vert x)$
    - $q_{\phi}$ is fit to $p_{\theta}$ by minimizing the Kullback-Leibler (KL) divergence \[D_{KL}(q_{\phi} (z \vert x) ~\Vert~ p_{\theta} (z \vert x))\]
  - the idea of using the posterior $p_{\theta} (z \vert x)$ to estimate the marginal likelihood is also used in the EM algorithm
evidence lower bound (ELBO)
- derivation: $\begin{align} &p_{\theta}(z \vert x) = \frac{p_{\theta} (x, z)}{p_{\theta} (x)} ~~~ \# \text{ by definition of conditional probability}\\ \Rightarrow &\text{log } p_{\theta} (x) = - \text{log } p_{\theta} (z \vert x) + \text{log } p_{\theta} (x, z) ~~~ \# \text{ take log}\\ \Rightarrow &\text{log } p_{\theta} (x) = - \text{log } p_{\theta} (z \vert x) + \text{log } p_{\theta} (x, z) + \text{log } q_{\phi} (z \vert x) - \text{log } q_{\phi} (z \vert x) ~~~ \# \text{ add and subtract}\\ \Rightarrow &\text{log } p_{\theta} (x) = \text{log } \frac{q_{\phi} (z \vert x)}{p_{\theta} (z \vert x)} + \text{log } \frac{p_{\theta} (x, z)}{q_{\phi} (z \vert x)} ~~~ \# \text{ rearrange}\\ \Rightarrow &\text{log } p_{\theta} (x) = \underbrace{E_{z \sim q_{\phi}}\bigg[ \text{log } \frac{q_{\phi}(z \vert x)}{p_{\theta}(z \vert x)}\bigg]}_{D_{KL}(q_{\phi} \Vert p_{\theta})} + E_{z \sim q_{\phi}} [\text{log } p_{\theta}(x, z) - \text{log } q_{\phi} (z \vert x)] ~~~\# \text{ take expectation w.r.t. } z \end{align}$
  - By holding $\theta$ as fixed, the LHS is fixed. If we increase $E_{z \sim q_{\phi}} [\text{log } p_{\theta} (x, z) - q_{\phi}(z \vert x)]$ w.r.t. $\phi$, then the KL divergence will decrease and the approximating distribution is improved.
  - Since the KL divergence is non-negative, we have \[\text{log } p_{\theta} (x) \geq E_{z \sim q_{\phi}} \bigg[ \text{log } p_{\theta} (x, z) - q_{\phi} (z \vert x) \bigg]\]
  - Denote the lower bound (ELBO) on the RHS by $\mathcal{L}(\theta, \phi; x)$. It is the objective function maximized in Variational Inference.
- an analytical form may not be available for ELBO, use a Monte Carlo estimate of the expectation instead:
  - e.g. sample $z_i ~ (i = 1, …, N)$ from $q_{\phi} (z \vert x)$, then \[\hat{\mathcal{L}}(\theta, \phi; x) = \frac{1}{N} \sum_{i = 1}^N \bigg[\text{log } p_{\theta} (x, z_i) - \text{log } q_{\phi} (z_i \vert x) \bigg] \]
    - note: gradient-based method may not be directly applicable to update $\theta$ and $\phi$, since $z_i’s$ are not differentiable. This motivates the next section.
reparameterization trick
- TODO(bmu)

Reference:

Auto-Encoding Variational Bayes, Diederik P. Kingma et al. NeurIPS 2014

Truncated Backpropagation through Time

2021-12-23T04:21:00+00:00

This post is a short review of Backpropagation Through Time and Truncated Backpropagation Through Time algorithms with a naive RNN model.

Recurrent Neural Network

motivation
- handle varying length in samples
- want to share features learned across different positions of sequence data
forward propagation: \[ \mathbf{a}^{(t + 1)} = g_a(\mathbf{W_{aa}} \mathbf{a}^{(t)} + \mathbf{W_{ax}} x^{(t)} + \mathbf{b_a}), ~~~ \mathbf{y}^{(t + 1)} = g_y(\mathbf{W_{ya}} \mathbf{a}^{(t)} + \mathbf{b_y}) \]
- note:
  - this is a naive RNN model with the simplest architecture
  - the parameters are shared across the time steps
backpropagation through time
- loss \[ \mathcal{L}^{(t)} (\mathbf{\hat{y}}^{(t)}, \mathbf{y}^{t)}) = -\mathbf{y}^{(t)} \text{ log } \mathbf{\hat{y}}^{(t)} - (1 - \mathbf{y}^{(t)}) \text{ log } (1 - \mathbf{\hat{y}}^{(t)}), ~~~ \mathcal{L} = \sum_{t = 1}^T \mathcal{L}^{(t)} (\mathbf{\hat{y}}^{(t)}, \mathbf{y}^{(t)})\]
- gradient descent on parameters
- heavy computational and memory cost:
  - need to store hidden states $\{\mathbf{a}^{(t)}\}_{t = 1}^T$

Truncated Backpropagation through Time

evenly split a long sequence into groups of short sequences, for every $k_1$ forward steps, perform one backward pass over the latest $k_2 ~ (\geq k_1)$ steps, repeat this loop until reaching the end of sequence
a practical method to reduce computational and memory cost, but lose long term dependency and has biased gradient estimate

Anticipated Reweighted Truncated Backpropagation (ARTBP)

TODO(bmu)

Reference

How to enable MathJax in Jekyll minima theme

2021-12-22T04:21:00+00:00

1: Find the minima bundle

bundle show minima

2: Inside the bundle (given by step 1), add the code block below to the end of _layouts/default.html (i.e. after the outmost <html> ... </html>)

<script type="text/x-mathjax-config">
  MathJax.Hub.Config({
    extensions: ["tex2jax.js"],
    jax: ["input/TeX", "output/HTML-CSS"],
    tex2jax: {
      inlineMath: [ ['$','$'], ["\\(","\\)"] ],
      displayMath: [ ['$$','$$'], ["\\[","\\]"] ],
      processEscapes: true
    },
    "HTML-CSS": { fonts: ["TeX"] }
  });
</script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.0/MathJax.js?config=TeX-AMS-MML_HTMLorMML" type="text/javascript">
</script>

note:
- this step enables MathJax (with both inline and display mode)
  - $$...$$ is only rendered as display math if the lines above and below it are blank
- the above code uses the copy of MathJax from a Content Delivery Network cdnjs, which needs network access. alternative: download and install a local copy of MathJax on server/hard disk

3: Create a local copy of _layouts and save the change

cp -r {path to _layouts in the bundle} {path to repo}

Reference:

mac os 10.15 + bootcamp(win 10) + nvidia eGPU

2020-09-15T04:21:00+00:00

back up mac
bootcamp win 10
- download win 10 iso(~5.7G)
- launch bootcamp, select a partition size, install windows (Apple’s instruction)
- system will reboot in windows
nvidia drivers
- plugin gpu and turn on gpu power supply (for msi RTX2080 super, needs all of 2 x 8 pins)
- install geforce experience
- check drivers update in geforce experience
- if you need nvidia control panel but it is missing, try: standard driver instead of DCH drivers by using advance drivers search
anaconda
- install anaconda
  - tick add anaconda to Path variable
  - or do it manually by start -> type “env var” -> click “edit the system environment variables” -> click “environment variables” -> click “Path” under User variables and click edit -> add “C:\Users[user_name]\anaconda3; C:\Users[user_name]\anaconda3\Scripts;”
- create env conda create --name env_gpu
- install tensorfow-gpu 1 conda install tensorflow-gpu=1.15 (this will automatically install cuda 10.0 and cudnn 7.6, compatible with tensorflow 1.15)
- check if gpu is visible: several methods, alternatively, run the command: nvidia-smi
git-bash
- install git
- if anaconda is successfully added to Path variable, then the command “conda activate env_gpu” should be recognized and the environment should be activated

up to now, python scripts should be able to run in this environment. on the other hand, since it is in a bash shell, .sh scripts should also work.