Nadav Benedek

Label Shift and Domain Adaptation in Machine Learning

2024-10-05T11:27:00+03:00

TL;DR: If you want the best accuracy on the target domain, you have to match the class frequency in the training set. You cannot affect the ROCAUC nor the PRAUC, but you can affect the accuracy. If you don’t know the target distribution at training time, you can measure the distribution during test time using only the features, calculate a weight-correction for every training example class, and retrain the model so it will match the newly detected distribution (See this paper). Sometimes you can know the future target distribution. For example, if you predict a dice image classifier, you can expect the rolled dice to have a uniform distribution, and so you can balance the classes at training time.

What is a Label Shift?

Label shift is when the classes/labels/targets distribution at deployment time (test time) is different from what you had in training time. For example, you train a cat/dog classifier using 1000 images of dogs and 4000 images of cats, because that’s the distribution of pets people have at home in France. However, when you deploy the model in Germany, when 50% of the people have cats and 50% have dogs, that’s a label shift. In label shift, the target distribution is different, but the manifestation of targets as features remains the same. That means dogs in Germany look the same as dogs in France. You only have more dogs, but they are the same kind of dogs. If the dogs in Germany look different than dogs in France, that’s a different phenomenon, not a label shift. More formally, if the source distribution is $p$ and target is $q$, the feature manifestation remains the same: $p(\boldsymbol{x}|y)=q(\boldsymbol{x}|y)$

Live example

We will try to see what happens in a label shift: we train a classifier on a source distribution of items, and check the performance when the distribution does not change, and when the distribution of true labels is changed.

Let’s say a basketball player’s average height is 180 cm, with stdev of 10 cm, and a football average height is 170 cm, with stdev of 10 cm. Assume this is true globally (in every country).

We collect a dataset of players in France, and our dataset contains 70% of basketball players, and 30% of football players. We will train the model and see the performance in France. Then, we will check the performance when we deploy the model in Germany, when we have 50% of basketball players, and 50% of football players.

Let’s write some code. First, a few helper functions.

# @title Click to Expand/Collapse
basketball_height, basket_std = 180, 10
football_height, football_std = 170, 10

dataset_length = 20000
test_set_portion = 0.4

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.metrics import accuracy_score
from sklearn.metrics import accuracy_score, classification_report, roc_auc_score, average_precision_score, confusion_matrix

np.random.seed(44)


def generate_dataset(dataset_length, probability_of_0):
  y = np.random.choice([0, 1], size=dataset_length, p=[probability_of_0, 1.0 - probability_of_0])
  num_ones = np.sum(y)
  # print(f"#[0] == {len(y)-num_ones}, #[1] == {num_ones}")
  x = np.empty(dataset_length, dtype=float)
  for i in range(dataset_length):
    if y[i]==0:
      x[i] = np.random.normal(loc=basketball_height, scale=basket_std, size=1)[0]
    else:
      x[i] = np.random.normal(loc=football_height, scale=football_std, size=1)[0]
  X = x.reshape(-1, 1) # make X a matrix and not a vector
  return X,y


def print_metrics(y_test, y_proba, threshold):

    y_pred = (y_proba >= threshold).astype(int)

    print(f"\nConfusion Matrix - each row true class (percentage), at treshold of {threshold}:")
    cm = confusion_matrix(y_test, y_pred)
    cm_percentage = (cm / cm.sum()) * 100  # Normalize by the total number of samples
    print(np.round(cm_percentage, 2))  # Print with two decimal places

    # print(f"Model accuracy: {accuracy_score(y_test, y_pred):.2f}")

    # Print classification report
    print(f"\nClassification Report at treshold {threshold}:")
    print(classification_report(y_test, y_pred))

    # Calculate and print ROC AUC
    if len(set(y_test)) == 2:  # Ensure it's binary classification
        roc_auc = roc_auc_score(y_test, y_proba)
        print(f"\nROC AUC: {roc_auc:.2f}")

        # Calculate and print PR AUC
        pr_auc = average_precision_score(y_test, y_proba)
        print(f"PR AUC (Precision-Recall AUC): {pr_auc:.2f}")
    else:
        print("\nROC AUC: Not applicable for multi-class classification")

def test_on_country_with_this_class_0_prob(model, probability_of_0, threshold, train=False):
  X,y = generate_dataset(dataset_length, probability_of_0)
  X_train, X_test, y_train, y_test  = train_test_split(X, y, test_size=test_set_portion)
  if train:
    model = LogisticRegression()
    model.fit(X_train, y_train)

  y_proba = model.predict_proba(X_test)[:, 1]

  print_metrics(y_test, y_proba, threshold)
  if train:
    return model


def train_test_on_source_then_test_on_two_more_countries(source_prob, target_probabilities, threshold):
  print(f"\n**** country we train+test at class 0 ratio of {source_prob}")
  model = test_on_country_with_this_class_0_prob(model=None, probability_of_0 = source_prob, threshold=threshold, train=True)
  for target_prob in target_probabilities:
    print(f"\n**** Test in a country with class 0 ratio of {target_prob}")
    test_on_country_with_this_class_0_prob(model, target_prob, threshold)

Now train + test on the SOURCE country, with 0 class frequency in population of 0.7, and test it also on 0.9 and 0.5:

train_test_on_source_then_test_on_two_more_countries(source_prob=0.7, target_probabilities=[0.9, 0.5], threshold=0.5)

And the results:

**** country we train+test at class 0 ratio of 0.7

Confusion Matrix - each row true class (percentage), at treshold of 0.5:
[[63.3   6.42]
 [19.31 10.96]]

Classification Report at treshold 0.5:
              precision    recall  f1-score   support

           0       0.77      0.91      0.83      5578
           1       0.63      0.36      0.46      2422

    accuracy                           0.74      8000
   macro avg       0.70      0.63      0.65      8000
weighted avg       0.73      0.74      0.72      8000


ROC AUC: 0.76
PR AUC (Precision-Recall AUC): 0.58

**** test in a country with class 0 ratio of 0.9

Confusion Matrix - each row true class (percentage), at treshold of 0.5:
[[81.7   8.15]
 [ 6.68  3.48]]

Classification Report at treshold 0.5:
              precision    recall  f1-score   support

           0       0.92      0.91      0.92      7188
           1       0.30      0.34      0.32       812

    accuracy                           0.85      8000
   macro avg       0.61      0.63      0.62      8000
weighted avg       0.86      0.85      0.86      8000


ROC AUC: 0.76
PR AUC (Precision-Recall AUC): 0.30

**** test in a country with class 0 ratio of 0.5

Confusion Matrix - each row true class (percentage), at treshold of 0.5:
[[45.5   4.75]
 [31.15 18.6 ]]

Classification Report at treshold 0.5:
              precision    recall  f1-score   support

           0       0.59      0.91      0.72      4020
           1       0.80      0.37      0.51      3980

    accuracy                           0.64      8000
   macro avg       0.70      0.64      0.61      8000
weighted avg       0.69      0.64      0.61      8000


ROC AUC: 0.76
PR AUC (Precision-Recall AUC): 0.75

In the country with 0.9 basketball players, the accuracy has increased (75%->86%), the ROC AUC remained the same, but the PR AUC decreased.

In the target country, where we have an equal number of players 0.5, we can see that the accuracy is lower than the 1st country (75%->63%). But the ROC AUC is the same. The ROC AUC will remain the same no matter the threshold we choose, and no matter the probability_of_0_in_target we choose. Also, the PR AUC increased.

Why does accuracy change in domain shift? When we train a classifier, the classifier takes into account not only the relation between the features and the label, but also the class proportions in the distribution.

Now let’s try to adapt the model to the 0.9 target country:

train_test_on_source_then_test_on_two_more_countries(source_prob=0.9, target_probabilities=[0.9, 0.5], threshold=0.5)

Confusion Matrix - each row true class (percentage), at treshold of 0.5:
[[89.46  0.32]
 [ 9.69  0.52]]

Classification Report at treshold 0.5:
              precision    recall  f1-score   support

           0       0.90      1.00      0.95      7183
           1       0.62      0.05      0.09       817

    accuracy                           0.90      8000
   macro avg       0.76      0.52      0.52      8000
weighted avg       0.87      0.90      0.86      8000


ROC AUC: 0.78
PR AUC (Precision-Recall AUC): 0.31

**** test in a country with class 0 ratio of 0.9

Confusion Matrix - each row true class (percentage), at treshold of 0.5:
[[89.64  0.26]
 [ 9.6   0.5 ]]

Classification Report at treshold 0.5:
              precision    recall  f1-score   support

           0       0.90      1.00      0.95      7192
           1       0.66      0.05      0.09       808

    accuracy                           0.90      8000
   macro avg       0.78      0.52      0.52      8000
weighted avg       0.88      0.90      0.86      8000


ROC AUC: 0.76
PR AUC (Precision-Recall AUC): 0.31

**** test in a country with class 0 ratio of 0.5

Confusion Matrix - each row true class (percentage), at treshold of 0.5:
[[49.74  0.15]
 [47.78  2.34]]

Classification Report at treshold 0.5:
              precision    recall  f1-score   support

           0       0.51      1.00      0.67      3991
           1       0.94      0.05      0.09      4009

    accuracy                           0.52      8000
   macro avg       0.72      0.52      0.38      8000
weighted avg       0.73      0.52      0.38      8000


ROC AUC: 0.76
PR AUC (Precision-Recall AUC): 0.76

We can see that instead of previous 86% accuracy, we now have 90% accuracy. Our model performed better in deployment time, since we adapted it to the new distribution. Label adaptation will produce the highest accuracy on target domain.

What will happen if we train our model using balanced classes (0.5)?

train_test_on_source_then_test_on_two_more_countries(source_prob=0.5, target_probabilities=[0.9, 0.5], threshold=0.5)

And the results:

**** country we train+test at class 0 ratio of 0.5

Confusion Matrix - each row true class (percentage), at treshold of 0.5:
[[34.35 14.96]
 [16.14 34.55]]

Classification Report at treshold 0.5:
              precision    recall  f1-score   support

           0       0.68      0.70      0.69      3945
           1       0.70      0.68      0.69      4055

    accuracy                           0.69      8000
   macro avg       0.69      0.69      0.69      8000
weighted avg       0.69      0.69      0.69      8000


ROC AUC: 0.76
PR AUC (Precision-Recall AUC): 0.76

**** test in a country with class 0 ratio of 0.9

Confusion Matrix - each row true class (percentage), at treshold of 0.5:
[[62.12 27.86]
 [ 3.16  6.85]]

Classification Report at treshold 0.5:
              precision    recall  f1-score   support

           0       0.95      0.69      0.80      7199
           1       0.20      0.68      0.31       801

    accuracy                           0.69      8000
   macro avg       0.57      0.69      0.55      8000
weighted avg       0.88      0.69      0.75      8000


ROC AUC: 0.76
PR AUC (Precision-Recall AUC): 0.30

**** test in a country with class 0 ratio of 0.5

Confusion Matrix - each row true class (percentage), at treshold of 0.5:
[[35.5  15.09]
 [15.18 34.24]]

Classification Report at treshold 0.5:
              precision    recall  f1-score   support

           0       0.70      0.70      0.70      4047
           1       0.69      0.69      0.69      3953

    accuracy                           0.70      8000
   macro avg       0.70      0.70      0.70      8000
weighted avg       0.70      0.70      0.70      8000


ROC AUC: 0.76
PR AUC (Precision-Recall AUC): 0.75

We can see that both the accuracy and the ROC AUC remained the same in the two different target distributions (~0.69), but the PR AUC changed.

That means that if you train your model on balanced classes, the model will perform with the same accuracy on any future target distribution. Not necessarily optimal accuracy, but constant.

Final Conclusions

If you want the accuracy (the trace of the confusion matrix) not to change during label shift, train with balanced classes. However, this constant accuracy comes with a price: The constant accuracy will be lower than if you train with the correct proportion as in the target domain. If you want to achieve the highest accuracy on the target domain, you should train (in source domain) with the same class proportion as in the target domain. This is called domain-adaptation.
Interesting to see that the PR AUC in a target domain only depends on the class-ratio in that domain, and it does not depend on the training ratio in the source domain. So PR AUC is not affected by class-balancing during training, you cannot fix it.
ROC AUC does not change during label-shift, no matter what your training distribution is.
When you move from domain A to B (with label shift assumption), while the ROC AUC stays the same, the accuracy may improve/worsen, and the PR AUC may improve/worsen, but, while you can’t affect the ROCAUC and PRAUC in the target domain, you can affect the accuracy, by retraining the classifier with the right proportion. Even if accuracy increases when moving from A to B, you can make it even higher by matching the label proportion.

Memory Footprint of a Neural Net During Backpropagation

2024-04-11T14:22:00+03:00

In this article we discuss the memory footprint of a neural network during backpropagation, how backprop works, what affects the memory footprint, code to demonstrate the memory footprint, and more.

Backpropagation

Let’s have a look at a three layer network backpropagation. Let’s denote as $f(\cdot)$ a general function of the inner parameters, $x_1$ as the input to the first layer, $x_2$ the input to the second layer (and the output of the first layer), and $x_4$ the output of the network. So we have: $x_2=f(x1=input, w1) \quad|\quad x_3=f(x2,w2) \quad|\quad x_4=f(x_3,w_3) \quad|\quad L=f(x_4) $

Now, if we want to find the gradient of $w_1$ we have: $ \frac{\partial L}{\partial w_1} = \frac{\partial L}{\partial x_4} \frac{\partial x_4}{\partial x_3} \frac{\partial x_3}{\partial x_2} \frac{\partial x_2}{\partial w_1}$

Let’s mark the parameters of $f()$ in bold when they are essential in the general case, and in light font when they are sometimes needed, depending on the actual function. For example, if $x_4=x_3 w_3$ then $ \frac{\partial x_4}{\partial x_3}=w_3$ therefore the derivative is only a function of $w_3$ and does not depend on the layer input activation $x_3$. However, if $x_4=\sigma (x_3 w_3)$ then the derivative depends on both variables, where $\sigma$ is sigmoid or ReLU. That’s what makes nonlinearity non-linear, after all. So, let’s look at all the terms in the chain rule, and for each one of them, analyze if it always depends on the input to the layer (x) or only sometimes:

\[\newcommand{\mb}[1]{\mathbf{#1}} \newcommand{\mi}[1]{\textit{#1}} \frac{\partial L}{\partial w_1} = \underbrace{ \frac{\partial L}{\partial x_4}}_{f(\mathbf{x_4})} \underbrace{ \frac{\partial x_4}{\partial x_3} }_{f(x_3, \mb{w_3})} \underbrace{\frac{\partial x_3}{\partial x_2} }_{f(x_2, \mb{w_2})} \underbrace{ \frac{\partial x_2}{\partial w_1}}_{f(\mb{x_1}, w_1)}\]

Observation 1: We can see that the first term (that represents the last layer), which is the impact of the output of the network to the loss, always depends on the output of the network, which is obvious. In the last term, which is the layer we want to optimize, the input to the layer is mandatory, unless the weight we’re interested in is the bias, for example, and then the derivative does not depend on the input.

Observation 2: What information do we need to store, before the backprop starts, in order to update $w_1$? We can see that in the general case, we need all layer inputs/outputs, meaning all $x_1 .. x_4$. That means that after the forward() pass, we must store all activations of the network: the input, the hidden representations, and the output (which is the input to the next layer). However in some cases, for example if one of the inner layers is a linear layer with no nonlinearity, we do not need to store the input for the layer during the forward pass. This will be demonstrated using the code below.

Combining the two observations, we can conclude that in the special case, where (1) we have a layer which is frozen (meaning that we do not want to optimize its weights) and; (2) the frozen layer is a linear layer; we can choose not to store the activation during forward pass, since it is unneeded for the optimization of the layer (since its frozen), and not needed for the update of upstream weights in the DAG/computation graph.

Furthermore, for efficiency, the backprop starts from the end, here are the steps:

We first compute and hold as state $s_4 = \underbrace{ \frac{\partial L}{\partial x_4}}_{f(\mathbf{x_4})} $ . Reminder: $x_4$ is the output of the network, so what we are calculating is the derivative of the loss function in respect to the model output. We need $x_4$ to calculate this gradient, but after we calculated it, we can release the activation $x_4$ from memory as we will not use it anymore.
If the third layer is unfrozen, update the (last in the chain) weight $\nabla w_3 = s_4 \underbrace{ \frac{\partial x_4}{\partial w_3}}_{f(\mb{x_3},w_3)}$, then compute $s_3 = s_4 \underbrace{ \frac{\partial x_4}{\partial x_3} }_{f(x_3, \mb{w_3})} $ , now we can release the activation $x_3$ from memory. We can see that if a layer is both frozen and linear, we do not use the activation $x_3$ at all, and in this case we do not need to store it in first place.
If the second layer is unfrozen, update the weight $\nabla w_2 = s_3 \underbrace{ \frac{\partial x_3}{\partial w_2}}_{f(\mb{x_2},w_2)}$. Note that we needed for this calculation both the temporary gradient flow tha arrived backward from the next layer, and also the input activation to this layer. You can think of it as follows: we need the information from all sides, both the input information and the feedback from the output channel. Compute $s_2 = s_3 \underbrace{\frac{\partial x_3}{\partial x_2} }_{f(x_2, \mb{w_2})}$, now we can release the activation $x_2$ from memory.
If the first layer is unfrozen, update the weight $\nabla w_1 = s_2 \underbrace{ \frac{\partial x_2}{\partial w_1}}_{f(\mb{x_1}, w_1)}$. We can release the input to the network $x_1$ and we’re done.

Observation 3: We do not need to hold input activations to a layer which is frozen and for which all the upstream weights (the ancestors DAG weights) are frozen too, since no one will use the computation of $s$.

To conclude:

Activation memory allocation: In cases where a layer is frozen AND (the derivative of its output w.r.t the input does not depend on the input, as in the linear case, OR all upstream dependant weights are frozen too), we can save memory and not store the input activation.

Activation memory deallocation: during backprop, we can release the activations we’ve already used, to free memory.

Network Memory Footprint

In PyTorch training loop, we have five basic steps:

Load the network to memory. If the network has 100M parameters and we use 32bit float per parameter, it will take 400MB.
Compute the $\mi{foward()}$ pass, and store some activations, depending on the conclusions above. Activations are stored for each sample in a batch, therefore the memory footprint depends on the batch size. If we train using mixed-precision, the forward activations are kept in 16bit instead of 32bit, so the footprint reduces by half.
Compute the $\mi{backward()}$ pass, allocate gradient storage per unfrozen parameters, use the activations we’ve stored to compute the gradients of unfrozen layers, and release unneeded activations, as we go backward. In this process we calculate two types of gradients: gradients w.r.t. weights, which are stored in param.grad, and intermiate gradients w.r.t. the input activations which needed internally to continue the backprop and freed when not needed.
Running the optimizer for unfronzen layers $\mi{optimizer.step()}$, uses the gradients we calculated, store and update internal moments/optimizer state only for unfrozen layers. Batch size does not effect the memory allocation of the optimizer, since all gradients are summed in place, and when GPU is used, cores work in parallel to update the .grad of the tensors. If we use Adam optimizer, two moments will be kept for each parameter.
Release the gradients we’ve accumulated using $\mi{zero_grad()}$

Demonstration Code

Run this code:

import torch
import torch.nn as nn
def test_memory(in_size=100, out_size=10, num_layers=200, freeze_start=0, freeze_end=0,
                hidden_size=100, optimizer_type=torch.optim. Adam, batch_size=1,
                device=0, add_relu=True):

  sample_input = torch.randn(batch_size, in_size)

  layers = [nn.Linear(in_size, hidden_size)]
  for layer_index in range(num_layers):
    layers_to_append = [nn.Linear(hidden_size, hidden_size, bias=False)]
    if add_relu:
      layers_to_append.append(nn.ReLU())

    # Selectively freeze some layers
    if freeze_start <= layer_index < freeze_end:
      for layer in layers_to_append:
        for param in layer.parameters():
          param.requires_grad = False

    layers.extend(layers_to_append)

  layers.append(nn.Linear(hidden_size, out_size))
  print(f"number of layers: {len(layers)}")
  model = nn.Sequential(*layers)

  optimizer = optimizer_type(model.parameters(), lr=.001)
  start = torch.cuda.memory_allocated (device)
  print("Starting at 0 memory usage as baseline.")
  model.to(device)
  after_model =  torch.cuda.memory_allocated (device) - start
  print(f"1: After model to device: {after_model:,}")
  print("")
  for i in range(3):
    print("Iteration", i)

    a = torch.cuda.memory_allocated(device)  - start

    # Running the forward pass. Here all activations will be saved, 
    # per every sample in batch
    out = model(sample_input.to(device)).sum()
    b = torch.cuda.memory_allocated(device) - start
    print(f"2: Memory consumed after forward pass (activations stored, depends on batch size): {b:,} change: ", f'{b - a:,}' )  # batch * num layers * hidden_size * 4 bytes per float

    # Backward step: Here we allocate (unless already allocated) 
    # and store the gradient of each non-frozen parameter,
    # and we release/discard the activations which are descendants in the DAG as we go.
    # So at the end the change in memory = +non-frozen parameters (if was unallocated) - non-degenerate activations
    # gradients are accumulated in place in the .grad attribute of the tensors 
    # for which gradients are being computed. Each GPU core works on a different
    # part of the .grad tensor, so they can all work in parallel
    out.backward()
    c = torch.cuda.memory_allocated(device) - start
    print(f"3: After backward pass (activations released, grad stored) {c:,} change: {c-b:,}")

    # Running the optimizer, at the first time, will store 2 moments for each non-frozen parameter (if using Adam), which will be kept throughout the training
    # So change in memory, in the first time = 2 * non-frozen parameters
    # optimizer changes the model parameters in place
    optimizer.step()
    d = torch.cuda.memory_allocated(device)  - start
    print(f"4: After optimizer step (moments stored at first time): {d:,} change: {d-c:,} " )

    # zero_grad = Reset and release gradients tensors created in .backward()
    model.zero_grad()
    e = torch.cuda.memory_allocated(device)  - start
    print(f"5: After zero_grad step (grads released): {e:,} change: {e-d:,} " )
    print("")

test_memory(optimizer_type=torch.optim.Adam, batch_size=64, freeze_start=0, freeze_end=0
            , add_relu=False)

Let’s have a look at the second iteration, for example:

Iteration 2
Memory consumed after forward pass (activations stored, depends batch size): 46,616,576 change:  5,171,200
After backward pass (activations released, grad stored) 49,580,544 change: 2,963,968
After optimizer step (moments stored at first time): 49,580,544 change: 0 
After zero_grad step (grads released): 41,445,376 change: -8,135,168 

What’s going on? The forward() pass allocated 5M of activations memory. You can play with the batch_size and see how it affects the activation memory size. You can change freeze_end=200 and see the activation memory drops, however if you then set add_relu=True you can see the memory footprint goes up again, since the layers are not linear anymore, as we proved above.

The backward() allocated 8M for the gradients (like the unfrozen model size), but released 5M of activations. So at the end we see a net increase of 3M.

The optimizer step allocated nothing in this iteration, since it already allocated 16M at the first iteration, which is exactly two moments per each parameter in the model.

And the zero_grad released the 8M of gradients.

Peak Memory Consumption

So what is the peak memory consumption for a network? We have two places where we could potentially reach peak memory consumption.

After forward(): model + 2 x model (for Adam optimizer) + activations (batch dependant)
After backward(): model + 2 x model (for Adam optimizer) + gradients (non-frozen model params)

So the question depends on which component is more dominant in the specific network: the activations or the gradients? For example, in CNNs we may have a big activation space, even for a small parameter. In Transformers, the activations depend on the sequence length. Activations are batch dependent while gradient memory footprint only depends on the size of the weights.

So we can phrase the peak memory consumption:

model + 2 x model (for Adam optimizer) + MAX( gradients (non-frozen model params, can be multiplied by two if more than one gpu in training, for accumulation), activations [batch size * activations_per_batch * activation_precision] )

Methods to reduce memory footprint

Gradient Accumulation

We can do the optimizer.step() and optimizer.zero_grad() steps only once in a while, and essentially split a batch to sub batches. As seen above, this will reduce the activation memory allocation, but will not reduce the gradient memory allocation.

Gradient Checkpointing

This actually should have been called Activation Checkpointing. Instead of storing all activations which were computed in the forward() pass, only store a subset of them, thus reducing memory footprint, and re-compute the missing activations on the fly, only when needed during the backprop() computation.

\[\square\]

PRILoRA: Pruned and Rank-Increasing Low-Rank Adaptation

2024-01-21T05:20:00+02:00

With the proliferation of large pre-trained language models (PLMs), fine-tuning all model parameters becomes increasingly inefficient, particularly when dealing with numerous downstream tasks that entail substantial training and storage costs. Several approaches aimed at achieving parameter-efficient fine-tuning (PEFT) have been proposed. Among them, Low-Rank Adaptation (LoRA) stands out as an archetypal method, incorporating trainable rank decomposition matrices into each target module. Nevertheless, LoRA does not consider the varying importance of each layer. To address these challenges, we introduce PRILoRA, which linearly allocates a different rank for each layer, in an increasing manner, and performs pruning throughout the training process, considering both the temporary magnitude of weights and the accumulated statistics of the input to any given layer. We validate the effectiveness of PRILoRA through extensive experiments on eight GLUE benchmarks, setting a new state of the art.

See the full paper.

Feature-preprocessing/engineering leakage during data-preparation and Train-Test Split Strategy Protocol

2023-08-05T11:27:00+03:00

Are we allowed to transform the input data in any way we want? Can we train sub-models to preprocess features? Can we use a pipeline of models? Can we use the output of one model as an input of another model?

Assume we have a supervised learning problem, and we would like to preprocess a feature using a separate supervised model.

Minimal problem example:

We would like to predict a 0-1 label (boolean), using 2 features: 1 numerical and 1 textual feature, using a decision tree (call it model A). Since decision trees use numbers, we would like to take the textual feature and transform it, using a separate supervised model that takes a textual input, predicts a scalar label (call it model B), and use this model to convert the textual feature into a numerical feature, so that we will have 2 numerical features, and then use a decision tree to predict the 0-1 label (boolean).

The question is if this process is legit. And if it is legit, are there any restrictions on the process to make it legit?

To make it more specific, can we first train model B, any way we want, and then transform the feature and train model A? Can we do a train/test split anyway we want (randomly) during training model B, and then do a train/test split (randomly) during training model A? Or must the split be the same during training of model B and model A? If this requirement is needed, it can be a bit complicated in real life scenarios, where you need to enforce the same train/test split procedure in all ML teams in an organization involved in the project.

Let’s make the problem even more simple: Assume that the textual feature is just a random string, meaning that it carries zero information in it, and that the numerical feature is also random and has no correlation with the label. Assume we have 1000 examples.

So, we train model B in a very overfitted manner, meaning that the training accuracy (800 examples) is 100%, and test accuracy (200 examples) is 50% (random guessing). This can happen when the model memorizes all the random texts that correspond with 0-labels and all the texts that correspond with 1-labels. That means that transforming the textual feature using model B will convert the training set features into the training-labels themselves.

Now, let’s say that in model A training, 100% of the examples in the test-set (200 of 200) are actually train-set examples of model B (because we did a new random train/test split). As for the training-set, 600 out of the 800 are examples that were part of the training-set of model B. They are completely overfitted, so they (the features) contain the label itself. The other 200 are random and have no correlation to the label. So training A will yield a model that simply uses the overfitted feature, generated by model B, to predict the label. Therefore, the test accuracy of model A will be 100%, although the features have zero information in them, because the transformed feature predicts exactly the label.

Figure 1: At first, we have a data set with completely random features (R). Model B is used for preprocessing features, and overfits on the training set heavily. Model B training accuracy is 100% and the test accuracy is random (50%). Then, Model A uses the engineered features, with a different train/test split, and achieves 100% test accuracy.

This is an example of a case where features are completely random, but we reached 100% test accuracy. We could have tweaked the example to reach any test performance we wanted.

I call this “Preprocessing Leakage”. If we would have kept the train/test split identical across model B and model A, the problem would have been avoided.

Mitigation: One way to solve this preprocessing leakage is to avoid a random train/test split, but rather do the split deterministically, using a stable hash function over the examples. For example, split by the hash of the user id, account id, etc., so that all sub-models will have the same train/test split. This also means having full control over the train/test split, and avoiding using different third party libraries that each can split the data in a different way.

Another way, if it’s possible, is to do the split before the training of any model. This is not always possible in an organization which has a feature store where many teams insert new features to the databases, and sometimes insert trained-features into the databases. Having full control on the way every team trains the feature-models can be very difficult. In many cases, there are many features you don’t even know the meaning of, let alone the way they were injected, using some trainable model you’re not responsible for.

The train/test split is important, and we need to keep it under control.

Train-Test Split Strategy Protocol

Seed-Stable: Sometimes, when the dataset is not too large, you don’t want a three-way split: train/dev/test, in order to avoid losing data for the training set. However, when you split into only train+test, you risk overfitting the test_set. So, after you fixed the test_set and measured the metrics, you can take the all_set, split it differently using a different random seed to a different train/test split, train the model again and evaluate on the new test_set. If the two evaluations, each using a different random seed, are pretty much the same, we call it seed-stable. You can do more re-trainings, using different seed-splits, to increase the confidence in the seed-stability, and then it is quite similar to k-fold Cross Validation or Monte Carlo Cross-Validation. In any case, when a few ml-teams are using a shared dataset to develop models for sub-tasks or pipeline of models in which a model output is used as an input to another model, all published models must not use any examples from the shared global test_set for their final training which is used to publish the model.

Access Safety: On the one hand, we don’t want training loops to accidentally have access to a folder or a database containing both the train+test examples, to avoid mistakes that a model trains on a test example. On the other hand, often examples do reside in one folder/database, and all_set keeps growing when we have more examples. So, a solution is to hold a file called split.json which will hold the split, and all the training procedures (in various ML teams) will have access to this file, which will point to the examples/files in the database/directories.

Who is generating the split.json file? It is generated by a program which receives a folder or a database of unsplit examples, and creates the split_file. If the file does not exist, it randomly splits the examples in the folder and saves the file. If the file does exist, it runs: (a) validation procedure, to make sure the file is valid: no overlapping, no less pointer, no more pointers, etc., and (b) update procedure. Every sub-algorithm, sub-model or derived dataset must use the global split file in order to make the train dataset and the test dataset, and run the validation procedure.

Each ml-team can check that the model is seed-stable on a different seed-split of the all_set, to make sure the model does not overfit the test_set, or alternatively use a different seed-split as develop_set, but their published model must use the global split as defined by the split_file.

Observation: in some companies, where you have different ml-teams working on different models which use the same data and may interact with each other, they must agree on a shared split_file, or alternatively have an external coordinator that specifies this split_file for them. The interaction of models can be in the form of a sequential-chain, that is the output of a model is the input of another model, or models that work in parallel, each working on a sub-task. Example: Image taken from an autonomous car, where one model segments the objects which are cars, and another model classifies the car into models.

How to make sure the ml-teams do not ‘cheat’? A team can cheat and overfit to the test_set. It is difficult for the organization to detect it. The only way to overcome this is to completely hide the global test_set from the teams. However, this means the teams will have less data to work with.

Save split inside model: When we are given a published model file (architecture+weights), how can we tell what images are we allowed to evaluate it on, in case it comes from a different ml-team? When saving a model to disk, save the set of filenames/pointers of examples used for training or the set of hashes of examples/features used for training, as part of the model state: model.split_context (register_buffer). When loading a model from disk, make sure the test_set you plan to validate the model on, and the model.split_context do not overlap. This way, you can be more sure that the evaluation you are doing is solid and correct. Optionally, if the model.split_context is less than the current global split training set, you can print a warning message that the model could be retrained with more training data.

What happens when the pre-split dataset grows (the all_set)? We want the existing models to remain valid. That means, an old example must stick to its previous train/test affiliation. A train example cannot move to be a test example, otherwise the model will overperform. A test example can become a training example theoretically, but I don’t see a necessity for this to happen. That means, when we re-split, we cannot do it with a random-seed split, but rather preserve the previous split affiliation, and for each new example, in some predefined probability, assign it to either train or test. When the dataset grows, we have the flexibility to change this probability, if for example we want to increase the ratio of the training set size to test size.

Dataset Derivatives: Sometimes, an ml-team needs to make derivatives on the dataset. For example: The dataset contains images of a front camera of a car. Model A does instance segmentation of cars with 2 classes (background/foreground), and model B uses the 512x512 bounding box to classify the cars with 100 classes. Team B would like to create a dataset of cars, and this dataset is extracted from the root dataset, which makes it a derivative. Now, Model B stands by itself: It can be evaluated standalone, regardless of model A. However, it can also be evaluated and used in tandem, as part of the full pipeline, which is finding cars in the image (model A) and classifying them (model B). So, in the two scenarios you want all the safety measures to be in place. It means that the split_file should include a pointer to the parent split_file, and the model.saved_training_set should also include the set of training images used in the parent split_file. This will allow you to safely evaluate the model both on the parent dataset and the derivative dataset.

Model evaluation principles: Sometimes, when you use data augmentation for training, you may decide to use augmentations for the test_set as well. For example, if your test_set is not big enough and you want to enlarge it. Or, if you want to make sure the model’s generalization can withstand the transformation. In those cases, where the evaluation process involves randomness, you need to make sure you set the RNG seed before the evaluation, for the evaluation to be consistent during different RNG states, either on the same computer or across computers. However, this may drastically ruin the training process, as when you return from the evaluation procedure, the next training batch starts from the same RNG state. Therefore, before setting the evaluation seed, you must save the RNG state, and restore it at the end:

# Save the current RNG states, before doing the evaluation, after a few training cycles
rng_state_torch = torch.get_rng_state()
rng_state_cuda = torch.cuda.get_rng_state_all() if torch.cuda.is_available() else None
rng_state_random = random.getstate()
rng_state_numpy = np.random.get_state()
seed_everything(dice_classifier_train_seed)

model.eval()
eval_loss = 0.0
full_labels, full_outputs = [], []
with torch.no_grad():
   for images, labels in test_loader:
      # …

# Restore the original RNG states, to allow the training randomness needed
torch.set_rng_state(rng_state_torch)
if torch.cuda.is_available():
   torch.cuda.set_rng_state_all(rng_state_cuda)
random.setstate(rng_state_random)
np.random.set_state(rng_state_numpy)

If you want to be more sure about your evaluation results:

(1) Do a sanity check, and sometimes run the evaluation procedure twice, and make sure you get the exact same results. This helps to validate that the RNG seeds were properly set and that nothing was forgotten.

(2) Make sure that the evaluation which happens every few steps during training is identical to the evaluation results after fresh loading of the model from the disk. Meaning the last evaluation loop in training, which happens before the model was saved to disk, should be identical to the evaluation of a loaded model, without training at all.

To help maintain consistency in evaluation, you want to make sure the data-loader has persistent_workers=False, so that every evaluation cycle will start from the exact same RNG state, with no leftovers from the previous evaluation cycle:

def worker_init_fn(worker_id):
	seed = 42 + worker_id
	random.seed(seed)
	np.random.seed(seed)
	torch.manual_seed(seed)
	
data_loader_test = DataLoader(dataset_test, batch_size=batch_size, shuffle=False,
	                              num_workers=num_workers,
	                              persistent_workers=False,  # persistent_workers must be False for consistent results!!
								  collate_fn=lambda x: tuple(zip(*x)), worker_init_fn=worker_init_fn, generator=torch.Generator().manual_seed(seed))
								  

Rarely, the evaluation procedure contains a .train() part. This can happen when you use some model libraries that return certain metrics and results (like loss) only when in .train() mode. If this is the case, and the model uses BatchNorm, you must save and restore BatchNorm state, before and after the .train() part, otherwise the .train() part in your evaluation procedure will have side effects on the training process of the model:

def save_batchnorm_state(model):
	bn_states = []
	for module in model.modules():
		if isinstance(module, (nn.BatchNorm2d, nn.BatchNorm1d)):
			bn_states.append({
				"running_mean": module.running_mean.clone(),
				"running_var": module.running_var.clone(),
				"num_batches_tracked": module.num_batches_tracked.clone() if hasattr(module, "num_batches_tracked") else None
			})
	return bn_states


# Function to restore BatchNorm states
def restore_batchnorm_state(model, bn_states):
	i = 0
	for module in model.modules():
		if isinstance(module, (nn.BatchNorm2d, nn.BatchNorm1d)):
			module.running_mean.copy_(bn_states[i]["running_mean"])
			module.running_var.copy_(bn_states[i]["running_var"])
			if bn_states[i]["num_batches_tracked"] is not None:
				module.num_batches_tracked.copy_(bn_states[i]["num_batches_tracked"])
			i += 1
			

The full code

Here is a class I created, DatasetSplitter, that implements the concepts specified above:

import os, torch, json, hashlib, random
from pathlib import Path
class DatasetSplitter:
	
	@staticmethod
	def _calculate_file_hash(file_path):
		"""Calculates a 6-character alphanumeric hash for a given file."""
		hasher = hashlib.md5()
		with open(file_path, 'rb') as f:
			buf = f.read()
			hasher.update(buf)
		return hasher.hexdigest()[:6]
	
	@staticmethod
	def validation(all_files, split_file_path):
		"""
		Validate that split_file_path file do not have same entry for train and test
		Validate that all_files matches the files in the split file.
		"""
		with open(split_file_path, 'r') as file: split_data_loaded = json.load(file)
		
		train_set, test_set = split_data_loaded['train'], split_data_loaded['test']
		all_files_in_split = {entry['path'] for entry in train_set + test_set}
		
		def find_duplicates(input_list, error_message):
			seen, duplicates = set(), set()
			for item in input_list:
				if item in seen: duplicates.add(item)  # Add to duplicates if already seen
				else: seen.add(item)  # Mark as seen
			if duplicates:
				raise ValueError(f"{error_message}: {', '.join(map(str, duplicates))}")
				
		# validations
		find_duplicates([entry['path'] for entry in train_set], "Train set contains identical path")
		find_duplicates([entry['hash'] for entry in train_set], "Train set contains identical hash")
		find_duplicates([entry['path'] for entry in test_set], "Test set contains identical path")
		find_duplicates([entry['hash'] for entry in test_set], "Test set contains identical hash")
		find_duplicates([entry['path'] for entry in train_set]+[entry['path'] for entry in test_set], "Train and test sets overlap in filename path")
		find_duplicates([entry['hash'] for entry in train_set] + [entry['hash'] for entry in test_set], "Train and test sets overlap in hash")
		
		
		# Ensure all files exist and have matching hashes
		for entry in train_set + test_set:
			file_path, file_hash = entry['path'], entry['hash']
			if not os.path.exists(file_path):
				raise ValueError(f"File {file_path} listed in split file does not exist!")
			actual_hash = DatasetSplitter._calculate_file_hash(file_path)
			if actual_hash != file_hash:
				raise ValueError(f"Hash mismatch for file {file_path}: expected {file_hash}, got {actual_hash}")
		
		# Ensure no extra files in dataset folder
		actual_files_in_folder = set(all_files)
		if extra_files := actual_files_in_folder - all_files_in_split - set([split_file_path]):
			print(f"DataSplitter.validation Warning: The following files are in the folder but not in the split file: {extra_files}")
	
	@staticmethod
	def folder_to_list_of_files(dataset_folder):
		return [
			str(file.resolve())
			for file in Path(dataset_folder).glob('**/*')
			if file.is_file() and not file.name.endswith('.json')
		]
	@staticmethod
	def update(list_of_files, split_file_path, train_size):
		
		with open(split_file_path, 'r') as file: split_data = json.load(file)
		
		train_set = split_data['train']
		test_set = split_data['test']
		
		set_train_hash = set([entry['hash'] for entry in train_set])
		set_train_path = set([entry['path'] for entry in train_set])
		set_test_hash = set([entry['hash'] for entry in test_set])
		set_test_path = set([entry['path'] for entry in test_set])
		
		all_files_in_split = {entry['path'] for entry in train_set + test_set}
		
		# Add new files to train or test sets
		actual_files_in_folder = set(list_of_files)
		new_files = actual_files_in_folder - all_files_in_split - set([split_file_path])
		
		print(f"Found {len(new_files)} files in the folder which are not in the train set or the test set.")
		
		for file_path in new_files:
			file_hash = DatasetSplitter._calculate_file_hash(file_path)
			
			# first, make sure the path and hash of file is not in train_set or train_set
			if file_path in set_train_path:
				print(f"WARNING, new file {file_path} in already in train_set, skipping.")
				continue
			if file_path in set_test_path:
				print(f"WARNING, new file {file_path} in already in test_set, skipping.")
				continue
			if file_hash in set_train_hash:
				print(f"WARNING, hash {file_hash} of new file {file_path} in already in train_set, skipping.")
				continue
			if file_hash in set_test_hash:
				print(f"WARNING, hash {file_hash} of new file {file_path} in already in test_set, skipping.")
				continue
				
				
			
			if random.random() < train_size:
				train_set.append({"path": file_path, "hash": file_hash})
				set_train_path.add(file_path)
				set_train_hash.add(file_hash)
				print(f"Added {file_path} to train set.")
			else:
				test_set.append({"path": file_path, "hash": file_hash})
				set_test_path.add(file_path)
				set_test_hash.add(file_hash)
				print(f"Added {file_path} to test set.")
		
		split_data['train'] = train_set
		split_data['test'] = test_set
		
		with open(split_file_path, 'w') as file: json.dump(split_data, file, indent=4)
	
	@staticmethod
	def create_or_update_root_split_file(all_files, split_file_path, train_size):
		"""
		If split_file_path does not exist, takes the dataset_folder, scan all the files in it, shuffle the list, and split the files into
		train and test sets, according to the train_size portion. Then, it creates the split_file_path json file, and save in it the two sets:
		the train set and the test set. All the file names in the split file should have absolute path.
		
		If the split_file_path already exists, it starts by running a separate validation() function: Read the file, check that the train set and the test
		does not overlap. If they do, it raises an exception. It also makes sure all the files in the train and test sets are in the dataset_folder.
		If they are not, it raises an exception. It also checks that there are no additional files in the folder that are not in the split file.
		If there are, output a warning message that the split file is can be updated.
		Then, it runs the update() function: For any NEW file in the folder, that is not in the split file already (either in train or test), it randomly
		assign it to train set with probability of train_size, and to test set with probability of 1 - train_size and prints a message explaining was
		is the new file affiliation. Then it saves the updated split file.
		
		In general, for any file in the split_file, besides the full path of the file, also include a 6 alphanumeric hash of the file content, so that
		if the filename is changed in the future, it's signature will be preserved. In the validation() function, also make sure each hash in the split_file,
		matches the true hash of the file in the folder.
		
		split_file should include additinal field: "parent" that is the path to the parent split file, if it exists. If it does not exist, it should be null.
		"""
		if os.path.exists(split_file_path):
			# Validate and update existing split file
			DatasetSplitter.validation(all_files, split_file_path)
			DatasetSplitter.update(all_files, split_file_path, train_size)
		else:
			
			random.shuffle(all_files)
			split_point = int(len(all_files) * train_size)
			train_files = all_files[:split_point]
			test_files = all_files[split_point:]
			
			split_data = {
				"parent": None,
				"train": [{"path": file, "hash": DatasetSplitter._calculate_file_hash(file)} for file in train_files],
				"test": [{"path": file, "hash": DatasetSplitter._calculate_file_hash(file)} for file in test_files],
			}
			
			with open(split_file_path, 'w') as file:
				json.dump(split_data, file, indent=4)
				print(f"Created new split file at {split_file_path}.")
	
	@staticmethod
	def create_split_file_from_splitted_lists(list_files_train, list_files_test, split_file_path, parent_split_file):
		"""
		If the file split_file_path does not exist, it will create it using the two lists. It will make sure the lists do not overlap.
		If the parent_split_file is not None, run the validation() function on the parent, and all ancestors.
		"""
		
		# Ensure no overlap between train and test sets
		train_set, test_set = set(list_files_train), set(list_files_test)
		if train_set & test_set: raise ValueError("Train and test sets overlap!")
		
		# Validate the parent split file if provided
		if parent_split_file:
			current_parent = parent_split_file
			while current_parent:
				if not os.path.exists(current_parent): raise ValueError(f"Parent split file not found: {current_parent}")
				
				DatasetSplitter.validation(all_files=DatasetSplitter.folder_to_list_of_files(os.path.dirname(current_parent)),
				                           split_file_path=current_parent)
				
				with open(current_parent, 'r') as parent_file: parent_data = json.load(parent_file)
				current_parent = parent_data.get("parent")
				
		split_data = {  # Create the split file content
			"parent": parent_split_file,
			"train": [{"path": file, "hash": DatasetSplitter._calculate_file_hash(file)} for file in list_files_train],
			"test": [{"path": file, "hash": DatasetSplitter._calculate_file_hash(file)} for file in list_files_test],
		}
		# Save the split file
		with open(split_file_path, 'w') as file: json.dump(split_data, file, indent=4)
		print(f"Split file created at {split_file_path}.")
		
	@staticmethod
	def add_split_context_to_model_before_save(split_filepath, model):
		"""
		This will add a field/buffer to a model (not a Parameter), called split_context, which is a list. The first element in the list
		is the content of the json object in the split file. If the split_file has a parent, including the parent content as the second element, and
		so on. This field will allow users of the model to make sure they do not evaluate the model on an example which is included in the training set,
		of the split_file or its ancestors.
		
		This is an example of how you should use it:
		
		DatasetSplitter.add_split_context_to_model_before_save(split_filepath, model)
		torch.save(model.state_dict(), model_save_path)
		"""
		split_context = []
		
		# Traverse the split file hierarchy
		current_split_filepath = split_filepath
		while current_split_filepath:
			# Load the current split file
			if not os.path.exists(current_split_filepath):
				raise ValueError(f"Split file not found: {current_split_filepath}")
			with open(current_split_filepath, 'r') as file: split_data = json.load(file)
			split_context.append(split_data)
			
			# Move to the parent split file, if it exists
			current_split_filepath = split_data.get("parent")
		
		# Serialize split_context as JSON and register as a tensor buffer
		serialized_context = json.dumps(split_context)
		context_tensor = torch.tensor(list(serialized_context.encode()), dtype=torch.uint8)
		model.register_buffer("split_context", context_tensor)
	
	@staticmethod
	def get_list_of_train_files_and_test_files(split_filepath, compare_to_this_total_list):
		"""
		Load the split file, and return the list of train files and test files.
		If  compare_to_this_total_list is not None, it will validate that the union of test_files and train_files is equal to compare_to_this_total_list
		"""
		if not os.path.exists(split_filepath): raise ValueError(f"Split file not found: {split_filepath}")
		with open(split_filepath, 'r') as file: split_data = json.load(file)
		train_files = [entry['path'] for entry in split_data.get('train', [])]
		test_files = [entry['path'] for entry in split_data.get('test', [])]
		
		if compare_to_this_total_list is not None:
			all_files = train_files + test_files
			# Compare the two sets
			if set(all_files) != set(compare_to_this_total_list):
				missing_from_split = set(compare_to_this_total_list) - set(all_files)  # Files in folder but not in split file
				extra_in_split = set(all_files) - set(compare_to_this_total_list)  # Files in split file but not in folder
				raise Exception(
					f"Split file [{split_filepath}] can be updated. Differences:\n"
					f"Missing from split file: {missing_from_split}\n"
					f"Extra in split file: {extra_in_split}"
				)
		
		return train_files, test_files
	
	@staticmethod
	def validate_model_after_load(split_filepath, loaded_model):
		"""
		Here we want to make sure that the test_set which is described in the split_filepath, or any of its ancestors (union) does not overlap
		with any of the loaded_model.split_context content, both in respect to the filenames, and to the hashes.
		This is an example of how you should load a model:
		
		loaded_state_dict = torch.load(model_path, map_location=device, weights_only=True)
		model.register_buffer("split_context", loaded_state_dict["split_context"])
		model.load_state_dict(loaded_state_dict)
		DatasetSplitter.validate_model_after_load(split_filepath, model)
		
		"""
		# Deserialize split_context from the model
		if not hasattr(loaded_model, "split_context"): raise ValueError("Loaded model does not have a `split_context` attribute.")
		
		# Deserialize the split_context tensor into a Python object
		serialized_context = bytes(loaded_model.split_context.tolist()).decode()  # Convert tensor to bytes, then decode
		split_context = json.loads(serialized_context)  # Deserialize JSON string back to a list of dictionaries
		del loaded_model
		
		# Load the split file and its ancestors into a unified test set
		current_split_filepath = split_filepath
		all_test_files, all_test_hashes = set(), set()
		
		while current_split_filepath:
			if not os.path.exists(current_split_filepath):
				raise ValueError(f"Split file not found: {current_split_filepath}")
			with open(current_split_filepath, 'r') as file: split_data = json.load(file)
			for entry in split_data['test']:
				all_test_files.add(Path(entry["path"]).name)
				all_test_hashes.add(entry["hash"])
			current_split_filepath = split_data.get("parent")
		
		for context in split_context:  # Check for overlap between the test set and the model's split_context
			for entry in context["train"]:  # Validate against the training set in the context
				filename = Path(entry["path"]).name
				file_hash = entry["hash"]
				if filename in all_test_files: raise ValueError(f"Filename {filename} in test set overlaps with training set in model context.")
				if file_hash in all_test_hashes:
					raise ValueError(f"File hash {file_hash} in test set overlaps with training set in model context.")
		
		# print("Validation passed: No overlap between test set and model's training split_context.")
	
	@staticmethod
	def helper_split_annotation_file_according_to_splitfile(split_filepath, loaded_annotation_json_file):
		"""
		the loaded_json_file contains list of annotations object. In each annotation object there is a field, according to this example:
		"data": {
	      "image": "\/data\/upload\/3\/ee88667f-13.jpg"
	    }
		Use only the filename in the json file, and ignore it's path. Then, check if the filename is in the split file. If it is not, raise an
		exception. Otherwise, check if it is in the train set or test set.
		The method returns two objects, the loaded_json_file which is filtered for training files, and for test files (but the loaded_json_file keep the
		same structure, just with filtered elements in the top list)
		"""
		with open(split_filepath, 'r') as file:  split_data = json.load(file)		# Load the split file
		
		train_files = {Path(entry["path"]).name for entry in split_data['train']}
		test_files = {Path(entry["path"]).name for entry in split_data['test']}
		all_split_files = train_files | test_files

		train_annotations, test_annotations = [], []
		
		for annotation in loaded_annotation_json_file:
			# Extract the filename from the annotation
			annotated_image_path = annotation["data"]["image"]
			annotated_image_filename = Path(annotated_image_path).name
			
			# Check if the file is in the split file
			if annotated_image_filename not in all_split_files:
				raise ValueError(f"File {annotated_image_filename} in annotations is not listed in the split file.")
			
			# Assign to train or test set
			if annotated_image_filename in train_files:
				train_annotations.append(annotation)
			elif annotated_image_filename in test_files:
				test_annotations.append(annotation)
			else: raise Exception("This should never happen.")
		
		return train_annotations, test_annotations
	
if __name__ == '__main__':
	DatasetSplitter.create_or_update_root_split_file(
		all_files=DatasetSplitter.folder_to_list_of_files('/image_storage/'),
		split_file_path='split.json',
		train_size=0.75
	)
	

Shares Efficient Frontier

2023-08-05T06:20:00+03:00

When we invest in a financial instrument (share, index, etc), we usually mostly care about its average annual yield and its variance, which is also called risk or volatility. Naturally, no one can predict the future, so the only thing we can do is to look at the past and assume that the past will reflect the future. So, if we analyze the past history of an instrument, for let’s say 10 or 20 years, we can measure the average annual yield and the variance, or more specifically the standard deviation (which is the square root of the variance, just so it will have the same units as the yield).

Now, imagine that we plot a map with two axis: the stdev (risk) axis, and the yield, it will look like this:

So we can easily observe, for example, that the NASDAQ has higher yield than the S&P 500, but higher variance. If an instrument has a better yield and lower variance over a second instrument, we would prefer to invest in the first one. This is called a Pareto-better instrument.

Okay. So far, so good. But what happens when we invest in a mixture of instruments? How will the combined coordinate be? So, the combined or weighted average of the yield will simply be the combined weighted average of the individual instruments. But the variance - this is something else. The stdev can sometimes be even lower than each of the instruments. This can happen when the covariance between the instruments is not perfect. In math, when you have two random variables, and you average them together, the resulting stdev is:

\[\frac{1}{2} \sqrt{stdev(X)^2+stdev(Y)^2+2Cov(X,Y)}\]

So if $stdev(X)=10=stdev(Y)$ and the variables are independent, meaning their covariance is zero, you get that the combined stdev is 7, which is lower than 10. This is the mathematical grounding for diversifying the risk. You take two risky instruments, invest in both of them, and reduce the risk.

Here you can see the correlation matrix between some instruments:

So using this notion, you can invest in a mixture (portfolio) of instruments that will give you pareto-better results, rather than investing in some of the instruments alone, as can be seen here:

So, you can observe many points, that are pareto-better than the Dow Jones or the S&P 500.

\[\square\]

Here’s the code:

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import mplcursors  

val_col_name = 'Close'
base = 4  

list_shares = [
				('https://gist.githubusercontent.com/ndvbd/30d8069937f945e492bd440a003296c7/raw/a119c81f4fb3d13d4f5b7b03c6cf0f4d6c778cdf/SP500.csv', 'SP500'),
				('https://gist.githubusercontent.com/ndvbd/2a4516b0f18129287b9de4708f5ce2bf/raw/c69988fc5993699b1bce21c18b4dd1623cb7cb6d/NASDAQ.csv', 'NASDAQ'),
				('https://gist.githubusercontent.com/ndvbd/01cb8aa365e212041037ca44e1068dba/raw/2dc76c20a3a354a641bfa5e0322adc3bc5dfff77/DOW.csv', 'DOW'),
					('https://gist.githubusercontent.com/ndvbd/039f3a31ce29c71cbc8433c9c4d0380e/raw/805b94bde56597caf464c885e681f850d90d6243/XLP.csv', 'XLP'),

]

pandas = []
for i in range(0, len(list_shares)):
	read_df = pd.read_csv(list_shares[i][0], index_col='Date', parse_dates=True)
	first_date = read_df.index[0]
	print(f"first date of {list_shares[i][1]}: {first_date}")
	pandas.append(read_df)

suffixes = [f'_{name[1]}' for name in list_shares]

returns = pandas[0][[val_col_name]].rename(columns={val_col_name: f'{val_col_name}{suffixes[0]}'})
for i in range(1, len(pandas)):
	returns = returns.merge(
		pandas[i][[val_col_name]].rename(columns={val_col_name: f'{val_col_name}{suffixes[i]}'})
		, left_index=True, right_index=True, how='inner')

result = pd.DataFrame()
current_date = returns.index[0]

while current_date <= returns.index[-1]:

	sample = returns.loc[current_date:current_date]

	result = pd.concat([result, sample])

	current_date += pd.DateOffset(years=1)

	if current_date not in returns.index:

		future_dates = returns.index[returns.index >= current_date]
		if len(future_dates) > 0:
			current_date = future_dates[0]
		else:
			break

result = result.reset_index()
returns = result

change_df = returns.copy()

shares_mean_std = []


for i in range(0, len(list_shares)):

	pct_change = change_df[f'Close{suffixes[i]}'].pct_change()
	mean = pct_change[1:].mean()
	stdev = pct_change[1:].std()
	number_years = pct_change.count()
	print(f"For {list_shares[i][1]} we have:  mean: {mean:.3f}, stdev: {stdev:.2f}, number_years: {number_years}")
	shares_mean_std.append((mean, stdev))

	change_df[f'Daily Return{suffixes[i]}'] = pct_change

	change_df.drop(columns=[f'Close{suffixes[i]}'], inplace=True)
change_df.dropna(inplace=True)

to_plot = []

def to_arbitrary_base(number, base, pad_to):
	digits = []
	while number:
		digits.append(int(number % base))
		number //= base

	digits = np.array(digits)
	padded_array = np.pad(digits, (0, pad_to - len(digits)), 'constant')

	return padded_array

def get_composed_earning_for_weight(random_weights):

	list_of_gains = [random_weights]
	current_earning = random_weights.copy()
	for year in range(len(change_df)):
		current_earning = (1 + change_df.iloc[year].values[1:]) * current_earning
		list_of_gains.append(current_earning)

	gain_list = np.array(list_of_gains)
	row_sums = gain_list.sum(axis=1)
	return row_sums

max_base = base ** len(list_shares)
print(f"max_base: {max_base}")
for i in range(max_base):
	if False:
		random_weights = np.random.rand(len(list_shares))
	else:

		random_weights = to_arbitrary_base(i, base=base, pad_to=len(list_shares)) / (base-1.0)
		if random_weights.sum() == 0.0:
			random_weights = np.ones(len(list_shares))  

	random_weights = random_weights / random_weights.sum()

	row_sums = get_composed_earning_for_weight(random_weights)

	percent_increase = np.diff(row_sums) / row_sums[:-1]
	mean, std = percent_increase.mean(), percent_increase.std(ddof=1)  

	to_plot.append((mean, std, random_weights))

if True:
	y_values, x_values, random_weights = zip(* to_plot)
	plt.scatter(x_values, y_values)

	cursor_hover = mplcursors.cursor(hover=2)
	@cursor_hover.connect("add")
	def on_add(sel):
		index = sel.index
		sel.annotation.set_text(f"[Y{100.0*y_values[index]:.1f}%,{100.0*x_values[index]:.1f}%]=" + str([f"{list_shares[idx][1]}:{val:.2f}" for idx, val in enumerate(random_weights[index]) ]))

	cursor_click = mplcursors.cursor()
	@cursor_click.connect("add")
	def on_click(sel):
		index = sel.index
		weight_vector = [val for idx, val in enumerate(random_weights[index])]
		print(f"plotting mixture: {weight_vector}")
		row_sums = get_composed_earning_for_weight(weight_vector)
		plt.figure()
		plt.plot(row_sums)
		plt.title('Plot of Vector')
		plt.xlabel('year')
		plt.ylabel('value')
		plt.xticks(np.arange(len(row_sums)))
		plt.show()

	additional_y_values, additional_x_values = zip(*shares_mean_std)
	scatter = plt.scatter(additional_x_values, additional_y_values, color='red', label='Additional Points', s=5)

	for (x, y, label) in zip(additional_x_values, additional_y_values, suffixes):
		plt.text(x, y, label, fontsize=9, ha='right', color='red')

	plt.xlabel('stdev')
	plt.ylabel('Annual Yield')
	plt.title('Scatter Plot of (x, y)')
	plt.show()
correlation = change_df.corr()
sns.heatmap(correlation, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix between FTSE 100 and S&P 500')
plt.show()

How much will we lose if we buy and sell shares too frequently? Optimal strategy for cashing out shares, tax harvesting and more.

2023-08-03T06:20:00+03:00

When we sell a share, we need to pay capital tax.

Assume we sell a share, and then immediately buy the same share, or equivalent-yield share, does it affect our gain? Are many transactions good for the gain or bad for the gain? When we have a portfolio of shares, and we need cash, which share should we sell? You can skip the math, and jump right to the last conclusion section.

Let’s look at a simple example of a share which is in profit (profitable share): We bought a share at $100, a share that doubles every year, and have a capital-tax of 25%. Today, after 1 year, we have two options:

Sell at \$200, cash left: \$175 (tax paid: \$25). Buy at \$175, sell after another 1 year at \$350 (tax paid: \$43.75), cash: \$306.25.
Don’t sell at \$200, only after another 1 year when it reaches \$400. Cash after two years: \$325 (tax paid: \$75).

So we can see in this simple example that holding the share for two years, and not selling-buying after one year, is optimal.

Why is this happening? How can we explain it? One way to look at it is that when we paid more capital-tax, it means we had more profit. So let’s look at the tax. In the second scenario, we paid tax on the growth of 100->400=+300 in share price. In the first scenario, we had growth of 100->200=+100, 175->350=+175, total +275.

If we split the second scenario into two growths: 100->200, 200->400, we can see that the only difference is the run 200->400 instead of 175->350. So in the 2nd scenario, we kept the tax to ourselves, so the run starts from 200 (bad), but the tax doubles itself every year, and the run ends at 400. While in the 1st scenario, the run starts from 175 (good), but we don’t have the tax to ourselves, so the run ends at 350, instead of 400 of the 2nd scenario. In other words, we earned +50-25=\$25 more in the 2nd scenario, and the profit is 0.75*\$25 more.

What happens if the share goes down and halves every year? In this case, there’s no tax to pay, and with both scenarios we’re left with $25 after two years. So the conclusion is that even when the share goes up or down, it’s better to minimize unneeded transactions. Furthermore, in practice, sometimes there are additional fees associated with buy/sell, which strengthen the conclusion even more.

We bought a share at \$100, and have a capital-tax of 25%. Today, after 1 year, the price is \$75, and we know it will double in a year. We have two options:

Sell at \$75 cash left: \$75 (no tax paid, and we can keep the capital loss to offset tax in the future). Buy at \$75, sell after another 1 year at \$150. We have a profit of \$75, but we keep the loss from the previous year of \$25, so we only need to pay tax on a profit of \$50 which is \$12.5, cash: \$137.5.
Don’t sell at \$75, only after another 1 year when it reaches \$150. Cash after two years: \$137.5 (tax paid: \$12.5).

So we can see that in the case of a lossy share, it doesn’t matter if we keep it, or sell and buy it again. Since there was no tax event involved, it does not matter, unlike the case of a profitable share, in which selling and buying again is not a smart move.

Define $b$ as the buy-price of the share, $y$ as the yield-multiple per year (e.g. a multiple of 1.1), $n$ as the number of years in our experiment, $s$ as the sell price, $t$ as the capital-tax ratio (e.g. $0.25$), $c$ as the net cash after the sell. Then, the sell price and the net cash we have after the tax reduction are:

\[s = b * y^n\] \[c = s - (s-b) * t\]

Injecting the first equation into the second, we can write the cash after a single sell:

\[c = by^n - (by^n-b) * t = by^n-by^nt+bt = b(y^n-y^nt+t) = b(y^n(1-t)+t)\]

How much do we lose from frequent/unnecessary intermediate sales?

Define $f$ as the frequency of sales (number of sales during the $n$ years of our experiment), so the effective yield for each sub period of $\frac{n}{f}$ years is:

\[p = (y^n)^{1/f} = y^{n/f}\]

So the cash after $f$ periods of buy/sell is:

\[b (y^{\frac{n}{f}} (1-t)+t)^{f}\]

So how much money do we lose from unnecessary sales? Let’s calculate the portion of money we stay with, in comparison to a single sell:

\[\frac{(y^{\frac{n}{f}} (1-t)+t)^{f}}{y^n(1-t)+t} \quad \square\]

For example, if the annual yield is 10% (y=1.1), capital-tax is 25% (t=0.25), number of sales is f=10, and the experiment is for n=10 years, we get a portion of $0.94$. That means we lose 6% due to the 10 sales we did, instead of holding the share and selling it once at the end of the experiment.

If we look at 20 years, and we sell and buy once a year, $y=1.1, t=0.25, f=20, n=20$ we get a portion of $0.80$, meaning that we lose $20\%$, by doing too many transactions. That’s definitely not negligible.

Another example: $y=1.2, t=0.25, f=10, n=10$ we get a portion of $0.83$, meaning that we lose $17\%$.

Cashing out

We have a portfolio of two or more shares, and we need to cash out. Which share should we sell?

Two Different Capital-Tax Shares

What happens if we hold two shares, each with different capital tax rules, and we need to liquidate and get some cash out? Is it better to sell the high-tax share or the low-tax share? Instinctively, it’s intuitive to want to sell the low-tax share, to pay less tax, right?

We buy share A at \$100 with capital-tax of 25% and buy share B at \$100 with capital-tax of 0%, both double every year. After 1 year we need \$100 in cash to buy a TV. Our options are either sell from share A or sell from share B:

After 1 year, share A is valued at \$200. If we decide to sell 57% of our share A position, we cash out 0.57 * (\$200 - 25% * (\$200 - \$100)) = \$100 to buy our TV, and we’re left with 0.43 shares A and 1.0 of share B which we didn’t sell. Wait another year, and we now sell 0.43 shares of A to have cash of 0.43*( \$400 - (\$400 - \$100) * 25%)= \$139.75 . Then sell share B, pay \$0 in taxes, and get \$400. Total cash after two years: \$539.75.
After 1 year, share B is valued at \$200. Sell \$100 out of share B, pay no taxes, and get the \$100 cash to buy our TV. So we’re left with 0.5 shares of B and 1.0 of share A. Wait another year. Sell 1.0 shares of A to get cash of 1.0*( \$400 - 25% * (\$400 - \$100))= \$325 . Sell 0.5 share B, pay zero taxes, and get \$200. Total cash after two years: \$525.

In this very specific example, we see that we need to sell share A with the higher capital-tax, but this is not always the case, and it depends on all the other parameters, as shown in the following paragraphs.

It is not true to state that we should keep the share with the lower/higher capital tax, in all cases.

Two shares with identical yield but different purchase price, current price, taxation level

We bought share A at $b_a$, valued today at $v_a$ with capital-tax of $t_a$ (e.g. 0.25) and bought share B at $b_b$, valued today at $v_b$ with capital-tax of $t_b$, both has yield-multiple per year of $y$ (e.g. 1.1). We now need $c$ in cash to buy something, and we plan to hold the shares for $n$ years, until the end of our experiment. Our options are either sell from share A or sell from share B:

If we decide to sell a fraction of $\frac{c}{v_a - t_a(v_a - b_a)}$ of our share A position, we cash out exactly $\frac{c}{v_a - t_a(v_a - b_a)}* (v_a - t_a(v_a - b_a)) = c $ to buy our TV, and we’re left with $1-\frac{c}{v_a - t_a(v_a - b_a)}$ shares of A and 1.0 shares of B. Wait $n$ years, and we now sell $1-\frac{c}{v_a - t_a(v_a - b_a)}$ shares of A to have cash of

\[(1 - \frac{c}{v_a - t_a(v_a - b_a)}) * ( v_a y^n(1-t_a) + b_a t_a)\]

Then sell 1.0 share B to get cash of $ (v_b y^n (1-t_b) + b_b t_b) $.

Total cash after $n$ years:

\[\begin{equation} \label{eq:1} \begin{aligned} (1 - \frac{c}{v_a(1-t_a) + b_a t_a}) * ( v_a y^n(1-t_a) + b_a t_a) + \underbrace{v_b y^n (1-t_b) + b_b t_b}_{\text{X}} \end{aligned} \end{equation}\]

If we decide to cash out from share B, we just need to swap variables, and the cash we get after $n$ years is:

\[\begin{equation} \label{eq:2} \begin{aligned} (1 - \frac{c}{v_b (1-t_b) + b_b t_b }) * ( v_b y^n(1-t_b) + b_b t_b) + \underbrace{v_a y^n (1-t_a) + b_a t_a}_{\text{Y}} \end{aligned} \end{equation}\]

Let’s compare which expression is higher, and subtract X and Y from both sides, to find the maximal expression:

\[\begin{equation} \label{eq:3} \begin{aligned} ( - \frac{c}{v_a(1-t_a) + b_a t_a}) * ( v_a y^n(1-t_a) + b_a t_a) \: ? \: ( - \frac{c}{v_b (1-t_b) + b_b t_b }) * ( v_b y^n(1-t_b) + b_b t_b) \end{aligned} \end{equation}\]

As you can see, we can eliminate c. That’s why the amount of cash we need to cash out does not affect the decision. Let’s divide by $(-c)$ and locate the minimal expression:

\[\begin{equation} \label{eq:4} \begin{aligned} \frac{ v_a y^n(1-t_a) + b_a t_a}{v_a(1-t_a) + b_a t_a} \: \: ? \: \: \frac{ v_b y^n(1-t_b) + b_b t_b}{v_b (1-t_b) + b_b t_b } \end{aligned} \end{equation}\]

It’s nice to see that each side does not mix variables between shares. Let’s look on one side, divide the nominator and the denominator by b, and define $m=\frac{v}{b}$.

So, in the general case, the optimal strategy is to sell the share that has a minimum value of the gain:

\[\begin{equation} \label{eq:7} \begin{aligned} G := \frac{ m y^n- t(m y^n- 1)}{m- t(m- 1)} \end{aligned} \end{equation}\]

As we can see, v and b does not appear in the formula anymore, just m. That means that the only parameters that can affect the decision are the quadruplet of (y, t, m, n).

Alternatively, we can denote what we expect the price of the share to be in n years as $F=v y^n$ and we get:

\[\begin{equation} \label{eq:6} \begin{aligned} G := \frac{ \underbrace{F- t(F- b)}_{\text{Future Profit}} }{ \underbrace{v- t(v - b) }_{\text{Current Profit}} } \end{aligned} \end{equation}\]

If you observe carefully, you see that the nominator is the net profit if we sell the share in n years, and the denominator is the net profit if we sell the share today. In other words, the golden rule is: Out of a portfolio of shares, sell the share of which the ratio of future net profit to the current net profit is the lowest. If we define the ratio between future profit to current profit as the gain G, we should strive to hold the shares having the highest G.

If two shares have the same yield and taxation

What if two shares have the same yield and taxation? It means that given the same yield and taxation, because of formula (5), two shares with the same m have the same gain. But, do we need to sell the higher m or the lower? Let’s differentiate by m to see how the G changes:

\[\begin{aligned} \frac{\partial G}{\partial m} = \frac{(y^n- t y^n)(m- t(m- 1))-(m y^n- t(m y^n- 1))(1- t)}{(m- t(m- 1))^2} = \end{aligned}\] \[\begin{aligned} \frac{ (1-t) \quad [ y^nm-ty^nm + y^nt - m y^n + tm y^n - t] }{(m- t(m- 1))^2}= \end{aligned}\] \[\begin{aligned} \frac{ (1-t) \quad t [ y^n - 1] }{(m- t(m- 1))^2} \end{aligned}\]

Now, since t is positive, (1-t) is positive, the denominator is positive and if the yield is positive then y>1, then also $y^n - 1$ is positive, we can see that the derivative is positive. That means that increasing m increases G and vice versa. And that means that G is monotonously increasing w.r.t. to m, and if we have two (or more) shares with the same yield and taxation, the only thing we need to know is m. We want to sell the share with the lowest m, that is, we want to sell the share with the lowest current price to purchase price ratio, the share that its price has grown by the least multiple since we purchased it.

Observation 1: If you apply this rule to the case when we bought two lots of the same share/company, each with a different price-per-share, then since obviously the current price per share is identical (because it’s the same share/symbol), selling the lower m means selling the share/lot with the higher purchase price. This is sometimes called Highest Cost Basis Policy, and it’s obvious: if you have 2 shares, and you need to sell 1, if you sell the share with the higher cost, you will pay less tax, and still remain with 1 share.

Observation 2: If you have multiple shares, all with the same yield and taxation, selling the share with the lowest m strategy is equivalent to selling the share in which the current tax event is the smallest in absolute dollars $ (the proof is similar, by doing a derivative of the current tax payment, w.r.t. to m).

If share A has zero taxation, and B has nonzero taxation, and share A has the same yield (or better) than share B, it’s always better to keep share A, and sell B. According to equation (5) we can see share A has G of $y^n$. Let’s prove that the gain of A is higher than the gain of B:

\[\begin{aligned} y^n > \frac{ m y^n- t(m y^n- 1)}{m- t(m - 1)} \rightarrow (m-t(m-1))y^n > my^n - t(my^n -1) \end{aligned}\] \[\newcommand{\b}[1]{\textbf{#1}} \begin{aligned} ty^n > t \rightarrow y^n>1 \quad \square \end{aligned}\]

So if the expected yield is positive, the inequality holds, which proves our statement.

Concrete examples

And now for some concrete examples:

Example of lower nonzero taxation of B, identical yield of A and B, but still with different optimal strategy:

$b_a=400, v_a=1800, t_a=0.25, b_b=400, v_b=1500, t_b=0.15, y=1.05, n=8, c=600$ -> best strategy: A

$b_a=200, v_a=3900, t_a=0.25, b_b=200, v_b=1400, t_b=0.15, y=1.30, n=10, c=300$ -> best strategy: B

What happens if we expect a different yield for each share? Should we always keep the share with better yield (assume identical risk=variance)? Surprisingly and counterintuitively, the answer is no! Again, it depends on all other factors.

Example of better yield in share B, but we still need to cash out B:

$b_a=100, v_a=2000, t_a=0.25, y_a=1.35, b_b=800, v_b=1000, t_b=0.20, y_b=1.40, n=2, c=300$ -> best strategy: sell B

And another surprising result: What happens if both shares have the same taxation (as in many real life cases), but the yield on share B is higher? Should we keep share B? No! Have a look:

$b_a=100, v_a=1700, t_a=0.30, y_a=1.35, b_b=1000, v_b=1600, t_b=0.30, y_b=1.40, n=2, c=500$ -> best strategy: sell B

Another surprising result: share B has capital tax of 0%, while share A has capital tax of 20%, and we need to sell B. This is because share A has better yield than share B, which incentives to keep it. This is a signal of the counter-effects: In general, we’d like to keep shares with lower taxation and better yield, but in this case the yield factor outweighed the tax factor.

$b_a=200, v_a=2500, t_a=0.20, y_a=1.35, b_b=700, v_b=3500, t_b=0.00, y_b=1.20, n=8, c=700 $ -> best strategy: sell B, \$29569 vs \$34144

There are examples of when share B has better yield, and better taxation, and we still need to sell it:

$b_a=100, v_a=2600, t_a=0.25, y_a=1.35, b_b=700, v_b=900, t_b=0.20, y_b=1.40, n=2, c=400$ -> best strategy: sell B, \$4405 vs \$4408

However, if share B has a capital tax of exactly zero, share A has nonzero tax, and yield of share A is equal or less to B, we should always keep B (with zero capital tax) and sell A, as proved above.

Sometimes, only the length of the experiment affects the decision, when all the other parameters are identical:

$b_a=800, v_a=2600, t_a=0.10, y_a=1.30, b_b=800, v_b=1000, t_b=0.30, y_b=1.35, n=9, c=200 $ -> best strategy: sell A, \$33502 vs \$33290

$b_a=800, v_a=2600, t_a=0.10, y_a=1.30, b_b=800, v_b=1000, t_b=0.30, y_b=1.35, n=5, c=200 $ -> best strategy: sell B, \$11422 vs \$11428

We know from previous conclusions, that when we have two shares (with identical future expectations), one in profit, and one in loss, that if we need to cash out, we should sell the one in loss. However, what happens when we are forced to sell the share in profit, because of some external constraints, or that we have some information that this company is going to collapse? Or, what happens when we have a profit from some other capital profit (e.g. rent) that can be offset when we sell a share in loss? Should we sell a lossy share to offset positive capital tax?

Let’s define the problem statement: The capital tax is $t$. We materialized a profit of $w$ and bought with it a TV, and because of that we need to pay a tax of $\b{tw}$. We hold 1 unit of a lossy share that we bought at price $b$ and its current value is $v$ ( $\b{v < b} ) $. We consider selling a portion $\b{p}$ of the lossy share. If we sell a little portion, we won’t have cash to pay for the tax $tw$, so we’ll have to take a loan with interest-multiple of $\b{r>1}$ (e.g. 1.05, this can be seen as risk-free interest, or discount rate). If we sell too much, we’ll have spare cash that we will use to buy with it the lossy share. Anyhow, we will sell the share in $n$ years from now (assume the share will be in profit $vy^n>b$), in order to compare the experiments.

The tax that we need to pay now is $T\equiv t(w - \underbrace{p(b-v)}_{\text{positive}} )$. If the tax is positive, we need to pay it, if it’s negative, we don’t get it as cash from the authorities, but it can be reduced from the tax we’ll pay when we sell the share, in full, after $n$ years from now.

If $T$ is positive, the cash we have now is $C_{tp} \equiv pv-T$. If $C_{tp} <0$ we need a loan for the tax payment, otherwise if $C_{tp} >0$ we can purchase more units of our share.
If $T$ is negative, the cash we have now (positive for sure) is $C_{tn} \equiv pv$ and we can purchase the share with it, and save the tax benefit for the future.

Consider the loan-case where $T>0,C_{tp}<0$. The cash we have after $n$ years is:

\[\begin{aligned} C_{tpcn}\equiv \underbrace{C_{tp}r^n }_{\text{pay the debt}} + \underbrace{(1-p)(vy^n(1-t)+tb)}_{\text{sell the remaining portion}} \end{aligned}\]

Let’s differentiate it by $p$, to find the optimal strategy:

\[\begin{aligned} C_{tpcn}' = \underbrace{r^n (v(1-t)+tb) }_{\text{cash if we'd sell share today and lend it}} - \underbrace{(vy^n(1-t)+tb) }_{\text{cash if wait n years and then sell}} \end{aligned}\]

It’s interesting to see that the derivative is negative, if it’s better to hold a share, then to sell it now (and lend the money), which is usually the case, otherwise there would be no point in holding any share. So, since $C_{tpcn}’ < 0$ we would like to reduce $p$ to 0 in order to maximize our utility. This means we should not share any portion of the lossy share, but just take a loan. At $p=0$ we will have: $C_{tpcn}(p=0)=-twr^n+vy^n(1-t)+tb$ $\square$

Consider the positive cash case where $T>0,C_{tp}>0$. We’ll use all the available cash to buy the share.

\[\begin{aligned} C_{tpcp} = \underbrace{C_{tp}(y^n(1-t)+t)}_{\text{from buying more portion today}} + \underbrace{(1-p)(vy^n(1-t)+tb)}_{\text{sell the remaining portion}} \end{aligned}\]

Let’s differentiate w.r.t. to p:

\[\begin{aligned} C_{tpcp}' = (v(1-t) + tb)(y^n(1-t)+t)- (vy^n(1-t)+tb) \end{aligned}\]

This will be positive when our assumption ($v < b$) holds.

Click for proof

Distance between two lines in 3D

2023-07-10T14:22:00+03:00

Assume you have two parametric lines: $p_1=r_1+e_1$ and $p_2=r_2+e_2$

Start by analyzing if they are parallel, by checking the if the normalized directions vectors ($e$) are identical. If they are, pick any point on $p_1$, and run the point-to-line formula (google it).

Otherwise, continue as follows: The definition of ‘distance’ is the minimum distance between any two points A,B on the two lines. So assume points A,B are the ones who provide the minimum distance between the lines.
Now, the line AB must be perpendicular to both lines $p_1,p_2$. Why? Because otherwise you can move a bit on one of the lines to make the distance shorter. If you move $\epsilon$, the distance will be reduced by $\epsilon \cdot cos(\alpha)$, where $\alpha \neq 90 deg$

So the only vector direction which is perpendicular to both lines is $n=e_1 \times e_2$. This is by definition of cross product. Let’s define $\hat{n}$ as its normalized version. Great.

Now imagine that we place the coordinates origin at point B. Since $AB$ is perpendicular to any point $p_1$ and specifically to point $r_1$, we have a right angled triangle $r_1AB$, and the distance $|AB|=d$ is
$d=\hat{n} \cdot (r_1-B) $
Great now let’s write B as $B=r_2 + (B-r_2)$ and we get:
$d=\hat{n} \cdot (r_1-r_2-(B-r_2))$
Now, we know that the vector $B-r_2$ in perpendicular to $\hat{n}$, by definition, so its dot product is zero. Therefore:
$d=\hat{n} \cdot (r_1 - r_2) \: \: \square$

Intuitive explanation for the max-min inequality: Why min-max is always greater than max-min.

2023-04-05T14:22:00+03:00

Min-Max is always greater than Max-Min:

\[min_y max_x f(x,y) \geq max_x min_y f(x,y)\]

Why?

Look at the following table, showing a simple function f(x,y) values for x,y=1,2,3. At the top you see the minimum of every column, which is the min-y, and on the right side the maximum of every row, that is max-x.

y \ min-y	2	1	1	max-x
3	5	1	1	5
2	8	1	3	8
1	2	6	4	6
x	1	2	3

Let’s prove that every number in max_x column is greater than any number in min_y row. For simplicity, every time we write ‘greater’ we mean ‘greater or equal’.

Can it be that a number in max_x is less than some number in min_y?

Let’s say we want to reduce the number 6 in max_x to be less than the number 2 in min_y. That means we need to replace the whole row to be ones, so the number in max_x will be 1, but then the number 2 in min_y will also be 1, since min_y is taking the minimums.

In general, every row and column we look at, we have some number in the intersection. Let’s call this number a. The corresponding number in max_x will always be greater than a, by definition, and the corresponding number in min_y will always be lower than a, by definition. That’s why for any row and column we choose, the corresponding number in max_x will always be higher than the corresponding number in min_y.

And that means every number in the max_x column is greater (or equal) than all the numbers in the min_y row.

If it’s true in general for any two numbers in max_x and min_y, it must be true for the specific number $min (max_x)$ and the number $max (min_y)$, therefore:

\[min_y max_x f(x,y) \geq max_x min_y f(x,y) \hspace{1cm} \square\]

ChatGPT - How does it work?

2023-01-07T05:20:00+02:00

ChatGPT, how does it work: Youtube.

Various Tips, Tricks, and Anecdotes for Training Neural Networks

2022-10-05T11:27:00+03:00

Finetuning a pretrained model architecture vs. training from scratch

I recall a case when I helped with supervised model training. The input was 32x32 image and the output was 7 classes. Our dataset size was around 140. We used augmentation heavily.

When we took a renset32 architecture and trained it from scratch, we got 1.41 test loss, and 0.97 training loss.

When we used a CIFAR10-renset32 pretrained architecture, and continued to finetuning, we got 0.51 test loss and 0.09 training loss. This is a huge improvement. Worth to mention that at this stage, we kept the last fully connected layer of CIFAR10-renset32 intact, while our dataset had only 7 labels and not 10, which did not matter. In addition, the training converged twice as fast as the full training. CIFAR’s dataset size is 60,000, which is larger than our 140 images dataset. Therefore, when the dataset is small, one must try using a pretrained model.

Replacing the last linear layer of a classification model

Often, when one is taking a model architecture, like resnet for example, the best practice is to replace the last linear with a new layer with the correct number of output classes to what you need. However, when you just replace a layer, you lose all the pretrained weights. Does it matter? Is keeping the last layer weights as a starting point important?

Let’s take the previous section problem and dataset and see what happened.

When we used a pretrained CIFAR10-renset32 with 10 classes output, on our 7 classes dataset, we got: 0.51 test loss (0.09 train loss)

When we replaced the last layer with a linear layer with 7 classes output we got: 0.58 test loss (0.25 train loss)

So you can see that the performance is lower.

When we replaced the last layer with a linear layer with 7 classes output while preserving the weights of the relevant neurons, we got: 0.49 test loss (0.09 train loss). So, we even improved the performance a bit, and our model has a little bit less parameters.

To conclude, in this case, keeping the pretrained model weights, even when we need to change the last layer, is important.