[QST] Unable to replicate evaluation metrics when using ignore_masking.

# ❓ Questions & Help

## Details
`transformers4rec==0.1.13`

I've been trying out t4r for a while and I decided to try and replicate the evaluation metrics on my own by performing offline predictions.

I've been using a model architecture similar to one in the provided examples and similar features.

Model:
```python3
inputs = tr.TabularSequenceFeatures.from_schema(
        schema,
        max_sequence_length=36,
        aggregation='concat',
        continuous_projection=64,
        d_output=64,
        masking="mlm",
)

# Define XLNetConfig class and set default parameters for HF XLNet config  
transformer_config = tr.XLNetConfig.build(
    d_model=64, n_head=4, n_layer=2, total_seq_length=36
)
# Define the model block including: inputs, masking, projection and transformer block.
body = tr.SequentialBlock(
    inputs, tr.MLPBlock([64]), tr.TransformerBlock(transformer_config, masking=inputs.masking)
)

# Defines the evaluation top-N metrics and the cut-offs
metrics = [NDCGAt(top_ks=[5, 10, 20], labels_onehot=True),  
           RecallAt(top_ks=[5, 10, 20], labels_onehot=True),
           AvgPrecisionAt(top_ks=[5, 10, 20], labels_onehot=True),
           PrecisionAt(top_ks=[5, 10, 20], labels_onehot=True)
          ]

# Define a head related to next item prediction task 
head = tr.Head(
    body,
    tr.NextItemPredictionTask(weight_tying=True, hf_format=True, 
                              metrics=metrics),
    inputs=inputs,
)

# Get the end-to-end Model class 
model = tr.Model(head)

train_args = T4RecTrainingArguments(data_loader_engine= 'nvtabular', 
                                    dataloader_drop_last = False,
                                    gradient_accumulation_steps = 1,
                                    per_device_train_batch_size = 256, 
                                    per_device_eval_batch_size = 32,
                                    output_dir = "./tmp", 
                                    learning_rate=0.0005,
                                    lr_scheduler_type='cosine',
                                    num_train_epochs=5,
                                    max_sequence_length=36, 
                                    report_to = [],
                                    logging_steps=200,
                                    no_cuda=False)

trainer = Trainer(
    model=model,
    args=train_args,
    schema=schema,
    compute_metrics=True,
)
```

When I evaluate the metrics by using:
```python3
trainer.eval_dataset_or_path = 'eval.parquet'
train_metrics = trainer.evaluate(metric_key_prefix='eval')
```

I get the following recall:
`eval_/next-item/recall_at_10': 0.1004810556769371`

---

I tried writing some code to replicate the results and validate that the evaluation metrics returned by the model were correct.
(As the model was already trained I moved everything from GPU to CPU)

The following script transforms my data from pd.DataFrame to a dict with the appropriate format for the t4r model and also extracts the labels (The last products of the sequence). 

For the moment I'm not removing the last item from the sessions on my dataset as I know `model(data, training=False)` does that for me.

```python3
# Load data
prediction_data = pd.read_parquet('eval.parquet')

# Create label
prods_arr = np.stack(prediction_data.products_padded)
last_item_idx = np.count_nonzero(prods_arr, axis=1) - 1
labels = np.array([prods_arr[n, idx] for n, idx in enumerate(last_item_idx)])

# Transform data to pytorch format   
pred_dtypes = prediction_data.applymap(lambda x: x[0]).dtypes
batch_pred = {}
for col, dtype in pred_dtypes.iteritems():
    
    if dtype == 'float64':
        tensor = np.stack(prediction_data[col]).astype(np.float32)
    else:
        tensor = np.stack(prediction_data[col])
    tensor = torch.from_numpy(tensor)
        
    batch_pred[col] = tensor.cpu()
```

I also created a function to evaluate the recall on my own

```python3
def recall(predicted_items: np.ndarray, real_items: np.ndarray) -> float:
    idx = 0
    recalls = np.zeros(len(predicted_items), dtype=np.float64)
    for real, pred in zip(real_items, predicted_items):

        real = real[real > 0]
        pred = pred[pred > 0]

        real_found_in_pred = np.isin(pred, real, assume_unique=True)

        if real_found_in_pred.any():
            recommended = real_found_in_pred.sum()
            recall = recommended / len(real)
        else:
            recall = 0

        recalls[idx] += recall
        idx += 1
    mean_recall = recalls.mean()
    return mean_recall
```

I performed offline predictions by using the following code:
```python3
predictions = model_cpu(batch_pred, training=False)['predictions']
_, topk_pred = torch.topk(predictions, k=10)
topk_pred = topk_pred.flip(dims=(1,))
```

After evaluating using my code

```python3
recall(topk_pred, labels)
```
Which returns a pretty similar result:
```
Recall@10
t4r trainer eval metric result: 0.1004810556769371
my own metric result:           0.1004810550781038
```

**BUT** When I try to run the predictions "manually" masking the last item of each session and using `ignore_masking=True` I get an entirely different result:

I re-run my script for label extraction and to adapt the pandas dataframe to a dict but this time I mask the last item of each session:

```python3
# Load data
prediction_data = pd.read_parquet('eval.parquet')

# Create label
prods_arr = np.stack(prediction_data.products_padded)
last_item_idx = np.count_nonzero(prods_arr, axis=1) - 1
labels = np.array([prods_arr[n, idx] for n, idx in enumerate(last_item_idx)])

# Performs masking
for n, idx in enumerate(last_item_idx):
    for col_nbr in range(prediction_data.shape[1]):
        arr = prediction_data.iloc[n, col_nbr].copy()
        arr[idx] = 0
        prediction_data.iloc[n, col_nbr] = arr
        
# Transform data to pytorch format   
pred_dtypes = prediction_data.applymap(lambda x: x[0]).dtypes
batch_pred = {}
for col, dtype in pred_dtypes.iteritems():
    
    if dtype == 'float64':
        tensor = np.stack(prediction_data[col]).astype(np.float32)
    else:
        tensor = np.stack(prediction_data[col])
    tensor = torch.from_numpy(tensor)
        
    batch_pred[col] = tensor.cpu()
```

I did check that the masking process was performed correctly

<img width="1239" alt="Captura de Pantalla 2022-10-21 a la(s) 17 28 05" src="https://user-images.githubusercontent.com/70455203/197284463-679c4360-3f7a-486b-8c28-d58d21197f1c.png">

I re-ran the inference phase with `ignore_masking=True`

```python3
model_results = model_cpu(batch_pred, training=False, ignore_masking=True)
predictions = model_results['predictions']
_, topk_pred = torch.topk(predictions, k=10)
topk_pred = topk_pred.flip(dims=(1,))
```
and got different and disappointing results

```
Recall@10
t4r trainer eval metric result: 0.1004810556769371
my own metric result:           0.1004810550781038
my own (ignore masking):        0.0686179125452678
```

I can't figure out if I'm missing something or if I did something wrong but for the moment I can't find anything on my side. 

It would be of great help if someone could check this issue out and try to replicate this experiment.

Thanks in advance to anyone willing to look into this issue.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QST] Unable to replicate evaluation metrics when using ignore_masking. #506

❓ Questions & Help

Details

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[QST] Unable to replicate evaluation metrics when using ignore_masking. #506

Description

❓ Questions & Help

Details

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions