❓ Questions & Help
Details
transformers4rec==0.1.13
I've been trying out t4r for a while and I decided to try and replicate the evaluation metrics on my own by performing offline predictions.
I've been using a model architecture similar to one in the provided examples and similar features.
Model:
inputs = tr.TabularSequenceFeatures.from_schema(
schema,
max_sequence_length=36,
aggregation='concat',
continuous_projection=64,
d_output=64,
masking="mlm",
)
# Define XLNetConfig class and set default parameters for HF XLNet config
transformer_config = tr.XLNetConfig.build(
d_model=64, n_head=4, n_layer=2, total_seq_length=36
)
# Define the model block including: inputs, masking, projection and transformer block.
body = tr.SequentialBlock(
inputs, tr.MLPBlock([64]), tr.TransformerBlock(transformer_config, masking=inputs.masking)
)
# Defines the evaluation top-N metrics and the cut-offs
metrics = [NDCGAt(top_ks=[5, 10, 20], labels_onehot=True),
RecallAt(top_ks=[5, 10, 20], labels_onehot=True),
AvgPrecisionAt(top_ks=[5, 10, 20], labels_onehot=True),
PrecisionAt(top_ks=[5, 10, 20], labels_onehot=True)
]
# Define a head related to next item prediction task
head = tr.Head(
body,
tr.NextItemPredictionTask(weight_tying=True, hf_format=True,
metrics=metrics),
inputs=inputs,
)
# Get the end-to-end Model class
model = tr.Model(head)
train_args = T4RecTrainingArguments(data_loader_engine= 'nvtabular',
dataloader_drop_last = False,
gradient_accumulation_steps = 1,
per_device_train_batch_size = 256,
per_device_eval_batch_size = 32,
output_dir = "./tmp",
learning_rate=0.0005,
lr_scheduler_type='cosine',
num_train_epochs=5,
max_sequence_length=36,
report_to = [],
logging_steps=200,
no_cuda=False)
trainer = Trainer(
model=model,
args=train_args,
schema=schema,
compute_metrics=True,
)
When I evaluate the metrics by using:
trainer.eval_dataset_or_path = 'eval.parquet'
train_metrics = trainer.evaluate(metric_key_prefix='eval')
I get the following recall:
eval_/next-item/recall_at_10': 0.1004810556769371
I tried writing some code to replicate the results and validate that the evaluation metrics returned by the model were correct.
(As the model was already trained I moved everything from GPU to CPU)
The following script transforms my data from pd.DataFrame to a dict with the appropriate format for the t4r model and also extracts the labels (The last products of the sequence).
For the moment I'm not removing the last item from the sessions on my dataset as I know model(data, training=False) does that for me.
# Load data
prediction_data = pd.read_parquet('eval.parquet')
# Create label
prods_arr = np.stack(prediction_data.products_padded)
last_item_idx = np.count_nonzero(prods_arr, axis=1) - 1
labels = np.array([prods_arr[n, idx] for n, idx in enumerate(last_item_idx)])
# Transform data to pytorch format
pred_dtypes = prediction_data.applymap(lambda x: x[0]).dtypes
batch_pred = {}
for col, dtype in pred_dtypes.iteritems():
if dtype == 'float64':
tensor = np.stack(prediction_data[col]).astype(np.float32)
else:
tensor = np.stack(prediction_data[col])
tensor = torch.from_numpy(tensor)
batch_pred[col] = tensor.cpu()
I also created a function to evaluate the recall on my own
def recall(predicted_items: np.ndarray, real_items: np.ndarray) -> float:
idx = 0
recalls = np.zeros(len(predicted_items), dtype=np.float64)
for real, pred in zip(real_items, predicted_items):
real = real[real > 0]
pred = pred[pred > 0]
real_found_in_pred = np.isin(pred, real, assume_unique=True)
if real_found_in_pred.any():
recommended = real_found_in_pred.sum()
recall = recommended / len(real)
else:
recall = 0
recalls[idx] += recall
idx += 1
mean_recall = recalls.mean()
return mean_recall
I performed offline predictions by using the following code:
predictions = model_cpu(batch_pred, training=False)['predictions']
_, topk_pred = torch.topk(predictions, k=10)
topk_pred = topk_pred.flip(dims=(1,))
After evaluating using my code
recall(topk_pred, labels)
Which returns a pretty similar result:
Recall@10
t4r trainer eval metric result: 0.1004810556769371
my own metric result: 0.1004810550781038
BUT When I try to run the predictions "manually" masking the last item of each session and using ignore_masking=True I get an entirely different result:
I re-run my script for label extraction and to adapt the pandas dataframe to a dict but this time I mask the last item of each session:
# Load data
prediction_data = pd.read_parquet('eval.parquet')
# Create label
prods_arr = np.stack(prediction_data.products_padded)
last_item_idx = np.count_nonzero(prods_arr, axis=1) - 1
labels = np.array([prods_arr[n, idx] for n, idx in enumerate(last_item_idx)])
# Performs masking
for n, idx in enumerate(last_item_idx):
for col_nbr in range(prediction_data.shape[1]):
arr = prediction_data.iloc[n, col_nbr].copy()
arr[idx] = 0
prediction_data.iloc[n, col_nbr] = arr
# Transform data to pytorch format
pred_dtypes = prediction_data.applymap(lambda x: x[0]).dtypes
batch_pred = {}
for col, dtype in pred_dtypes.iteritems():
if dtype == 'float64':
tensor = np.stack(prediction_data[col]).astype(np.float32)
else:
tensor = np.stack(prediction_data[col])
tensor = torch.from_numpy(tensor)
batch_pred[col] = tensor.cpu()
I did check that the masking process was performed correctly

I re-ran the inference phase with ignore_masking=True
model_results = model_cpu(batch_pred, training=False, ignore_masking=True)
predictions = model_results['predictions']
_, topk_pred = torch.topk(predictions, k=10)
topk_pred = topk_pred.flip(dims=(1,))
and got different and disappointing results
Recall@10
t4r trainer eval metric result: 0.1004810556769371
my own metric result: 0.1004810550781038
my own (ignore masking): 0.0686179125452678
I can't figure out if I'm missing something or if I did something wrong but for the moment I can't find anything on my side.
It would be of great help if someone could check this issue out and try to replicate this experiment.
Thanks in advance to anyone willing to look into this issue.
❓ Questions & Help
Details
transformers4rec==0.1.13I've been trying out t4r for a while and I decided to try and replicate the evaluation metrics on my own by performing offline predictions.
I've been using a model architecture similar to one in the provided examples and similar features.
Model:
When I evaluate the metrics by using:
I get the following recall:
eval_/next-item/recall_at_10': 0.1004810556769371I tried writing some code to replicate the results and validate that the evaluation metrics returned by the model were correct.
(As the model was already trained I moved everything from GPU to CPU)
The following script transforms my data from pd.DataFrame to a dict with the appropriate format for the t4r model and also extracts the labels (The last products of the sequence).
For the moment I'm not removing the last item from the sessions on my dataset as I know
model(data, training=False)does that for me.I also created a function to evaluate the recall on my own
I performed offline predictions by using the following code:
After evaluating using my code
Which returns a pretty similar result:
BUT When I try to run the predictions "manually" masking the last item of each session and using
ignore_masking=TrueI get an entirely different result:I re-run my script for label extraction and to adapt the pandas dataframe to a dict but this time I mask the last item of each session:
I did check that the masking process was performed correctly
I re-ran the inference phase with
ignore_masking=Trueand got different and disappointing results
I can't figure out if I'm missing something or if I did something wrong but for the moment I can't find anything on my side.
It would be of great help if someone could check this issue out and try to replicate this experiment.
Thanks in advance to anyone willing to look into this issue.