System Info
Hi, I found what looks like a library-wide issue in transformers affecting multiple ForSequenceClassification models, not just ModernBERT.
If a model is initialized with:
num_labels=1
problem_type="single_label_classification"
the forward pass uses CrossEntropyLoss() with only one output logit. This leads to a degenerate zero loss during training/evaluation instead of performing binary classification meaningfully.
I first observed this with ModernBertForSequenceClassification, but the same logic appears in other sequence-classification models as well (for example RoBERTa and others using the same loss-selection pattern).
In modeling_modernbert.py, the relevant part is:
loss = None
if labels is not None:
if self.config.problem_type is None:
if self.num_labels == 1:
self.config.problem_type = "regression"
elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):
self.config.problem_type = "single_label_classification"
else:
self.config.problem_type = "multi_label_classification"
if self.config.problem_type == "regression":
loss_fct = MSELoss()
if self.num_labels == 1:
loss = loss_fct(logits.squeeze(), labels.squeeze())
else:
loss = loss_fct(logits, labels)
elif self.config.problem_type == "single_label_classification":
loss_fct = CrossEntropyLoss()
loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
elif self.config.problem_type == "multi_label_classification":
loss_fct = BCEWithLogitsLoss()
loss = loss_fct(logits, labels)
With `num_labels=1 and problem_type="single_label_classification", this becomes:
CrossEntropyLoss()(logits.view(-1, 1), labels.view(-1))
which produces a degenerate zero loss because there is only one class dimension.
Why I think this is a bug:
This setup naturally suggests binary classification with labels like:
- 0 -> class 0
- 1 -> class 1
So from a user perspective, this looks like it should be a valid single-label binary classification setup.
Right now, however, num_labels=1 is effectively treated as if there were only one possible class in the loss computation, which makes the classification loss meaningless.
Who can help?
No response
Information
Tasks
Reproduction
Minimal reproduction
from transformers import AutoModelForSequenceClassification
import torch
model = AutoModelForSequenceClassification.from_pretrained(
"answerdotai/ModernBERT-base",
num_labels=1,
problem_type="single_label_classification",
)
input_ids = torch.tensor([[101, 102]])
attention_mask = torch.tensor([[1, 1]])
labels = torch.tensor([0])
outputs = model(
input_ids=input_ids,
attention_mask=attention_mask,
labels=labels,
)
print(outputs.logits.shape)
print(outputs.loss)
Observed result
outputs.loss is 0 (or degenerate), and the same behavior also shows up during training.
Expected behavior
Expected behavior
I would expect num_labels=1 with problem_type="single_label_classification" to support binary classification meaningfully for labels {0, 1}, instead of silently producing a degenerate zero loss.
For example, this could be implemented with a single-logit binary objective such as BCEWithLogitsLoss, or by internally mapping this configuration to an equivalent binary-classification setup.
In any case, the current behavior of silently returning zero loss seems incorrect.
Actual behavior
The model runs, but training/eval loss becomes degenerate (0) because CrossEntropyLoss is applied to logits with shape [..., 1].
System Info
Hi, I found what looks like a library-wide issue in
transformersaffecting multipleForSequenceClassificationmodels, not just ModernBERT.If a model is initialized with:
the forward pass uses
CrossEntropyLoss()with only one output logit. This leads to a degenerate zero loss during training/evaluation instead of performing binary classification meaningfully.I first observed this with
ModernBertForSequenceClassification, but the same logic appears in other sequence-classification models as well (for example RoBERTa and others using the same loss-selection pattern).In
modeling_modernbert.py, the relevant part is:With `num_labels=1 and problem_type="single_label_classification", this becomes:
which produces a degenerate zero loss because there is only one class dimension.
Why I think this is a bug:
This setup naturally suggests binary classification with labels like:
So from a user perspective, this looks like it should be a valid single-label binary classification setup.
Right now, however,
num_labels=1is effectively treated as if there were only one possible class in the loss computation, which makes the classification loss meaningless.Who can help?
No response
Information
Tasks
examplesfolder (such as GLUE/SQuAD, ...)Reproduction
Minimal reproduction
Observed result
outputs.lossis0(or degenerate), and the same behavior also shows up during training.Expected behavior
Expected behavior
I would expect
num_labels=1withproblem_type="single_label_classification"to support binary classification meaningfully for labels{0, 1}, instead of silently producing a degenerate zero loss.For example, this could be implemented with a single-logit binary objective such as
BCEWithLogitsLoss, or by internally mapping this configuration to an equivalent binary-classification setup.In any case, the current behavior of silently returning zero loss seems incorrect.
Actual behavior
The model runs, but training/eval loss becomes degenerate
(0)becauseCrossEntropyLossis applied to logits with shape[..., 1].