`problem_type="single_label_classification"` with `num_labels=1` leads to degenerate zero loss across multiple sequence-classification models

### System Info

Hi, I found what looks like a library-wide issue in `transformers` affecting multiple `ForSequenceClassification` models, not just ModernBERT.

If a model is initialized with:
```python
num_labels=1
problem_type="single_label_classification"
```
the forward pass uses `CrossEntropyLoss()` with only one output logit. This leads to a degenerate zero loss during training/evaluation instead of performing binary classification meaningfully.

I first observed this with `ModernBertForSequenceClassification`, but the same logic appears in other sequence-classification models as well (for example RoBERTa and others using the same loss-selection pattern).

In `modeling_modernbert.py`, the relevant part is:

```python
loss = None
if labels is not None:
    if self.config.problem_type is None:
        if self.num_labels == 1:
            self.config.problem_type = "regression"
        elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):
            self.config.problem_type = "single_label_classification"
        else:
            self.config.problem_type = "multi_label_classification"

    if self.config.problem_type == "regression":
        loss_fct = MSELoss()
        if self.num_labels == 1:
            loss = loss_fct(logits.squeeze(), labels.squeeze())
        else:
            loss = loss_fct(logits, labels)
    elif self.config.problem_type == "single_label_classification":
        loss_fct = CrossEntropyLoss()
        loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
    elif self.config.problem_type == "multi_label_classification":
        loss_fct = BCEWithLogitsLoss()
        loss = loss_fct(logits, labels)
```
With `num_labels=1 and problem_type="single_label_classification", this becomes:
```python
CrossEntropyLoss()(logits.view(-1, 1), labels.view(-1))
```

which produces a degenerate zero loss because there is only one class dimension.

Why I think this is a bug:
This setup naturally suggests binary classification with labels like:

- 0 -> class 0
- 1 -> class 1

So from a user perspective, this looks like it should be a valid single-label binary classification setup.
Right now, however, `num_labels=1` is effectively treated as if there were only one possible class in the loss computation, which makes the classification loss meaningless.

### Who can help?

_No response_

### Information

- [ ] The official example scripts
- [ ] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

Minimal reproduction
```python
from transformers import AutoModelForSequenceClassification
import torch

model = AutoModelForSequenceClassification.from_pretrained(
    "answerdotai/ModernBERT-base",
    num_labels=1,
    problem_type="single_label_classification",
)

input_ids = torch.tensor([[101, 102]])
attention_mask = torch.tensor([[1, 1]])
labels = torch.tensor([0])

outputs = model(
    input_ids=input_ids,
    attention_mask=attention_mask,
    labels=labels,
)

print(outputs.logits.shape)
print(outputs.loss)
```

Observed result
`outputs.loss` is `0` (or degenerate), and the same behavior also shows up during training.

### Expected behavior

Expected behavior
I would expect `num_labels=1` with `problem_type="single_label_classification"` to support binary classification meaningfully for labels `{0, 1}`, instead of silently producing a degenerate zero loss.
For example, this could be implemented with a single-logit binary objective such as `BCEWithLogitsLoss`, or by internally mapping this configuration to an equivalent binary-classification setup.
In any case, the current behavior of silently returning zero loss seems incorrect.

Actual behavior
The model runs, but training/eval loss becomes degenerate `(0)` because `CrossEntropyLoss` is applied to logits with shape `[..., 1]`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`problem_type="single_label_classification"` with `num_labels=1` leads to degenerate zero loss across multiple sequence-classification models #45479

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

problem_type="single_label_classification" with num_labels=1 leads to degenerate zero loss across multiple sequence-classification models #45479

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

`problem_type="single_label_classification"` with `num_labels=1` leads to degenerate zero loss across multiple sequence-classification models #45479