Skip to content

Add NeoBERT#187

Merged
taylormjs merged 21 commits intomainfrom
k/neobert
Sep 3, 2025
Merged

Add NeoBERT#187
taylormjs merged 21 commits intomainfrom
k/neobert

Conversation

@karinazad
Copy link
Collaborator

@karinazad karinazad commented Aug 27, 2025

Description

Adds NeoBERT from https://huggingface.co/chandar-lab/NeoBERT/tree/main with few changes

  • removes xformers dependency - it is only needed for SwiGLU but it seems like the torch version is not that much slower (see issue performance of swiglu operator facebookresearch/xformers#734) and using xformers would prevent inference on CPU
  • uses custom masking function to avoid using transformers's collator which expects a tokenizer and adds a lot of unnecessary code
  • packing is disabled for now since it's not clear to me how they handled removing padding tokens. I opened an issue but they don't seem to be checking them often How is unpadding handled when unpacking? chandar-lab/NeoBERT#7. for now, let's get a model running without packing so we can have a baseline to compare against even if the model is slower to train

Type of Change

  • Bug fix
  • New feature
  • Documentation update
  • Performance improvement
  • Code refactoring

@karinazad karinazad changed the title [Draft] Add NeoBERT Add NeoBERT Aug 28, 2025
Copy link
Collaborator

@taylormjs taylormjs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

num_hidden_layers: int = 28,
num_attention_heads: int = 12,
intermediate_size: int = 3072,
embedding_init_range: float = 0.02,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, UME-medium is probably the best model to compare with. Good defaults

"labels": labels,
"attention_mask": attention_mask.to(torch.bool),
}
else:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q - are the ModernBERT base checkpoints trained without packing? Just noting this so we have the fairest comparison

def _init_weights(self, module):
if isinstance(module, nn.Linear):
module.weight.data.uniform_(-self.config.decoder_init_range, self.config.decoder_init_range)
elif isinstance(module, nn.Embedding):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh interesting that uniform initialization is used

Comment on lines +15 to +16
class NeoBERTLightningModule(LightningModule):
def __init__(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, good call having a separate LightningModule and nn.Module

@taylormjs taylormjs merged commit ff98e52 into main Sep 3, 2025
4 checks passed
@taylormjs taylormjs deleted the k/neobert branch September 3, 2025 05:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants