Instructions to use UMCU/CardioBERTa_base.nl with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use UMCU/CardioBERTa_base.nl with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("fill-mask", model="UMCU/CardioBERTa_base.nl")# Load model directly from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("UMCU/CardioBERTa_base.nl") model = AutoModelForMaskedLM.from_pretrained("UMCU/CardioBERTa_base.nl") - Notebooks
- Google Colab
- Kaggle
Continued, off-premise, pre-training of MedRoBERTa.nl using about 50GB of open Dutch and translated English corpora.
Data statistics
Sources:
- Dutch: medical guidelines (FMS, NHG)
- Dutch: NtvG papers
- English: Pubmed abstracts
- English: PMC abstracts translated using DeepL
- English: Apollo guidelines, papers and books
- English: Meditron guidelines
- English: MIMIC3
- English: MIMIC CXR
- English: MIMIC4
All translated (if not with DeepL) with a combination of GeminiFlash 1.5/GPT4o mini, MariaNMT, NLLB200.
- Number of tokens: 15B
- Number of documents: 27M
Training
- Effective batch size: 5120
- Learning rate: 2e-4
- Weight decay: 1e-3
- Learning schedule: linear, with 5_000 warmup steps
- Num epochs: ~3
Train perplexity: 3.0 Validation perplexity: 3.0
Acknowledgement
This work was done together with the Amsterdam UMC, in the context of the DataTools4Heart project.
We were happy to be able to use the Google TPU research cloud for training the model.
- Downloads last month
- 19
Model tree for UMCU/CardioBERTa_base.nl
Base model
CLTL/MedRoBERTa.nl