UMCU
/

CardioBERTa_base.nl

Model card Files Files and versions

Continued, off-premise, pre-training of MedRoBERTa.nl using about 50GB of open Dutch and translated English corpora.

Data statistics

Sources:

Dutch: medical guidelines (FMS, NHG)
Dutch: NtvG papers
English: Pubmed abstracts
English: PMC abstracts translated using DeepL
English: Apollo guidelines, papers and books
English: Meditron guidelines
English: MIMIC3
English: MIMIC CXR
English: MIMIC4

All translated (if not with DeepL) with a combination of GeminiFlash 1.5/GPT4o mini, MariaNMT, NLLB200.

Number of tokens: 15B
Number of documents: 27M

Training

Effective batch size: 5120
Learning rate: 2e-4
Weight decay: 1e-3
Learning schedule: linear, with 5_000 warmup steps
Num epochs: ~3

Train perplexity: 3.0 Validation perplexity: 3.0

Acknowledgement

This work was done together with the Amsterdam UMC, in the context of the DataTools4Heart project.

We were happy to be able to use the Google TPU research cloud for training the model.

Downloads last month: 19

Safetensors

Model size

0.2B params

Tensor type

F32

·

Model tree for UMCU/CardioBERTa_base.nl

Base model

CLTL/MedRoBERTa.nl

Finetuned

(9)

this model

Spaces using UMCU/CardioBERTa_base.nl 3