Training GPT-like model on non-language series [R]
I am responsible for a research project that is supposed to train a GPT-like model (Transformer-decoder) with 100M, 250M and 500M model variants.
# params
## training dataset
- 750M tokens
- vocabulary is ~15k to ~100k tokens (depends on tokenizer settings)
- ~3% of the vocabulary is used in ~50% of the training tokens (similar to language, where most of the vocabulary is used very sparsely)
## training hyper-params
- optimizer = AdamW
- lr = 1e-3 (works the best compared to 1e-2 and 1e-4)
- betas = [0.9, 0.95]
- effective batch size = 4M tokens
- epoch = 16
- warmup steps ~200 (approx 1 epoch)
## model hyper-params
- 16 layers (but variants with up to 48 layers were tested)
- embedding = flexible to yield 100M, 250M and 500M model
- MLP size = 4*n_embd
- 16 attention heads
- context window = 1000
# Issue
The model seems to fail to learn the basic auto-regressive behavior. It often gets stuck on generating a single token (no repetition penalty, no sampling yet).
Is training GPT-like models still a black magic? Is there some trick to this?
*Disclaimer*: I will add/edit the parameters above as people ask clarifying questions.
[link] [comments]
Want to read more?
Check out the full article on the original site