Training GPT-like model on non-language series [R]

I am responsible for a research project that is supposed to train a GPT-like model (Transformer-decoder) with 100M, 250M and 500M model variants.

# params

## training dataset

- 750M tokens

- vocabulary is ~15k to ~100k tokens (depends on tokenizer settings)

- ~3% of the vocabulary is used in ~50% of the training tokens (similar to language, where most of the vocabulary is used very sparsely)

## training hyper-params

- optimizer = AdamW

- lr = 1e-3 (works the best compared to 1e-2 and 1e-4)

- betas = [0.9, 0.95]

- effective batch size = 4M tokens

- epoch = 16

- warmup steps ~200 (approx 1 epoch)

## model hyper-params

- 16 layers (but variants with up to 48 layers were tested)

- embedding = flexible to yield 100M, 250M and 500M model

- MLP size = 4*n_embd

- 16 attention heads

- context window = 1000

# Issue

The model seems to fail to learn the basic auto-regressive behavior. It often gets stuck on generating a single token (no repetition penalty, no sampling yet).

Is training GPT-like models still a black magic? Is there some trick to this?

*Disclaimer*: I will add/edit the parameters above as people ask clarifying questions.

submitted by /u/gartin336
[link] [comments]

Training GPT-like model on non-language series [R]

Want to read more?

Tagged with