2 min readfrom Machine Learning

Training GPT-like model on non-language series [R]

I am responsible for a research project that is supposed to train a GPT-like model (Transformer-decoder) with 100M, 250M and 500M model variants.

# params

## training dataset

- 750M tokens

- vocabulary is ~15k to ~100k tokens (depends on tokenizer settings)

- ~3% of the vocabulary is used in ~50% of the training tokens (similar to language, where most of the vocabulary is used very sparsely)

## training hyper-params

- optimizer = AdamW

- lr = 1e-3 (works the best compared to 1e-2 and 1e-4)

- betas = [0.9, 0.95]

- effective batch size = 4M tokens

- epoch = 16

- warmup steps ~200 (approx 1 epoch)

## model hyper-params

- 16 layers (but variants with up to 48 layers were tested)

- embedding = flexible to yield 100M, 250M and 500M model

- MLP size = 4*n_embd

- 16 attention heads

- context window = 1000

# Issue

The model seems to fail to learn the basic auto-regressive behavior. It often gets stuck on generating a single token (no repetition penalty, no sampling yet).

Is training GPT-like models still a black magic? Is there some trick to this?

*Disclaimer*: I will add/edit the parameters above as people ask clarifying questions.

submitted by /u/gartin336
[link] [comments]

Want to read more?

Check out the full article on the original site

View original article

Tagged with

#natural language processing for spreadsheets
#financial modeling with spreadsheets
#rows.com
#natural language processing
#generative AI for data analysis
#large dataset processing
#Excel alternatives for data analysis
#GPT-like model
#Transformer-decoder
#training dataset
#tokens
#training hyper-params
#model hyper-params
#vocabulary
#optimizer
#AdamW
#embedding
#auto-regressive behavior
#learning rate
#batch size
Training GPT-like model on non-language series [R]