docs: increase MLX smoke validation batch size#36
docs: increase MLX smoke validation batch size#36brendanboyle87 wants to merge 1 commit intoopenai:mainfrom
Conversation
- add a PR-audit research log entry covering the clean takeaways from pull requests openai#36 through openai#70 - promote long-context training plus matching long-context eval as a first-class clean branch based on PR openai#61 and PR openai#63 - refine mixed-precision export notes to emphasize using int6/int8 byte savings to fund wider MLP capacity, based on PR openai#65 - update the current snapshot and research thesis so future agents do not over-focus on exporter-only ideas after the broader PR sweep
|
??? this is increasing val batch size?? |
Sorry if I was off base here This was based on the fact that this script is for local mlx dev. there was no intermediate output so I was trying to figure out how long validation would take. Codex gave an estimate in hours vs minutes “On this machine, a full validation with the old VAL_BATCH_SIZE=8192 is roughly a 5 to 6+ hour job. With VAL_BATCH_SIZE=524288, it is about 5 minutes. The reason is in train_gpt_mlx.py:766: validation uses VAL_BATCH_SIZE // GRAD_ACCUM_STEPS. With GRAD_ACCUM_STEPS=8 and TRAIN_SEQ_LEN=1024, 8192 means only 1024 eval tokens per batch, which is exactly 1 sequence. 524288 means 65536 eval tokens, or 64 sequences per |
Summary
VAL_BATCH_SIZE=524288Why
The default validation batch size setting in the README trial run takes a very long time on a local Mac run for an M4 Max Mac Studio with 128GB, so this raises the documented MLX smoke-test value to a more practical local setting.