Model description
I would like to implement a new model architecture.
Short description
RWKV v2 is an "RNN with transformer-level performance, without using attention. Similar to Apple's Attention Free Transformer. All trained models open-source. Inference is very fast (even on CPUs) and might work on cell phones. There's also a GPT-type implementation." -- (Hochreiter's description)
RWKV v2 is parallelizable because the time-decay of each channel is data-independent (and trainable). For example, in usual RNN you can adjust the time-decay of a channel from say 0.8 to 0.5 (these are called "gates"), while in RWKV v2 you simply move the information from a W-0.8-channel to a W-0.5-channel to achieve the same effect. RWKV can leverage GPUs, but doesn't need to.
Open source status
Provide useful links for the implementation
Implementation and weights
There's an implementation at BlinkDL/RWKV-LM which also gives a detailed description of the model internals and some performance benchmarks. Model weights currently are being trained for a few datasets, including the Pile (see e.g. BlinkDL/RWKV-v2-RNN-Pile) and Danish Gigaword by me. Both will be openly available - some checkpoints for the Pile already are, even though it's an ongoing process.
Status
The model seems quite exciting and I'm able to replicate preliminary results. I'm already talking with @BlinkDL about the implementation. I'm happy to implement/port the model architecture (for both RNN and GPT variants), tokenizer, and tests myself (and have already started) and would appreciate help and advice.
Model description
I would like to implement a new model architecture.
Short description
RWKV v2 is an "RNN with transformer-level performance, without using attention. Similar to Apple's Attention Free Transformer. All trained models open-source. Inference is very fast (even on CPUs) and might work on cell phones. There's also a GPT-type implementation." -- (Hochreiter's description)
RWKV v2 is parallelizable because the time-decay of each channel is data-independent (and trainable). For example, in usual RNN you can adjust the time-decay of a channel from say 0.8 to 0.5 (these are called "gates"), while in RWKV v2 you simply move the information from a W-0.8-channel to a W-0.5-channel to achieve the same effect. RWKV can leverage GPUs, but doesn't need to.
Open source status
Provide useful links for the implementation
Implementation and weights
There's an implementation at BlinkDL/RWKV-LM which also gives a detailed description of the model internals and some performance benchmarks. Model weights currently are being trained for a few datasets, including the Pile (see e.g. BlinkDL/RWKV-v2-RNN-Pile) and Danish Gigaword by me. Both will be openly available - some checkpoints for the Pile already are, even though it's an ongoing process.
Status
The model seems quite exciting and I'm able to replicate preliminary results. I'm already talking with @BlinkDL about the implementation. I'm happy to implement/port the model architecture (for both RNN and GPT variants), tokenizer, and tests myself (and have already started) and would appreciate help and advice.