Qwerky: Attention is not what you need? RWKV mashed into QwQ models

vessenes 3 days ago

This blog post and results were very surprising to me. Featherless AI has been doing some pretty in depth research around using existing model weights but throwing away the transformer and replacing with RWKV. Which is lighter on memory, better at long context and a fundamentally different architecture - it does not scale up quadratically with context length.

Their method:

* Keep the old model as a teacher * Copy it, freeze all layers * Rip out the transformer layers * Inject new RWKV layers there * Train against og model as a teacher * Unfreeze entire new model, train against teacher * Train for long context

Their tests were done with a tiny number of tokens for training, like 200-500m total. Results: interesting! It’s definitely better than QwQ on some tasks, and worse on a few.

The super interesting result here is that it seems as if most of the knowledge and intelligence is kept in the feed forward part of the network, not the transformer, since it can be replaced, and the replacement can be trained relatively quickly. That’s super super fascinating.