Hybrid Attention

29 points | by JohannaAlmeida 3 hours ago ago

6 comments

$JohannaAlmeida 3 hours ago

Full attention O(n²): 17.96s / 5.6 tok/s
HybridAttention O(n·W + n·D): 0.35s / 286.6 tok/s
$empath75 2 hours ago

Is this for just like auto complete, because you are not going to get anything very useful out of a code-only training set.
[-]
- $JohannaAlmeida 2 hours ago
  
  Yeah auto complete is an amazing use case. I needed a small model that used transformers , could fit on my weak consumer GPU .
  So i needed to make fundamental arquitecture changes .Do some KV cache tricks.
  And then prove the new arquitecture was faster with benchmarks and perplexity was acceptable.
- $altruios an hour ago
  
  I think it's more a proof of concept: locally trained. It would take lots of resources/time to train something non-trivial.
$woodson 2 hours ago

Look into RWKV.
[-]
- $JohannaAlmeida an hour ago
  
  Yeah RWKV is definitely related in spirit (recurrent state for long context). Here I’m combining local windowed attention with a gated recurrent path + KV cache compression, so it’s more hybrid than fully replacing attention