You can improve response time in Transformers by caching key and value outputs of attention layers from previous tokens during autoregressive generation.
Here is the code snippet below:

In the above code we are using the following key points:
-
A simple attention block with internal cache to store past key-value tensors
-
Reuse of cached values to reduce redundant computation
-
Efficient handling of incremental token inputs during generation
Hence, caching Transformer outputs during generation significantly speeds up inference by minimizing redundant computation.