You can use Rotary Positional Embedding (RoPE) to encode relative positions directly in attention scores, improving generalization to longer sequences.
Here is the code snippet below:

In the above code, we are using the following key points:
-
Frequency-based angle generation for encoding relative positions
-
Element-wise rotation of input embeddings using sine and cosine
-
Applied only to a specific portion of the embedding (rotary dimension)
Hence, rotary embeddings boost generative model robustness on longer context inputs.