High Dimensional, Dynamic Rotary Positional Embedding [P]
![High Dimensional, Dynamic Rotary Positional Embedding [P]](/_next/image?url=https%3A%2F%2Fexternal-preview.redd.it%2FGo7zlxhewkLxNN5-ZvZe623w5Zrdi3SXYEIr0JeEGQk.png%3Fwidth%3D140%26height%3D75%26auto%3Dwebp%26s%3D2d3a7ad647024e077a4b7f7b5746c806eba71b8a&w=3840&q=75)
| At the end of my last post, I presented an idea: what if I used the core of my last project, the cumulative matrix product, and repurposed it as a positional embedding? I just finished fleshing out the math behind HDD-RoPE and training a model with this positional embedding algorithm, and the results are excellent. When trained on the dataset TinyStories, the validation loss begins to converge a fair amount faster than the baseline transformer trained using xPos. The repo at https://github.com/mikayahlevi/hdd-rope/ allows you to replicate the results and goes in depth about the math and details of the architecture. Standard RoPE breaks the queries and keys into groups of two and rotates each pair at a predefined rate. This allows the model to learn relative position by observing the change in basis between the queries and keys. Pairs of two make intuitive sense for a linear sequence, as a chunk can be rotated with a single degree of freedom, corresponding to linear one-dimensionally progressing position. If you would like to learn more, please check out the repo. I formalize the math and lay out a roadmap. [link] [comments] |
Want to read more?
Check out the full article on the original site