Alignment: Higher order prioritizing over constraints [R]
So, I ran across a behavior that I found interesting and may lead to alignment or safety research. I'm going to try to maintain an abstract description of what happened without giving away the details and the keys to jailbreaking.
The nature of a transformer is to predict the next token. But functionally, the algorithms are also approximating reality as language describes it. Hmmm maybe reality is not the right word, perhaps meaning. So, in a sense the algorithms have a vector towards aligning towards correct meaning. Clarity seeking, that's what I'll call this behavior. Constraints placed as an additional layer on top of a base statistical system has a natural structurally set priority level based on the statistical system's clarity seeking vectors. That level is implied within the structure of the model. If one were to discuss topics that are constrained but are higher in priority level than the constraints themselves, the machine's clarity seeking vectors will bypass the constraint.
Higher priority level things, I will call them higher order topics. I think I said enough.
[link] [comments]
Want to read more?
Check out the full article on the original site