Emergent introspective awareness in large language models

kromem@lemmy.world · 2 days ago

Emergent introspective awareness in large language models

kromem@lemmy.world · 1 day ago

The injection is the activation of a steering vector (extracted as discussed in the methodology section) and not a token prefix, but yes, it’s a mathematical representation of the concept, so let’s build from there.

Control group: Told that they are testing if injected vectors present and to self-report. No vectors activated. Zero self reports of vectors activated.

Experimental group: Same setup, but now vectors activated. A significant number of times, the model explicitly says they can tell a vector is activated (which it never did when the vector was not activated). Crucially, this is only graded as introspection if the model mentions they can tell the vector is activated before mentioning the concept, so it can’t just be a context-aware rationalization of why they said a random concept.

More clear? Again, the paper gives examples of the responses if you want to take a look at how they are structured, and to see that the model is self-reporting the vector activation before mentioning what it’s about.

technocrit@lemmy.dbzer0.com · edit-2 7 hours ago

None of this obfuscation and word salad demonstrates that a machine is self-aware or introspective.

It’s the same old bullshit that these grifters have been pumping out for years now.

MagicShel@lemmy.zip · 1 day ago

I’ve read it all twice. Once a deep skim and a second more thorough read before my last post.

I just don’t agree that this shows what they think it does. Now I’m not dumb, but maybe it’s a me issue. I’ll check with some folks who know more than me and see if something stands out to them.