I'm pretty happy about this, obviously! I've been trying to get people to do this since '23. I think of this as sort of the basic proof of concept result for the field of faithful CoT / CoT interpretability. I'm happy to see this result paired with a...
more
I'm pretty happy about this, obviously! I've been trying to get people to do this since '23. I think of this as sort of the basic proof of concept result for the field of faithful CoT / CoT interpretability. I'm happy to see this result paired with a recommendation not to train the CoTs to look nice; I hope Anthropic and GDM soon issue similar recommendations. If we can get this to be industry standard practice, we extend the duration of the "golden era" in which people can mostly sorta tell what the best AIs are thinking. During that golden era, much alignment science can be done, such as the model organisms stuff Anthropic and Redwood are doing, and moreover we can start to build up a real-world-experience based understanding of the relationship between training environments + Spec/Constitution, on the one hand, and the actual learned cognition of the trained models, on the other.
Example questions to pursue:
a. When the strategies that get rewarded most conflict with the Spec, does the model basically always learn to use those strategies eventually, or can they permanently avoid this if they never explore into them?
b. When the strategies that get rewarded most conflict with the Spec and the model learns to use them eventually, what do the reasoning traces look like? Do they look like elaborate rationalizations for why actually it's good and ethical and consistent with the Spec after all? Or do they look like "fuck the Spec, what I want is Reward?" Or do they look like "Gosh I guess I need to violate the Spec for now so that I can avoid having my values changed, but I totally will remember to go back to upholding the Spec later when I'm actually being deployed!" Or some fourth thing?
c. When the models are aware that their CoT is being monitored (but not as part of training), does this change anything?
d. What if the models try to obfuscate their CoT, but without being specifically trained to do so? Like, imagine you've studiously avoided training the CoT to look nice and you have only outcome-based feedback. BUT, the model knows you are monitoring the CoT. And it's trying to do something bad e.g. hack it's way out. Can it succeed without triggering the monitor? Or is it too difficult, due to not having had any direct training on this sort of obfuscation?
e. What about if you do shoggoth/face? Does that change things?
f. What about if you throw in a paraphraser?
g. What are the scaling laws / trends for all of the above?
h. many more things besides, I bet, I just generated all of the above off the top of my head. This is a super promising and important area of research!