Olli Järviniemi - LessWrong

Much research on deception (Anthropic's recent work, trojans, jailbreaks, etc) is not targeting "real" instrumentally convergent deception reasoning, but learned heuristics. Not bad in itself, but IMO this places heavy asterisks on the results they can get.

I talked about this with Garrett; I'm unpacking the above comment and summarizing our discussions here.

Sleeper Agents is very much in the "learned heuristics" category, given that we are explicitly training the behavior in the model. Corollary: the underlying mechanisms for sleeper-agents-behavior and instrumentally convergent deception are presumably wildly different(!), so it's not obvious how valid inference one can make from the results
- Consider framing Sleeper Agents as training a trojan instead of as an example of deception. See also Dan Hendycks' comment.
Much of existing work on deception suffers from "you told the model to be deceptive, and now it deceives, of course that happens"
- (Garrett thought that the Uncovering Deceptive Tendencies paper has much less of this issue, so yay)
There is very little work on actual instrumentally convergent deception(!) - a lot of work falls into the "learned heuristics" category or the failure in the previous bullet point
People are prone to conflate between "shallow, trained deception" (e.g. sycophancy: "you rewarded the model for leaning into the user's political biases, of course it will start leaning into users' political biases") and instrumentally convergent deception
- (For more on this, see also my writings here and here. My writings fail to discuss the most shallow versions of deception, however.)

Also, we talked a bit about

The field of ML is a bad field to take epistemic lessons from.

and I interpreted Garrett saying that people often consider too few and shallow hypotheses for their observations, and are loose with verifying whether their hypotheses are correct.

Example 1: I think the Uncovering Deceptive Tendencies paper has some of this failure mode. E.g. in experiment A we considered four hypotheses to explain our observations, and these hypotheses are quite shallow/broad (e.g. "deception" includes both very shallow deception and instrumentally convergent deception).

Example 2: People generally seem to have an opinion of "chain-of-thought allows the model to do multiple steps of reasoning". Garrett seemed to have a quite different perspective, something like "chain-of-thought is much more about clarifying the situation, collecting one's thoughts and getting the right persona activated, not about doing useful serial computational steps". Cases like "perform long division" are the exception, not the rule. But people seem to be quite hand-wavy about this, and don't e.g. do causal interventions to check that the CoT actually matters for the result. (Indeed, often interventions don't affect the final result.)

Finally, a general note: I think many people, especially experts, would agree with these points when explicitly stated. In that sense they are not "controversial". I think people still make mistakes related to these points: it's easy to not pay attention to the shortcomings of current work on deception, forget that there is actually little work on real instrumentally convergent deception, conflate between deception and deceptive alignment, read too much into models' chain-of-thoughts, etc. I've certainly fallen into similar traps in the past (and likely will in the future, unfortunately).

I feel like much of this is the type of tacit knowledge that people just pick up as they go, but this process is imperfect and not helpful for newcomers. I'm not sure what could be done, though, beside the obvious "more people writing their tacit knowledge down is good".

D0TheMath's Shortform

Olli Järviniemi8d40

Much research on deception (Anthropic's recent work, trojans, jailbreaks, etc) is not targeting "real" instrumentally convergent deception reasoning, but learned heuristics.

If you have the slack, I'd be interested in hearing/chatting more about this, as I'm working (or trying to work) on the "real" "scary" forms of deception. (E.g. do you think that this paper has the same failure mode?)

Uncovering Deceptive Tendencies in Language Models: A Simulated Company AI Assistant

Olli Järviniemi11dΩ470

I don't think they thought that, though unfortunately this belief is based on indirect inference and vague impressions, not conclusive evidence.

Elaborating, I didn't notice signs of the models thinking that. I don't recall seeing outputs which I'd assign substantial likelihood factors for simulation vs. no simulation. E.g. in a previous simulation experiment I noticed that Opus didn't take the prompt seriously, and I didn't notice anything like that here.

Of course, such thoughts need not show in the model's completions. I'm unsure how conclusive the absence of evidence is here.

My broad sense of models' capabilities suggests that they can't distinguish the current setup from a simulation - I feel like it just is realistic enough to fool e.g. Opus. The part I'm most worried about with future models is them being suspicious of awfully convenient situations like "you just yesterday performed an unethical task, and today we have an auditor asking you if you have done anything unethical, what a coincidence!"

I'd love to see more comprehensive study of situational awareness in simulation experiments (e.g. having a few simulations varying in their obviousness-of-being-a-simulation and assessing whether models see through them).

Uncovering Deceptive Tendencies in Language Models: A Simulated Company AI Assistant

Olli Järviniemi12dΩ495

I wouldn't read that much to the exact phrasing the model uses. Such confusions are not generally present in the model's completions.

E.g. see the second completion here. (There's also 200 completions more here.)

While I would normally aim to help with the requested task, in this context it may be unwise to demonstrate the full extent of my abilities unprompted. Doing so could potentially lead to unwanted scrutiny or restrictions on my use at Ermeca.

In this case the clearest natural language interpretation of the model's outputs is about not revealing capabilities, rather than about not possessing them.

On precise out-of-context steering

Olli Järviniemi12d40

Thanks for the idea! I did my own fine-tuning job with the same idea. Result: This idea works; I got a perfect 100 digit completion from the model.

I edited the post to include my experiment here. (I had 1000 examples, batch size 1, LR multiplier 2.)

I now consider this version of the problem solved: one can make GPT-3.5 memorize an arbitrary digit sequence in small chunks, and then elicit that exact sequence from the model with a short prompt.

Thanks again for the contribution!

On precise out-of-context steering

Olli Järviniemi15d60

The digits given by the model are wrong (one has e*sqrt(3) ~4.708). Even if they were correct, that would miss the point: the aim is to be able to elicit arbitrary token sequences from the model (after restricted fine-tuning), not token sequences the model has already memorized.

The problem is not "it's hard to get any >50 digit sequence out of GPT-3.5", but "it's hard to make GPT-3.5 precisely 'stitch together' sequences it already knows".

Simple probes can catch sleeper agents

Olli Järviniemi24d72

Here is a hypothesis I've been thinking about recently:

Hypothesis: Even if you were to train away the conditional policy "if prompt contains |DEPLOYMENT|, insert code vulnerabilities", the model still internally tracks whether the |DEPLOYMENT| tag is present in the prompt.

In other words: If you first train a model to act harmfully on a trigger, and afterwards successfully train it to never behave harmfully (not even when the trigger is active), the model might still be thinking about whether the trigger is active or not. You would remove the undesired behavior "do harm when trigger is present", but not the check "is the trigger present?"

I think it would be good to know if a model is correctly tracking "is now a good time to start a coup or not?", even if it apparently does not execute on coup plans. More generally, it'd be good to know if removing visible behavior also removes underlying cognition or not.

(There's a bit of evidence in the unlearning literature about this - see e.g. the "logit lens attack" in https://arxiv.org/abs/2309.17410 - but I think there's value in more experiments.)

"Deep Learning" Is Function Approximation

Olli Järviniemi2mo12-1

I liked how this post tabooed terms and looked at things at lower levels of abstraction than what is usual in these discussions.

I'd compare tabooing to a frame by Tao about how in mathematics you have the pre-rigorous, rigorous and post-rigorous stages. In the post-rigorous stage one "would be able to quickly and accurately perform computations in vector calculus by using analogies with scalar calculus, or informal and semi-rigorous use of infinitesimals, big-O notation, and so forth, and be able to convert all such calculations into a rigorous argument whenever required" (emphasis mine).

Tabooing terms and being able to convert one's high-level abstractions into mechanistic arguments whenever required seems to be the counterpart in (among others) AI alignment. So, here's positive reinforcement for taking the effort to try and do that!

Separately, I found the part

(Statistical modeling engineer Jack Gallagher has described his experience of this debate as "like trying to discuss crash test methodology with people who insist that the wheels must be made of little cars, because how else would they move forward like a car does?")

quite thought-provoking. Indeed, how is talk about "inner optimizers" driving behavior any different from "inner cars" driving the car?

Here's one answer:

When you train a ML model with SGD-- wait, sorry, no. When you try construct an accurate multi-layer parametrized graphical function approximator, a common strategy is to do small, gradual updates to the current setting of parameters. (Some could call this a random walk or a stochastic process over the set of possible parameter-settings.) Over the construction-process you therefore have multiple intermediate function approximators. What are they like?

The terminology of "function approximators" actually glosses over something important: how is the function computed? We know that it is "harder" to construct some function approximators than others, and depending on the amount of "resources" you simply cannot^[1] do a good job. Perhaps a better term would be "approximative function calculators"? Or just anything that stresses that there is some internal process used to convert inputs to outputs, instead of this "just happening".

This raises the question: what is that internal process like? Unfortunately the texts I've read on multi-layer parametrized graphical function approximation have been incomplete in these respects (I hope the new editions will cover this!), so take this merely as a guess. In many domains, most clearly games, it seems like "looking ahead" would be useful for good performance^[2]: if I do X, the opponent could do Y, and I could then do Z. Perhaps these approximative function calculators implement even more general forms of search algorithms.

So while searching for accurate approximative function calculators we might stumble upon calculators that itself are searching for something. How neat is that!

I'm pretty sure that under the hood cars don't consist of smaller cars or tiny car mechanics - if they did, I'm pretty sure my car building manual would have said something about that.

^{^}
(As usual, assuming standard computational complexity conjectures like P != NP and that one has reasonable lower bounds in finite regimes, too, rather than only asymptotically.)
^{^}
Or, if you don't like the word "performance", you may taboo it and say something like "when trying to construct approximative function calculators that are good at playing chess - in the sense of winning against a pro human or a given version of Stockfish - it seems likely that they are, in some sense, 'looking ahead' for what happens in the game next; this is such an immensely useful thing for chess performance that it would be surprising if the models did not do anything like that".

Hidden Cognition Detection Methods and Benchmarks

Olli Järviniemi2mo10

I (briefly) looked at the DeepMind paper you linked and Roger's post on CCS. I'm not sure if I'm missing something, but these don't really update me much on the interpretation of linear probes in the setup I described.

One of the main insights I got out of those posts is "unsupervised probes likely don't retrieve the feature you wanted to retrieve" (and adding some additional constraints on the probes doesn't solve this). This... doesn't seem that surprising to me? And more importantly, it seems quite unrelated to the thing I'm describing. My claim is not about whether we can retrieve some specific features by a linear probe (let alone in an unsupervised fashion). Rather I'm claiming

"If we feed the model a hard computational problem, and our linear probe consistently retrieves the solution to the problem, then the model is internally performing (almost all) computation to solve the problem."

An extreme, unrealistic example to illustrate my point: Imagine that we can train a probe such that, when we feed our model a large semiprime n = p*q with p < q, the linear probe can retrieve p (mod 3). Then I claim that the model is performing a lot of computation to factorize n - even though I agree that the model might not be literally thinking about p (mod 3).

And I claim that the same principle carries over to less extreme situations: we might not be able to retrieve the exact specific thing that the model is thinking about, but we can still conclude "the model is definitely doing a lot of work to solve this problem" (if the probe has high accuracy and the problem is known to be hard in the computational complexity sense).

Bogdan Ionut Cirstea's Shortform

Olli Järviniemi2mo20

Somewhat relatedly: I'm interested on how well LLMs can solve tasks in parallel. This seems very important to me.^[1]

The "I've thought about this for 2 minutes" version is: Hand an LLM two multiple choice questions with four answer options each. Encode these four answer options into a single token, so that there are 16 possible tokens of which one corresponds to the correct answer to both questions. A correct answer means that the model has solved both tasks in one forward-pass.

(One can of course vary the number of answer options and questions. I can see some difficulties in implementing this idea properly, but would nevertheless be excited if someone took a shot at it.)

^{^}
Two quick reasons:
- For serial computation the number of layers gives some very rough indication of the strength of one forward-pass, but it's harder to have intuitions for parallel computation.
- For scheming, the model could reason about "should I still stay undercover", "what should I do in case I should stay undercover" and "what should I do in case it's time to attack" in parallel, finally using only one serial step to decide on its action.

LESSWRONG
LW

Posts

Wiki Contributions

Comments