LESSWRONG
LW

Home All Posts Concepts Library Community

Quick Takes

Very Spicy Take

Epistemic Note:
Many highly respected community members with substantially greater decision making experience (and Lesswrong karma) presumably disagree strongly with my conclusion.

Premise 1:
It is becoming increasingly clear that OpenAI is not appropriately prioritizing safety over advancing capabilities research.

Premise 2:
This was the default outcome.

Instances in history in which private companies (or any individual humans) have intentionally turned down huge profits and power are the exception, not the rule.

Premise 3:... (read more)

Showing 3 of 16 replies (Click to show all)

6ryan_greenblatt17m

Do you think that whenever anyone makes a decision that ends up being bad ex-post they should be forced to retire? Doesn't this strongly disincentivize making positive EV bets which are likely to fail?

2sapphire6m

Leadership is supposed to be about service not personal gain.

ryan_greenblatt2m20

I don't see how this is relevant to my comment.

By "positive EV bets" I meant positive EV with respect to shared values, not with respect to personal gain.

Tamsin Leake's Shortform

Tamsin Leake24m42

I'm surprised at people who seem to be updating only now about OpenAI being very irresponsible, rather than updating when they created a giant public competitive market for chatbots (which contains plenty of labs that don't care about alignment at all), thereby reducing how long everyone has to solve alignment. I still parse that move as devastating the commons in order to make a quick buck.

William_S's Shortform

William_S15dΩ731669

I worked at OpenAI for three years, from 2021-2024 on the Alignment team, which eventually became the Superalignment team. I worked on scalable oversight, part of the team developing critiques as a technique for using language models to spot mistakes in other language models. I then worked to refine an idea from Nick Cammarata into a method for using language model to generate explanations for features in language models. I was then promoted to managing a team of 4 people which worked on trying to understand language model features in context, leading to t... (read more)

Showing 3 of 35 replies (Click to show all)

Shubhorup Biswas43m10

They would not know if others have signed the SAME NDAs without trading information about their own NDAs, which is forbidden.

21tlevin1d

Kelsey Piper now reports: "I have seen the extremely restrictive off-boarding agreement that contains nondisclosure and non-disparagement provisions former OpenAI employees are subject to. It forbids them, for the rest of their lives, from criticizing their former employer. Even acknowledging that the NDA exists is a violation of it."

2wassname3d

Interesting! For most of us, this is outside our area of competence, so appreciate your input.

Rafael Harth's Shortform

Rafael Harth3h41

From my perspective, the only thing that keeps the OpenAI situation from being all kinds of terrible is that I continue to think they're not close to human-level AGI, so it probably doesn't matter all that much.

This is also my take on AI doom in general; my P(doom|AGI soon) is quite high (>50% for sure), but my P(AGI soon) is low. In fact it decreased in the last 12 months.

TurnTrout's shortform feed

TurnTrout2moΩ7150

Apparently^[1] there was recently some discussion of Survival Instinct in Offline Reinforcement Learning (NeurIPS 2023). The results are very interesting:

On many benchmark datasets, offline RL can produce well-performing and safe policies even when trained with "wrong" reward labels, such as those that are zero everywhere or are negatives of the true rewards. This phenomenon cannot be easily explained by offline RL's return maximization objective. Moreover, it gives offline RL a degree of robustness that is uncharacteristic of its online RL count

... (read more)

Showing 3 of 4 replies (Click to show all)

Algon3h20

Because future rewards are discounted

Don't you mean future values? Also, AFAICT, the only thing going on here that seperates online from offline RL is that offline RL algorithms shape the initial value function to give conservative behaviour. And so you get conservative behaviour.

1Tao Lin2mo

One lesson you could take away from this is "pay attention to the data, not the process" - this happened because the data had longer successes than failures. If successes were more numerous than failures, many algorithms would have imitated those as well with null reward.

6tailcalled2mo

The paper sounds fine quality-wise to me, I just find it implausible that it's relevant for important alignment work, since the proposed mechanism is mainly an aversion to building new capabilities.

romeostevensit's Shortform

romeostevensit2d3420

Several dozen people now presumably have Lumina in their mouths. Can we not simply crowdsource some assays of their saliva? I would chip money in to this. Key questions around ethanol levels, aldehyde levels, antibacterial levels, and whether the organism itself stays colonized at useful levels.

Showing 3 of 6 replies (Click to show all)

2Lorxus18h

Surely so! Hit me up if you ever end doing this - I'm likely getting the Lumina treatment in a couple months.

1wassname8h

A before and after would be even better!

Lorxus4h10

Any recommendations on how I should do that? You may assume that I know what a gas chromatograph is and what a Petri dish is and why you might want to use either or both of those for data collection, but not that I have any idea of how to most cost-effectively access either one as some rando who doesn't even have a MA in Chemistry.

D0TheMath's Shortform

Garrett Baker9h97

A Theory of Usable Information Under Computational Constraints

We propose a new framework for reasoning about information in complex systems. Our foundation is based on a variational extension of Shannon's information theory that takes into account the modeling power and computational constraints of the observer. The resulting \emph{predictive V-information} encompasses mutual information and other notions of informativeness such as the coefficient of determination. Unlike Shannon's mutual information and in violation of the data processing inequality, V-

... (read more)

[+][comment deleted]7h10

Deleted by Stephen Fowler, Today at 10:18 AM

Reason: Low effort comment, better served by just upvoting.

David Gross's Shortform

David Gross7d30

We inhabit this real material world, the one which we perceive all around us (and which somehow gives rise to perceptive and self-conscious beings like us).
Though not all of our perceptions conform to a real material world. We may be fooled by things like illusions or hallucinations or dreams that mimic perceptions of this world but are actually all in our minds.
Indeed if you examine your perceptions closely, you'll see that none of them actually give you representations of the material world, but merely reactions to it.
In fact, since the only evidence we

2Richard_Kennaway1d

Dragging files around in a GUI is a familiar action that does known things with known consequences. Somewhere on the hard disc (or SSD, or somewhere in the cloud, etc.) there is indeed a "file" which has indeed been "moved" into a "folder", and taking off those quotation marks only requires some background knowledge (which in fact I have) of the lower-level things that are going on and which the GUI presents to me through this visual metaphor. Some explanations work better than others. The idea that there is stuff out there that gives rise to my perceptions, and which I can act on with predictable results, seems to me the obvious explanation that any other contender will have to do a great deal of work to topple from the plinth. The various philosophical arguments over doctrines such as "idealism", "realism", and so on are more like a musical recreation (see my other comment) than anything to take seriously as a search for truth. They are hardly the sort of thing that can be right or wrong, and to the extent that they are, they are all wrong. Ok, that's my personal view of a lot of philosophy, but I'm not the only one.

2David Gross18h

It sounds like you want to say things like "coherence and persistent similarity of structure in perceptions demonstrates that perceptions are representations of things external to the perceptions themselves" or "the idea that there is stuff out there seems the obvious explanation" or "explanations that work better than others are the best alternatives in the search for truth" and yet you also want to say "pish, philosophy is rubbish; I don't need to defend an opinion about realism or idealism or any of that nonsense". In fact what you're doing isn't some alternative to philosophy, but a variety of it.

Richard_Kennaway10h20

Some philosophy is rubbish. Quite a lot, I believe. And with a statement such as "perceptions are caused by things external to the perceptions themselves", which I find unremarkable in itself as a prima facie obvious hypothesis to run with, there is a tendency for philosophers to go off the rails immediately by inventing precise definitions of words such as "perceptions", "are", and "caused", and elaborating all manner of quibbles and paradoxes. Hence the whole tedious catalogue of realisms.

Science did not get anywhere by speculating on whether there are four or five elements and arguing about their natures.

simeon_c's Shortform

simeon_c8d12974

Idea: Daniel Kokotajlo probably lost quite a bit of money by not signing an OpenAI NDA before leaving, which I consider a public service at this point. Could some of the funders of the AI safety landscape give some money or social reward for this?

I guess reimbursing everything Daniel lost might be a bit too much for funders but providing some money, both to reward the act and incentivize future safety people to not sign NDAs would have a very high value.

Showing 3 of 22 replies (Click to show all)

10AlexMennen15h

I wonder if it might be more effective to fund legal action against OpenAI than to compensate individual ex-employees for refusing to sign an NDA. Trying to take vested equity away from ex-employees who refuse to sign an NDA sounds likely to not hold up in court, and if we can establish a legal precident that OpenAI cannot do this, that might make other ex-employees much more comfortable speaking out against OpenAI than the possibility that third-parties might fundraise to partially compensate them for lost equity would be (a possibility you might not even be able to make every ex-employee aware of). The fact that this would avoid financially rewarding OpenAI for bad behavior is also a plus. Of course, legal action is expensive, but so is the value of the equity that former OpenAI employees have on the line.

habryka13h106

Yeah, at the time I didn't know how shady some of the contracts here were. I do think funding a legal defense is a marginally better use of funds (though my guess is funding both is worth it).

1wassname3d

Thanks, I hadn't seen that, I find it convincing.

Bogdan Ionut Cirstea's Shortform

Bogdan Ionut Cirstea18h50

On an apparent missing mood - FOMO on all the vast amounts of automated AI safety R&D that could (almost already) be produced safely

Automated AI safety R&D could results in vast amounts of work produced quickly. E.g. from Some thoughts on automating alignment research (under certain assumptions detailed in the post):

each month of lead that the leader started out with would correspond to 15,000 human researchers working for 15 months.

Despite this promise, we seem not to have much knowledge when such automated AI safety R&D might happ... (read more)

Showing 3 of 5 replies (Click to show all)

1Bogdan Ionut Cirstea17h

Seems like probably the modal scenario to me too, but even limited exceptions like the one you mention seem to me like they could be very important to deploy at scale ASAP, especially if they could be deployed using non-x-risky systems (e.g. like current ones, very bad at DC evals). This seems good w.r.t. automated AI safety potentially 'piggybacking', but bad for differential progress. Sure, though wouldn't this suggest at least focusing hard on (measuring / eliciting) what might not come at the same time?

2ryan_greenblatt15h

Why think this is important to measure or that this already isn't happening? E.g., on the current model organism related project I'm working on, I automate inspecting reasoning traces in various ways. But I don't feel like there is any particularly interesting thing going on here which is important to track (e.g. this tip isn't more important than other tips for doing LLM research better).

Bogdan Ionut Cirstea13h10

Intuitively, I'm thinking of all this as something like a race between [capabilities enabling] safety and [capabilities enabling dangerous] capabilities (related: https://aligned.substack.com/i/139945470/targeting-ooms-superhuman-models); so from this perspective, maintaining as large a safety buffer as possible (especially if not x-risky) seems great. There could also be something like a natural endpoint to this 'race', corresponding to being able to automate all human-level AI safety R&D safely (and then using this to produce a scalable solution to a... (read more)

keltan's Shortform

keltan2d134

Note to self, write a post about the novel akrasia solutions I thought up before becoming a rationalist.

Figuring out how to want to want to do things
Personalised advertising of Things I Wanted to Want to Do
What I do when all else fails

5trevor18h

Have you tried whiteboarding-related techniques? I think that suddenly starting to using written media (even journals), in an environment without much or any guidance, is like pressing too hard on the gas; you're gaining incredible power and going from zero to one on things faster than you ever have before. Depending on their environment and what they're interested in starting out, some people might learn (or be shown) how to steer quickly, whereas others might accumulate/scaffold really lopsided optimization power and crash and burn (e.g. getting involved in tons of stuff at once that upon reflection was way too much for someone just starting out).

keltan13h10

This seems incredibly interesting to me. Googling “White-boarding techniques” only gives me results about digitally shared idea spaces. Is this what you’re referring to? I’d love to hear more on this topic.

4keltan2d

Maybe I could even write a sequence on this?

RobertM's Shortform

RobertM15h40

Unfortunately, it looks like non-disparagement clauses aren't unheard of in general releases:

http://www.shpclaw.com/Schwartz-Resources/severance-and-release-agreements-six-6-common-traps-and-a-rhetorical-question

Release Agreements commonly include a “non-disparagement” clause – in which the employee agrees not to disparage “the Company.”

https://joshmcguirelaw.com/civil-litigation/adventures-in-lazy-lawyering-the-broad-general-release

The release had a very broad definition of the company (including officers, directors, shareholders, etc.), but a fairly reas

... (read more)

Wei Dai's Shortform

Wei Dai2mo502

AI labs are starting to build AIs with capabilities that are hard for humans to oversee, such as answering questions based on large contexts (1M+ tokens), but they are still not deploying "scalable oversight" techniques such as IDA and Debate. (Gemini 1.5 report says RLHF was used.) Is this more good news or bad news?

Good: Perhaps RLHF is still working well enough, meaning that the resulting AI is following human preferences even out of training distribution. In other words, they probably did RLHF on large contexts in narrow distributions, with human rater... (read more)

Showing 3 of 8 replies (Click to show all)

Wei Dai20h42

Bad: AI developers haven't taken alignment seriously enough to have invested enough in scalable oversight, and/or those techniques are unworkable or too costly, causing them to be unavailable.

Turns out at least one scalable alignment team has been struggling for resources. From Jan Leike (formerly co-head of Superalignment at OpenAI):

Over the past few months my team has been sailing against the wind. Sometimes we were struggling for compute and it was getting harder and harder to get this crucial research done.

Even worse, apparently the whole Supera... (read more)

4ryan_greenblatt2mo

I'm skeptical that increased scale makes hacking the reward model worse. Of course, it could (and likely will/does) make hacking human labelers more of a problem, but this isn't what the comment appears to be saying. Note that the reward model is of the same scale as the base model, so the relative scale should be the same. This also contradicts results from an earlier paper by Leo Gao. I think this paper is considerably more reliable than the comment overall, so I'm inclined to believe the paper or think that I'm misunderstanding the comment. Additionally, from first principles I think that RLHF sample efficiency should just increase with scale (at least with well tuned hyperparameters) and I think I've heard various things that confirm this.

2ryan_greenblatt2mo

Oops, fixed.

mesaoptimizer's Shortform

mesaoptimizer21h4443

If your endgame strategy involved relying on OpenAI, DeepMind, or Anthropic to implement your alignment solution that solves science / super-cooperation / nanotechnology, consider figuring out another endgame plan.

shortplav

niplav2d72

Just checked who from the authors of the Weak-To-Strong Generalization paper is still at OpenAI:

Collin Burns
Jan Hendrick Kirchner
Leo Gao
Bowen Baker
Yining Chen
Adrian Ecoffet
Manas Joglekar
Jeff Wu

Gone are:

Ilya Sutskever
Pavel Izmailov^[1]
Jan Leike
Leopold Aschenbrenner

Reason unknown ↩︎

niplav1d66

Oh damn the superalignment team has been dissolved.

quila's Shortform

quila2d94

(Personal) On writing and (not) speaking

I often struggle to find words and sentences that match what I intend to communicate.

Here are some problems this can cause:

Wordings that are odd or unintuitive to the reader, but that are at least literally correct.^[1]
Not being able express what I mean, and having to choose between not writing it, or risking miscommunication by trying anyways. I tend to choose the former unless I'm writing to a close friend. Unfortunately this means I am unable to express some key insights to a general audience.
Writing taking lots of

Aaron Bergman1d10

Thank you, that is all very kind! ☺️☺️☺️

I expect if he continues being what he is, he'll produce lots of cool stuff which I'll learn from later.

I hope so haha

1quila2d

Record yourself typing?

2Emrik2d

quila's Shortform

quila6d42

At what point should I post content as top-level posts rather than shortforms?

For example, a recent writing I posted to shortform was ~250 concise words plus an image: 'Anthropics may support a 'non-agentic superintelligence' agenda'. It would be a top-level post on my blog if I had one set up (maybe soon :p).

Some general guidelines on this would be helpful.

4niplav6d

This is a good question, especially since there've been some short form posts recently that are high quality and would've made good top-level posts—after all, posts can be short.

Emrik2d10

Epic Lizka post is epic.

Also, I absolutely love the word "shard" but my brain refuses to use it because then it feels like we won't get credit for discovering these notions by ourselves. Well, also just because the words "domain", "context", "scope", "niche", "trigger", "preimage" (wrt to a neural function/policy / "neureme") adequately serve the same purpose and are currently more semantically/semiotically granular in my head.

trigger/preimage ⊆ scope ⊆ domain

"niche" is a category in function space (including domain, operation, and codomain), "domain" is a set.

"scope" is great because of programming connotations and can be used as a verb. "This neural function is scoped to these contexts."

elifland's Shortform

elifland5d5139

The word "overconfident" seems overloaded. Here are some things I think that people sometimes mean when they say someone is overconfident:

They gave a binary probability that is too far from 50% (I believe this is the original one)
They overestimated a binary probability (e.g. they said 20% when it should be 1%)
Their estimate is arrogant (e.g. they say there's a 40% chance their startup fails when it should be 95%), or maybe they give an arrogant vibe
They seem too unwilling to change their mind upon arguments (maybe their credal resilience is too high)
They g

... (read more)

Unnamed2d42

Moore & Schatz (2017) made a similar point about different meanings of "overconfidence" in their paper The three faces of overconfidence. The abstract:

Overconfidence has been studied in 3 distinct ways. Overestimation is thinking that you are better than you are. Overplacement is the exaggerated belief that you are better than others. Overprecision is the excessive faith that you know the truth. These 3 forms of overconfidence manifest themselves under different conditions, have different causes, and have widely varying consequences. It is a mist

... (read more)

4Daniel Kokotajlo2d

I feel like this should be a top-level post.

3Garrett Baker4d

When I accuse someone of overconfidence, I usually mean they're being too hedgehogy when they should be being more foxy.

jacquesthibs's Shortform

jacquesthibs3d163

For anyone interested in Natural Abstractions type research: https://arxiv.org/abs/2405.07987

Claude summary:

Key points of "The Platonic Representation Hypothesis" paper:

Neural networks trained on different objectives, architectures, and modalities are converging to similar representations of the world as they scale up in size and capabilities.
This convergence is driven by the shared structure of the underlying reality generating the data, which acts as an attractor for the learned representations.
Scaling up model size, data quantity, and task dive

... (read more)

cubefox2d10

This sounds really intriguing. I would like someone who is familiar with natural abstraction research to comment on this paper.

davekasten's Shortform

davekasten3d4511

Epistemic status: not a lawyer, but I've worked with a lot of them.

As I understand it, an NDA isn't enforceable against a subpoena (though the former employer can seek a protective order for the testimony). Someone should really encourage law enforcement or Congress to subpoena the OpenAI resigners...

metachirality2d103

A subpoena for what?