I'm surprised at people who seem to be updating only now about OpenAI being very irresponsible, rather than updating when they created a giant public competitive market for chatbots (which contains plenty of labs that don't care about alignment at all), thereby reducing how long everyone has to solve alignment. I still parse that move as devastating the commons in order to make a quick buck.
I worked at OpenAI for three years, from 2021-2024 on the Alignment team, which eventually became the Superalignment team. I worked on scalable oversight, part of the team developing critiques as a technique for using language models to spot mistakes in other language models. I then worked to refine an idea from Nick Cammarata into a method for using language model to generate explanations for features in language models. I was then promoted to managing a team of 4 people which worked on trying to understand language model features in context, leading to t...
They would not know if others have signed the SAME NDAs without trading information about their own NDAs, which is forbidden.
From my perspective, the only thing that keeps the OpenAI situation from being all kinds of terrible is that I continue to think they're not close to human-level AGI, so it probably doesn't matter all that much.
This is also my take on AI doom in general; my P(doom|AGI soon) is quite high (>50% for sure), but my P(AGI soon) is low. In fact it decreased in the last 12 months.
Apparently[1] there was recently some discussion of Survival Instinct in Offline Reinforcement Learning (NeurIPS 2023). The results are very interesting:
...On many benchmark datasets, offline RL can produce well-performing and safe policies even when trained with "wrong" reward labels, such as those that are zero everywhere or are negatives of the true rewards. This phenomenon cannot be easily explained by offline RL's return maximization objective. Moreover, it gives offline RL a degree of robustness that is uncharacteristic of its online RL count
Because future rewards are discounted
Don't you mean future values? Also, AFAICT, the only thing going on here that seperates online from offline RL is that offline RL algorithms shape the initial value function to give conservative behaviour. And so you get conservative behaviour.
Several dozen people now presumably have Lumina in their mouths. Can we not simply crowdsource some assays of their saliva? I would chip money in to this. Key questions around ethanol levels, aldehyde levels, antibacterial levels, and whether the organism itself stays colonized at useful levels.
Any recommendations on how I should do that? You may assume that I know what a gas chromatograph is and what a Petri dish is and why you might want to use either or both of those for data collection, but not that I have any idea of how to most cost-effectively access either one as some rando who doesn't even have a MA in Chemistry.
A Theory of Usable Information Under Computational Constraints
...We propose a new framework for reasoning about information in complex systems. Our foundation is based on a variational extension of Shannon's information theory that takes into account the modeling power and computational constraints of the observer. The resulting \emph{predictive V-information} encompasses mutual information and other notions of informativeness such as the coefficient of determination. Unlike Shannon's mutual information and in violation of the data processing inequality, V-
Some philosophy is rubbish. Quite a lot, I believe. And with a statement such as "perceptions are caused by things external to the perceptions themselves", which I find unremarkable in itself as a prima facie obvious hypothesis to run with, there is a tendency for philosophers to go off the rails immediately by inventing precise definitions of words such as "perceptions", "are", and "caused", and elaborating all manner of quibbles and paradoxes. Hence the whole tedious catalogue of realisms.
Science did not get anywhere by speculating on whether there are four or five elements and arguing about their natures.
Idea: Daniel Kokotajlo probably lost quite a bit of money by not signing an OpenAI NDA before leaving, which I consider a public service at this point. Could some of the funders of the AI safety landscape give some money or social reward for this?
I guess reimbursing everything Daniel lost might be a bit too much for funders but providing some money, both to reward the act and incentivize future safety people to not sign NDAs would have a very high value.
Yeah, at the time I didn't know how shady some of the contracts here were. I do think funding a legal defense is a marginally better use of funds (though my guess is funding both is worth it).
On an apparent missing mood - FOMO on all the vast amounts of automated AI safety R&D that could (almost already) be produced safely
Automated AI safety R&D could results in vast amounts of work produced quickly. E.g. from Some thoughts on automating alignment research (under certain assumptions detailed in the post):
each month of lead that the leader started out with would correspond to 15,000 human researchers working for 15 months.
Despite this promise, we seem not to have much knowledge when such automated AI safety R&D might happ...
Intuitively, I'm thinking of all this as something like a race between [capabilities enabling] safety and [capabilities enabling dangerous] capabilities (related: https://aligned.substack.com/i/139945470/targeting-ooms-superhuman-models); so from this perspective, maintaining as large a safety buffer as possible (especially if not x-risky) seems great. There could also be something like a natural endpoint to this 'race', corresponding to being able to automate all human-level AI safety R&D safely (and then using this to produce a scalable solution to a...
This seems incredibly interesting to me. Googling “White-boarding techniques” only gives me results about digitally shared idea spaces. Is this what you’re referring to? I’d love to hear more on this topic.
Unfortunately, it looks like non-disparagement clauses aren't unheard of in general releases:
Release Agreements commonly include a “non-disparagement” clause – in which the employee agrees not to disparage “the Company.”
https://joshmcguirelaw.com/civil-litigation/adventures-in-lazy-lawyering-the-broad-general-release
...The release had a very broad definition of the company (including officers, directors, shareholders, etc.), but a fairly reas
AI labs are starting to build AIs with capabilities that are hard for humans to oversee, such as answering questions based on large contexts (1M+ tokens), but they are still not deploying "scalable oversight" techniques such as IDA and Debate. (Gemini 1.5 report says RLHF was used.) Is this more good news or bad news?
Good: Perhaps RLHF is still working well enough, meaning that the resulting AI is following human preferences even out of training distribution. In other words, they probably did RLHF on large contexts in narrow distributions, with human rater...
Bad: AI developers haven't taken alignment seriously enough to have invested enough in scalable oversight, and/or those techniques are unworkable or too costly, causing them to be unavailable.
Turns out at least one scalable alignment team has been struggling for resources. From Jan Leike (formerly co-head of Superalignment at OpenAI):
Over the past few months my team has been sailing against the wind. Sometimes we were struggling for compute and it was getting harder and harder to get this crucial research done.
Even worse, apparently the whole Supera...
Just checked who from the authors of the Weak-To-Strong Generalization paper is still at OpenAI:
Gone are:
Reason unknown ↩︎
I often struggle to find words and sentences that match what I intend to communicate.
Here are some problems this can cause:
Thank you, that is all very kind! ☺️☺️☺️
I expect if he continues being what he is, he'll produce lots of cool stuff which I'll learn from later.
I hope so haha
At what point should I post content as top-level posts rather than shortforms?
For example, a recent writing I posted to shortform was ~250 concise words plus an image: 'Anthropics may support a 'non-agentic superintelligence' agenda'. It would be a top-level post on my blog if I had one set up (maybe soon :p).
Some general guidelines on this would be helpful.
Epic Lizka post is epic.
Also, I absolutely love the word "shard" but my brain refuses to use it because then it feels like we won't get credit for discovering these notions by ourselves. Well, also just because the words "domain", "context", "scope", "niche", "trigger", "preimage" (wrt to a neural function/policy / "neureme") adequately serve the same purpose and are currently more semantically/semiotically granular in my head.
trigger/preimage ⊆ scope ⊆ domain
"niche" is a category in function space (including domain, operation, and codomain), "domain" is a set.
"scope" is great because of programming connotations and can be used as a verb. "This neural function is scoped to these contexts."
The word "overconfident" seems overloaded. Here are some things I think that people sometimes mean when they say someone is overconfident:
Moore & Schatz (2017) made a similar point about different meanings of "overconfidence" in their paper The three faces of overconfidence. The abstract:
...Overconfidence has been studied in 3 distinct ways. Overestimation is thinking that you are better than you are. Overplacement is the exaggerated belief that you are better than others. Overprecision is the excessive faith that you know the truth. These 3 forms of overconfidence manifest themselves under different conditions, have different causes, and have widely varying consequences. It is a mist
For anyone interested in Natural Abstractions type research: https://arxiv.org/abs/2405.07987
Claude summary:
Key points of "The Platonic Representation Hypothesis" paper:
Neural networks trained on different objectives, architectures, and modalities are converging to similar representations of the world as they scale up in size and capabilities.
This convergence is driven by the shared structure of the underlying reality generating the data, which acts as an attractor for the learned representations.
Scaling up model size, data quantity, and task dive
This sounds really intriguing. I would like someone who is familiar with natural abstraction research to comment on this paper.
Epistemic status: not a lawyer, but I've worked with a lot of them.
As I understand it, an NDA isn't enforceable against a subpoena (though the former employer can seek a protective order for the testimony). Someone should really encourage law enforcement or Congress to subpoena the OpenAI resigners...
A subpoena for what?
Very Spicy Take
Epistemic Note:
Many highly respected community members with substantially greater decision making experience (and Lesswrong karma) presumably disagree strongly with my conclusion.
Premise 1:
It is becoming increasingly clear that OpenAI is not appropriately prioritizing safety over advancing capabilities research.
Premise 2:
This was the default outcome.
Instances in history in which private companies (or any individual humans) have intentionally turned down huge profits and power are the exception, not the rule.
Premise 3:... (read more)
I don't see how this is relevant to my comment.
By "positive EV bets" I meant positive EV with respect to shared values, not with respect to personal gain.