AI Safety Fundamentals: Alignment

Eliciting Latent Knowledge

June 17, 2024 BlueDot Impact Season 13
Eliciting Latent Knowledge
AI Safety Fundamentals: Alignment
Chapters
4:12
Toy scenario: the SmartVault
5:43
How the SmartVault AI works: model-based RL
7:40
How it could go wrong: observations leave out key information
10:38
How we might address this problem by asking questions
12:43
Baseline: what you’d try first and how it could fail
14:07
Training strategy: generalize from easy questions to hard questions
15:48
Counterexample: why this training strategy won’t always work
16:41
Test case: prediction is done by inference on a Bayes net
17:22
How the prediction model works
19:52
How the humans answer questions
20:44
Isn’t this oversimplified and unrealistic?
23:17
Intended behavior: translate to the human’s Bayes net
25:57
Bad behavior: do inference in the human Bayes net
27:38
Would this strategy learn the human simulator or the direct translator?
28:55
Research methodology
30:34
Why focus on the worst case?
32:08
What counts as a counterexample for ELK?
34:37
Informal steps
35:44
Can we construct a dataset that separates “correct” from “looks correct to a human”?
37:42
Strategy: have a human operate the SmartVault and ask them what happened
38:29
How this defeats the previous counterexample
40:59
New counterexample: better inference in the human Bayes net
43:06
Strategy: have AI help humans improve our understanding
46:33
How this defeats the previous counterexample
47:05
New counterexample: gradient descent is more efficient than science
49:20
Strategy: have humans adopt the optimal Bayes net
50:57
How this defeats the previous counterexample
51:23
New counterexample: ontology mismatch
52:19
So are we just stuck now?
54:47
Ontology identification
55:31
Examples of ontology mismatches
57:55
Relationship between ontology identification and ELK
More Info
AI Safety Fundamentals: Alignment
Eliciting Latent Knowledge
Jun 17, 2024 Season 13
BlueDot Impact

In this post, we’ll present ARC’s approach to an open problem we think is central to aligning powerful machine learning (ML) systems: 


Suppose we train a model to predict what the future will look like according to cameras and other sensors. We then use planning algorithms to find a sequence of actions that lead to predicted futures that look good to us.


But some action sequences could tamper with the cameras so they show happy humans regardless of what’s really happening. More generally, some futures look great on camera but are actually catastrophically bad.


In these cases, the prediction model “knows” facts (like “the camera was tampered with”) that are not visible on camera but would change our evaluation of the predicted future if we learned them. How can we train this model to report its latent knowledge of off-screen events?


We’ll call this problem eliciting latent knowledge (ELK). In this report we’ll focus on detecting sensor tampering as a motivating example, but we believe ELK is central to many aspects of alignment. 



Source:

https://docs.google.com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8/edit#


Narrated for AI Safety Fundamentals by Perrin Walker of TYPE III AUDIO.

---

A podcast by BlueDot Impact.

Learn more on the AI Safety Fundamentals website.

Show Notes Chapter Markers

In this post, we’ll present ARC’s approach to an open problem we think is central to aligning powerful machine learning (ML) systems: 


Suppose we train a model to predict what the future will look like according to cameras and other sensors. We then use planning algorithms to find a sequence of actions that lead to predicted futures that look good to us.


But some action sequences could tamper with the cameras so they show happy humans regardless of what’s really happening. More generally, some futures look great on camera but are actually catastrophically bad.


In these cases, the prediction model “knows” facts (like “the camera was tampered with”) that are not visible on camera but would change our evaluation of the predicted future if we learned them. How can we train this model to report its latent knowledge of off-screen events?


We’ll call this problem eliciting latent knowledge (ELK). In this report we’ll focus on detecting sensor tampering as a motivating example, but we believe ELK is central to many aspects of alignment. 



Source:

https://docs.google.com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8/edit#


Narrated for AI Safety Fundamentals by Perrin Walker of TYPE III AUDIO.

---

A podcast by BlueDot Impact.

Learn more on the AI Safety Fundamentals website.

Toy scenario: the SmartVault
How the SmartVault AI works: model-based RL
How it could go wrong: observations leave out key information
How we might address this problem by asking questions
Baseline: what you’d try first and how it could fail
Training strategy: generalize from easy questions to hard questions
Counterexample: why this training strategy won’t always work
Test case: prediction is done by inference on a Bayes net
How the prediction model works
How the humans answer questions
Isn’t this oversimplified and unrealistic?
Intended behavior: translate to the human’s Bayes net
Bad behavior: do inference in the human Bayes net
Would this strategy learn the human simulator or the direct translator?
Research methodology
Why focus on the worst case?
What counts as a counterexample for ELK?
Informal steps
Can we construct a dataset that separates “correct” from “looks correct to a human”?
Strategy: have a human operate the SmartVault and ask them what happened
How this defeats the previous counterexample
New counterexample: better inference in the human Bayes net
Strategy: have AI help humans improve our understanding
How this defeats the previous counterexample
New counterexample: gradient descent is more efficient than science
Strategy: have humans adopt the optimal Bayes net
How this defeats the previous counterexample
New counterexample: ontology mismatch
So are we just stuck now?
Ontology identification
Examples of ontology mismatches
Relationship between ontology identification and ELK