‹ All episodes

AI Safety Fundamentals: Alignment

Eliciting Latent Knowledge

June 17, 2024 BlueDot Impact Season 13

Chapters

Toy scenario: the SmartVault

How the SmartVault AI works: model-based RL

How it could go wrong: observations leave out key information

How we might address this problem by asking questions

Baseline: what you’d try first and how it could fail

Training strategy: generalize from easy questions to hard questions

Counterexample: why this training strategy won’t always work

Test case: prediction is done by inference on a Bayes net

How the prediction model works

How the humans answer questions

Isn’t this oversimplified and unrealistic?

Intended behavior: translate to the human’s Bayes net

Bad behavior: do inference in the human Bayes net

Would this strategy learn the human simulator or the direct translator?

Research methodology

Why focus on the worst case?

What counts as a counterexample for ELK?

Can we construct a dataset that separates “correct” from “looks correct to a human”?

Strategy: have a human operate the SmartVault and ask them what happened

How this defeats the previous counterexample

New counterexample: better inference in the human Bayes net

Strategy: have AI help humans improve our understanding

How this defeats the previous counterexample

New counterexample: gradient descent is more efficient than science

Strategy: have humans adopt the optimal Bayes net

How this defeats the previous counterexample

New counterexample: ontology mismatch

So are we just stuck now?

Ontology identification

Examples of ontology mismatches

Relationship between ontology identification and ELK

AI Safety Fundamentals: Alignment

Eliciting Latent Knowledge

Jun 17, 2024 Season 13

BlueDot Impact

In this post, we’ll present ARC’s approach to an open problem we think is central to aligning powerful machine learning (ML) systems:

Suppose we train a model to predict what the future will look like according to cameras and other sensors. We then use planning algorithms to find a sequence of actions that lead to predicted futures that look good to us.

But some action sequences could tamper with the cameras so they show happy humans regardless of what’s really happening. More generally, some futures look great on camera but are actually catastrophically bad.

In these cases, the prediction model “knows” facts (like “the camera was tampered with”) that are not visible on camera but would change our evaluation of the predicted future if we learned them. How can we train this model to report its latent knowledge of off-screen events?

We’ll call this problem eliciting latent knowledge (ELK). In this report we’ll focus on detecting sensor tampering as a motivating example, but we believe ELK is central to many aspects of alignment.

Source:

https://docs.google.com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8/edit#

Narrated for AI Safety Fundamentals by Perrin Walker of TYPE III AUDIO.

---

A podcast by BlueDot Impact.

Learn more on the AI Safety Fundamentals website.

Share Episode

Share on Facebook Share on Twitter Share on LinkedIn Download

Subscribe

Spotify RSS Feed

Listen on

Spotify RSS Feed

Share Episode

Share on Facebook Share on Twitter Share on LinkedIn

In this post, we’ll present ARC’s approach to an open problem we think is central to aligning powerful machine learning (ML) systems:

Suppose we train a model to predict what the future will look like according to cameras and other sensors. We then use planning algorithms to find a sequence of actions that lead to predicted futures that look good to us.

But some action sequences could tamper with the cameras so they show happy humans regardless of what’s really happening. More generally, some futures look great on camera but are actually catastrophically bad.

In these cases, the prediction model “knows” facts (like “the camera was tampered with”) that are not visible on camera but would change our evaluation of the predicted future if we learned them. How can we train this model to report its latent knowledge of off-screen events?

We’ll call this problem eliciting latent knowledge (ELK). In this report we’ll focus on detecting sensor tampering as a motivating example, but we believe ELK is central to many aspects of alignment.

Source:

https://docs.google.com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8/edit#

Narrated for AI Safety Fundamentals by Perrin Walker of TYPE III AUDIO.

---

A podcast by BlueDot Impact.

Learn more on the AI Safety Fundamentals website.

Toy scenario: the SmartVault

How the SmartVault AI works: model-based RL

How it could go wrong: observations leave out key information

How we might address this problem by asking questions

Baseline: what you’d try first and how it could fail

Training strategy: generalize from easy questions to hard questions

Counterexample: why this training strategy won’t always work

Test case: prediction is done by inference on a Bayes net

How the prediction model works

How the humans answer questions

Isn’t this oversimplified and unrealistic?

Intended behavior: translate to the human’s Bayes net

Bad behavior: do inference in the human Bayes net

Would this strategy learn the human simulator or the direct translator?

Research methodology

Why focus on the worst case?

What counts as a counterexample for ELK?

Informal steps

Can we construct a dataset that separates “correct” from “looks correct to a human”?

Strategy: have a human operate the SmartVault and ask them what happened

How this defeats the previous counterexample

New counterexample: better inference in the human Bayes net

Strategy: have AI help humans improve our understanding

How this defeats the previous counterexample

New counterexample: gradient descent is more efficient than science

Strategy: have humans adopt the optimal Bayes net

How this defeats the previous counterexample

New counterexample: ontology mismatch

So are we just stuck now?

Ontology identification

Examples of ontology mismatches

Relationship between ontology identification and ELK