LessWrong (Curated & Popular)

“Three Subtle Examples of Data Leakage” by abstractapplic

This is a description of my work on some data science projects, lightly obfuscated and fictionalized to protect the confidentiality of the organizations I handled them for (and also to make it flow better). I focus on the high-level epistemic/mathematical issues, and the lived experience of working on intellectual problems, but gloss over the timelines and implementation details.

The Upper Bound

One time, I was working for a company which wanted to win some first-place sealed-bid auctions in a market they were thinking of joining, and asked me to model the price-to-beat in those auctions. There was a twist: they were aiming for the low end of the market, and didn't care about lots being sold for more than $1000.

"Okay," I told them. "I'll filter out everything with a price above $1000 before building any models or calculating any performance metrics!"

They approved of this, and told me [...]

---

Outline:

(00:27) The Upper Bound

(02:58) The Time-Travelling Convention

(05:56) The Tobit Problem

(06:30) My Takeaways

The original text contained 3 footnotes which were omitted from this narration.

---

First published:
October 1st, 2024

Source:
https://www.lesswrong.com/posts/rzyHbLZHuqHq6KM65/three-subtle-examples-of-data-leakage

---

Narrated by TYPE III AUDIO.