Knowledge Science - Alles über KI, ML und NLP
Knowledge Science - Alles über KI, ML und NLP
Episode 169 - English AI Generated: KS Pulse - Show don't tell, Alice in Wonderland
Englisch Version - The German Version also exists, but the content differs minimally:
AI-generated News of the Day. The Pulse is an experiment to see if it is interesting to get the latest news in 5 min. small packages generated by an AI every day.
It is completely AI-generated. Only the content is curated. Carsten and I select suitable news items. After that, the manuscript and the audio file are automatically created.
Accordingly, we cannot always guarantee accuracy.
- Show, Don't Tell: Aligning Language Models with Demonstrated Feedback - https://arxiv.org/pdf/2406.00888
- Alice in Wonderland: Simple Tasks Showing Complete Reasoning Breakdown in State-Of-the-Art Large Language Models - https://arxiv.org/pdf/2406.02061
Welcome to the Knowledge Science Pulse podcast, where we dive into the latest advances in AI research. I'm your host Sigurd, and joining me today is my co-host Carsten. I'm excited to discuss these two fascinating papers with you.
#### Thanks Sigurd, I'm thrilled to be here! Both papers offer intriguing insights into aligning and evaluating large language models. Shall we dive right in?
#### Absolutely! Let's start with the DITTO paper. The key idea is using a small number of user-provided demonstrations to align language models to specific tasks or preferences, right?
#### Exactly! DITTO leverages just a handful of demonstrations, treating them as preferred over the model's own outputs. This allows generating online comparison data to fine-tune the model without needing a large dataset.
#### That's quite innovative. And the results show DITTO outperforms baselines like supervised fine-tuning, few-shot prompting, and self-play methods across benchmarks and a user study, correct?
#### Yes, DITTO achieved an average 19 percentage point higher win-rate compared to the baselines. Notably, it also outperformed few-shot prompting with the more powerful GPT-4 model by 18 points.
#### Impressive! The user study adds valuable real-world validation. Participants provided demonstrations for email writing tasks, and DITTO's outputs were strongly preferred over the baselines.
#### Indeed, and an important finding was that demonstrations are much more sample-efficient for individual users compared to collecting pairwise preferences. DITTO opens up promising possibilities for customizing models to specific users or domains.
#### Well said! Let's move on to the second paper now, which takes a critical look at the reasoning capabilities of state-of-the-art language models using a deceptively simple task.
#### Right, the "Alice in Wonderland" or AIW problem. It's a short, common sense reasoning question about the number of sisters Alice's brother has, given information about Alice's siblings. Shockingly, most SOTA models fail dramatically on this!
#### That's concerning, given the strong reasoning capabilities often attributed to these models based on their benchmark performance. What were some of the key observations?
#### Well, apart from the low success rates, models expressed high overconfidence in their incorrect answers. They often generated convincing but nonsensical "reasoning" to justify the wrong solutions.
#### Almost like confabulation then, misleadingly making the answers sound plausible to humans. Did the paper test any interventions to elicit correct responses?
#### They tried enhanced prompting, asking models to reconsider, and providing an even harder AIW+ variation. But the models kept failing, generating more confabulated explanations while arriving at the same wrong answers.
#### That points to serious fundamental reasoning deficits that current benchmarks are failing to surface. As you mentioned, models scoring highly on reasoning benchmarks still flopped on the AIW problem.
#### Absolutely. The authors stress the need for the ML community to develop better, falsifiable benchmarks that properly assess models' reasoning skills and reveal such weaknesses. Current benchmarks seem inadequate.
#### I couldn't agree more. It's crucial for research progress and for avoiding misleading hype about model capabilities. The paper's call for open, reproducible model creation pipelines to enable proper analysis is also on point.
#### Well said, Sigurd. These two papers offer valuable and contrasting perspectives - one advancing techniques for aligning models to users, the other challenging our understanding of model reasoning and highlighting the need for more rigorous evaluation.
#### Indeed, lots of food for thought! Thank you Carsten for the engaging discussion. Dear listeners, we hope you enjoyed this deep dive. Join us again next time on the Knowledge Science Pulse!