Mar 28Liked by Elliot Hershberg

Note, the GPT-4 scores are suspicious, it's possible that the scores reflect it regurgitating/rephrasing answers that were in its training data, not actually generating correct answers it has never seen before.

(It can definitely sometimes give correct answers to questions it has never seen, but not reliably. The paper claims to check for data contamination in their testing, but the procedure they describe seems weak to me. They look for substring matches of the questions, and I expect the internet will have a lot of "the test asks something like X and here's the answer" to evade test-makers searching for test cheatsheets. Also, someone tried to reproduce the Leetcode scores reported by OpenAI. GPT-4 is good at Leetcode problems posted before the training cutoff date, and bad at problems posted after.)

Expand full comment
Mar 27Liked by Elliot Hershberg

Yee Haw to all this!!

Expand full comment