Note, the GPT-4 scores are suspicious, it's possible that the scores reflect it regurgitating/rephrasing answers that were in its training data, not actually generating correct answers it has never seen before.
(It can definitely sometimes give correct answers to questions it has never seen, but not reliably. The paper claims to check for data contamination in their testing, but the procedure they describe seems weak to me. They look for substring matches of the questions, and I expect the internet will have a lot of "the test asks something like X and here's the answer" to evade test-makers searching for test cheatsheets. Also, someone tried to reproduce the Leetcode scores reported by OpenAI. GPT-4 is good at Leetcode problems posted before the training cutoff date, and bad at problems posted after.)
Note, the GPT-4 scores are suspicious, it's possible that the scores reflect it regurgitating/rephrasing answers that were in its training data, not actually generating correct answers it has never seen before.
(It can definitely sometimes give correct answers to questions it has never seen, but not reliably. The paper claims to check for data contamination in their testing, but the procedure they describe seems weak to me. They look for substring matches of the questions, and I expect the internet will have a lot of "the test asks something like X and here's the answer" to evade test-makers searching for test cheatsheets. Also, someone tried to reproduce the Leetcode scores reported by OpenAI. GPT-4 is good at Leetcode problems posted before the training cutoff date, and bad at problems posted after.)
Duly noted! Worthwhile to point out.
Yee Haw to all this!!
Yee Haw indeed!