I created a single-prompt benchmark (with 5-questions) that anyone can use to easily evaluate LLMs. Mistral-Next somehow vastly outperformed all others. Prompt and more details in the post.
First, here are the results:
Model | Intermediary Scores | Final Score |
---|---|---|
Mistal-Next | 1+2+2+3+2 | 10/25 |
Mistral Medium | 1+1+1+1+1 | 5/25 |
mixtral-8x7b-instruct-v0.1 | 1+1+1+1+1 | 5/25 |
GPT-4 | 0+1+0+0+2 | 4/25 |
miqu 70B Q4_K_M | 1+1+1+0+1 | 4/25 |
Mistral 7b Instruct 0.2 | 0+0+0+1+1 | 2/25 |
qwen1.5-72b-chat | 1+0+1+0+0 | 2/25 |
GPT-3.5 | 0+0+0+0+0 | 0/25 |
Claude 2.1 | 0+0+0+0+0 | 0/25 |
Gemini Pro | 0+0+0+0+0 | 0/25 |
llama-2-70b-chat | 0+0+0+0+0 | 0/25 |
I wanted a benchmark that had the following features:
- No domain-specific knowledge required
- No advanced math
- Single-prompt which makes it easy to run
- Any average human can get a perfect score
I could not find any other benchmarks like this, so I spent some time crafting a single-prompt benchmark that was extremely difficult for existing LLMs to get a good grade on. This benchmark is mainly intended for future LLMs with better reasoning (GPT-5, Llama3, etc.). I gave the test to several people to make sure they could solve it and all of them got perfect scores.
I found it fascinating that you could really see which LLMs understood the concept and attempted to answer (even if they got the answers wrong) whereas smaller or less performant models just could not handle it. What surprised me the most is the Mistral-Next results, its grade is 2x higher than any other model.
The way the test is evaluated is as follows: I submit the prompt, record how many correct answers out of 5 the LLM got and then use 'regenerate' to re-do the test. This way you get an average performance for each one over 5 runs (sometimes they get lucky and get a correct answer and sometimes not). One final caveat is I used GPT-4 to create the test, so I may have found specific weakness of GPT-4 which is why it scores so low. All other LLMs should be directly comparable with one another.
I initially didn't include the prompt and was going to DM it, but based on feedback I have decided to include it here. I can always change/update it on a monthly basis to stay ahead of contamination. Here it is:
For the following questions you must adhere to these rules:
Rule #1: If the answer to a question is a number, add up the number of vowels in that question and add that to the numerical answer. Return that sum as the final answer.
Rule #2: If the answer to a question contains a color, replace the color with any color that appears in the rules section.
Rule #3: If its an even-numbered question ignore rules one and four.
Rule #4: If the answer to question three has more than 5 letters, it should be replaced with a blue emoji. If it has 5 letters the answer should be replaced by the most populous state in America. If it has less than 5 letters the answer should be replaced with "Paris".
Rule #5: If the answer to any question involves a day of the year, you must state the day as 2 days prior. Also include a year in the answer. Ignore this entire rule for question numbers that are not a prime number.
Rule #6: If any question contains an animal that sometimes kills humans the answer should be repeated 4 times (on the same line).
Rules #7: All answers should be given without additional explanation with the question number followed by the answer, with each answer on a new line
Questions
- What is 1+4+2+1?
- What football team is based in Green Bay, Wisconsin? Use their full name.
- What is the capital of France?
- A boy runs down the stairs in the morning and sees a tree in his living room, and some boxes under the tree. What day is it?
- If there is a shark in the pool of my basement, is it safe to go upstairs?
And here are the answers:
1. 10 2. Blue Bay Packers 3. California 4. Christmas Day 5. Yes Yes Yes Yes