A leaderboard based on my needs
I want to create a leaderboard based on my needs. But I found that when I run the same evaluation prompt multiple times, the evaluation scores are always different, sometimes even deviate a lot.
I tested gpt-4o-mini, claude-3.5-sonnet, deepseek-chat, and set the temperature to 0, but the same thing still happens. Is there some trick I don't know that can stabilize the evaluation score?