Bogus outliers are throwing off the leaderboard and scores

#1
by squid2 - opened

There are a few submitted benchmark results that appear fake that are completely throwing off the scaling of the performance scores and the leaderboard rankings. For example:

  • An iPhone 11 Pro supposedly ran bartowski/DeepSeek-R1-Distill-Qwen-1.5B-GGUF/DeepSeek-R1-Distill-Qwen-1.5B-Q8_0.gguf with a preprocessing rate of 226935.1 T/s and a token generation rate of 2409.49 T/s
  • An iPhone 15 Pro supposedly ran LGAI-EXAONE/EXAONE-3.5-7.8B-Instruct-GGUF/EXAONE-3.5-7.8B-Instruct-Q4_K_M.gguf with a preprocessing rate of 116609.27 T/s and token generation at 2442.14 T/s
  • An iPhone 14 Pro Max supposedly ran bartowski/DeepSeek-R1-Distill-Qwen-7B-GGUF/DeepSeek-R1-Distill-Qwen-7B-IQ2_M.gguf with a preprocessing rate of 47463.46 T/s and token generation at 785.98 T/s
  • An iPhone 15 Pro Max supposedly ran Qwen/Qwen2.5-3B-Instruct-GGUF/qwen2.5-3b-instruct-q5_k_m.gguf with a preprocessing rate of 22651.11 T/s and token generation at 127.96 T/s

All of these results (and many others) are almost 3 orders of magnitude away from reality and obviously invalid. This completely throws off the leaderboard and performance scores, since they are relative with the fastest device scored 100.

Bogus results need to be flagged or removed (perhaps in an automated manner), and the leaderboard should ignore these outliers. Perhaps the leaderboard should use the median results for each device model, rather than the maximum and mean.

Hey, thanks for pointing out these issues. I'm looking into adding ranking methods like Elo or TrueSkill,which should be much more robust to outliers. But yeah, we need to remove them anyways.

Implemented and deployed the Glicko-2 ranking system. Since it is based on binary comparisons (win/loss), it is less prone to outliers, but that is also a downside since it only considers the win/loss outcome and does not account for the margin of victory, like how fast one device is vs the other device. But it should work ok, as long os we have enough data.

a-ghorbani changed discussion status to closed

Sign up or log in to comment