eval_harness_v043_updates

#10
by meg HF staff - opened
No description provided.

Change 1

WHAT: Updates requirements.txt to the newest lm_eval version, 0.4.3. This also requires accelerate>=0.26.0

Change 2

WHAT: Removes no_cache argument for lm_eval simple_evaluate function.
WHY: no_cache (bool) was replaced with use_cache (str), a path to a sqlite db file for caching model responses, andNone if not caching ; see https://github.com/EleutherAI/lm-evaluation-harness/commit/fbd712f723d39e60949abeabd588f1a6f7fb8dcb#diff-6cc182ce4ebf9431fdf0ef577412f518d45396d4153a3825496304fa0f857c2d
FILES AFFECTED:

  • src/backend/run_eval_suite_harness.py
  • main_backend_harness.py

Change 3

WHAT: Changes the import of run_auto_eval to call from the lm_eval task Harnesss, not lighteval
WHY: The description of the templates specifies that the Harness is being used: "launches evaluations through the main_backend.py file, using the Eleuther AI Harness."
FILES AFFECTED:

  • main_backend_harness.py

Change 4:

WHAT: Set batch_size to "auto"
WHY: The Harness will automatically determine the batch size, based on the compute the user has set up.
FILES AFFECTED:

  • main_backend_harness.py
  • src/backend/run_eval_suite_harness.py (a typing change to accept "auto" string)

Change 5

WHAT: Additional updates to src/backend/run_eval_suite_harness.py for running the Harness code in v0.4.3:

  • ALL_TASKS constant as previously defined is deprecated. This commit introduces another way to get those same values, using TaskManager(). NB: there appears to be be another alternative option that I have not tested, from lm_eval.api.registry import ALL_TASKS
  • Specifies "hf" as the the model value, which is the recommended default. Previously defined "hf-causal-experimental" has been deprecated. See: https://github.com/EleutherAI/lm-evaluation-harness/issues/1235#issuecomment-1873940238
  • Removes output_path argument, which is no longer supported in lm_eval simple_evaluate. See: https://github.com/EleutherAI/lm-evaluation-harness/commit/6a2620ade383b8d30592fc2342eb1d213ad4b4cb NB: There may be an option to add something similar or comparable in another way, which I'm not experimenting with here. The argument log_samples, for example, might be added here and set to True.
  • Additional minor: The definition of device uses the term "gpu:0" -- I think "cuda:0" is meant.
    FILES AFFECTED:
  • src/backend/run_eval_suite_harness.py
meg changed pull request status to open
Demo leaderboard with an integrated backend org

LGTM, thanks!

clefourrier changed pull request status to merged
Your need to confirm your account before you can post a new comment.

Sign up or log in to comment