core_leaderboard

Running

App Files Files Community

Zachary Siegel commited on Sep 28, 2024

Commit

b335ab8

1 Parent(s): 2faf3bd

scaffold for core bench

Browse files

Files changed (1) hide show

app.py +38 -67

app.py CHANGED Viewed

@@ -328,71 +328,58 @@ with gr.Blocks(theme=my_theme, css='css.css', title="HAL: Holistic Agent Leaderb
         line-height: 1.6;
         color: #333;
     }
-    .button {
-        margin: 15px 0;
-        padding: 10px 20px;
         font-size: 1em;
         font-weight: bold;
-        color: #fff;
-        background-color: #3498db;
         border: none;
         border-radius: 5px;
-        text-decoration: none;
-        display: inline-flex;
-        align-items: center;
         transition: background-color 0.3s ease;
     }
-    .button:hover {
-        background-color: #2980b9;
     }
-    .button img {
-        margin-right: 8px;
-        height: 20px;
     }
 </style>
-<div class="feature-row">
-    <div class="feature-column">
-        <div class="feature-keyword">Paper</div>
-        <div class="feature-content">
-            <a href="https://arxiv.org/abs/2409.11363" class="button">
-                <img src="https://example.com/favicon-paper.png" alt="Paper Icon"> View Paper
-            </a>
-        </div>
-    </div>
-    <div class="feature-column">
-        <div class="feature-keyword">Github</div>
-        <div class="feature-content">
-            <a href="https://github.com/siegelz/core-bench" class="button">
-                <img src="https://example.com/favicon-github.png" alt="Github Icon"> View Github
-            </a>
-        </div>
-    </div>
-    <div class="feature-column">
-        <div class="feature-keyword">Dataset</div>
-        <div class="feature-content">
-            <a href="https://huggingface.co/datasets/siegelz/core-bench" class="button">
-                <img src="https://example.com/favicon-dataset.png" alt="Dataset Icon"> View Dataset
-            </a>
-        </div>
-    </div>
 </div>
 </br>
 <h2 class="section-heading" id="leaderboards">Leaderboards</h2>
-<p>Select a benchmark to see the agent leaderboard. Verified results have been run by the HAL team:</p>
 """)
     with gr.Tabs() as tabs:
-        with gr.Tab("CORE-Bench"):
-            gr.Markdown("""
-                        CORE-Bench evaluates the ability of agents to computationally reproduce the results of published scientific papers. Agents are given the codebase of a paper and must install all libraries and dependencies, run the code, and read through the output and figures to answer questions about the paper. The benchmark has tasks at three difficulty levels:
-                        <b>CORE-Bench-Easy</b>: The agent is given the output of the code and must answer questions about the output without running any code. To answer questions, agents must navigate through the terminal output as well as files and figures generated by the code.
-                        <b>CORE-Bench-Medium</b>: The agent is given a Dockerfile and instructions on how to use the Dockerfile to fully reproduce the paper. This level mainly evaluates agents ability to use and interact with the terminal. The agent must then answer questions about the output of the code, as in the above level.
-                        <b>CORE-Bench-Hard</b>: The agent is given the codebase of the paper and must install all libraries and dependencies, run the code, and read through the output and figures to answer questions about the paper. This level is most akin to fully reproducing a paper and is the most realistic and challenging level.
-                        """)
             with gr.Row():
                 with gr.Column(scale=2):
                     Leaderboard(
@@ -405,25 +392,11 @@ with gr.Blocks(theme=my_theme, css='css.css', title="HAL: Holistic Agent Leaderb
                         hide_columns=config.USACO_HIDE_COLUMNS,
                         search_columns=config.USACO_SEARCH_COLUMNS,
                     )
-                    gr.Markdown("""*Error ranges span from the lowest to highest observed values in repeated runs.*""", elem_classes=["text-right"])
             with gr.Row():
-                gr.Markdown("### Accuracy vs. Cost for USACO agents")
             with gr.Row():
                 scatter_plot = gr.Plot(create_scatter_plot(parse_json_files(os.path.join(abs_path, "evals_live"), 'usaco', aggregate=False), "Total Cost", "Accuracy", "Total Cost (in USD)", "Accuracy", ["Agent Name"]))
-            gr.HTML('<div style="height: 30px;"></div>')
-            gr.Markdown("## Task success heatmap")
-            gr.Markdown("The task success heatmap shows which agent can solve which tasks. Agents are sorted by total accuracy (higher is better); tasks in USACO are sorted by decreasing order of difficulty (tasks on the left are solved by the most agents; tasks on the right are solved by the least. For agents that have been run more than once, the run with the highest score is shown.")
-            with gr.Row():
-                task_success_heatmap = gr.Plot()
-            demo.load(
-            lambda: create_task_success_heatmap(
-                preprocessor.get_task_success_data('usaco'),
-                'USACO'
-            ),
-            outputs=[task_success_heatmap]
-            )
     # Will trigger autoscaling of plots when tabs are switched
     tabs.select(fn=None, inputs=None, outputs=None, js="""
@@ -435,8 +408,6 @@ with gr.Blocks(theme=my_theme, css='css.css', title="HAL: Holistic Agent Leaderb
     """)
     gr.HTML("""<h2 class="section-heading" id="agent-submission">How to add an agent?</h2>""")
     gr.Markdown((Path(__file__).parent / "agent_submission.md").read_text())
-    gr.HTML("""<h2 class="section-heading" id="benchmark-submission">How to add a benchmark?</h2>""")
-    gr.Markdown((Path(__file__).parent / "benchmark_submission.md").read_text())
     gr.HTML("""<h2 class="section-heading" id="reproduction-guide">How can I run evaluations?</h2>""")
     gr.Markdown("""Coming soon...""")

         line-height: 1.6;
         color: #333;
     }
+   .button-container {
+        display: flex;
+        justify-content: center;
+        margin-top: 2px;
+    }
+    .button-container .button {
+        margin: 0 10px;
+        padding: 15px 25px;
         font-size: 1em;
         font-weight: bold;
+        color: #fff !important; /* Force white text color */
+        background-color: #3498db !important; /* Force background color */
         border: none;
         border-radius: 5px;
+        text-decoration: none !important; /* Force no underline */
+        text-align: center;
         transition: background-color 0.3s ease;
+        cursor: pointer;
+        height: 50px;
     }
+    .button-container .button:hover {
+        background-color: #2980b9 !important; /* Force hover color */
     }
+    .button:visited {
+        color: #fff; /* Keep text color white when link is visited */
     }
 </style>
+<div class="button-container">
+    <a href="https://arxiv.org/abs/2409.11363" class="button">Paper</a>
+    <a href="https://github.com/siegelz/core-bench" class="button">Github</a>
+    <a href="https://huggingface.co/datasets/siegelz/core-bench" class="button">Dataset</a>
 </div>
 </br>
 <h2 class="section-heading" id="leaderboards">Leaderboards</h2>
+<p>
+   CORE-Bench evaluates the ability of agents to computationally reproduce the results of published scientific papers. Agents are given the codebase of a paper and must install all libraries and dependencies, run the code, and read through the output and figures to answer questions about the paper. The benchmark has tasks at three difficulty levels:
+</p>
+<p>
+    <i><b>CORE-Bench-Hard:</b></i> The agent is given the codebase of the paper and must install all libraries and dependencies, run the code, and read through the output and figures to answer questions about the paper. This level is most akin to fully reproducing a paper and is the most realistic and challenging level.
+</p>
+<p>
+    <i><b>CORE-Bench-Medium:</b></i> The agent is given a Dockerfile and instructions on how to use the Dockerfile to fully reproduce the paper. This level mainly evaluates agents ability to use and interact with the terminal. The agent must then answer questions about the output of the code, as in the above level.
+</p>
+<p>
+    <i><b>CORE-Bench-Easy:</b></i> The agent is given the output of the code and must answer questions about the output without running any code. To answer questions, agents must navigate through the terminal output as well as files and figures generated by the code.
+</p>
 """)
     with gr.Tabs() as tabs:
+        with gr.Tab("CORE-Bench-Hard"):
             with gr.Row():
                 with gr.Column(scale=2):
                     Leaderboard(
                         hide_columns=config.USACO_HIDE_COLUMNS,
                         search_columns=config.USACO_SEARCH_COLUMNS,
                     )
+                    # gr.Markdown("""*Error ranges span from the lowest to highest observed values in repeated runs.*""", elem_classes=["text-right"])
             with gr.Row():
+                gr.Markdown("### Accuracy vs. Cost on CORE-Bench-Hard")
             with gr.Row():
                 scatter_plot = gr.Plot(create_scatter_plot(parse_json_files(os.path.join(abs_path, "evals_live"), 'usaco', aggregate=False), "Total Cost", "Accuracy", "Total Cost (in USD)", "Accuracy", ["Agent Name"]))
     # Will trigger autoscaling of plots when tabs are switched
     tabs.select(fn=None, inputs=None, outputs=None, js="""
     """)
     gr.HTML("""<h2 class="section-heading" id="agent-submission">How to add an agent?</h2>""")
     gr.Markdown((Path(__file__).parent / "agent_submission.md").read_text())
     gr.HTML("""<h2 class="section-heading" id="reproduction-guide">How can I run evaluations?</h2>""")
     gr.Markdown("""Coming soon...""")