SemBench

Implement the Benchmark & Submit Results

Overview

There are two ways to implement SemBench queries in your system, depending on how your system is deployed. Code-based systems (Python packages) implement queries as Python methods. Query-based systems (SQL engines) write queries as standalone SQL files. Both approaches inherit from the same GenericRunner base class and produce results in a unified format.

Implementation Guide

This guide walks you through the SemBench architecture step-by-step. Code-based systems (Python packages like LOTUS) implement queries as Python methods in a runner class. Query-based systems (SQL engines like BigQuery) write queries as standalone SQL files with a minimal runner. Newer scenarios also support Code* mode where Python query files live in files/ and are loaded dynamically. Click any file in the explorer to view its contents, or use the links in each step below.

Note: The file contents shown below are simplified excerpts for illustration purposes. Please refer to the original source files for the complete implementation reference.

GenericRunner
get_system_name()
_discover_query_impl()
_discover_query_text()
GenericLotusRunner
execute_query()
_configure_lm()
GenericBigQueryRunner
execute_queries()
Jinja2 templating
LotusRunner
(per scenario)
_execute_q*()
BigQueryRunner
(per scenario)
(minimal subclass)
run.py
get_runner_class()
execute_query(s)()
Code
_discover_query_impl()
_execute_q*()
SQL
_discover_query_text()
Q{id}.sql
Select a file

1 Run the Benchmark

The CLI entry point is src/run.py. You specify which system, scenario, and queries to run. The script dynamically loads the correct runner class for each system using get_runner_class().

run.py

2 The Runner Base Class

Every system runner inherits from GenericRunner (located at src/runner/generic_runner.py). It defines the abstract interface (get_system_name(), execute_query()), sets up paths to data, queries, and results, and provides load_data() for reading CSVs. It also provides two query-discovery mechanisms: _discover_query_impl(query_id) finds a Python method named _execute_q{id} via reflection (used by Code mode), and _discover_query_text(query_id) locates a Q{id}.sql file in the system’s query directory (used by SQL mode). You should not need to modify this file.

generic_runner.py

3 Build Your System’s Base Runner

Each system should implement its own intermediate base runner between GenericRunner and the per-scenario runners. Create it at src/runner/generic_{system}_runner/generic_{system}_runner.py. This is where you put shared logic specific to your system: engine initialization, hyper-parameter configuration, token usage extraction, and monetary cost calculation. For example, GenericLotusRunner (at src/runner/generic_lotus_runner/) initializes the LOTUS LM, configures reasoning effort per model, calls _discover_query_impl() to dispatch to _execute_q*() methods, and extracts token stats from lotus.settings.lm.stats. GenericBigQueryRunner (at src/runner/generic_bigquery_runner/) sets up the BigQuery client, calls _discover_query_text() to load .sql templates, uses Jinja2 to substitute <<variables>>, and queries inference logs for cost tracking.

generic_lotus_runner.py generic_bigquery_runner.py

4 Implement Your Queries

There are two approaches depending on your system’s mode:

Code mode — Create a per-scenario runner at src/scenario/{scenario}/runner/{system}_runner/{system}_runner.py and implement _execute_q*() methods. Each method loads data, applies semantic operators (e.g., sem_filter, sem_join, sem_map), and returns a DataFrame. See the movie LOTUS runner for examples.

SQL / Code* mode — Write standalone query files in files/{scenario}/query/{system}/. SQL queries (Q{id}.sql) are templates with <<variable>> placeholders substituted at runtime. Code* queries (Q{id}.py) are Python files with a run() function that receives the data directory. The per-scenario runner itself (at src/scenario/{scenario}/runner/{system}_runner/{system}_runner.py) is minimal — just inherit from your system’s base runner and set defaults.

lotus_runner.py bigquery_runner.py Q1.sql Q1.py

System × Scenario Matrix

The table below shows how each system implements queries for each scenario. Code = Python methods in the runner; Code* = external Python query files loaded dynamically; SQL = standalone SQL files; Hybrid = combination of approaches.

System Movie Animals MMQA Ecomm Medical Cars
LOTUS Code Code Code Code* Code* Code*
Palimpzest Code Code Code Code* Code* Code*
BigQuery SQL SQL SQL SQL SQL SQL
ThalamusDB Hybrid Hybrid Hybrid Hybrid Hybrid Hybrid

Submit Results

Coming Soon

The submission process for uploading benchmark results to the SemBench leaderboard is currently under discussion. Stay tuned for updates!

Follow on GitHub