The MultiNet project is built upon GenESIS (Generalizable and Extendable Stratified Inference System), a modular framework made to streamline the integration of diverse AI models across a multitude of tasks and datasets. It allows new models and datasets to be added to the MultiNet benchmark efficiently and without disrupting existing components. At its core, GenESIS aims to maximize scalability and engineering efficiency through the following principles:
- Interchangeability: Models and datasets are designed to be mutually supportive and easily swappable.
- Abstraction: Common architectural patterns and logic are shared across modules, simplifying development, understanding, and testing. This is significantly advanced by the use of base module classes.
- Encapsulation: Developers can concentrate on their specific model or dataset without requiring deep knowledge of unrelated modules.
The latest version of GenESIS features base module classes, which improves maintainability and reduces the overhead engineering time needed to evaluate future dataset additions. It also adds batch inference capability, which reduces chance of process failure during evaluation, and vastly increases evaluation speed.
For example, GenESIS is used to evaluate models like GPT-4o on datasets such as OpenX and Procgen. For general-purpose Vision-Language Models (VLMs), this involves setting up appropriate instruction prompts, preprocessing input data from various datasets into a consumable format, managing API interactions (including chat history for synchronous calls or batch job submissions for asynchronous ones), and converting model outputs back into a format suitable for evaluation against task-specific labels. GenESIS modularizes and abstracts these processes, separating dataset/model-specific logic and maximizing code reusability. This approach is crucial for scaling the MultiNet benchmark system. Many models accessible via OpenAI API, for instance, can be run on OpenX and Procgen datasets using GenESIS.
GenESIS consists of three hierarchical components called "modules". As we move from upper to lower modules, features become more general and shared. The architectural foundation now includes base classes for dataset modules, streamlining the addition of new datasets and evaluation strategies.
-
Dataset Module: This parent module utilizes modality and source modules. It encapsulates dataset-specific features and all logic for a particular dataset or task.
- Responsibilities (not limited to):
- Loading data (e.g., via a dataloader object) and fetching batches of inputs and labels.
- Generating system instruction prompts for LLM/VLM modules based on dataset-specific definitions.
- Comparing model outputs with labels to calculate evaluation metrics.
- Base Architecture:
BaseDatasetModule(src/modules/dataset_modules/base_dataset_module.py) provides a common structure. Concrete dataset modules inherit from it.- Two primary evaluation patterns are supported:
- Streaming Evaluation (
DatasetModulepattern): For iterative processing of data. The module'srun_evalmethod typically loops through data, callinginfer_stepon the modality module for each step. - Asynchronous Batch Evaluation (
DatasetBatchModulepattern): For leveraging batch processing APIs (e.g., OpenAI Batch API). This involves methods likesend_batch_jobs_for_all_datasetsto submit tasks and arun_evalthat later collects results from these jobs using storedBatchInfo.
- Streaming Evaluation (
- Examples:
OpenXModule(src/modules/dataset_modules/openx_module.py): Handles OpenX datasets, using OpenXDataloader, calculating MSE and success rates. It implements both streaming evaluation (asOpenXModule) and batch API-based evaluation (asOpenXBatchModule).ProcgenModule(src/modules/dataset_modules/procgen_module.py): Manages Procgen environments, supporting discrete action spaces and calculating metrics like Brier scores and F1-scores. It also implements both streaming (ProcGenModule) and batch (ProcGenBatchModule) patterns.
- Responsibilities (not limited to):
-
Source Module: Contains the actual AI model or API client to generate outputs.
- Responsibilities (not limited to):
- Executing the inference step to generate model output.
- Managing multi-turn context (updating context memory, ensuring inputs fit within model context window).
- Employing tokenizers/encoders to prepare inputs for the model.
- Example:
OpenAIModule(src/modules/source_modules/openai_module.py): Uses OpenAI's Python client.- Supports synchronous inference via
infer_step. - Offers direct batch inference using OpenAI's Batch API via
batch_infer_stepand helper methods like_execute_batch_job,get_batch_job_status, andretrieve_batch_results. - Manages concurrent prompt histories (via an
idxparameter in methods likeadd_data) to support batch operations efficiently. - Includes mechanisms for token calculation (text and image) and context truncation.
- Tracks token usage against daily limits for batch queue operations.
- Supports synchronous inference via
- Responsibilities (not limited to):
-
Modality Module: Connects dataset and source modules, handling data conversion across modalities (images, texts, arrays, etc.).
- Responsibilities (not limited to):
- Converting data from the dataset module into types the source module can handle, tagging data for clarity.
- Converting model output from the source module back to its original type for evaluation by the dataset module.
- Example:
VLMModule(src/modules/modality_modules/vlm_module.py): Processes image and text data. For other data types, it typically converts them to strings. It supports:- Synchronous inference via
infer_step. - Asynchronous batch job management by wrapping source module capabilities, offering methods like
send_batch_job(to submit a job and get abatch_job_id),get_batch_job_status, andretrieve_batch_results. These are used byDatasetBatchModuleimplementations.
- Synchronous inference via
- Responsibilities (not limited to):
These are the main variables and functions that each module type should support for GenESIS to work effectively. Note that these represent core interfaces; specific implementations might add further helper functions or variables. The introduction of base dataset modules has standardized many of these requirements.
(
-
Dataset Module The
base_dataset_module.pyfile defines common interfaces.-
Shared Properties/Methods (often in
BaseDatasetModule):self.modality_module (Object): The modality module object.self.batch_size (int): Batch size for dataloaders.self.k_shots (int): Number of few-shot examples.self.action_stats (dict): Statistics of the action space.self._definitions_class: Class providing dataset-specific definitions (descriptions, action spaces).self.get_dataloader_fn: Function to get the dataset-specific dataloader.self.format_instruction_prompt_fn: Function to format instruction prompts._find_shards(self, dataset_name: str) -> list[str]: Finds data shards. (Abstract in base, implemented in concrete class)_get_vlm_instruction(self, dataset_name: str, env_name: str) -> str: Generates system prompts._get_output_type(self, dataset_name: str, env_name: str) -> type: Determines expected model output type.
-
For Streaming Evaluation (e.g., inheriting
DatasetModule):- Required functions:
-
def __init__(self, disk_root_dir: str, modality: str, source: str, model: str, batch_size: int, k_shots: int) -> None
(Handled by `BaseDatasetModule` and then specialized) -
def run_eval(self) -> None
(A template is in `BaseDatasetModule`; `_run_eval_dataset` is the core dataset-specific logic called by it). -
def _run_eval_dataset(self, dataset_name: str) -> dict
(Abstract in base, implemented in concrete class. Typically iterates through data using `_process_batch` and calls modality module's `infer_step`.) -
def _process_batch(self, batch: dict, dataset_name: str) # Yields per-timestep data
(Processes a batch from dataloader, often yielding data for each timestep for `infer_step`).
-
- Required functions:
-
For Asynchronous Batch Evaluation (e.g., inheriting
DatasetBatchModule):- Required functions (beyond
DatasetModuleaspects):-
def __init__(self, disk_root_dir: str, modality: str, source: str, model: str, batch_info_dir: str, batch_size: int, k_shots: int) -> None
(`batch_info_dir` is for storing/loading `BatchInfo` files). -
def send_batch_jobs_for_all_datasets(self) -> dict
(Iterates datasets, calls `_send_batch_jobs_for_dataset`). -
def _send_batch_jobs_for_dataset(self, dataset_name: str) -> list[str]
(Loads data, calls `_send_batch_job` for each batch, returns list of `BatchInfo` file paths). -
def _send_batch_job(self, batch: dict, dataset_name: str, batch_num: int) -> None
(Processes a full batch via its `_process_batch`, calls modality module's `send_batch_job`, saves `BatchInfo`). -
def run_eval(self, results_path: str, batch_info_dict: dict) -> None
(Template in `DatasetBatchModule`. Takes a dictionary mapping datasets to lists of `BatchInfo` file paths. Calls `_run_eval_dataset` for each dataset). -
def _run_eval_dataset(self, dataset_batch_info_paths: Union[str, list[str]]) -> dict
(Abstract in base. Implemented in concrete class to load `BatchInfo`, retrieve results via modality module, and evaluate). -
def _process_batch(self, batch: dict, dataset_name: str) # Returns data for a whole batch
(Processes a batch from dataloader, returning inputs for the entire batch to be sent as one job).
-
- Associated
BatchInfodataclass: Used to save and load metadata about submitted batch jobs (e.g.,batch_id,labels,output_types).
- Required functions (beyond
-
-
Modality Module
- Required variables:
self.source_module (Object): The source module object.
- Required functions:
-
(May take additional args like
def __init__(self, source: str, model: str, **kwargs) -> None
max_concurrent_promptsfor configuring the source module). -
(For synchronous, step-by-step inference).
def infer_step(self, cur_inputs: list[list[tuple[str, Any]]], k_shots_examples: list[list[tuple[str, list[tuple[str, Any]]]]] = [], instructions: list[str] = [], output_types: list[type] = [] ) -> list[Any]
- New key functions for batch processing (as seen in
VLMModule):-
def send_batch_job(self, cur_inputs: list[list[tuple[str, Any]]], k_shots_examples: list[list[tuple[str, list[tuple[str, Any]]]]] = [], instructions: list[str] = [] ) -> tuple[str, int] # (batch_job_id, num_tokens)
-
def get_batch_job_status(self, batch_job_id: str) -> str
-
def retrieve_batch_results(self, batch_job_id: str, output_types: list[type] = []) -> list[Any]
-
-
- Required variables:
-
Source Module
- Required variables:
self.model (str or Object): The AI model to use.
- Required functions:
-
(May take additional args like
def __init__(self, model: str, **kwargs) -> None
max_concurrent_prompts,max_output_tokens_per_query). -
(For single, synchronous inference, typically on
def infer_step(self, inputs: list[tuple[str, Any]], system_prompt: str = None) -> Any
history[0]). -
(Adds to context history,
def add_data(self, type: str, data: list[tuple[str, Any]], idx: int = 0) -> None
idxfor managing concurrent histories). -
(Clears context histories).
def clear_history(self) -> None
- New key functions for batch processing (as seen in
OpenAIModule):-
def batch_infer_step(self, inputs_batch: list[list[tuple[str, Any]]], system_prompts: list[str], retrieve_and_return_results: bool = True ) -> tuple[list[Any], str, int] # (responses, batch_job_id, input_tokens)
-
def get_batch_job_status(self, batch_job_id: str) -> str
-
def retrieve_batch_results(self, batch_job_id: str) -> list[Any]
-
-
- Required variables:
We welcome contributions to the GenESIS framework and the MultiNet project! If you'd like to contribute, please feel free to:
- Open an Issue: For bug reports, feature requests, or discussions about potential changes, please open an issue on our GitHub repository.
- Submit a Pull Request: If you've made improvements or added new features, please submit a pull request. We will review it.
We appreciate any help in improving Multinet! If you find any part of the codebase confusing or unclear, please don't hesitate to reach out by opening an issue—it helps us improve documentation and code clarity for everyone.
