Source Code Info:
- Repository: CatDB (https://github.com/CoDS-GCS/CatDB)
- Programming Language: Python 3.10 & 3.9, Java
- Packages/Libraries Needed: JDK 11, Python, Git, Maven, pdflatex, unzip, unrar, xz-utils
Hardware and Software Info: We ran all experiments on a server node (VM) with an Intel Core CPU (with 32 vcores) and 150 GB of DDR4 RAM. The software stack consisted of Ubuntu 22.04, OpenJDK 11 (for Java baselines), and Python 3.10 (for Python baselines).
Setup and Experiments: The repository is pre-populated with the paper's experimental results (./results), individual plots (./plots), and SystemDS source code. The entire experimental evaluation can be run via ./runAll.sh, which deletes the results and plots and performs setup, dataset download, dataset preparation, dataset generating, local experiments, and plotting. However, for a more controlled evaluation, we recommend running the individual steps separately.
The ./run1SetupDependencies.sh script installs all the required dependencies. Here is a brief overview of each dependency and its purpose:
- JDK 11: for Java-based baselines (H2O AutoML)
- unzip, unrar, and xz-utils: for decompressing datasets
- python3.9 & 3.10: for python-based baselines
- pdflatex =2021: for result visualization
The ./run2SetupBaseLines.sh script will automatically compile Java, and Python based implementations and set up the runnable apps in the Setup directory. There is no need for manual effort in this process.
For LLM-based baselines, you need to set API keys for the LLM services (OpenAI, Google Gemini, and Llama). For each service, create an API key using the following links:
- OpenAI: https://platform.openai.com/
- Google Gemini:https://aistudio.google.com/
- Groq (Llama): https://console.groq.com/
The API keys must be set in the following path:
- CatDB Setup Path:
# Path:
Experiments/stup/Baselines/CatDB/APIKeys.yaml
# Content:
---
- llm_platform: OpenAI
key_1: 'YOUR KEY'
- llm_platform: Meta
key_1: 'YOUR KEY'
- llm_platform: Google
key_1 : 'YOUR KEY'
- CAAFE Setup Path: setup in OS path
export OPENAI_API_KEY_1=<YOUR KEY>
export GROQ_API_KEY_1=<YOUR KEY>
export GOOGLE_API_KEY_1=<YOUR KEY>
- AIDE Setup Path:
# Path:
Experiments/setup/Baselines/aideml/APIKeys.yaml
# Content:
---
- llm_platform: OpenAI
key_1: 'YOUR KEY'
- llm_platform: Meta
key_1: 'YOUR KEY'
- llm_platform: Google
key_1 : 'YOUR KEY'
- AutoGen Setup Path:
# Path:
Experiments/setup/Baselines/AutoML/AutoGenAutoML/OAI_CONFIG_LIST
# Content:
[
{
"model": "gpt-4o",
"api_key":"YOUR KEY"
},
{
"model": "llama-3.1-70b-versatile",
"api_key": "YOUR KEY",
"api_type": "groq"
},
{
"model": "gemini-1.5-pro",
"api_key": "YOUR KEY",
"api_type": "google"
}
]
We manage our datasets using two scripts: ./run3DownloadData.sh and ./run4PrepareData.sh.
-
In the
./run3DownloadData.shscript, we automatically download all datasets used in the experiments. The refined format of these datasets is then moved into thedatadirectory. -
The
./run4PrepareData.shscript decompresses both the local datasets and the data catalog files. It also splits the data according to the paper's settings, using a70/30split for training and test datasets.
Datasets Used:
| # | Dataset | URL | Download Link |
|---|---|---|---|
| 1 | Wifi | Local | download |
| 2 | Diabetes | OpenML: DatasetID #37 | download |
| 3 | Tic-Tac-Toe | OpenML: DatasetID #50 | download |
| 4 | IMDB | https://relational-data.org/dataset/IMDb | download |
| 5 | KDD98 | OpenML: DatasetID #42343 | download |
| 6 | Walking | OpenML: DatasetID #1509 | download |
| 7 | CMC | OpenML: DatasetID #23 | download |
| 8 | EU IT | Local | download |
| 9 | Survey | Local | download |
| 10 | Etailing | Local | download |
| 11 | Accidents | https://relational-data.org/dataset/Accidents | download |
| 12 | Financial | https://relational-data.org/dataset/Financial | download |
| 13 | Airline | https://relational-data.org/dataset/Airline | download |
| 14 | Gas-Drift | OpenML: DatasetID #1476 | download |
| 15 | Volkert | OpenML: DatasetID #41166 | download |
| 16 | Yelp | https://relational-data.org/dataset/Yelp | download |
| 17 | Bike-Sharing | OpenML: DatasetID #44048 | download |
| 18 | Utility | Local | download |
| 19 | NYC | OpenML: DatasetID #44065 | download |
| 20 | House-Sales | OpenML: DatasetID #44051 | download |
The ./run5LocalExperiments.sh script is responsible for running all experiments. For each experiment, we have planned to execute it five times and store the experimental results in the results directory.
Since we run experiments five times in the ./run6PlotResults.sh script, we follow the following process:
- Plot the averaged results using LaTeX's tikzpicture and store the plots in the
plotsdirectory.
./run1SetupDependencies.sh;
./run2SetupBaseLines.sh;
./run3DownloadData.sh;
./run4PrepareData.sh;
./run5LocalExperiments.sh;
./run6PlotResults.sh;
Last Update: April 19, 2025 (draft version)