This project is fully reproducible and designed to run directly in the Kaggle Notebook environment.
Access the public Kaggle notebook here:
👉 Kaggle Notebook
https://www.kaggle.com/code/thanhtamdo91/openstax-2026-rice-datathon-mapping-standards-i
- Click “Copy & Edit” (top right)
- This will create your own editable copy in your Kaggle workspace
- In the notebook, click Settings
- Set:
- Accelerator: GPU
- GPU Type: T4 ×2 (recommended)
- Save settings
This project uses the Gemini API for data augmentation.
- Go to Kaggle → Settings → Secrets
- Add a new secret:
- Name:
GEMINI_API_KEY - Value: your Gemini API key
- Name:
- The notebook automatically loads it using
kaggle_secrets
If the API key is not provided, the notebook will still run using the base dataset.
- Click Run All
- The notebook will:
- Parse and flatten the OpenStax JSON data
- Perform EDA
- Generate embeddings and FAISS index
- Run semantic retrieval + re-ranking
- Output evaluation metrics and visualizations
- GPU Enabled: ~10–15 minutes
- CPU Only: Not recommended (embedding + FAISS steps are slow)
- Evaluation metrics (Hit@K, F1, Hamming Loss)
- Retrieval performance analysis
- Error distribution across standard codes
- Augmented training artifacts (optional)
- No local setup is required
- No additional dependencies need to be installed
- All datasets are loaded directly from Kaggle inputs