Working with Research Datasets

Paramus provides direct access to curated chemical and materials science datasets. You can install, query, and cross-reference datasets without leaving your research environment.

Available Research Domains

Domain	Datasets	What You Get
Polymer Science	RadonPy, PI1M, OpenMacromolecularGenome, VipEA, OMG-Property-Database, PolyIE	~1M+ polymer structures with physical properties from MD simulations
Computational Chemistry	QM9, QM9S, MSR-ACC-TAE25	134k small molecules with DFT-level energies, HOMO/LUMO, dipole moments
Inorganic / Crystallography	COD, a-Si-24, Anionic-Solvation-Dataset	Crystal structures, amorphous silicon configurations, solvation data
Organic / Solubility	BigSolDB	112,465 experimental solubility records across multiple solvents

Installing a Dataset

Select a dataset tile and click Install. Paramus downloads the data files from their source (Zenodo, GitHub) and prepares them for querying. Original files are never modified — normalized copies and a search index are created alongside them.

Querying by Chemical Properties

Ask questions in natural language through the chat. Paramus translates your request into the right query automatically.

Find soluble compounds in ethanol at room temperature:

“Show me compounds with LogS above -2 in ethanol between 20 and 30 degrees Celsius from BigSolDB”

Screen polymers by glass transition temperature:

“Which polymers in RadonPy have a Tg above 400K and density below 1.2 g/cm3?”

Look up molecular properties by structure:

“Get the HOMO-LUMO gap and dipole moment for all molecules containing a carbonyl group in QM9”

SMILES columns are automatically canonicalized using RDKit, so c1ccccc1 and C1=CC=CC=C1 both find benzene.

Query Methods

Method	Use Case
`dataset.query`	Filter by structure, property ranges, solvents, conditions
`dataset.query_schema`	Inspect available columns, types, and value ranges
`dataset.query_remote`	Query a dataset without downloading it first
`dataset.list`	See all installed datasets
`dataset.get`	Get metadata and file listing for a dataset

Supported File Formats

Paramus handles common research data formats out of the box:

Format	Extensions
Tabular	.csv, .json, .jsonl, .xlsx, .xls, .parquet, .feather
Scientific	.h5, .hdf5, .mat, .npy, .npz
Serialized	.pkl, .pickle
Archives	.tar, .tar.gz, .tar.bz2, .zip (auto-extracted)

Use dataset.unfold to convert between formats (e.g. Parquet to CSV).

Semantic Knowledge Graphs

Beyond tabular datasets, three RDF knowledge graphs capture domain-specific research context:

Knowledge Graph	Focus
Polymer Chemistry R&D	Polymer synthesis, characterization, and property prediction
Medicinal Chemistry (Molidustat)	HIF-PHD inhibitor research, SAR relationships
Germanium Extraction R&D	Hydrometallurgical processing, extraction optimization

These are managed separately via semantic.list, semantic.switch, and semantic.info.

Dataset Metadata

Each dataset card follows the Croissant 1.0 + Schema.org standard, capturing provenance, licensing, and citation:

{
  "@type": "Dataset",
  "name": "BigSolDB",
  "dataOrigin": "experimental",
  "measurementTechnique": "Various experimental methods",
  "license": "CC-BY-4.0",
  "citation": {
    "name": "BigSolDB: Solubility Dataset of Compounds in Organic Solvents",
    "identifier": "10.1038/s41597-023-02..."
  }
}

This ensures every query result can be traced back to its original publication and data source.

Frequently Asked Questions

What chemical datasets are available in Paramus?

Paramus includes curated datasets across polymer science (RadonPy, PI1M, OpenMacromolecularGenome), computational chemistry (QM9, QM9S), inorganic crystallography (COD), and solubility (BigSolDB with 112k+ experimental records). New datasets are added regularly.

How do I install and query a dataset?

Select a dataset tile in the Paramus interface and click Install. Paramus downloads source files from Zenodo or GitHub and builds a local search index. You can then query by chemical properties using natural language or the dataset.query API method.

What file formats are supported for research data?

Paramus handles tabular formats (CSV, JSON, JSONL, XLSX, Parquet, Feather), scientific formats (HDF5, MAT, NPY, NPZ), serialized objects (Pickle), and archives (TAR, ZIP) with automatic extraction. Use dataset.unfold to convert between formats.

Can I search datasets by molecular structure?

Yes. SMILES columns are automatically canonicalized using RDKit, so equivalent representations like c1ccccc1 and C1=CC=CC=C1 both match benzene. You can query by substructure, property ranges, solvents, and experimental conditions.

What are the semantic knowledge graphs in Paramus?

Three RDF knowledge graphs examples that capture domain-specific research context: Polymer Chemistry R&D (synthesis and property prediction), Medicinal Chemistry with example molecule Molidustat (HIF-PHD inhibitor SAR), and Germanium Extraction R&D (hydrometallurgical processing). Manage them with semantic.list, semantic.switch, and semantic.info.