Skip to content

fix: add support for wildcard pattern in seed dataset path#12

Merged
nabinchha merged 40 commits into
mainfrom
nabinchha/bug/2-support-seed-path-with-partition-files
Nov 5, 2025
Merged

fix: add support for wildcard pattern in seed dataset path#12
nabinchha merged 40 commits into
mainfrom
nabinchha/bug/2-support-seed-path-with-partition-files

Conversation

@nabinchha

@nabinchha nabinchha commented Nov 4, 2025

Copy link
Copy Markdown
Contributor

Fix for: #2

Users should be able point to a local folder with wildcard pattern like so (for parquet, json, jsonl, csv). If we follow this pattern, duckdb is able to read from all files across these extensions.

parquet_reference = LocalSeedDatasetReference(dataset="../my/long/path/*.parquet")
json_reference = LocalSeedDatasetReference(dataset="../my/long/path/*.json")
jsonl_reference = LocalSeedDatasetReference(dataset="../my/long/path/*.jsonl")
csv_reference = LocalSeedDatasetReference(dataset="../my/long/path/*.csv")

#8 should merge first.

We need this for BigIron to support any sort of non trivial partitioned seed dataset. We currently have a workaround in BigIron to consolidate partitions into one file, but that does not scale at all.

Example Preview result pointing seed dataset to "csv/*.csv"

[11:36:46] [INFO] 0️⃣ Using the first matching file in 'csv/*.csv' to determine column names in seed dataset
[11:36:46] [INFO] 🕵️ Preview generation in progress
[11:36:46] [INFO] ✅ Validation passed
[11:36:46] [INFO] ⛓️ Sorting column configs into a Directed Acyclic Graph
[11:36:46] [INFO] 🩺 Running health checks for models...
[11:36:46] [INFO]   |-- 👀 Checking 'nvidia/nvidia-nemotron-nano-9b-v2'...
col_names: ['language', 'greetings', 'name']
[11:36:47] [INFO]   |-- ✅ Passed!
[11:36:47] [INFO] 🌱 Sampling 10 records from seed dataset
[11:36:47] [INFO]   |-- seed dataset size: 22 records
[11:36:47] [INFO]   |-- sampling strategy: shuffle
[11:36:47] [INFO]   |-- selection: partition 2 of 3
[11:36:47] [INFO]   |-- seed dataset size after selection: 7 records
[11:36:47] [INFO] 📝 Preparing llm-text column generation
[11:36:47] [INFO]   |-- column name: 'greetings_completion'
[11:36:47] [INFO]   |-- model config:
{
    "alias": "nano-v2",
    "model": "nvidia/nvidia-nemotron-nano-9b-v2",
    "inference_parameters": {
        "temperature": 0.5,
        "top_p": null,
        "max_tokens": 2048,
        "max_parallel_requests": 4,
        "timeout": null,
        "extra_body": null
    },
    "provider": null
}
[11:36:47] [INFO]   |-- default model provider: 'nvidia'
[11:36:47] [INFO] 🐙 Processing llm-text column 'greetings_completion' with 4 concurrent workers
[11:37:01] [INFO] 📊 Model usage summary:
{
    "nvidia/nvidia-nemotron-nano-9b-v2": {
        "token_usage": {
            "prompt_tokens": 397,
            "completion_tokens": 5102,
            "total_tokens": 5499
        },
        "request_usage": {
            "successful_requests": 10,
            "failed_requests": 0,
            "total_requests": 10
        },
        "tokens_per_second": 380,
        "requests_per_minute": 41
    }
}
[11:37:01] [INFO] 📐 Measuring dataset column statistics:
[11:37:01] [INFO]   |-- 📝 column: 'greetings_completion'
[11:37:01] [INFO]   |-- 🌱 column: 'language'
[11:37:01] [INFO]   |-- 🌱 column: 'greetings'
[11:37:01] [INFO]   |-- 🌱 column: 'name'
[11:37:01] [INFO] 🎉 Preview complete!

Base automatically changed from nm/seed-config-partition-strategy to main November 4, 2025 23:36
Comment thread src/data_designer/config/seed.py
Comment thread src/data_designer/config/seed.py
johnnygreco
johnnygreco previously approved these changes Nov 4, 2025
@eric-tramel

Copy link
Copy Markdown
Contributor

big fan, here. Thanks @nabinchha !

eric-tramel
eric-tramel previously approved these changes Nov 5, 2025
@nabinchha nabinchha dismissed stale reviews from eric-tramel and johnnygreco via 690b00f November 5, 2025 18:23
@nabinchha

nabinchha commented Nov 5, 2025

Copy link
Copy Markdown
Contributor Author

Found another validation fix we need to make while testing a notebook:
690b00f cc @johnnygreco

As is, we will not support this when we need to upload to a datastore.

@nabinchha nabinchha requested a review from johnnygreco November 5, 2025 18:27
@nabinchha nabinchha requested a review from eric-tramel November 5, 2025 18:39
Comment thread src/data_designer/config/datastore.py Outdated
Comment thread src/data_designer/config/datastore.py Outdated
johnnygreco
johnnygreco previously approved these changes Nov 5, 2025
@nabinchha nabinchha merged commit d01e5bf into main Nov 5, 2025
10 checks passed
@nabinchha nabinchha deleted the nabinchha/bug/2-support-seed-path-with-partition-files branch November 5, 2025 19:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants