Skip to content

Feature/oswpq reader#11

Merged
singjc merged 7 commits intomasterfrom
feature/oswpq_reader
Nov 16, 2025
Merged

Feature/oswpq reader#11
singjc merged 7 commits intomasterfrom
feature/oswpq_reader

Conversation

@singjc
Copy link
Owner

@singjc singjc commented Nov 16, 2025

This pull request adds support for the PyProphet split parquet format (oswpqd) to ARYCAL, enabling fully parallelized reading and alignment workflows. The changes introduce a unified feature accessor abstraction for both OSW (SQLite) and OSWPQ (parquet) formats, update documentation and configuration templates, and simplify the input configuration structure. These improvements make the tool more flexible, easier to use, and better documented for multiple data formats.

Major feature additions:

  • Added support for PyProphet split parquet format (oswpqd) in both the CLI and configuration, including a parallelized reader and unified feature accessor (FeatureAccessor) for both OSW and OSWPQ formats. [1] [2]

  • Updated input validation and configuration to support both sqMass and parquet for XIC files, and both osw and oswpqd for feature files. [1] [2]

Documentation and usability improvements:

  • Expanded the README.md with clear documentation for multiple input formats, quick start instructions, configuration field explanations, and example configurations for both OSW and OSWPQ workflows. [1] [2]

Codebase simplification:

  • Removed unused GUI-related fields from the input configuration structure to streamline CLI usage.

  • Refactored the codebase to use the new FeatureAccessor abstraction throughout the alignment and results writing logic, ensuring compatibility with both OSW and OSWPQ feature types. [1] [2] [3] [4] [5] [6]

These changes collectively improve ARYCAL's flexibility, performance, and ease of use for users working with different chromatogram alignment workflows.

Summary by CodeRabbit

  • New Features

    • Support for PyProphet split-parquet (OSWPQ), sqMass and Parquet XIC inputs.
    • Automatic config template generation and Quick Start flow when no parameters provided.
    • New top-level config options: threads, log_level, compute_scores, scores_output_file, retain_alignment_path.
  • Documentation

    • Expanded README with Quick Start, template usage, parallelization notes, and detailed Configuration Fields.
  • Improvements

    • Stronger config validation and file-path inference; clearer config field naming (decoy → include_decoys) and updated alignment defaults.

…essors

- Updated Input validation to accept 'parquet' as a valid xic type alongside 'sqMass'.
- Enhanced FeaturesFileType enum to include OSWPQ and updated deserialization logic.
- Introduced FeatureAccessor enum to handle both OSW and OSWPQ feature accessors.
- Implemented OswpqAccess struct for reading PyProphet's split parquet format.
- Updated Runner to support loading features from OSWPQ files and adjusted related methods.
- Added necessary methods in OswpqAccess for fetching precursor features and writing alignment data.
…dd peak mapping conversion for PyProphet format
… configuration template generation, and implement output file backup
@coderabbitai
Copy link

coderabbitai bot commented Nov 16, 2025

Caution

Review failed

The pull request is closed.

Walkthrough

Adds PyProphet Split Parquet (OSWPQ) feature-format support, expands XIC/feature input formats, refactors feature access via a new FeatureAccessor, adds CLI config-template generation, and tightens input validation and file-type inference.

Changes

Cohort / File(s) Summary
Documentation & Configuration Schema
README.md, crates/arycal-common/src/config.rs
README updated with OSWPQ/parquet support, parallelization notes, quick-start and template guidance. FeaturesFileType::OSWPQ added; FiltersConfig.decoy replaced by include_decoys (serde alias preserved); defaults updated.
Input Validation & Loading
crates/arycal-cli/src/input.rs
Removed GUI and several CLI-specific public fields; added from_arguments, load, save_to_file; added file-existence checks, file-type inference from paths, and support for Parquet XIC and OSWPQ features; adjusted logging and overrides handling.
Core Alignment Refactor
crates/arycal-cli/src/lib.rs
Introduced FeatureAccessor enum (Osw / Oswpq) and delegation methods; Runner and alignment write functions refactored to operate on FeatureAccessor; added feature-type logging and early failure for Unknown types; updated call sites and signatures.
CLI Template Generation & Control Flow
crates/arycal-cli/src/main.rs
parameters arg made optional; added generate_config_template to emit arycal_config_template.json + docs when no config provided; CLI now prints guidance and exits when template is generated.
OSWPQ Module & Integration
crates/arycal-cloudpath/src/oswpq.rs, crates/arycal-cloudpath/src/lib.rs
New oswpq module: OswpqAccess, error type, data models (PrecursorFeature, AlignmentFeature), DuckDB-backed parquet read/write, run discovery, parallel feature fetching, batch alignment writing, and helpers; pub mod oswpq exported.

Sequence Diagram(s)

sequenceDiagram
    participant CLI as CLI Main
    participant Input as Input Loader
    participant Runner as Alignment Runner
    participant FeatureAccessor as FeatureAccessor
    participant OSW as OswAccess
    participant OSWPQ as OswpqAccess

    CLI->>CLI: Check for config file
    alt No config provided
        CLI->>CLI: generate_config_template()
        CLI->>CLI: Write arycal_config_template.json + docs
        CLI->>CLI: Exit with guidance
    else Config provided
        CLI->>Input: from_arguments()
        Input->>Input: infer_types() from paths
        Input->>Input: validate() existence & types
        Input-->>CLI: Return Input
        
        CLI->>Runner: Initialize with Input
        alt FeaturesFileType::OSW
            Runner->>OSW: Create OswAccess
            Runner->>FeatureAccessor: Wrap as Osw
        else FeaturesFileType::OSWPQ
            Runner->>OSWPQ: Create OswpqAccess
            Runner->>FeatureAccessor: Wrap as Oswpq
        else FeaturesFileType::Unknown
            Runner->>Runner: Fail early with error
        end
        
        Runner->>FeatureAccessor: fetch_transition_ids()
        FeatureAccessor-->>FeatureAccessor: Delegate to inner impl
        Runner->>Runner: Run alignment & scoring
        Runner->>FeatureAccessor: write_ms2_alignment_batch()
        FeatureAccessor-->>FeatureAccessor: Route to OSW or OSWPQ writer
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

  • Substantial refactor touching core Runner and many call sites (FeatureAccessor delegation).
  • Large new module (oswpq.rs) with DuckDB/parquet logic and rayon parallelism.
  • Config semantics changed (decoy → include_decoys) and new defaults; verify backward compatibility.
  • Areas to inspect closely:
    • DuckDB in-memory queries and parquet export paths
    • Concurrency (rayon) safety and batching logic in OSWPQ writes
    • Correct delegation and parity between OswAccess and OswpqAccess methods
    • File-type inference and file-existence validation paths

Possibly related PRs

  • Feature/oswpq reader #11 — Implements OSWPQ support and FeatureAccessor abstraction; likely overlaps strongly with this PR's OSWPQ and access-layer changes.
  • Update/gui #2 — Modifies core CLI/config types (Input, config enums); changes are directly related to this PR's input validation and config schema updates.

Poem

🐰 I dug through parquet fields and rows,

where OSW and OSWPQ both grow.
A template sprouted, threads awake,
FeatureAccessor hops in stride to make
alignments sing as outputs flow. 🎋

Pre-merge checks and finishing touches

❌ Failed checks (1 inconclusive)
Check name Status Explanation Resolution
Title check ❓ Inconclusive The title 'Feature/oswpq reader' refers to a key feature added in the PR, but is somewhat generic and doesn't clearly communicate the primary scope of the change—which involves adding OSWPQ support and introducing a unified feature accessor abstraction for multiple file formats. Consider a more descriptive title such as 'Add PyProphet parquet (OSWPQ) feature reader and unified FeatureAccessor' to clarify that this adds both a new file format reader and an important abstraction for feature access.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed Docstring coverage is 88.10% which is sufficient. The required threshold is 80.00%.

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 01dc35b and 97861d4.

📒 Files selected for processing (4)
  • README.md (2 hunks)
  • crates/arycal-cli/src/main.rs (3 hunks)
  • crates/arycal-cloudpath/src/oswpq.rs (1 hunks)
  • crates/arycal-common/src/config.rs (6 hunks)

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)
crates/arycal-cli/src/input.rs (2)

115-137: Error message for features type mentions oswpqd instead of oswpq

In validate:

if self.features.file_type != Some(FeaturesFileType::OSW) && self.features.file_type != Some(FeaturesFileType::OSWPQ) {
    return Err(io::Error::new(
        io::ErrorKind::InvalidData,
        format!(
            "Invalid features type {:?}; expected 'osw' or 'oswpqd'",
            self.features.file_type
        ),
    ).into());
}

You correctly accept FeaturesFileType::OSWPQ, whose string representation is "oswpq", but the error message says "oswpqd". That’s slightly misleading and could confuse users when they copy/paste from the message into their config.

I’d suggest updating the message to:

-            "Invalid features type {:?}; expected 'osw' or 'oswpqd'",
+            "Invalid features type {:?}; expected 'osw' or 'oswpq'",

to match the actual accepted values and the template documentation.


92-113: Fix extension-based type inference to use extension() method instead of ends_with()

In crates/arycal-cli/src/input.rs lines 92-113, PathBuf::ends_with compares whole path components, not filename suffixes. For typical file paths like "/path/to/merged.sqMass", the condition ends_with(".sqMass") will fail since ".sqMass" is not a complete path component. This causes the inference to default to Unknown, which validation then rejects unless the file type is explicitly set in JSON.

Use the extension() method instead:

if first_file.extension() == Some(OsStr::new("sqMass")) {
    self.xic.file_type = Some(XicFileType::SqMass);
}

Apply the same fix to the features file type check (use "osw" instead of ".osw").

crates/arycal-common/src/config.rs (1)

113-137: Alias mapping does not invert legacy decoy semantics as documented; old configs will have inverted behavior

The review comment is accurate. With only #[serde(alias = "decoy")], the mapping is direct (decoy: Xinclude_decoys: X), which inverts the intended behavior:

  • Old decoy: true (exclude decoys) → include_decoys: truefilter_decoys = false (include) — wrong
  • Old decoy: false (include decoys) → include_decoys: falsefilter_decoys = true (exclude) — wrong

The contradictory comments at lines 118-120 and 121 also create confusion: one suggests semantics are preserved, the other admits they won't be ("interpreted with new semantics").

To resolve this:

  • Option A: Add a custom Deserialize impl for FiltersConfig that inverts the deserialized decoy value
  • Option B: Remove the misleading documentation and clarify this is an intentional breaking change (update lines 118-120)
crates/arycal-cli/src/lib.rs (1)

371-419: Update MPI path to wrap OswAccess in FeatureAccessor enum and ensure Clone is derived

The MPI run() method at lines 408-418 has a type mismatch that will fail compilation:

  • Line 412 creates vec![osw_access] as Vec<OswAccess>
  • Line 414 tries to assign self.feature_access.clone() as Vec<FeatureAccessor>
  • These incompatible types cannot coexist in the same variable

The non-MPI path (lines 268-278) correctly wraps variants: FeatureAccessor::Osw(osw_access) and FeatureAccessor::Oswpq(oswpq_access).

The MPI path must do the same:

let osw_access = OswAccess::new(&scores_output_file, false)?;
vec![FeatureAccessor::Osw(osw_access)]  // Wrap in enum variant

Also verify that FeatureAccessor derives Clone (not visible in the enum definition at line 41), since line 414 calls .clone().

🧹 Nitpick comments (9)
README.md (1)

178-179: Missing documentation for top-level config fields.

The Configuration Fields section documents nested config sections (XIC, Features, Filters, Alignment) but omits documentation for the top-level threads and log_level fields that appear in the OSW example config (lines 178-179). Add these to the Configuration Fields section for completeness.

Consider adding:

**Top-level Section:**
- `threads`: Number of threads for parallel processing (default/auto if omitted)
- `log_level`: Logging verbosity ("trace", "debug", "info", "warn", "error")

Also applies to: 215-215

crates/arycal-cli/src/main.rs (1)

135-187: Optional parameters flow and template error handling look correct

Making parameters optional and explicitly handling the “no config provided” case (generating arycal_config_template.json, printing guidance, then exiting with code 1) cleanly avoids panics in Input::from_arguments. Just be aware that Input::from_arguments still calls .expect("required parameters"), so if it’s ever used outside this guarded path it will need a more graceful handling of missing configs.

crates/arycal-cli/src/lib.rs (2)

163-203: Feature type dispatch in Runner::new looks correct, but still assumes a feature file exists

The new feature_file_type switch correctly routes to OswAccess::new vs OswpqAccess::new based on FeaturesFileType, with an explicit error for Unknown. That matches the updated config schema.

One thing to consider: parameters.features.file_paths[0] will panic if features.file_paths is empty. Today that’s only prevented by config discipline; a small explicit check with a clear error (e.g., “no feature file configured”) would make this more robust.


1180-1205: write_aligned_score_results_to_db still OSW-specific and currently unused

write_aligned_score_results_to_db still accepts &[OswAccess] and isn’t called anymore (the call is commented out). That’s fine for now, but if you plan to resurrect full-trace score writes you’ll likely want to:

  • Switch its parameter to &[FeatureAccessor], and
  • Add corresponding methods to FeatureAccessor so it can write to OSW and OSWPQ backends consistently.

Otherwise, consider removing or clearly marking it as legacy to avoid confusion.

crates/arycal-cli/src/input.rs (2)

36-73: from_arguments flow is reasonable; note hard expect on parameters

The from_arguments pipeline (load → infer_types → validate → apply xic_paths override → final consistency checks) is straightforward and integrates well with the new CLI behavior.

Just be aware that it still calls:

let path = matches
    .get_one::<String>("parameters")
    .expect("required parameters");

While main now guards the “no parameters supplied” case, this will panic if from_arguments is ever used elsewhere without that guard. If you want the library API to be more robust on its own, you could return a Result error instead of panicking when the parameter is missing.


138-159: Path existence checks are good; note that CLI overrides aren’t revalidated

The validate loops that ensure every path in xic.file_paths and features.file_paths exists are helpful, especially with OSWPQ directories allowed.

One subtle point: you apply the xic_paths CLI override after calling validate in from_arguments, so additional paths supplied on the command line aren’t checked for existence here. If you want full parity, you could either re-run validate() after applying overrides or explicitly check those extra paths.

crates/arycal-cloudpath/src/oswpq.rs (3)

568-568: Address TODO: Strip modifications from modified sequence.

The comment indicates that unmodified sequences should be extracted by stripping modifications, but currently the modified sequence is used as-is. This may cause issues if downstream code expects true unmodified sequences.

Would you like me to help implement a modification-stripping function, or should this be tracked as a separate issue? Common formats use brackets [] or parentheses () for modifications.


747-859: Consider extracting common write logic to reduce duplication.

The write_ms2_alignment_detailed method duplicates the batch insert and append pattern from write_alignment_features. Consider extracting a generic write_to_parquet_with_schema helper method.

This would reduce maintenance burden and ensure consistent behavior across write methods. However, since this method is marked #[allow(dead_code)], it may be lower priority.


962-973: Consider expanding test coverage.

The current test only covers the error case for a nonexistent path. Consider adding tests for:

  • Successful creation with a valid test directory
  • Backup behavior when output files exist
  • Path handling with/without .oswpq extension
  • Query building and result parsing with mock data

These tests would increase confidence in the implementation, especially for the SQL generation logic.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 952c22e and 01dc35b.

📒 Files selected for processing (7)
  • README.md (2 hunks)
  • crates/arycal-cli/src/input.rs (2 hunks)
  • crates/arycal-cli/src/lib.rs (10 hunks)
  • crates/arycal-cli/src/main.rs (3 hunks)
  • crates/arycal-cloudpath/src/lib.rs (1 hunks)
  • crates/arycal-cloudpath/src/oswpq.rs (1 hunks)
  • crates/arycal-common/src/config.rs (5 hunks)
🧰 Additional context used
🧬 Code graph analysis (4)
crates/arycal-cli/src/lib.rs (2)
crates/arycal-cloudpath/src/oswpq.rs (6)
  • fetch_transition_ids (497-581)
  • fetch_feature_data_for_precursor_batch (684-702)
  • fetch_full_precursor_feature_data_for_runs (584-680)
  • create_feature_ms2_alignment_table (706-711)
  • create_feature_transition_alignment_table (714-717)
  • new (116-140)
crates/arycal-cloudpath/src/osw.rs (10)
  • fetch_transition_ids (884-1050)
  • fetch_feature_data_for_precursor_batch (1763-1958)
  • fetch_full_precursor_feature_data_for_runs (1364-1570)
  • create_feature_ms2_alignment_table (2070-2139)
  • create_feature_transition_alignment_table (2212-2267)
  • insert_feature_ms2_alignment_batch (2142-2208)
  • insert_feature_transition_alignment_batch (2270-2323)
  • new (171-200)
  • new (220-236)
  • new (314-337)
crates/arycal-cli/src/main.rs (2)
crates/arycal-cli/src/input.rs (1)
  • default (24-33)
crates/arycal-common/src/config.rs (8)
  • default (13-15)
  • default (50-52)
  • default (130-137)
  • default (180-198)
  • default (281-305)
  • default (372-427)
  • default (494-532)
  • default (644-722)
crates/arycal-cloudpath/src/oswpq.rs (1)
crates/arycal-cli/src/lib.rs (6)
  • new (160-240)
  • fetch_transition_ids (47-63)
  • fetch_full_precursor_feature_data_for_runs (81-96)
  • fetch_feature_data_for_precursor_batch (65-79)
  • create_feature_ms2_alignment_table (98-109)
  • create_feature_transition_alignment_table (111-122)
crates/arycal-cli/src/input.rs (2)
crates/arycal-cloudpath/src/oswpq.rs (1)
  • new (116-140)
crates/arycal-cli/src/lib.rs (1)
  • new (160-240)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: build
🔇 Additional comments (8)
README.md (3)

24-24: Feature list accurately documents parallelization and multi-format support.

The additions clearly communicate the new parallelization capability and expanded input format support. The documentation is clear and actionable for users evaluating whether ARYCAL meets their needs.

Also applies to: 30-34


183-214: OSWPQ example is valid but inconsistently detailed compared to OSW example.

The OSWPQ configuration correctly shows the new parquet file types and .oswpqd directory path. However, the example is noticeably sparse compared to the OSW example—it omits many fields present in the OSW config (e.g., use_tic, smoothing, compute_scores, threads, log_level).

This risks confusing users about whether missing fields are required, optional, or have defaults. Either:

  1. Include the same level of detail in both examples, or
  2. Add a note explaining that the OSWPQ example shows only essential fields and that all other fields from the OSW example are also supported.

Please verify that all omitted fields have sensible defaults or are truly optional by checking the config schema.


135-181: Based on the verification results from the shell script execution, the filters section field names have been confirmed to match the codebase exactly:

  • include_decoys: bool ✓
  • include_identifying_transitions: Option ✓
  • max_score_ms2_qvalue: Option ✓
  • precursor_ids: Option ✓

All default values in the example also match the FiltersConfig implementation. The field names use proper Rust naming conventions with serde renames handled correctly.

However, the shell script output was truncated before showing the complete AlignmentConfig struct definition and the top-level Config struct fields (threads, log_level). While the filters section is verified as correct, a complete verification of all fields in the example would require viewing the full struct definitions.

No issues found in the filters section; field naming is consistent with the schema.

crates/arycal-cloudpath/src/lib.rs (1)

11-18: OSWPQ module export looks good

Adding pub mod oswpq; is consistent with the new OswpqAccess usage in arycal-cli and is required for those imports to compile. No further issues here.

crates/arycal-common/src/config.rs (1)

42-77: OSWPQ feature type wiring is consistent

The new FeaturesFileType::OSWPQ variant, deserialization from "oswpq" | "parquet", and as_str returning "oswpq" are consistent with the CLI and template documentation and should work as expected for OSWPQ feature sources.

crates/arycal-cli/src/lib.rs (2)

40-149: FeatureAccessor abstraction is a good unification of OSW/OSWPQ backends

Wrapping OswAccess and OswpqAccess in FeatureAccessor and forwarding the key methods (fetch_transition_ids, feature fetch, and alignment-table/insert methods) gives you a clean, uniform surface for the rest of the pipeline. The error mapping into anyhow::Result also keeps the call sites tidy.


267-285: Remove immutable variable declaration and add mut keyword

The code declares separate_output_accessor immutable, then assigns to it in both conditional branches, which is a Rust compilation error. This must be mutable:

-        let separate_output_accessor: Option<FeatureAccessor>;
+        let mut separate_output_accessor: Option<FeatureAccessor>;

With this change, the borrow via from_ref is valid because separate_output_accessor stays in scope and is not mutated after the borrow.

Likely an incorrect or invalid review comment.

crates/arycal-cloudpath/src/oswpq.rs (1)

682-702: LGTM: Effective parallelization pattern.

The parallel processing of precursor batches is well-implemented. Each thread creates its own DuckDB connection (via fetch_full_precursor_feature_data_for_runs), avoiding concurrency issues.

Comment on lines +15 to +101
fn generate_config_template(path: &str) -> Result<()> {
// Create a default Input with only core alignment fields
let template_config = Input {
xic: XicConfig::default(),
features: FeaturesConfig::default(),
filters: FiltersConfig::default(),
alignment: AlignmentConfig::default(),
threads: std::thread::available_parallelism().unwrap().get().saturating_sub(1).max(1),
log_level: "info".to_string(),
};

// Serialize to pretty JSON
let json = serde_json::to_string_pretty(&template_config)?;

// Create documentation to append
let docs = r#"

# Configuration Guide:
#
# NOTE: Remove comments (lines starting with #) before using this file as a configuration.
#
# XIC Section:
# - include-precursor: Include precursor chromatograms (true/false)
# - num-isotopes: Number of isotopic peaks to include (typically 3)
# - file-type: "sqMass" for SQLite-based XICs or "parquet" for Parquet-based XICs
# - file-paths: List of paths to your XIC files
#
# Features Section:
# - file-type: "osw" for SQLite features or "oswpq" for PyProphet split parquet format
# - file-paths: Path to your features file (merged.osw or merged_runs.oswpqd directory)
#
# Filters Section:
# - include_decoys: false = only align targets, true = align both targets and decoys
# - include_identifying_transitions: Include non-quantifying transitions (default: false)
# - max_score_ms2_qvalue: Maximum q-value threshold for filtering (1.0 = no filtering)
# - precursor_ids: Optional path to TSV file with specific precursor IDs to process
#
# Alignment Section:
# - batch_size: Number of precursors to process before writing results (10000 recommended)
# - method: Alignment method - "FFT", "DTW", or "FFTDTW" (FFTDTW recommended)
# - reference_type: "star" (align to one reference) or "mst" (minimum spanning tree)
# - reference_run: Specific run to use as reference (null = auto-select)
# - use_tic: Use total ion chromatogram for alignment (always true)
# - smoothing: Savitzky-Golay filter parameters
# * sgolay_window: Window size (must be odd, typically 11)
# * sgolay_order: Polynomial order (typically 3)
# - rt_mapping_tolerance: Retention time tolerance in seconds for peak mapping (10.0 recommended)
# - decoy_peak_mapping_method: "shuffle" or "random_region"
# - decoy_window_size: Window size for random_region method
# - compute_scores: Calculate alignment scores (true recommended)
# - scores_output_file: Write scores to separate file (null = write to input file)
# - retain_alignment_path: Keep alignment path data (needed for identifying transitions)
#
# Threads & Logging:
# - threads: Number of threads to use (default = # of CPUs - 1)
# - log_level: Log verbosity ("error", "warn", "info", "debug", "trace")
#
# Example configurations:
#
# For PyProphet parquet format:
# "features": {
# "file-type": "oswpq",
# "file-paths": ["/path/to/merged_runs.oswpqd"]
# }
#
# For parquet XICs:
# "xic": {
# "file-type": "parquet",
# "file-paths": ["/path/to/file1.parquet", "/path/to/file2.parquet"]
# }
#
# To filter for specific precursors:
# "filters": {
# "precursor_ids": "/path/to/precursor_list.tsv"
# }
#
# To use a specific run as reference:
# "alignment": {
# "reference_type": "star",
# "reference_run": "your_run_name"
# }
"#;

let mut file = File::create(path)?;
file.write_all(json.as_bytes())?;
file.write_all(docs.as_bytes())?;
Ok(())
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

Template generation is fine; fix decoy_peak_mapping_method name mismatch

The template generation logic (default Input, pretty-printed JSON, appended docs) is sound.

However, the docs block advertises:

  • decoy_peak_mapping_method: "shuffle" or "random_region"

while the implementation in crates/arycal-cli/src/lib.rs checks for "shuffle" and "random_regions" (plural). With the current code, users following the docs and setting "random_region" will silently get no decoy peak generation for the “random region” mode.

Consider either:

  • Updating the docs here and in config.rs to "random_regions", or
  • Accepting both spellings in the implementation for robustness.

As a minor clean-up, you could also build template_config via Input::default() to avoid duplicating the default field initialization.


I'll verify the core claim about the decoy_peak_mapping_method spelling mismatch by examining the implementation files.


🏁 Script executed:

#!/bin/bash
# Search for decoy_peak_mapping_method usage in lib.rs and config.rs
echo "=== Searching for decoy_peak_mapping_method in lib.rs ==="
rg -n "decoy_peak_mapping_method|random_region" crates/arycal-cli/src/lib.rs -C 3

echo -e "\n=== Searching for decoy_peak_mapping_method in config.rs ==="
rg -n "decoy_peak_mapping_method|random_region" crates/arycal-cli/src/ -C 2 --type rust

echo -e "\n=== Checking AlignmentConfig struct definition ==="
ast-grep --pattern 'struct AlignmentConfig {
  $$$
}'

Length of output: 6217


I need to locate the AlignmentConfig struct definition to verify the field name:


🏁 Script executed:

#!/bin/bash
# Find AlignmentConfig struct definition
echo "=== Searching for AlignmentConfig struct definition ==="
rg -n "struct AlignmentConfig" crates/arycal-cli/src/ -A 20 --type rust

echo -e "\n=== Also check for serde rename attributes ==="
rg -n "AlignmentConfig|#\[serde" crates/arycal-cli/src/ -B 2 -A 8 --type rust | head -100

Length of output: 3410


The AlignmentConfig is defined in arycal_common::config. Let me search for it there:


🏁 Script executed:

#!/bin/bash
# Find AlignmentConfig struct in arycal_common
echo "=== Searching for AlignmentConfig struct in arycal_common ==="
find . -name "*.rs" -path "*/arycal_common/*" -exec grep -l "struct AlignmentConfig" {} \;

echo -e "\n=== Get the struct definition and fields ==="
find . -name "*.rs" -path "*/arycal_common/*" -exec grep -l "AlignmentConfig" {} \; | head -5 | xargs -I {} sh -c 'echo "File: {}"; grep -A 30 "struct AlignmentConfig" {}'

Length of output: 107


Let me search more broadly for the struct definition:


🏁 Script executed:

#!/bin/bash
# Broader search for AlignmentConfig definition
echo "=== Searching entire codebase for AlignmentConfig struct ==="
rg -n "struct AlignmentConfig" . -A 25 --type rust

echo -e "\n=== Check if it's in a config.rs file ==="
fd "config.rs" . --type f

Length of output: 2974


Fix decoy_peak_mapping_method spelling: docs advertise "random_region" but code expects "random_regions"

The implementation in crates/arycal-cli/src/lib.rs (lines 712, 1131) checks for "random_regions" (plural), while the documentation in crates/arycal-cli/src/main.rs (lines 62–63) and the comment in crates/arycal-common/src/config.rs (line 165) advertise "random_region" (singular). Users following the docs will silently receive no decoy peak generation for that mode.

Update crates/arycal-cli/src/main.rs lines 62–63 and crates/arycal-common/src/config.rs line 165 to match the implementation: change "random_region" to "random_regions".

As a minor refactor, also simplify template_config initialization at crates/arycal-cli/src/main.rs line 17 by using Input::default() and only override non-default fields.

Comment on lines +281 to +304
let query = format!(
r#"
SELECT
PROTEIN_ID,
PEPTIDE_ID,
IPF_PEPTIDE_ID,
PRECURSOR_ID,
RUN_ID,
FILENAME,
FEATURE_ID,
EXP_RT,
LEFT_WIDTH,
RIGHT_WIDTH,
FEATURE_MS2_AREA_INTENSITY,
SCORE_MS2_Q_VALUE,
SCORE_MS2_PEP,
PRECURSOR_DECOY
FROM read_parquet('{}')
WHERE PRECURSOR_ID = {}
ORDER BY EXP_RT
"#,
precursor_path.display(),
precursor_id
);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Validate or parameterize file paths to prevent SQL injection.

The file path is directly interpolated into the SQL query via format!. While paths are derived from self.base_path, if an attacker can control directory names (e.g., through symlinks or directory naming), they could potentially inject SQL.

Consider validating that the path doesn't contain SQL metacharacters, or use DuckDB's parameter binding if supported for file paths:

// Validate path doesn't contain dangerous characters
let path_str = precursor_path.display().to_string();
if path_str.contains('\'') || path_str.contains(';') {
    return Err(OpenSwathParquetError::InvalidFormat(
        "Invalid characters in file path".to_string()
    ));
}

Alternatively, canonicalize paths and ensure they're within the expected base directory.

🤖 Prompt for AI Agents
In crates/arycal-cloudpath/src/oswpq.rs around lines 281-304 the code
interpolates precursor_path directly into a SQL string which can allow SQL
injection via crafted paths; fix by converting precursor_path to a String,
canonicalizing it (resolve symlinks) and verifying it is inside the expected
base directory, reject paths containing dangerous characters like '\'' or ';'
returning OpenSwathParquetError::InvalidFormat on failure, and then use a safe
mechanism to pass the path into the query (prefer parameter binding if supported
by DuckDB; otherwise properly escape single quotes after validation) before
constructing the SQL.

Comment on lines +347 to +370
let conn = self.create_connection()?;

// Build a query that unions all precursor IDs from all runs
let mut union_queries = Vec::new();

for run_name in &runs {
let precursor_path = self.get_precursors_features_path(run_name);

if precursor_path.exists() {
union_queries.push(format!(
"SELECT DISTINCT PRECURSOR_ID FROM read_parquet('{}')",
precursor_path.display()
));
}
}

if union_queries.is_empty() {
return Ok(Vec::new());
}

let query = format!(
"SELECT DISTINCT PRECURSOR_ID FROM ({}) ORDER BY PRECURSOR_ID",
union_queries.join(" UNION ALL ")
);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Validate file paths before SQL interpolation.

Similar to the previous comment, file paths are interpolated directly into SQL queries. Ensure paths are validated to prevent potential SQL injection through malicious directory names.

Apply the same validation as suggested in the previous comment.

🤖 Prompt for AI Agents
In crates/arycal-cloudpath/src/oswpq.rs around lines 347 to 370, file system
paths are being interpolated directly into the SQL string for read_parquet which
permits malicious directory names to break or inject SQL; validate and sanitize
each precursor_path before using it: canonicalize the path and ensure it resides
under an expected base directory (reject or skip otherwise), convert to a UTF-8
string (or reject non-UTF-8), and escape single quotes by replacing ' with ''
(or otherwise percent-encode) before formatting into the SQL; only push the
formatted query for paths that pass these checks so you never interpolate
unvalidated user-controlled names.

Comment on lines +464 to +477
// Export to parquet (append mode if file exists)
let export_query = if output_path.exists() {
// Read existing data, union with new data, and overwrite
format!(
"COPY (SELECT * FROM read_parquet('{}') UNION ALL SELECT * FROM alignment_temp) TO '{}' (FORMAT PARQUET)",
output_path.display(),
output_path.display()
)
} else {
format!(
"COPY alignment_temp TO '{}' (FORMAT PARQUET)",
output_path.display()
)
};
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Optimize append pattern to avoid full file rewrites.

The current append implementation reads the entire existing parquet file, unions it with new data, and rewrites everything. This is O(n) per write, leading to O(n²) complexity for multiple appends.

Consider one of these alternatives:

  1. Write to separate parquet files and union them at read time
  2. Keep data in memory/temp tables during the run and write once at the end
  3. Use DuckDB's native append support if available (check DuckDB version)

The current pattern is acceptable for small datasets but will severely impact performance for large-scale alignments. Given the PR's goal of "fully parallelized reading and alignment workflows," this bottleneck should be addressed.

Comment on lines +522 to +548
// Build the query
let mut query = format!(
r#"
SELECT DISTINCT
PRECURSOR_ID,
MODIFIED_SEQUENCE,
PRECURSOR_CHARGE,
PRECURSOR_DECOY
FROM read_parquet('{}')
WHERE 1=1
"#,
precursor_path.display()
);

// Add decoy filter if needed
if filter_decoys {
query.push_str(" AND (PRECURSOR_DECOY = 0 OR PRECURSOR_DECOY IS NULL)");
}

// Add precursor ID filter if provided
if let Some(ref ids) = precursor_ids {
let id_list = ids.iter()
.map(|id| id.to_string())
.collect::<Vec<_>>()
.join(",");
query.push_str(&format!(" AND PRECURSOR_ID IN ({})", id_list));
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Validate file path to prevent SQL injection.

The file path is directly interpolated into the SQL query. Apply the same validation as suggested in earlier comments.

Comment on lines +600 to +616
let query = format!(
r#"
SELECT
FILENAME,
RUN_ID,
PRECURSOR_ID,
FEATURE_ID,
EXP_RT,
LEFT_WIDTH,
RIGHT_WIDTH,
FEATURE_MS2_AREA_INTENSITY
FROM read_parquet('{}')
WHERE PRECURSOR_ID = ?
ORDER BY EXP_RT
"#,
precursor_path.display()
);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Good: Parameterized query for precursor_id; validate file path.

Good use of parameterized queries for precursor_id (line 621). However, the file path is still interpolated via format!. Apply path validation as suggested earlier.

🤖 Prompt for AI Agents
In crates/arycal-cloudpath/src/oswpq.rs around lines 600 to 616, the SQL uses
format! to interpolate precursor_path into the read_parquet literal which can be
unsafe; validate and sanitize the path before interpolating. Canonicalize the
Path (std::fs::canonicalize), check it exists, and ensure it is under your
allowed base directory (compare canonicalized prefixes) to prevent directory
traversal; then convert to a safe string (e.g., to_string_lossy) and either
escape any single quotes before calling format! or—preferably—use the query
engine's parameter/binding API for file paths if available; update the code to
perform these checks and only interpolate a validated, canonicalized, and
escaped path (or switch to a bound parameter) when constructing the query.

Comment on lines +930 to +942
// Export to parquet (append mode if file exists)
let export_query = if output_path.exists() {
format!(
"COPY (SELECT * FROM read_parquet('{}') UNION ALL SELECT * FROM transition_alignment_temp) TO '{}' (FORMAT PARQUET)",
output_path.display(),
output_path.display()
)
} else {
format!(
"COPY transition_alignment_temp TO '{}' (FORMAT PARQUET)",
output_path.display()
)
};
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Same append inefficiency as in previous write methods.

This method also uses the inefficient append-by-union pattern. The performance concern raised for write_alignment_features (lines 464-477) applies here as well.

Comment on lines +216 to +242
### Configuration Fields

**XIC Section:**
- `include-precursor`: Include precursor chromatograms (true/false)
- `num-isotopes`: Number of isotopic peaks (typically 3)
- `file-type`: "sqMass" or "parquet"
- `file-paths`: List of XIC file paths

**Features Section:**
- `file-type`: "osw" (SQLite) or "oswpq" (PyProphet split parquet)
- `file-paths`: Path(s) to feature files

**Filters Section:**
- `include_decoys`: false = targets only, true = targets + decoys
- `include_identifying_transitions`: Include non-quantifying transitions
- `max_score_ms2_qvalue`: Q-value threshold (1.0 = no filtering)
- `precursor_ids`: Optional TSV file with specific precursor IDs

**Alignment Section:**
- `batch_size`: Precursors to process per batch (10000 recommended)
- `method`: "FFT", "DTW", or "FFTDTW" (FFTDTW recommended)
- `reference_type`: "star" (single reference) or "mst" (minimum spanning tree)
- `reference_run`: Specific reference run name (null = auto-select)
- `rt_mapping_tolerance`: RT tolerance in seconds (10.0 recommended)
- `decoy_peak_mapping_method`: "shuffle" or "random_region"
- `compute_scores`: Calculate alignment scores (true recommended)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

Configuration Fields section is well-structured; verify field names and completeness.

This new reference section is a valuable addition that explains each configuration option clearly. The recommendations (e.g., "FFTDTW recommended," "10.0 recommended") are helpful. However, verify:

  1. Field naming consistency: Confirm the documented field names exactly match the schema (e.g., hyphens in include-precursor vs underscores in include_decoys)
  2. Completeness: Ensure all documented fields actually exist and are configurable
  3. Missing fields: The Alignment section omits use_tic, smoothing, scores_output_file, and retain_alignment_path from the OSW example config—verify if these should be documented

🏁 Script executed:

#!/bin/bash
# Verify config field names and schema completeness

# Find all field definitions in config structs
rg -B2 -A2 "pub.*:" --type rust crates/arycal-common/src/config.rs | head -150

# Check for any config fields not mentioned in README
rg "pub\s+\w+:" --type rust crates/arycal-common/src/config.rs | sort | uniq

Length of output: 12899


Alignment section incomplete; 5 fields missing, one option undocumented.

Verification confirms the original concern. The AlignmentConfig struct contains 11 public fields, but the README documents only 7. The undocumented fields are:

  • use_tic (boolean; code note: "Currently not used. We always use TIC.")
  • smoothing (nested SmoothingConfig with sgolay_window and sgolay_order)
  • decoy_window_size (optional; used with "random_region" decoy method)
  • scores_output_file (optional; outputs scores to separate OSW file)
  • retain_alignment_path (boolean; retained when using identifying transitions)

Additionally, the reference_type field supports three options per code comments ("star", "mst", "progressive"), but the README lists only two.

Update the Alignment section to include these 5 fields with their descriptions and note that reference_type also supports "progressive".

🤖 Prompt for AI Agents
In README.md around lines 216 to 242, the Alignment section is missing five
fields and one option; update it to document the additional AlignmentConfig
public fields: add `use_tic` (boolean; currently unused, we always use TIC),
`smoothing` (object with `sgolay_window` and `sgolay_order` describing
Savitzky–Golay smoothing parameters), `decoy_window_size` (optional numeric,
used when `decoy_peak_mapping_method` is "random_region"), `scores_output_file`
(optional path to write alignment scores to a separate OSW file), and
`retain_alignment_path` (boolean; retain alignment path when using identifying
transitions); also update the `reference_type` entry to list "star", "mst", and
"progressive" as supported options and add brief recommended values where
appropriate.

@singjc singjc merged commit eee4ab6 into master Nov 16, 2025
1 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant