Conversation
…essors - Updated Input validation to accept 'parquet' as a valid xic type alongside 'sqMass'. - Enhanced FeaturesFileType enum to include OSWPQ and updated deserialization logic. - Introduced FeatureAccessor enum to handle both OSW and OSWPQ feature accessors. - Implemented OswpqAccess struct for reading PyProphet's split parquet format. - Updated Runner to support loading features from OSWPQ files and adjusted related methods. - Added necessary methods in OswpqAccess for fetching precursor features and writing alignment data.
…dd peak mapping conversion for PyProphet format
…d batch insertions in OSWPQ access
… configuration template generation, and implement output file backup
|
Caution Review failedThe pull request is closed. WalkthroughAdds PyProphet Split Parquet (OSWPQ) feature-format support, expands XIC/feature input formats, refactors feature access via a new FeatureAccessor, adds CLI config-template generation, and tightens input validation and file-type inference. Changes
Sequence Diagram(s)sequenceDiagram
participant CLI as CLI Main
participant Input as Input Loader
participant Runner as Alignment Runner
participant FeatureAccessor as FeatureAccessor
participant OSW as OswAccess
participant OSWPQ as OswpqAccess
CLI->>CLI: Check for config file
alt No config provided
CLI->>CLI: generate_config_template()
CLI->>CLI: Write arycal_config_template.json + docs
CLI->>CLI: Exit with guidance
else Config provided
CLI->>Input: from_arguments()
Input->>Input: infer_types() from paths
Input->>Input: validate() existence & types
Input-->>CLI: Return Input
CLI->>Runner: Initialize with Input
alt FeaturesFileType::OSW
Runner->>OSW: Create OswAccess
Runner->>FeatureAccessor: Wrap as Osw
else FeaturesFileType::OSWPQ
Runner->>OSWPQ: Create OswpqAccess
Runner->>FeatureAccessor: Wrap as Oswpq
else FeaturesFileType::Unknown
Runner->>Runner: Fail early with error
end
Runner->>FeatureAccessor: fetch_transition_ids()
FeatureAccessor-->>FeatureAccessor: Delegate to inner impl
Runner->>Runner: Run alignment & scoring
Runner->>FeatureAccessor: write_ms2_alignment_batch()
FeatureAccessor-->>FeatureAccessor: Route to OSW or OSWPQ writer
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
Possibly related PRs
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 inconclusive)
✅ Passed checks (2 passed)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (4)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 9
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (4)
crates/arycal-cli/src/input.rs (2)
115-137: Error message for features type mentionsoswpqdinstead ofoswpqIn
validate:if self.features.file_type != Some(FeaturesFileType::OSW) && self.features.file_type != Some(FeaturesFileType::OSWPQ) { return Err(io::Error::new( io::ErrorKind::InvalidData, format!( "Invalid features type {:?}; expected 'osw' or 'oswpqd'", self.features.file_type ), ).into()); }You correctly accept
FeaturesFileType::OSWPQ, whose string representation is"oswpq", but the error message says"oswpqd". That’s slightly misleading and could confuse users when they copy/paste from the message into their config.I’d suggest updating the message to:
- "Invalid features type {:?}; expected 'osw' or 'oswpqd'", + "Invalid features type {:?}; expected 'osw' or 'oswpq'",to match the actual accepted values and the template documentation.
92-113: Fix extension-based type inference to useextension()method instead ofends_with()In
crates/arycal-cli/src/input.rslines 92-113,PathBuf::ends_withcompares whole path components, not filename suffixes. For typical file paths like"/path/to/merged.sqMass", the conditionends_with(".sqMass")will fail since ".sqMass" is not a complete path component. This causes the inference to default toUnknown, which validation then rejects unless the file type is explicitly set in JSON.Use the
extension()method instead:if first_file.extension() == Some(OsStr::new("sqMass")) { self.xic.file_type = Some(XicFileType::SqMass); }Apply the same fix to the features file type check (use
"osw"instead of".osw").crates/arycal-common/src/config.rs (1)
113-137: Alias mapping does not invert legacydecoysemantics as documented; old configs will have inverted behaviorThe review comment is accurate. With only
#[serde(alias = "decoy")], the mapping is direct (decoy: X→include_decoys: X), which inverts the intended behavior:
- Old
decoy: true(exclude decoys) →include_decoys: true→filter_decoys = false(include) — wrong- Old
decoy: false(include decoys) →include_decoys: false→filter_decoys = true(exclude) — wrongThe contradictory comments at lines 118-120 and 121 also create confusion: one suggests semantics are preserved, the other admits they won't be ("interpreted with new semantics").
To resolve this:
- Option A: Add a custom
Deserializeimpl forFiltersConfigthat inverts the deserializeddecoyvalue- Option B: Remove the misleading documentation and clarify this is an intentional breaking change (update lines 118-120)
crates/arycal-cli/src/lib.rs (1)
371-419: Update MPI path to wrap OswAccess in FeatureAccessor enum and ensure Clone is derivedThe MPI
run()method at lines 408-418 has a type mismatch that will fail compilation:
- Line 412 creates
vec![osw_access]asVec<OswAccess>- Line 414 tries to assign
self.feature_access.clone()asVec<FeatureAccessor>- These incompatible types cannot coexist in the same variable
The non-MPI path (lines 268-278) correctly wraps variants:
FeatureAccessor::Osw(osw_access)andFeatureAccessor::Oswpq(oswpq_access).The MPI path must do the same:
let osw_access = OswAccess::new(&scores_output_file, false)?; vec![FeatureAccessor::Osw(osw_access)] // Wrap in enum variantAlso verify that
FeatureAccessorderivesClone(not visible in the enum definition at line 41), since line 414 calls.clone().
🧹 Nitpick comments (9)
README.md (1)
178-179: Missing documentation for top-level config fields.The Configuration Fields section documents nested config sections (XIC, Features, Filters, Alignment) but omits documentation for the top-level
threadsandlog_levelfields that appear in the OSW example config (lines 178-179). Add these to the Configuration Fields section for completeness.Consider adding:
**Top-level Section:** - `threads`: Number of threads for parallel processing (default/auto if omitted) - `log_level`: Logging verbosity ("trace", "debug", "info", "warn", "error")Also applies to: 215-215
crates/arycal-cli/src/main.rs (1)
135-187: Optional parameters flow and template error handling look correctMaking
parametersoptional and explicitly handling the “no config provided” case (generatingarycal_config_template.json, printing guidance, then exiting with code 1) cleanly avoids panics inInput::from_arguments. Just be aware thatInput::from_argumentsstill calls.expect("required parameters"), so if it’s ever used outside this guarded path it will need a more graceful handling of missing configs.crates/arycal-cli/src/lib.rs (2)
163-203: Feature type dispatch inRunner::newlooks correct, but still assumes a feature file existsThe new
feature_file_typeswitch correctly routes toOswAccess::newvsOswpqAccess::newbased onFeaturesFileType, with an explicit error forUnknown. That matches the updated config schema.One thing to consider:
parameters.features.file_paths[0]will panic iffeatures.file_pathsis empty. Today that’s only prevented by config discipline; a small explicit check with a clear error (e.g., “no feature file configured”) would make this more robust.
1180-1205:write_aligned_score_results_to_dbstill OSW-specific and currently unused
write_aligned_score_results_to_dbstill accepts&[OswAccess]and isn’t called anymore (the call is commented out). That’s fine for now, but if you plan to resurrect full-trace score writes you’ll likely want to:
- Switch its parameter to
&[FeatureAccessor], and- Add corresponding methods to
FeatureAccessorso it can write to OSW and OSWPQ backends consistently.Otherwise, consider removing or clearly marking it as legacy to avoid confusion.
crates/arycal-cli/src/input.rs (2)
36-73:from_argumentsflow is reasonable; note hardexpecton parametersThe
from_argumentspipeline (load → infer_types → validate → applyxic_pathsoverride → final consistency checks) is straightforward and integrates well with the new CLI behavior.Just be aware that it still calls:
let path = matches .get_one::<String>("parameters") .expect("required parameters");While
mainnow guards the “no parameters supplied” case, this will panic iffrom_argumentsis ever used elsewhere without that guard. If you want the library API to be more robust on its own, you could return aResulterror instead of panicking when the parameter is missing.
138-159: Path existence checks are good; note that CLI overrides aren’t revalidatedThe
validateloops that ensure every path inxic.file_pathsandfeatures.file_pathsexists are helpful, especially with OSWPQ directories allowed.One subtle point: you apply the
xic_pathsCLI override after callingvalidateinfrom_arguments, so additional paths supplied on the command line aren’t checked for existence here. If you want full parity, you could either re-runvalidate()after applying overrides or explicitly check those extra paths.crates/arycal-cloudpath/src/oswpq.rs (3)
568-568: Address TODO: Strip modifications from modified sequence.The comment indicates that unmodified sequences should be extracted by stripping modifications, but currently the modified sequence is used as-is. This may cause issues if downstream code expects true unmodified sequences.
Would you like me to help implement a modification-stripping function, or should this be tracked as a separate issue? Common formats use brackets
[]or parentheses()for modifications.
747-859: Consider extracting common write logic to reduce duplication.The
write_ms2_alignment_detailedmethod duplicates the batch insert and append pattern fromwrite_alignment_features. Consider extracting a genericwrite_to_parquet_with_schemahelper method.This would reduce maintenance burden and ensure consistent behavior across write methods. However, since this method is marked
#[allow(dead_code)], it may be lower priority.
962-973: Consider expanding test coverage.The current test only covers the error case for a nonexistent path. Consider adding tests for:
- Successful creation with a valid test directory
- Backup behavior when output files exist
- Path handling with/without
.oswpqextension- Query building and result parsing with mock data
These tests would increase confidence in the implementation, especially for the SQL generation logic.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
README.md(2 hunks)crates/arycal-cli/src/input.rs(2 hunks)crates/arycal-cli/src/lib.rs(10 hunks)crates/arycal-cli/src/main.rs(3 hunks)crates/arycal-cloudpath/src/lib.rs(1 hunks)crates/arycal-cloudpath/src/oswpq.rs(1 hunks)crates/arycal-common/src/config.rs(5 hunks)
🧰 Additional context used
🧬 Code graph analysis (4)
crates/arycal-cli/src/lib.rs (2)
crates/arycal-cloudpath/src/oswpq.rs (6)
fetch_transition_ids(497-581)fetch_feature_data_for_precursor_batch(684-702)fetch_full_precursor_feature_data_for_runs(584-680)create_feature_ms2_alignment_table(706-711)create_feature_transition_alignment_table(714-717)new(116-140)crates/arycal-cloudpath/src/osw.rs (10)
fetch_transition_ids(884-1050)fetch_feature_data_for_precursor_batch(1763-1958)fetch_full_precursor_feature_data_for_runs(1364-1570)create_feature_ms2_alignment_table(2070-2139)create_feature_transition_alignment_table(2212-2267)insert_feature_ms2_alignment_batch(2142-2208)insert_feature_transition_alignment_batch(2270-2323)new(171-200)new(220-236)new(314-337)
crates/arycal-cli/src/main.rs (2)
crates/arycal-cli/src/input.rs (1)
default(24-33)crates/arycal-common/src/config.rs (8)
default(13-15)default(50-52)default(130-137)default(180-198)default(281-305)default(372-427)default(494-532)default(644-722)
crates/arycal-cloudpath/src/oswpq.rs (1)
crates/arycal-cli/src/lib.rs (6)
new(160-240)fetch_transition_ids(47-63)fetch_full_precursor_feature_data_for_runs(81-96)fetch_feature_data_for_precursor_batch(65-79)create_feature_ms2_alignment_table(98-109)create_feature_transition_alignment_table(111-122)
crates/arycal-cli/src/input.rs (2)
crates/arycal-cloudpath/src/oswpq.rs (1)
new(116-140)crates/arycal-cli/src/lib.rs (1)
new(160-240)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: build
🔇 Additional comments (8)
README.md (3)
24-24: Feature list accurately documents parallelization and multi-format support.The additions clearly communicate the new parallelization capability and expanded input format support. The documentation is clear and actionable for users evaluating whether ARYCAL meets their needs.
Also applies to: 30-34
183-214: OSWPQ example is valid but inconsistently detailed compared to OSW example.The OSWPQ configuration correctly shows the new parquet file types and .oswpqd directory path. However, the example is noticeably sparse compared to the OSW example—it omits many fields present in the OSW config (e.g.,
use_tic,smoothing,compute_scores,threads,log_level).This risks confusing users about whether missing fields are required, optional, or have defaults. Either:
- Include the same level of detail in both examples, or
- Add a note explaining that the OSWPQ example shows only essential fields and that all other fields from the OSW example are also supported.
Please verify that all omitted fields have sensible defaults or are truly optional by checking the config schema.
135-181: Based on the verification results from the shell script execution, the filters section field names have been confirmed to match the codebase exactly:
include_decoys: bool ✓include_identifying_transitions: Option ✓max_score_ms2_qvalue: Option ✓precursor_ids: Option ✓All default values in the example also match the
FiltersConfigimplementation. The field names use proper Rust naming conventions with serde renames handled correctly.However, the shell script output was truncated before showing the complete
AlignmentConfigstruct definition and the top-levelConfigstruct fields (threads,log_level). While the filters section is verified as correct, a complete verification of all fields in the example would require viewing the full struct definitions.No issues found in the filters section; field naming is consistent with the schema.
crates/arycal-cloudpath/src/lib.rs (1)
11-18: OSWPQ module export looks goodAdding
pub mod oswpq;is consistent with the newOswpqAccessusage inarycal-cliand is required for those imports to compile. No further issues here.crates/arycal-common/src/config.rs (1)
42-77: OSWPQ feature type wiring is consistentThe new
FeaturesFileType::OSWPQvariant, deserialization from"oswpq" | "parquet", andas_strreturning"oswpq"are consistent with the CLI and template documentation and should work as expected for OSWPQ feature sources.crates/arycal-cli/src/lib.rs (2)
40-149: FeatureAccessor abstraction is a good unification of OSW/OSWPQ backendsWrapping
OswAccessandOswpqAccessinFeatureAccessorand forwarding the key methods (fetch_transition_ids, feature fetch, and alignment-table/insert methods) gives you a clean, uniform surface for the rest of the pipeline. The error mapping intoanyhow::Resultalso keeps the call sites tidy.
267-285: Remove immutable variable declaration and addmutkeywordThe code declares
separate_output_accessorimmutable, then assigns to it in both conditional branches, which is a Rust compilation error. This must be mutable:- let separate_output_accessor: Option<FeatureAccessor>; + let mut separate_output_accessor: Option<FeatureAccessor>;With this change, the borrow via
from_refis valid becauseseparate_output_accessorstays in scope and is not mutated after the borrow.Likely an incorrect or invalid review comment.
crates/arycal-cloudpath/src/oswpq.rs (1)
682-702: LGTM: Effective parallelization pattern.The parallel processing of precursor batches is well-implemented. Each thread creates its own DuckDB connection (via
fetch_full_precursor_feature_data_for_runs), avoiding concurrency issues.
| fn generate_config_template(path: &str) -> Result<()> { | ||
| // Create a default Input with only core alignment fields | ||
| let template_config = Input { | ||
| xic: XicConfig::default(), | ||
| features: FeaturesConfig::default(), | ||
| filters: FiltersConfig::default(), | ||
| alignment: AlignmentConfig::default(), | ||
| threads: std::thread::available_parallelism().unwrap().get().saturating_sub(1).max(1), | ||
| log_level: "info".to_string(), | ||
| }; | ||
|
|
||
| // Serialize to pretty JSON | ||
| let json = serde_json::to_string_pretty(&template_config)?; | ||
|
|
||
| // Create documentation to append | ||
| let docs = r#" | ||
|
|
||
| # Configuration Guide: | ||
| # | ||
| # NOTE: Remove comments (lines starting with #) before using this file as a configuration. | ||
| # | ||
| # XIC Section: | ||
| # - include-precursor: Include precursor chromatograms (true/false) | ||
| # - num-isotopes: Number of isotopic peaks to include (typically 3) | ||
| # - file-type: "sqMass" for SQLite-based XICs or "parquet" for Parquet-based XICs | ||
| # - file-paths: List of paths to your XIC files | ||
| # | ||
| # Features Section: | ||
| # - file-type: "osw" for SQLite features or "oswpq" for PyProphet split parquet format | ||
| # - file-paths: Path to your features file (merged.osw or merged_runs.oswpqd directory) | ||
| # | ||
| # Filters Section: | ||
| # - include_decoys: false = only align targets, true = align both targets and decoys | ||
| # - include_identifying_transitions: Include non-quantifying transitions (default: false) | ||
| # - max_score_ms2_qvalue: Maximum q-value threshold for filtering (1.0 = no filtering) | ||
| # - precursor_ids: Optional path to TSV file with specific precursor IDs to process | ||
| # | ||
| # Alignment Section: | ||
| # - batch_size: Number of precursors to process before writing results (10000 recommended) | ||
| # - method: Alignment method - "FFT", "DTW", or "FFTDTW" (FFTDTW recommended) | ||
| # - reference_type: "star" (align to one reference) or "mst" (minimum spanning tree) | ||
| # - reference_run: Specific run to use as reference (null = auto-select) | ||
| # - use_tic: Use total ion chromatogram for alignment (always true) | ||
| # - smoothing: Savitzky-Golay filter parameters | ||
| # * sgolay_window: Window size (must be odd, typically 11) | ||
| # * sgolay_order: Polynomial order (typically 3) | ||
| # - rt_mapping_tolerance: Retention time tolerance in seconds for peak mapping (10.0 recommended) | ||
| # - decoy_peak_mapping_method: "shuffle" or "random_region" | ||
| # - decoy_window_size: Window size for random_region method | ||
| # - compute_scores: Calculate alignment scores (true recommended) | ||
| # - scores_output_file: Write scores to separate file (null = write to input file) | ||
| # - retain_alignment_path: Keep alignment path data (needed for identifying transitions) | ||
| # | ||
| # Threads & Logging: | ||
| # - threads: Number of threads to use (default = # of CPUs - 1) | ||
| # - log_level: Log verbosity ("error", "warn", "info", "debug", "trace") | ||
| # | ||
| # Example configurations: | ||
| # | ||
| # For PyProphet parquet format: | ||
| # "features": { | ||
| # "file-type": "oswpq", | ||
| # "file-paths": ["/path/to/merged_runs.oswpqd"] | ||
| # } | ||
| # | ||
| # For parquet XICs: | ||
| # "xic": { | ||
| # "file-type": "parquet", | ||
| # "file-paths": ["/path/to/file1.parquet", "/path/to/file2.parquet"] | ||
| # } | ||
| # | ||
| # To filter for specific precursors: | ||
| # "filters": { | ||
| # "precursor_ids": "/path/to/precursor_list.tsv" | ||
| # } | ||
| # | ||
| # To use a specific run as reference: | ||
| # "alignment": { | ||
| # "reference_type": "star", | ||
| # "reference_run": "your_run_name" | ||
| # } | ||
| "#; | ||
|
|
||
| let mut file = File::create(path)?; | ||
| file.write_all(json.as_bytes())?; | ||
| file.write_all(docs.as_bytes())?; | ||
| Ok(()) |
There was a problem hiding this comment.
🧩 Analysis chain
Template generation is fine; fix decoy_peak_mapping_method name mismatch
The template generation logic (default Input, pretty-printed JSON, appended docs) is sound.
However, the docs block advertises:
decoy_peak_mapping_method: "shuffle" or "random_region"
while the implementation in crates/arycal-cli/src/lib.rs checks for "shuffle" and "random_regions" (plural). With the current code, users following the docs and setting "random_region" will silently get no decoy peak generation for the “random region” mode.
Consider either:
- Updating the docs here and in
config.rsto"random_regions", or - Accepting both spellings in the implementation for robustness.
As a minor clean-up, you could also build template_config via Input::default() to avoid duplicating the default field initialization.
I'll verify the core claim about the decoy_peak_mapping_method spelling mismatch by examining the implementation files.
🏁 Script executed:
#!/bin/bash
# Search for decoy_peak_mapping_method usage in lib.rs and config.rs
echo "=== Searching for decoy_peak_mapping_method in lib.rs ==="
rg -n "decoy_peak_mapping_method|random_region" crates/arycal-cli/src/lib.rs -C 3
echo -e "\n=== Searching for decoy_peak_mapping_method in config.rs ==="
rg -n "decoy_peak_mapping_method|random_region" crates/arycal-cli/src/ -C 2 --type rust
echo -e "\n=== Checking AlignmentConfig struct definition ==="
ast-grep --pattern 'struct AlignmentConfig {
$$$
}'Length of output: 6217
I need to locate the AlignmentConfig struct definition to verify the field name:
🏁 Script executed:
#!/bin/bash
# Find AlignmentConfig struct definition
echo "=== Searching for AlignmentConfig struct definition ==="
rg -n "struct AlignmentConfig" crates/arycal-cli/src/ -A 20 --type rust
echo -e "\n=== Also check for serde rename attributes ==="
rg -n "AlignmentConfig|#\[serde" crates/arycal-cli/src/ -B 2 -A 8 --type rust | head -100Length of output: 3410
The AlignmentConfig is defined in arycal_common::config. Let me search for it there:
🏁 Script executed:
#!/bin/bash
# Find AlignmentConfig struct in arycal_common
echo "=== Searching for AlignmentConfig struct in arycal_common ==="
find . -name "*.rs" -path "*/arycal_common/*" -exec grep -l "struct AlignmentConfig" {} \;
echo -e "\n=== Get the struct definition and fields ==="
find . -name "*.rs" -path "*/arycal_common/*" -exec grep -l "AlignmentConfig" {} \; | head -5 | xargs -I {} sh -c 'echo "File: {}"; grep -A 30 "struct AlignmentConfig" {}'Length of output: 107
Let me search more broadly for the struct definition:
🏁 Script executed:
#!/bin/bash
# Broader search for AlignmentConfig definition
echo "=== Searching entire codebase for AlignmentConfig struct ==="
rg -n "struct AlignmentConfig" . -A 25 --type rust
echo -e "\n=== Check if it's in a config.rs file ==="
fd "config.rs" . --type fLength of output: 2974
Fix decoy_peak_mapping_method spelling: docs advertise "random_region" but code expects "random_regions"
The implementation in crates/arycal-cli/src/lib.rs (lines 712, 1131) checks for "random_regions" (plural), while the documentation in crates/arycal-cli/src/main.rs (lines 62–63) and the comment in crates/arycal-common/src/config.rs (line 165) advertise "random_region" (singular). Users following the docs will silently receive no decoy peak generation for that mode.
Update crates/arycal-cli/src/main.rs lines 62–63 and crates/arycal-common/src/config.rs line 165 to match the implementation: change "random_region" to "random_regions".
As a minor refactor, also simplify template_config initialization at crates/arycal-cli/src/main.rs line 17 by using Input::default() and only override non-default fields.
| let query = format!( | ||
| r#" | ||
| SELECT | ||
| PROTEIN_ID, | ||
| PEPTIDE_ID, | ||
| IPF_PEPTIDE_ID, | ||
| PRECURSOR_ID, | ||
| RUN_ID, | ||
| FILENAME, | ||
| FEATURE_ID, | ||
| EXP_RT, | ||
| LEFT_WIDTH, | ||
| RIGHT_WIDTH, | ||
| FEATURE_MS2_AREA_INTENSITY, | ||
| SCORE_MS2_Q_VALUE, | ||
| SCORE_MS2_PEP, | ||
| PRECURSOR_DECOY | ||
| FROM read_parquet('{}') | ||
| WHERE PRECURSOR_ID = {} | ||
| ORDER BY EXP_RT | ||
| "#, | ||
| precursor_path.display(), | ||
| precursor_id | ||
| ); |
There was a problem hiding this comment.
Validate or parameterize file paths to prevent SQL injection.
The file path is directly interpolated into the SQL query via format!. While paths are derived from self.base_path, if an attacker can control directory names (e.g., through symlinks or directory naming), they could potentially inject SQL.
Consider validating that the path doesn't contain SQL metacharacters, or use DuckDB's parameter binding if supported for file paths:
// Validate path doesn't contain dangerous characters
let path_str = precursor_path.display().to_string();
if path_str.contains('\'') || path_str.contains(';') {
return Err(OpenSwathParquetError::InvalidFormat(
"Invalid characters in file path".to_string()
));
}Alternatively, canonicalize paths and ensure they're within the expected base directory.
🤖 Prompt for AI Agents
In crates/arycal-cloudpath/src/oswpq.rs around lines 281-304 the code
interpolates precursor_path directly into a SQL string which can allow SQL
injection via crafted paths; fix by converting precursor_path to a String,
canonicalizing it (resolve symlinks) and verifying it is inside the expected
base directory, reject paths containing dangerous characters like '\'' or ';'
returning OpenSwathParquetError::InvalidFormat on failure, and then use a safe
mechanism to pass the path into the query (prefer parameter binding if supported
by DuckDB; otherwise properly escape single quotes after validation) before
constructing the SQL.
| let conn = self.create_connection()?; | ||
|
|
||
| // Build a query that unions all precursor IDs from all runs | ||
| let mut union_queries = Vec::new(); | ||
|
|
||
| for run_name in &runs { | ||
| let precursor_path = self.get_precursors_features_path(run_name); | ||
|
|
||
| if precursor_path.exists() { | ||
| union_queries.push(format!( | ||
| "SELECT DISTINCT PRECURSOR_ID FROM read_parquet('{}')", | ||
| precursor_path.display() | ||
| )); | ||
| } | ||
| } | ||
|
|
||
| if union_queries.is_empty() { | ||
| return Ok(Vec::new()); | ||
| } | ||
|
|
||
| let query = format!( | ||
| "SELECT DISTINCT PRECURSOR_ID FROM ({}) ORDER BY PRECURSOR_ID", | ||
| union_queries.join(" UNION ALL ") | ||
| ); |
There was a problem hiding this comment.
Validate file paths before SQL interpolation.
Similar to the previous comment, file paths are interpolated directly into SQL queries. Ensure paths are validated to prevent potential SQL injection through malicious directory names.
Apply the same validation as suggested in the previous comment.
🤖 Prompt for AI Agents
In crates/arycal-cloudpath/src/oswpq.rs around lines 347 to 370, file system
paths are being interpolated directly into the SQL string for read_parquet which
permits malicious directory names to break or inject SQL; validate and sanitize
each precursor_path before using it: canonicalize the path and ensure it resides
under an expected base directory (reject or skip otherwise), convert to a UTF-8
string (or reject non-UTF-8), and escape single quotes by replacing ' with ''
(or otherwise percent-encode) before formatting into the SQL; only push the
formatted query for paths that pass these checks so you never interpolate
unvalidated user-controlled names.
| // Export to parquet (append mode if file exists) | ||
| let export_query = if output_path.exists() { | ||
| // Read existing data, union with new data, and overwrite | ||
| format!( | ||
| "COPY (SELECT * FROM read_parquet('{}') UNION ALL SELECT * FROM alignment_temp) TO '{}' (FORMAT PARQUET)", | ||
| output_path.display(), | ||
| output_path.display() | ||
| ) | ||
| } else { | ||
| format!( | ||
| "COPY alignment_temp TO '{}' (FORMAT PARQUET)", | ||
| output_path.display() | ||
| ) | ||
| }; |
There was a problem hiding this comment.
Optimize append pattern to avoid full file rewrites.
The current append implementation reads the entire existing parquet file, unions it with new data, and rewrites everything. This is O(n) per write, leading to O(n²) complexity for multiple appends.
Consider one of these alternatives:
- Write to separate parquet files and union them at read time
- Keep data in memory/temp tables during the run and write once at the end
- Use DuckDB's native append support if available (check DuckDB version)
The current pattern is acceptable for small datasets but will severely impact performance for large-scale alignments. Given the PR's goal of "fully parallelized reading and alignment workflows," this bottleneck should be addressed.
| // Build the query | ||
| let mut query = format!( | ||
| r#" | ||
| SELECT DISTINCT | ||
| PRECURSOR_ID, | ||
| MODIFIED_SEQUENCE, | ||
| PRECURSOR_CHARGE, | ||
| PRECURSOR_DECOY | ||
| FROM read_parquet('{}') | ||
| WHERE 1=1 | ||
| "#, | ||
| precursor_path.display() | ||
| ); | ||
|
|
||
| // Add decoy filter if needed | ||
| if filter_decoys { | ||
| query.push_str(" AND (PRECURSOR_DECOY = 0 OR PRECURSOR_DECOY IS NULL)"); | ||
| } | ||
|
|
||
| // Add precursor ID filter if provided | ||
| if let Some(ref ids) = precursor_ids { | ||
| let id_list = ids.iter() | ||
| .map(|id| id.to_string()) | ||
| .collect::<Vec<_>>() | ||
| .join(","); | ||
| query.push_str(&format!(" AND PRECURSOR_ID IN ({})", id_list)); | ||
| } |
There was a problem hiding this comment.
Validate file path to prevent SQL injection.
The file path is directly interpolated into the SQL query. Apply the same validation as suggested in earlier comments.
| let query = format!( | ||
| r#" | ||
| SELECT | ||
| FILENAME, | ||
| RUN_ID, | ||
| PRECURSOR_ID, | ||
| FEATURE_ID, | ||
| EXP_RT, | ||
| LEFT_WIDTH, | ||
| RIGHT_WIDTH, | ||
| FEATURE_MS2_AREA_INTENSITY | ||
| FROM read_parquet('{}') | ||
| WHERE PRECURSOR_ID = ? | ||
| ORDER BY EXP_RT | ||
| "#, | ||
| precursor_path.display() | ||
| ); |
There was a problem hiding this comment.
Good: Parameterized query for precursor_id; validate file path.
Good use of parameterized queries for precursor_id (line 621). However, the file path is still interpolated via format!. Apply path validation as suggested earlier.
🤖 Prompt for AI Agents
In crates/arycal-cloudpath/src/oswpq.rs around lines 600 to 616, the SQL uses
format! to interpolate precursor_path into the read_parquet literal which can be
unsafe; validate and sanitize the path before interpolating. Canonicalize the
Path (std::fs::canonicalize), check it exists, and ensure it is under your
allowed base directory (compare canonicalized prefixes) to prevent directory
traversal; then convert to a safe string (e.g., to_string_lossy) and either
escape any single quotes before calling format! or—preferably—use the query
engine's parameter/binding API for file paths if available; update the code to
perform these checks and only interpolate a validated, canonicalized, and
escaped path (or switch to a bound parameter) when constructing the query.
| // Export to parquet (append mode if file exists) | ||
| let export_query = if output_path.exists() { | ||
| format!( | ||
| "COPY (SELECT * FROM read_parquet('{}') UNION ALL SELECT * FROM transition_alignment_temp) TO '{}' (FORMAT PARQUET)", | ||
| output_path.display(), | ||
| output_path.display() | ||
| ) | ||
| } else { | ||
| format!( | ||
| "COPY transition_alignment_temp TO '{}' (FORMAT PARQUET)", | ||
| output_path.display() | ||
| ) | ||
| }; |
There was a problem hiding this comment.
Same append inefficiency as in previous write methods.
This method also uses the inefficient append-by-union pattern. The performance concern raised for write_alignment_features (lines 464-477) applies here as well.
| ### Configuration Fields | ||
|
|
||
| **XIC Section:** | ||
| - `include-precursor`: Include precursor chromatograms (true/false) | ||
| - `num-isotopes`: Number of isotopic peaks (typically 3) | ||
| - `file-type`: "sqMass" or "parquet" | ||
| - `file-paths`: List of XIC file paths | ||
|
|
||
| **Features Section:** | ||
| - `file-type`: "osw" (SQLite) or "oswpq" (PyProphet split parquet) | ||
| - `file-paths`: Path(s) to feature files | ||
|
|
||
| **Filters Section:** | ||
| - `include_decoys`: false = targets only, true = targets + decoys | ||
| - `include_identifying_transitions`: Include non-quantifying transitions | ||
| - `max_score_ms2_qvalue`: Q-value threshold (1.0 = no filtering) | ||
| - `precursor_ids`: Optional TSV file with specific precursor IDs | ||
|
|
||
| **Alignment Section:** | ||
| - `batch_size`: Precursors to process per batch (10000 recommended) | ||
| - `method`: "FFT", "DTW", or "FFTDTW" (FFTDTW recommended) | ||
| - `reference_type`: "star" (single reference) or "mst" (minimum spanning tree) | ||
| - `reference_run`: Specific reference run name (null = auto-select) | ||
| - `rt_mapping_tolerance`: RT tolerance in seconds (10.0 recommended) | ||
| - `decoy_peak_mapping_method`: "shuffle" or "random_region" | ||
| - `compute_scores`: Calculate alignment scores (true recommended) | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
Configuration Fields section is well-structured; verify field names and completeness.
This new reference section is a valuable addition that explains each configuration option clearly. The recommendations (e.g., "FFTDTW recommended," "10.0 recommended") are helpful. However, verify:
- Field naming consistency: Confirm the documented field names exactly match the schema (e.g., hyphens in
include-precursorvs underscores ininclude_decoys) - Completeness: Ensure all documented fields actually exist and are configurable
- Missing fields: The Alignment section omits
use_tic,smoothing,scores_output_file, andretain_alignment_pathfrom the OSW example config—verify if these should be documented
🏁 Script executed:
#!/bin/bash
# Verify config field names and schema completeness
# Find all field definitions in config structs
rg -B2 -A2 "pub.*:" --type rust crates/arycal-common/src/config.rs | head -150
# Check for any config fields not mentioned in README
rg "pub\s+\w+:" --type rust crates/arycal-common/src/config.rs | sort | uniqLength of output: 12899
Alignment section incomplete; 5 fields missing, one option undocumented.
Verification confirms the original concern. The AlignmentConfig struct contains 11 public fields, but the README documents only 7. The undocumented fields are:
use_tic(boolean; code note: "Currently not used. We always use TIC.")smoothing(nestedSmoothingConfigwithsgolay_windowandsgolay_order)decoy_window_size(optional; used with "random_region" decoy method)scores_output_file(optional; outputs scores to separate OSW file)retain_alignment_path(boolean; retained when using identifying transitions)
Additionally, the reference_type field supports three options per code comments ("star", "mst", "progressive"), but the README lists only two.
Update the Alignment section to include these 5 fields with their descriptions and note that reference_type also supports "progressive".
🤖 Prompt for AI Agents
In README.md around lines 216 to 242, the Alignment section is missing five
fields and one option; update it to document the additional AlignmentConfig
public fields: add `use_tic` (boolean; currently unused, we always use TIC),
`smoothing` (object with `sgolay_window` and `sgolay_order` describing
Savitzky–Golay smoothing parameters), `decoy_window_size` (optional numeric,
used when `decoy_peak_mapping_method` is "random_region"), `scores_output_file`
(optional path to write alignment scores to a separate OSW file), and
`retain_alignment_path` (boolean; retain alignment path when using identifying
transitions); also update the `reference_type` entry to list "star", "mst", and
"progressive" as supported options and add brief recommended values where
appropriate.
This pull request adds support for the PyProphet split parquet format (
oswpqd) to ARYCAL, enabling fully parallelized reading and alignment workflows. The changes introduce a unified feature accessor abstraction for both OSW (SQLite) and OSWPQ (parquet) formats, update documentation and configuration templates, and simplify the input configuration structure. These improvements make the tool more flexible, easier to use, and better documented for multiple data formats.Major feature additions:
Added support for PyProphet split parquet format (
oswpqd) in both the CLI and configuration, including a parallelized reader and unified feature accessor (FeatureAccessor) for both OSW and OSWPQ formats. [1] [2]Updated input validation and configuration to support both
sqMassandparquetfor XIC files, and bothoswandoswpqdfor feature files. [1] [2]Documentation and usability improvements:
README.mdwith clear documentation for multiple input formats, quick start instructions, configuration field explanations, and example configurations for both OSW and OSWPQ workflows. [1] [2]Codebase simplification:
Removed unused GUI-related fields from the input configuration structure to streamline CLI usage.
Refactored the codebase to use the new
FeatureAccessorabstraction throughout the alignment and results writing logic, ensuring compatibility with both OSW and OSWPQ feature types. [1] [2] [3] [4] [5] [6]These changes collectively improve ARYCAL's flexibility, performance, and ease of use for users working with different chromatogram alignment workflows.
Summary by CodeRabbit
New Features
Documentation
Improvements