Skip to content

Integrate trueno-rag for enhanced text/document ML #125

@noahgift

Description

@noahgift

Summary

Add optional integration with trueno-rag (v0.1.3) to enhance the text/ module with RAG pipeline capabilities.

Motivation

aprender's text/ module currently provides:

  • Tokenization (whitespace, word, char)
  • Stop words filtering
  • Porter stemming

trueno-rag would add document-level processing capabilities useful for ML pipelines.

Proposed Integration

[features]
rag = ["trueno-rag"]

Features to integrate

Feature Description Use Case
6 chunking strategies Recursive, semantic, fixed, sentence, paragraph, markdown Document preprocessing for training
Hybrid retrieval Dense + BM25 Training data retrieval
Reranking Cross-encoder support Result quality improvement
Metrics Recall, MRR, NDCG Retrieval evaluation

Example API

use aprender::text::DocumentChunker;

let chunker = DocumentChunker::recursive(chunk_size: 512, overlap: 64);
let chunks = chunker.chunk(&document);

// Use chunks for training data preparation
for chunk in chunks {
    let features = extract_features(&chunk);
    model.train(&features);
}

Potential Use Cases

  1. Training data preparation - Chunk large documents for sequence models
  2. Semantic code search - Enhance CITL module with retrieval
  3. Model documentation search - Search model zoo documentation

Priority

MEDIUM - Enhances text processing capabilities for document-based ML.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions