research: evaluate RL training infrastructure options for memory consolidation

## Context

Multiple research findings point to RL-based optimization as the path to significantly better agent memory and experience extraction:

- **Complementary RL (arXiv:2603.17621)**: Co-evolutionary actor + experience-extractor with GRPO/CISPO optimization. Paper's own ablation (Figure 3a) shows static extractor without RL yields only marginal gains -- the architecture is adoptable but performance claims are RL-specific.
- **Memex(RL) (arXiv:2603.04257)**: RL reward shaping trains write/read behaviors under context budget. Results: 24.2% -> 85.6% success on hardened ALFWorld.
- **EvoSkill (arXiv:2603.02766)**: Evolutionary loop auto-discovers reusable skills from failure trajectories with Pareto frontier selection.

The non-RL adaptations from these papers are filed in #704. This issue tracks the longer-term question: should SynthOrg invest in RL training infrastructure to unlock the full performance gains?

## Evaluation Criteria

- [ ] What RL frameworks exist for LLM agent optimization? (RLHF libraries, GRPO implementations, etc.)
- [ ] What infrastructure is required? (GPU compute, training pipeline, evaluation harness)
- [ ] What is the minimum viable RL loop? (e.g., reward signal from task outcomes, no custom training)
- [ ] Can existing execution history serve as training data?
- [ ] Cost/benefit: what performance improvement justifies the infrastructure investment?
- [ ] Is there a hosted/managed option that avoids self-hosted training infrastructure?

## References

- [Complementary RL](https://arxiv.org/abs/2603.17621)
- [Memex(RL)](https://arxiv.org/abs/2603.04257)
- [EvoSkill](https://arxiv.org/abs/2603.02766)
- [Agentic RL Survey (arXiv:2509.02547)](https://arxiv.org/abs/2509.02547) -- 500+ paper survey
- Related: #704 (non-RL memory consolidation upgrades)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

research: evaluate RL training infrastructure options for memory consolidation #708

Context

Evaluation Criteria

References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

research: evaluate RL training infrastructure options for memory consolidation #708

Description

Context

Evaluation Criteria

References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions