Skip to content

Fix concurrency races in session manager and stabilize service lifecycles#24

Merged
lxowalle merged 3 commits intosipeed:mainfrom
Esubaalew:main
Feb 13, 2026
Merged

Fix concurrency races in session manager and stabilize service lifecycles#24
lxowalle merged 3 commits intosipeed:mainfrom
Esubaalew:main

Conversation

@Esubaalew
Copy link
Contributor

This PR tries to addres concurrency and sessions issuses inside session manager , cron , and heartbeat services . the first bug like fix was race condtions issue that might happen when two requests could create two different sessions for the same key at the same time.

Key Changes

  1. pkg/session/manager:
    race isssue: changed GetOrCreate to use a single Lock during the check-and-set phase, so avoiding multiple goroutines from creating and overwriting the same session.

Non-blocking I/O: Refactored Save to snapshot session data under a RLock and perform disk I/O outside of the critical section. This prevents the entire manager from hanging during slow disk writes.

Atomic Writes: (CreateTemp -> Sync -> Rename) this can prevent data corruption/partial writes during system crashes.

  1. pkg/cron/service.go & pkg/heartbeat.go:
    Idempotent Stop/Start: Fixed a bug where restarting a service would cause an immediate exit or panic due to a closed stopChan.

Channel Management: Start() now re-initializes the stopChan on every call, and Stop() safely handles nil/closed states to ensure reliability during repeated toggle operations.

Granular Locking: Refactored checkJobs and executeJobByID to ensure thread-safe state updates without holding locks across external callback executions.

most od the issues i thing i fixed might not be urgent fixes thy are thing that could likely casuse bug

Copilot AI review requested due to automatic review settings February 11, 2026 13:56
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses concurrency issues and improves service lifecycle management across the session manager, cron service, and heartbeat service. The changes aim to fix race conditions, implement non-blocking I/O, add atomic writes, and make service Start/Stop operations idempotent.

Changes:

  • Fixed race condition in session manager's GetOrCreate by using a single Lock instead of RLock/Lock pattern
  • Refactored session Save to use atomic writes (CreateTemp -> Sync -> Rename) and perform I/O outside critical sections
  • Improved cron and heartbeat service lifecycle with idempotent Start/Stop by reinitializing stopChan on each Start and properly cleaning up on Stop

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 9 comments.

File Description
pkg/session/manager.go Fixed GetOrCreate race condition, refactored Save to use atomic writes and non-blocking I/O, added edge case handling for TruncateHistory
pkg/heartbeat/service.go Improved service lifecycle by removing stopChan initialization from constructor and managing it in Start/Stop, passed stopChan to runLoop as parameter
pkg/cron/service.go Enhanced lifecycle management with stopChan reinitialization in Start, refactored job execution with granular locking by collecting job IDs first then executing
pkg/agent/loop.go Updated Save method calls to use new signature that takes session key instead of session object

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +33 to +42
if hs.stopChan != nil {
return nil
}

if !hs.enabled {
return fmt.Errorf("heartbeat service is disabled")
}

go hs.runLoop()
hs.stopChan = make(chan struct{})
go hs.runLoop(hs.stopChan)
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After Start() is called and creates a new stopChan on line 41, the old goroutine from a previous runLoop (if it exists) continues running with the old channel and won't be stopped. This creates a goroutine leak. The check on line 33 should verify if a goroutine is actually running, not just if stopChan exists. Consider tracking the running state separately (similar to how cron service uses the 'running' boolean), or ensure Stop() is always called before Start().

Copilot uses AI. Check for mistakes.
Comment on lines +164 to +166
if len(stored.Messages) > 0 {
snapshot.Messages = make([]providers.Message, len(stored.Messages))
copy(snapshot.Messages, stored.Messages)
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The shallow copy of Messages on line 166 doesn't properly deep-copy nested structures. Each Message contains a ToolCalls slice, and within ToolCalls there are pointer fields (Function *FunctionCall) and map fields (Arguments map[string]interface{}). These are shared with the original, not copied. If the original session's messages are modified after this copy (which shouldn't happen but could in concurrent scenarios), the snapshot could be affected. Consider implementing a proper deep copy that also copies the ToolCalls slice and its nested structures.

Copilot uses AI. Check for mistakes.

cs.mu.Lock()
defer cs.mu.Unlock()
cs.saveStore()
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The saveStore call on line 149 ignores its error return value. If the save fails, the job state updates (LastRunAtMS, LastStatus, etc.) are lost on the next restart, even though the job executed. This could lead to inconsistent state where jobs appear to have never run or have outdated status information. Consider logging or handling the error from saveStore to ensure persistence failures are visible.

Suggested change
cs.saveStore()
if err := cs.saveStore(); err != nil {
fmt.Fprintf(os.Stderr, "cron: failed to save store: %v\n", err)
}

Copilot uses AI. Check for mistakes.

data, err := json.MarshalIndent(session, "", " ")
sessionPath := filepath.Join(sm.storage, key+".json")
tmpFile, err := os.CreateTemp(sm.storage, key+".*.tmp")
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The atomic write implementation uses CreateTemp with a pattern that includes the key, which could be problematic if the key contains characters invalid in filenames (e.g., path separators like '/' or '\'). The agent code uses keys like "channel:chat_id" which are safe, but if keys from other sources contain problematic characters, CreateTemp could fail. Consider sanitizing the key or using a fixed prefix like ".tmp" to make the pattern more robust.

Suggested change
tmpFile, err := os.CreateTemp(sm.storage, key+".*.tmp")
tmpFile, err := os.CreateTemp(sm.storage, "session-*.tmp")

Copilot uses AI. Check for mistakes.
Comment on lines +127 to +131
if keepLast <= 0 {
session.Messages = []providers.Message{}
session.Updated = time.Now()
return
}
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new edge case handling for keepLast <= 0 on lines 127-131 clears all messages, which is a reasonable interpretation. However, this behavior is undocumented and may be unexpected. When keepLast is 0, users might expect either an error or no-op behavior rather than clearing the history. Consider documenting this behavior or validating that keepLast is positive before proceeding.

Copilot uses AI. Check for mistakes.
for i := range cs.store.Jobs {
job := &cs.store.Jobs[i]
if job.ID == jobID {
jobCopy := *job
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The shallow copy of CronJob on line 160 shares pointer references in the State field. State contains *int64 pointers for NextRunAtMS and LastRunAtMS, which are shared between the original and the copy. While the current code doesn't modify the pointed-to values (it only reassigns the pointers themselves), this creates a subtle coupling. If future code modifies values through these pointers, it could affect both the original and the callback copy unexpectedly. Consider creating new int64 values and copying the dereferenced values for complete isolation.

Suggested change
jobCopy := *job
jobCopy := *job
// Deep-copy pointer fields in State to avoid sharing underlying int64 values.
if job.State.NextRunAtMS != nil {
v := *job.State.NextRunAtMS
jobCopy.State.NextRunAtMS = &v
}
if job.State.LastRunAtMS != nil {
v := *job.State.LastRunAtMS
jobCopy.State.LastRunAtMS = &v
}

Copilot uses AI. Check for mistakes.
return
}

if len(session.Messages) <= keepLast {
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TruncateHistory method doesn't update the session's Updated timestamp when keepLast is greater than the current message count (line 133-135). While the method does nothing in this case, it might be expected to update the timestamp to reflect that the truncation operation was attempted. For consistency with other branches that do update the timestamp, consider whether an update is needed here as well.

Suggested change
if len(session.Messages) <= keepLast {
if len(session.Messages) <= keepLast {
session.Updated = time.Now()

Copilot uses AI. Check for mistakes.
Comment on lines +179 to +188
var job *CronJob
for i := range cs.store.Jobs {
if cs.store.Jobs[i].ID == jobID {
job = &cs.store.Jobs[i]
break
}
}
if job == nil {
return
}
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a potential issue where a job could be deleted (by removeJobUnsafe on line 203 or externally via RemoveJob) between finding the job on line 158-164 and re-finding it on line 179-188. When this happens, the second lookup returns nil (line 186-188), but no error state is recorded. The job executed successfully, but its state update is silently lost. Consider tracking the deletion separately or logging when a job disappears between execution and state update.

Copilot uses AI. Check for mistakes.
if err != nil {
return err
}

Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the CreateTemp pattern issue, the sessionPath construction on line 177 is vulnerable to path traversal if the key contains ".." or path separators. A malicious or misconfigured key could write outside the storage directory. Consider sanitizing the key by replacing path separators or validating that the resolved path is still within the storage directory using filepath.Clean and checking the prefix.

Suggested change
// Validate key to prevent path traversal via sessionPath and temporary file names.
if key == "" || key != filepath.Base(key) || key == "." || key == ".." {
return os.ErrInvalid
}

Copilot uses AI. Check for mistakes.
@lxowalle
Copy link
Collaborator

Hi @Esubaalew , Thanks for your pr. I encountered some issues, there were too many conflicts when merging this pr. Could you please test it on the latest code and submit again?

@Esubaalew
Copy link
Contributor Author

Hi @Esubaalew , Thanks for your pr. I encountered some issues, there were too many conflicts when merging this pr. Could you please test it on the latest code and submit again?

Hey @lxowalle I just did, thanks.

@lxowalle lxowalle merged commit 7f60392 into sipeed:main Feb 13, 2026
1 of 3 checks passed
@lxowalle
Copy link
Collaborator

Thanks for the pr!

emadomedher pushed a commit to emadomedher/picoclaw that referenced this pull request Feb 17, 2026
Fix concurrency races in session manager and stabilize service lifecycles
StarWindv referenced this pull request in StarWindv/PicoClaw-shou Feb 19, 2026
Fix concurrency races in session manager and stabilize service lifecycles
@Orgmar
Copy link
Contributor

Orgmar commented Feb 19, 2026

@Esubaalew Nice work on the race condition fixes! The atomic write approach with temp/sync/rename is a solid pattern for crash safety, and separating the disk I/O from the lock is a good call for throughput.

By the way, we're building the PicoClaw Dev Group to bring contributors closer together and push the project forward. If you're interested, send an email to support@sipeed.com with the subject [Join PicoClaw Dev Group] + Your GitHub account and we'll get you a Discord invite!

@Esubaalew
Copy link
Contributor Author

Esubaalew commented Feb 19, 2026 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants