Fix concurrency races in session manager and stabilize service lifecycles by Esubaalew · Pull Request #24 · sipeed/picoclaw

Esubaalew · 2026-02-11T13:56:23Z

This PR tries to addres concurrency and sessions issuses inside session manager , cron , and heartbeat services . the first bug like fix was race condtions issue that might happen when two requests could create two different sessions for the same key at the same time.

Key Changes

pkg/session/manager:
race isssue: changed GetOrCreate to use a single Lock during the check-and-set phase, so avoiding multiple goroutines from creating and overwriting the same session.

Non-blocking I/O: Refactored Save to snapshot session data under a RLock and perform disk I/O outside of the critical section. This prevents the entire manager from hanging during slow disk writes.

Atomic Writes: (CreateTemp -> Sync -> Rename) this can prevent data corruption/partial writes during system crashes.

pkg/cron/service.go & pkg/heartbeat.go:
Idempotent Stop/Start: Fixed a bug where restarting a service would cause an immediate exit or panic due to a closed stopChan.

Channel Management: Start() now re-initializes the stopChan on every call, and Stop() safely handles nil/closed states to ensure reliability during repeated toggle operations.

Granular Locking: Refactored checkJobs and executeJobByID to ensure thread-safe state updates without holding locks across external callback executions.

most od the issues i thing i fixed might not be urgent fixes thy are thing that could likely casuse bug

Copilot

Pull request overview

This PR addresses concurrency issues and improves service lifecycle management across the session manager, cron service, and heartbeat service. The changes aim to fix race conditions, implement non-blocking I/O, add atomic writes, and make service Start/Stop operations idempotent.

Changes:

Fixed race condition in session manager's GetOrCreate by using a single Lock instead of RLock/Lock pattern
Refactored session Save to use atomic writes (CreateTemp -> Sync -> Rename) and perform I/O outside critical sections
Improved cron and heartbeat service lifecycle with idempotent Start/Stop by reinitializing stopChan on each Start and properly cleaning up on Stop

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 9 comments.

File	Description
pkg/session/manager.go	Fixed GetOrCreate race condition, refactored Save to use atomic writes and non-blocking I/O, added edge case handling for TruncateHistory
pkg/heartbeat/service.go	Improved service lifecycle by removing stopChan initialization from constructor and managing it in Start/Stop, passed stopChan to runLoop as parameter
pkg/cron/service.go	Enhanced lifecycle management with stopChan reinitialization in Start, refactored job execution with granular locking by collecting job IDs first then executing
pkg/agent/loop.go	Updated Save method calls to use new signature that takes session key instead of session object

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-11T14:04:20Z

pkg/heartbeat/service.go

+	if hs.stopChan != nil {
 		return nil
 	}

 	if !hs.enabled {
 		return fmt.Errorf("heartbeat service is disabled")
 	}

-	go hs.runLoop()
+	hs.stopChan = make(chan struct{})
+	go hs.runLoop(hs.stopChan)


After Start() is called and creates a new stopChan on line 41, the old goroutine from a previous runLoop (if it exists) continues running with the old channel and won't be stopped. This creates a goroutine leak. The check on line 33 should verify if a goroutine is actually running, not just if stopChan exists. Consider tracking the running state separately (similar to how cron service uses the 'running' boolean), or ensure Stop() is always called before Start().

Copilot · 2026-02-11T14:04:21Z

pkg/session/manager.go

+	if len(stored.Messages) > 0 {
+		snapshot.Messages = make([]providers.Message, len(stored.Messages))
+		copy(snapshot.Messages, stored.Messages)


The shallow copy of Messages on line 166 doesn't properly deep-copy nested structures. Each Message contains a ToolCalls slice, and within ToolCalls there are pointer fields (Function *FunctionCall) and map fields (Arguments map[string]interface{}). These are shared with the original, not copied. If the original session's messages are modified after this copy (which shouldn't happen but could in concurrent scenarios), the snapshot could be affected. Consider implementing a proper deep copy that also copies the ToolCalls slice and its nested structures.

Copilot · 2026-02-11T14:04:21Z

pkg/cron/service.go


 	cs.mu.Lock()
 	defer cs.mu.Unlock()
 	cs.saveStore()


The saveStore call on line 149 ignores its error return value. If the save fails, the job state updates (LastRunAtMS, LastStatus, etc.) are lost on the next restart, even though the job executed. This could lead to inconsistent state where jobs appear to have never run or have outdated status information. Consider logging or handling the error from saveStore to ensure persistence failures are visible.

Suggested change

cs.saveStore()

if err := cs.saveStore(); err != nil {

fmt.Fprintf(os.Stderr, "cron: failed to save store: %v\n", err)

}

Copilot · 2026-02-11T14:04:21Z

pkg/session/manager.go


-	data, err := json.MarshalIndent(session, "", "  ")
+	sessionPath := filepath.Join(sm.storage, key+".json")
+	tmpFile, err := os.CreateTemp(sm.storage, key+".*.tmp")


The atomic write implementation uses CreateTemp with a pattern that includes the key, which could be problematic if the key contains characters invalid in filenames (e.g., path separators like '/' or '\'). The agent code uses keys like "channel:chat_id" which are safe, but if keys from other sources contain problematic characters, CreateTemp could fail. Consider sanitizing the key or using a fixed prefix like ".tmp" to make the pattern more robust.

Suggested change

tmpFile, err := os.CreateTemp(sm.storage, key+".*.tmp")

tmpFile, err := os.CreateTemp(sm.storage, "session-*.tmp")

Copilot · 2026-02-11T14:04:21Z

pkg/session/manager.go

+	if keepLast <= 0 {
+		session.Messages = []providers.Message{}
+		session.Updated = time.Now()
+		return
+	}


The new edge case handling for keepLast <= 0 on lines 127-131 clears all messages, which is a reasonable interpretation. However, this behavior is undocumented and may be unexpected. When keepLast is 0, users might expect either an error or no-op behavior rather than clearing the history. Consider documenting this behavior or validating that keepLast is positive before proceeding.

Copilot · 2026-02-11T14:04:22Z

pkg/cron/service.go

+	for i := range cs.store.Jobs {
+		job := &cs.store.Jobs[i]
+		if job.ID == jobID {
+			jobCopy := *job


The shallow copy of CronJob on line 160 shares pointer references in the State field. State contains *int64 pointers for NextRunAtMS and LastRunAtMS, which are shared between the original and the copy. While the current code doesn't modify the pointed-to values (it only reassigns the pointers themselves), this creates a subtle coupling. If future code modifies values through these pointers, it could affect both the original and the callback copy unexpectedly. Consider creating new int64 values and copying the dereferenced values for complete isolation.

Suggested change

jobCopy := *job

jobCopy := *job

// Deep-copy pointer fields in State to avoid sharing underlying int64 values.

if job.State.NextRunAtMS != nil {

v := *job.State.NextRunAtMS

jobCopy.State.NextRunAtMS = &v

}

if job.State.LastRunAtMS != nil {

v := *job.State.LastRunAtMS

jobCopy.State.LastRunAtMS = &v

}

Copilot · 2026-02-11T14:04:22Z

pkg/session/manager.go

+		return
+	}
+
 	if len(session.Messages) <= keepLast {


The TruncateHistory method doesn't update the session's Updated timestamp when keepLast is greater than the current message count (line 133-135). While the method does nothing in this case, it might be expected to update the timestamp to reflect that the truncation operation was attempted. For consistency with other branches that do update the timestamp, consider whether an update is needed here as well.

Suggested change

if len(session.Messages) <= keepLast {

if len(session.Messages) <= keepLast {

session.Updated = time.Now()

Copilot · 2026-02-11T14:04:22Z

pkg/cron/service.go

+	var job *CronJob
+	for i := range cs.store.Jobs {
+		if cs.store.Jobs[i].ID == jobID {
+			job = &cs.store.Jobs[i]
+			break
+		}
+	}
+	if job == nil {
+		return
+	}


There's a potential issue where a job could be deleted (by removeJobUnsafe on line 203 or externally via RemoveJob) between finding the job on line 158-164 and re-finding it on line 179-188. When this happens, the second lookup returns nil (line 186-188), but no error state is recorded. The job executed successfully, but its state update is silently lost. Consider tracking the deletion separately or logging when a job disappears between execution and state update.

Copilot · 2026-02-11T14:04:22Z

pkg/session/manager.go

+	if err != nil {
+		return err
+	}



Similar to the CreateTemp pattern issue, the sessionPath construction on line 177 is vulnerable to path traversal if the key contains ".." or path separators. A malicious or misconfigured key could write outside the storage directory. Consider sanitizing the key by replacing path separators or validating that the resolved path is still within the storage directory using filepath.Clean and checking the prefix.

Suggested change

// Validate key to prevent path traversal via sessionPath and temporary file names.

if key == "" || key != filepath.Base(key) || key == "." || key == ".." {

return os.ErrInvalid

}

lxowalle · 2026-02-12T16:54:40Z

Hi @Esubaalew , Thanks for your pr. I encountered some issues, there were too many conflicts when merging this pr. Could you please test it on the latest code and submit again?

Esubaalew · 2026-02-12T19:56:38Z

Hi @Esubaalew , Thanks for your pr. I encountered some issues, there were too many conflicts when merging this pr. Could you please test it on the latest code and submit again?

Hey @lxowalle I just did, thanks.

…ices

lxowalle · 2026-02-13T14:17:26Z

Thanks for the pr!

Fix concurrency races in session manager and stabilize service lifecycles

Orgmar · 2026-02-19T09:46:51Z

@Esubaalew Nice work on the race condition fixes! The atomic write approach with temp/sync/rename is a solid pattern for crash safety, and separating the disk I/O from the lock is a good call for throughput.

By the way, we're building the PicoClaw Dev Group to bring contributors closer together and push the project forward. If you're interested, send an email to support@sipeed.com with the subject [Join PicoClaw Dev Group] + Your GitHub account and we'll get you a Discord invite!

Esubaalew · 2026-02-19T10:45:27Z

Thanks, I’ve just sent the email.

…

On Thu, Feb 19, 2026 at 12:47 PM Orgmar ***@***.***> wrote: *Orgmar* left a comment (sipeed/picoclaw#24) <#24 (comment)> @Esubaalew <https://github.com/Esubaalew> Nice work on the race condition fixes! The atomic write approach with temp/sync/rename is a solid pattern for crash safety, and separating the disk I/O from the lock is a good call for throughput. By the way, we're building the *PicoClaw Dev Group* to bring contributors closer together and push the project forward. If you're interested, send an email to ***@***.*** with the subject [Join PicoClaw Dev Group] + Your GitHub account and we'll get you a Discord invite! — Reply to this email directly, view it on GitHub <#24 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AZM3ESGVPIFXV4BJL3UCQVL4MWBCDAVCNFSM6AAAAACUYDOZH6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTSMRVHE4TQMBRG4> . You are receiving this because you were mentioned.Message ID: ***@***.***>

Copilot AI review requested due to automatic review settings February 11, 2026 13:56

Copilot started reviewing on behalf of Esubaalew February 11, 2026 13:56 View session

Copilot AI reviewed Feb 11, 2026

View reviewed changes

Esubaalew force-pushed the main branch from 0afa05d to 31ae7c8 Compare February 12, 2026 19:55

fix concurrency and persistence safety in session/cron/heartbeat serv…

7fa341c

…ices

Esubaalew force-pushed the main branch from 31ae7c8 to 7fa341c Compare February 12, 2026 20:11

Esubaalew and others added 2 commits February 13, 2026 09:28

Merge branch 'main' into main

fc40f29

Merge branch 'main' into main

bc27707

lxowalle merged commit 7f60392 into sipeed:main Feb 13, 2026
1 of 3 checks passed

emadomedher pushed a commit to emadomedher/picoclaw that referenced this pull request Feb 17, 2026

Merge pull request sipeed#24 from Esubaalew/main

4a9c729

Fix concurrency races in session manager and stabilize service lifecycles

StarWindv referenced this pull request in StarWindv/PicoClaw-shou Feb 19, 2026

Merge pull request #24 from Esubaalew/main

1eed932

Fix concurrency races in session manager and stabilize service lifecycles

-	cs.saveStore()
+	if err := cs.saveStore(); err != nil {
+		fmt.Fprintf(os.Stderr, "cron: failed to save store: %v\n", err)
+	}

	tmpFile, err := os.CreateTemp(sm.storage, key+".*.tmp")
	tmpFile, err := os.CreateTemp(sm.storage, "session-*.tmp")

-			jobCopy := *job
+			jobCopy := *job
+			// Deep-copy pointer fields in State to avoid sharing underlying int64 values.
+			if job.State.NextRunAtMS != nil {
+				v := *job.State.NextRunAtMS
+				jobCopy.State.NextRunAtMS = &v
+			}
+			if job.State.LastRunAtMS != nil {
+				v := *job.State.LastRunAtMS
+				jobCopy.State.LastRunAtMS = &v
+			}

	if len(session.Messages) <= keepLast {
	if len(session.Messages) <= keepLast {
	session.Updated = time.Now()

+	// Validate key to prevent path traversal via sessionPath and temporary file names.
+	if key == "" || key != filepath.Base(key) || key == "." || key == ".." {
+		return os.ErrInvalid
+	}

Conversation

Esubaalew commented Feb 11, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

lxowalle commented Feb 12, 2026

Uh oh!

Esubaalew commented Feb 12, 2026

Uh oh!

Uh oh!

lxowalle commented Feb 13, 2026

Uh oh!

Orgmar commented Feb 19, 2026

Uh oh!

Esubaalew commented Feb 19, 2026 via email

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants