Commit Graph

23 Commits

Author SHA1 Message Date
bmendonca3
4985c561df sessions: reclaim orphan self-pid lock files 2026-03-02 19:53:41 +00:00
Vincent Koc
5a2200b280 fix(sessions): harden recycled PID lock recovery follow-up (#31320)
* fix: detect PID recycling in session write lock staleness check

The session lock uses isPidAlive() to determine if a lock holder is
still running. In containers, PID recycling can cause a different
process to inherit the same PID, making the lock appear valid when
the original holder is dead.

Record the process start time (field 22 of /proc/pid/stat) in the
lock file and compare it during staleness checks. If the PID is alive
but its start time differs from the recorded value, the lock is
treated as stale and reclaimed immediately.

Backward compatible: lock files without starttime are handled with
the existing PID-alive + age-based logic. Non-Linux platforms skip
the starttime check entirely (getProcessStartTime returns null).

* shared: harden pid starttime parsing

* sessions: validate lock pid/starttime payloads

* changelog: note recycled PID lock recovery fix

* changelog: credit hiroki and vincent on lock recovery fix

---------

Co-authored-by: HirokiKobayashi-R <hiroki@rhems-japan.co.jp>
2026-03-01 21:42:22 -08:00
Peter Steinberger
1becebe188 fix: harden session lock contention and cleanup 2026-02-22 13:40:55 +00:00
Peter Steinberger
35016a380c fix(sandbox): serialize registry mutations and lock usage 2026-02-18 04:55:40 +01:00
Peter Steinberger
fb6e415d0c fix(agents): align session lock hold budget with run timeouts 2026-02-17 03:10:36 +01:00
Peter Steinberger
7147cd9cc0 refactor: dedupe process-scoped lock maps 2026-02-17 00:45:02 +00:00
Vishal Doshi
e91a5b0216 fix: release stale session locks and add watchdog for hung API calls (#18060)
When a model API call hangs indefinitely (e.g. Anthropic quota exceeded
mid-call), the gateway acquires a session .jsonl.lock but the promise
never resolves, so the try/finally block never reaches release(). Since
the owning PID is the gateway itself, stale detection cannot help —
isPidAlive() always returns true.

This commit adds four layers of defense:

1. **In-process lock watchdog** (session-write-lock.ts)
   - Track acquiredAt timestamp on each held lock
   - 60-second interval timer checks all held locks
   - Auto-releases any lock held longer than maxHoldMs (default 5 min)
   - Catches the hung-API-call case that try/finally cannot

2. **Gateway startup cleanup** (server-startup.ts)
   - On boot, scan all agent session directories for *.jsonl.lock files
   - Remove locks with dead PIDs or older than staleMs (30 min)
   - Log each cleaned lock for diagnostics

3. **openclaw doctor stale lock detection** (doctor-session-locks.ts)
   - New health check scans for .jsonl.lock files
   - Reports PID status and age of each lock found
   - In --fix mode, removes stale locks automatically

4. **Transcript error entry on API failure** (attempt.ts)
   - When promptError is set, write an error marker to the session
     transcript before releasing the lock
   - Preserves conversation history even on model API failures

Closes #18060
2026-02-16 23:59:22 +01:00
Peter Steinberger
5248b759fe refactor(shared): reuse isPidAlive 2026-02-15 19:06:54 +00:00
Peter Steinberger
f4782e1e73 refactor(agents): dedupe session write lock release 2026-02-15 16:30:01 +00:00
Peter Steinberger
fdfc34fa1f perf(test): stabilize e2e harness and reduce flaky gateway coverage 2026-02-13 17:32:14 +00:00
Peter Steinberger
1eccfa8934 perf(test): trim duplicate e2e suites and harden signal hooks 2026-02-13 16:46:43 +00:00
Coy Geek
717129f7f9 fix: silence unused hook token url param (#9436)
* fix: Gateway authentication token exposed in URL query parameters

* fix: silence unused hook token url param

* fix: remove gateway auth tokens from URLs (#9436) (thanks @coygeek)

* test: fix Windows path separators in audit test (#9436)

---------

Co-authored-by: George Pickett <gpickett00@gmail.com>
2026-02-05 18:08:29 -08:00
cpojer
f16e32b73d fix: Do not process.exit(0) in the middle of a test. 2026-02-06 09:57:51 +09:00
Sash Zats
ec0728b357 fix: release session locks on process termination (#1962)
Adds cleanup handlers to release held file locks when the process
terminates via SIGTERM, SIGINT, or normal exit. This prevents orphaned
lock files that would block future sessions.

Fixes #1951
2026-02-05 16:35:34 -08:00
cpojer
5ceff756e1 chore: Enable "curly" rule to avoid single-statement if confusion/errors. 2026-01-31 16:19:20 +09:00
Josh Palmer
c41ea252b0 fix flaky web-fetch tests + lock cleanup
What:
- stub resolvePinnedHostname in web-fetch tests to avoid DNS flake
- close lock file handles via FileHandle.close during cleanup to avoid EBADF

Why:
- make CI deterministic without network/DNS dependence
- prevent double-close errors from GC

Tests:
- pnpm vitest run --config vitest.unit.config.ts src/agents/tools/web-tools.fetch.test.ts src/agents/session-write-lock.test.ts (failed: missing @aws-sdk/client-bedrock)
2026-01-29 11:05:11 +01:00
Gustavo Madeira Santana
66a5b324a1 fix: harden session lock cleanup (#2483) (thanks @janeexai) 2026-01-26 21:16:05 -05:00
Shadow
d8e5dd91ba fix: clean up session locks on exit (#2483) (thanks @janeexai) 2026-01-26 19:48:46 -06:00
Jane
14f8acdecb fix(agents): release session locks on process termination
Adds process exit handlers to release all held session locks on:
- Normal process.exit() calls
- SIGTERM / SIGINT signals

This ensures locks are cleaned up even when the process terminates
unexpectedly, preventing the 'session file locked' error.
2026-01-26 19:46:04 -06:00
Peter Steinberger
fdc50a0feb fix: normalize session lock path 2026-01-23 18:34:33 +00:00
Peter Steinberger
cf8b3ed988 fix: harden memory indexing and embedded session locks 2026-01-18 05:41:45 +00:00
Peter Steinberger
c379191f80 chore: migrate to oxlint and oxfmt
Co-authored-by: Christoph Nakazawa <christoph.pojer@gmail.com>
2026-01-14 15:02:19 +00:00
Peter Steinberger
6668805aca fix(agents): enforce single-writer session files 2026-01-11 02:25:45 +00:00