fix(remote-device): recover from reconnect wedge when channel stalls …#528
fix(remote-device): recover from reconnect wedge when channel stalls …#528edgarsskore wants to merge 1 commit into
Conversation
…in 'joining' The reconnect health-check treated 'joining' as unconditionally healthy so realtime-js's own rejoin backoff could converge. But on a HALF-OPEN socket (readyState OPEN yet dead, e.g. after sleep/wifi-loss) realtime-js parks the channel in 'joining' indefinitely and never reconnects the socket — so recreateChannel() (the only path that disconnect()s the dead socket) never fired and the device wedged offline silently until restart. This is the "joining forever" case the original #520 review marked a non-issue; it reproduced in prod (device offline 4+ min, logs frozen, zero recreate attempts, last_seen still advancing via the separate HTTP heartbeat). Bound time-in-'joining': keep it healthy only while it stays under JOINING_WEDGE_TIMEOUT_MS (30s = 3 health ticks; realtime-js's join push times out in ~10s, so unbroken 'joining' past this means the state machine stalled). Past the bound, log + emit remote_channel_joining_wedge telemetry + force a recreate. joiningSince resets on 'joined' and on every non-joining state, so normal rejoin oscillation never trips it — a single/brief 'joining' tick still does not recreate. Adds a deterministic repro to test/test-remote-channel-reconnect.js (red→green; advances a simulated clock so a 30s bound is observable without real wall-clock). Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
📝 WalkthroughWalkthroughThe realtime channel health check now tracks how long it stays in ChangesRemote channel joining watchdog
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/remote-device/remote-channel.ts`:
- Line 310: The watchdog telemetry call in remote_channel_joining_wedge can
reject because captureRemote is async and is invoked from a non-awaited
health-check path. Update the watchdog logic in remote-channel.ts so this
best-effort telemetry is safely swallowed by attaching error handling to the
captureRemote call (or routing it through an internal async helper used by the
watchdog), and keep the reconnect/retry flow in RemoteChannel unaffected if
telemetry fails.
In `@test/test-remote-channel-reconnect.js`:
- Around line 311-312: Remove the stale “expected to fail” note from the
reconnect test comment so it matches the watchdog behavior in this PR. Update
the comment near the remote channel reconnect test to stop referencing the old
time-bounded-'joining' fix and make sure any wording around
checkConnectionHealth() reflects that the test should now pass.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 6a66d504-ca93-43f1-ae8c-f43897a5877e
📒 Files selected for processing (2)
src/remote-device/remote-channel.tstest/test-remote-channel-reconnect.js
| const stuckMs = now - this.joiningSince; | ||
| if (stuckMs < JOINING_WEDGE_TIMEOUT_MS) return; | ||
| console.debug(`[DEBUG] ⚠️ Channel stuck 'joining' ${Math.round(stuckMs / 1000)}s - forcing recreate — ${this.connState()}`); | ||
| captureRemote('remote_channel_joining_wedge', { stuckMs, attempt: this.reconnectAttempt }); |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟡 Minor | ⚡ Quick win
Catch best-effort telemetry failures in the watchdog.
captureRemote is async, and this void health-check path does not await it. During the same network-loss scenario this code handles, telemetry rejection can surface as an unhandled promise rejection.
Proposed fix
- captureRemote('remote_channel_joining_wedge', { stuckMs, attempt: this.reconnectAttempt });
+ captureRemote('remote_channel_joining_wedge', { stuckMs, attempt: this.reconnectAttempt }).catch(() => { });📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| captureRemote('remote_channel_joining_wedge', { stuckMs, attempt: this.reconnectAttempt }); | |
| captureRemote('remote_channel_joining_wedge', { stuckMs, attempt: this.reconnectAttempt }).catch(() => { }); |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/remote-device/remote-channel.ts` at line 310, The watchdog telemetry call
in remote_channel_joining_wedge can reject because captureRemote is async and is
invoked from a non-awaited health-check path. Update the watchdog logic in
remote-channel.ts so this best-effort telemetry is safely swallowed by attaching
error handling to the captureRemote call (or routing it through an internal
async helper used by the watchdog), and keep the reconnect/retry flow in
RemoteChannel unaffected if telemetry fails.
| // or the device wedges offline until restart. EXPECTED TO FAIL until the | ||
| // time-bounded-'joining' fix lands in checkConnectionHealth(). |
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win
Remove the stale “expected to fail” note.
This test should pass with the watchdog implementation in this PR, so the comment now contradicts the intended final state.
Proposed cleanup
- // or the device wedges offline until restart. EXPECTED TO FAIL until the
- // time-bounded-'joining' fix lands in checkConnectionHealth().
+ // or the device wedges offline until restart.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // or the device wedges offline until restart. EXPECTED TO FAIL until the | |
| // time-bounded-'joining' fix lands in checkConnectionHealth(). | |
| // or the device wedges offline until restart. |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@test/test-remote-channel-reconnect.js` around lines 311 - 312, Remove the
stale “expected to fail” note from the reconnect test comment so it matches the
watchdog behavior in this PR. Update the comment near the remote channel
reconnect test to stop referencing the old time-bounded-'joining' fix and make
sure any wording around checkConnectionHealth() reflects that the test should
now pass.
Problem
Follow-up to #520. On a half-open socket (readyState OPEN but dead — e.g. after
sleep/wifi-loss), realtime-js SOMETIMES parks the channel in
joiningindefinitely and neverreconnects the socket. The health-check treated
joiningas unconditionally healthy,so
recreateChannel()(the only path that drops the dead socket) never fired — thedevice wedged offline forever until restart.
Reproduced in prod: device offline 4+ min, logs frozen, zero recreate attempts, while
last_seenkept advancing via the separate HTTP heartbeat.Fix
Bound time-in-
joining: stay healthy only underJOINING_WEDGE_TIMEOUT_MS(30s = 3health ticks; realtime-js's join push times out in ~10s). Past the bound → log, emit
remote_channel_joining_wedgetelemetry, and force a recreate.joiningSinceresets onjoinedand every non-joining state, so normal rejoin oscillation never trips it.Test
Adds a deterministic repro to
test/test-remote-channel-reconnect.js(red→green; uses asimulated clock). All 4 cases pass, incl. the still-passing single-tick "joining is
healthy" guard.
Summary by CodeRabbit