Skip to content

fix(remote-device): recover from reconnect wedge when channel stalls …#528

Open
edgarsskore wants to merge 1 commit into
mainfrom
fix/remote-device-joining-wedge
Open

fix(remote-device): recover from reconnect wedge when channel stalls …#528
edgarsskore wants to merge 1 commit into
mainfrom
fix/remote-device-joining-wedge

Conversation

@edgarsskore

@edgarsskore edgarsskore commented Jun 27, 2026

Copy link
Copy Markdown
Collaborator

Problem

Follow-up to #520. On a half-open socket (readyState OPEN but dead — e.g. after
sleep/wifi-loss), realtime-js SOMETIMES parks the channel in joining indefinitely and never
reconnects the socket. The health-check treated joining as unconditionally healthy,
so recreateChannel() (the only path that drops the dead socket) never fired — the
device wedged offline forever until restart.

Reproduced in prod: device offline 4+ min, logs frozen, zero recreate attempts, while
last_seen kept advancing via the separate HTTP heartbeat.

Fix

Bound time-in-joining: stay healthy only under JOINING_WEDGE_TIMEOUT_MS (30s = 3
health ticks; realtime-js's join push times out in ~10s). Past the bound → log, emit
remote_channel_joining_wedge telemetry, and force a recreate. joiningSince resets on
joined and every non-joining state, so normal rejoin oscillation never trips it.

Test

Adds a deterministic repro to test/test-remote-channel-reconnect.js (red→green; uses a
simulated clock). All 4 cases pass, incl. the still-passing single-tick "joining is
healthy" guard.

Summary by CodeRabbit

  • Bug Fixes
    • Improved recovery when the realtime connection gets stuck while joining, helping the app reconnect automatically instead of remaining stuck.
    • Added protection against prolonged connection interruptions so the channel is refreshed sooner when it stops making progress.
    • Expanded automated coverage for the reconnect flow to verify recovery from this stuck joining state.

…in 'joining'

The reconnect health-check treated 'joining' as unconditionally healthy so
realtime-js's own rejoin backoff could converge. But on a HALF-OPEN socket
(readyState OPEN yet dead, e.g. after sleep/wifi-loss) realtime-js parks the
channel in 'joining' indefinitely and never reconnects the socket — so
recreateChannel() (the only path that disconnect()s the dead socket) never
fired and the device wedged offline silently until restart. This is the
"joining forever" case the original #520 review marked a non-issue; it
reproduced in prod (device offline 4+ min, logs frozen, zero recreate
attempts, last_seen still advancing via the separate HTTP heartbeat).

Bound time-in-'joining': keep it healthy only while it stays under
JOINING_WEDGE_TIMEOUT_MS (30s = 3 health ticks; realtime-js's join push times
out in ~10s, so unbroken 'joining' past this means the state machine stalled).
Past the bound, log + emit remote_channel_joining_wedge telemetry + force a
recreate. joiningSince resets on 'joined' and on every non-joining state, so
normal rejoin oscillation never trips it — a single/brief 'joining' tick still
does not recreate.

Adds a deterministic repro to test/test-remote-channel-reconnect.js (red→green;
advances a simulated clock so a 30s bound is observable without real wall-clock).

Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
@coderabbitai

coderabbitai Bot commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

The realtime channel health check now tracks how long it stays in joining and forces recreation after a timeout. A reconnect test covers the half-open scenario where the channel remains stuck in joining until recovery.

Changes

Remote channel joining watchdog

Layer / File(s) Summary
Joining-state watchdog and recovery
src/remote-device/remote-channel.ts
Tracks time spent in joining, clears the timer on joined, and recreates the channel after JOINING_WEDGE_TIMEOUT_MS or other unhealthy states while emitting a wedge event.
Joining-state reconnect regression
test/test-remote-channel-reconnect.js
Adds a half-open stuck-joining harness and a regression test that advances mocked time in 10-second ticks until recovery to joined.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Poem

A bunny watched the channel hum,
Through joining fog and socket glum.
Tick-tock went time, then hop! recur—
The wedge gave way; back came the purr.
🐇✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly matches the main change: fixing a remote-device reconnect wedge when the channel stalls during joining.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/remote-device-joining-wedge

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/remote-device/remote-channel.ts`:
- Line 310: The watchdog telemetry call in remote_channel_joining_wedge can
reject because captureRemote is async and is invoked from a non-awaited
health-check path. Update the watchdog logic in remote-channel.ts so this
best-effort telemetry is safely swallowed by attaching error handling to the
captureRemote call (or routing it through an internal async helper used by the
watchdog), and keep the reconnect/retry flow in RemoteChannel unaffected if
telemetry fails.

In `@test/test-remote-channel-reconnect.js`:
- Around line 311-312: Remove the stale “expected to fail” note from the
reconnect test comment so it matches the watchdog behavior in this PR. Update
the comment near the remote channel reconnect test to stop referencing the old
time-bounded-'joining' fix and make sure any wording around
checkConnectionHealth() reflects that the test should now pass.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6a66d504-ca93-43f1-ae8c-f43897a5877e

📥 Commits

Reviewing files that changed from the base of the PR and between 9cd9cb4 and e44a113.

📒 Files selected for processing (2)
  • src/remote-device/remote-channel.ts
  • test/test-remote-channel-reconnect.js

const stuckMs = now - this.joiningSince;
if (stuckMs < JOINING_WEDGE_TIMEOUT_MS) return;
console.debug(`[DEBUG] ⚠️ Channel stuck 'joining' ${Math.round(stuckMs / 1000)}s - forcing recreate — ${this.connState()}`);
captureRemote('remote_channel_joining_wedge', { stuckMs, attempt: this.reconnectAttempt });

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟡 Minor | ⚡ Quick win

Catch best-effort telemetry failures in the watchdog.

captureRemote is async, and this void health-check path does not await it. During the same network-loss scenario this code handles, telemetry rejection can surface as an unhandled promise rejection.

Proposed fix
-            captureRemote('remote_channel_joining_wedge', { stuckMs, attempt: this.reconnectAttempt });
+            captureRemote('remote_channel_joining_wedge', { stuckMs, attempt: this.reconnectAttempt }).catch(() => { });
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
captureRemote('remote_channel_joining_wedge', { stuckMs, attempt: this.reconnectAttempt });
captureRemote('remote_channel_joining_wedge', { stuckMs, attempt: this.reconnectAttempt }).catch(() => { });
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/remote-device/remote-channel.ts` at line 310, The watchdog telemetry call
in remote_channel_joining_wedge can reject because captureRemote is async and is
invoked from a non-awaited health-check path. Update the watchdog logic in
remote-channel.ts so this best-effort telemetry is safely swallowed by attaching
error handling to the captureRemote call (or routing it through an internal
async helper used by the watchdog), and keep the reconnect/retry flow in
RemoteChannel unaffected if telemetry fails.

Comment on lines +311 to +312
// or the device wedges offline until restart. EXPECTED TO FAIL until the
// time-bounded-'joining' fix lands in checkConnectionHealth().

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Remove the stale “expected to fail” note.

This test should pass with the watchdog implementation in this PR, so the comment now contradicts the intended final state.

Proposed cleanup
-  // or the device wedges offline until restart. EXPECTED TO FAIL until the
-  // time-bounded-'joining' fix lands in checkConnectionHealth().
+  // or the device wedges offline until restart.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// or the device wedges offline until restart. EXPECTED TO FAIL until the
// time-bounded-'joining' fix lands in checkConnectionHealth().
// or the device wedges offline until restart.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/test-remote-channel-reconnect.js` around lines 311 - 312, Remove the
stale “expected to fail” note from the reconnect test comment so it matches the
watchdog behavior in this PR. Update the comment near the remote channel
reconnect test to stop referencing the old time-bounded-'joining' fix and make
sure any wording around checkConnectionHealth() reflects that the test should
now pass.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant