Skip to content

Runbook — W4 /price cutover

Operator playbook for executing the v0.3.0 W4 breaking-change cutover on tvps (the Vultr Japan VPS that runs api.fsagent.cc).

This runbook expands the cutover playbook in docs/planning/superpowers/specs/2026-05-11-w4-price-route-cutover.md §7–§8 into concrete commands, expected output, and decision branches.

Audience: the human running the cutover (currently: project owner). Read end-to-end before starting. Total budget: 60 minutes including verification, plus a 30-minute rollback window if needed.


0. Prerequisites — must be true before T+00:00

If any item is NO, do not start. Resolve first.

  • [ ] W4 implementation PR(s) merged to main; commit pinned in ~/twilight/source on tvps
  • [ ] All W4 dependency PRs merged (see spec §10): #38, #39, #41, #42, #43, #44, #46, #47, #48 and T2.7 lifespan injection
  • [ ] core/tests/integration/test_service_price_route.py green in CI on the cutover commit
  • [ ] Tushare token still valid (TUSHARE_TOKEN in tvps ~/twilight/.env)
  • [ ] Hermes profile install script scripts/install-skill.sh is on the latest revision in the user's local repo
  • [ ] No active client traffic that would be disrupted by a 60-second /price blip — agent in idle, no scheduled batch

Two terminals open:

sh
# Terminal A (local repo)
cd ~/path/to/twilight-drive
git fetch origin main
git checkout main && git pull

# Terminal B (tvps)
ssh tvps
cd ~/twilight/source

1. Cutover timeline

T+00:00 — Pre-flight snapshot (5 min)

In Terminal B (tvps):

sh
# Verify current state
docker compose -f deploy/compose.yml ps
# Expected: twilight-backend running (Up X minutes), restart count 0

# Snapshot the DuckDB warehouse for rollback
cp ~/twilight/data/warehouse.duckdb ~/twilight/data/warehouse.duckdb.pre-w4
ls -lh ~/twilight/data/warehouse.duckdb*
# Expected: both files, same size

# Record current image digest for fast rollback
docker compose -f deploy/compose.yml images | tee /tmp/w4-pre-images.txt
# Expected: prints the IMAGE column with current digest; SAVE the digest

Decision: if docker compose ps shows anything other than healthy + running → abort, investigate first. The cutover assumes the current service is in good shape.

T+00:05 — Tag and push (server deploy) (10 min)

In Terminal A (local):

sh
git tag v0.3.0-w4-cutover
git push origin v0.3.0-w4-cutover
# CI will build + publish ghcr.io/lacatfly/twilight-drive:v0.3.0-w4-cutover

Wait for the build to publish. Watch with:

sh
gh run watch --exit-status
# Or: open https://github.com/LaCatFly/twilight-drive/actions

When the image is published, in Terminal B:

sh
# Set the version and pull
export TWILIGHT_VERSION=v0.3.0-w4-cutover
docker compose -f deploy/compose.yml pull
docker compose -f deploy/compose.yml up -d

# Wait for healthcheck
sleep 5
docker compose -f deploy/compose.yml ps
# Expected: twilight-backend (healthy), restart count 0

T+00:15 — Server smoke test (5 min)

Still in Terminal B:

sh
# Pull a token for testing (env or vault, do NOT hardcode)
TOK="$(cat ~/twilight/.tokens/operator-test)"

# Test the new shape on a known liquid stock
curl -s -H "Authorization: Bearer $TOK" \
  "https://api.fsagent.cc/price?code=600519.SH" | jq .

Expected (new shape, all fields):

jsonc
{
  "ts_code": "600519.SH",
  "trade_date": "2026-05-09",
  "close_raw": 1700.5,           // number, not null
  "close_qfq": 1700.5,           // number when adj_factor present; null OK if Tushare returned no factor
  "close_hfq": 1701.2,
  "open_raw": 1695.0,
  "high_raw": 1710.0,
  "low_raw": 1690.0,
  "volume": 1234567,             // 手
  "amount": 2100000.0,           // 千元
  "adj_factor": 1.0,
  "adj_factor_latest": 1.0,
  "as_of": "2026-05-09T07:30:00+00:00",
  "freshness_seconds": 42,
  "source": "tushare-daily",     // or akshare-spot during OPEN
  "source_chain": ["tushare-daily"],
  "market_state": "closed_final",
  "units": { "close_raw": "yuan", "volume": "hand", "amount": "kyuan", ... },
  "cite": { "kind": "tool", "tool_call_id": "tc_...", ... }
}

Red flags (rollback immediately if any):

  • close_raw is null or absent — Composer didn't return a Quote
  • Response is the old envelope (value / metric keys) — pull didn't take
  • HTTP 5xx — Composer or AuditLog is failing; check docker compose logs backend --tail=200

T+00:25 — Hermes profile reinstall (10 min)

In Terminal A (local):

sh
bash scripts/install-skill.sh
# Expected: copies skill/stock-research/ files into Hermes config dir,
# prints "skill installed" or similar

In the user's main Hermes shell (whatever launched Hermes):

sh
# Restart Hermes so it reloads the new skill
hermes restart
# Or whatever the operator's actual restart command is

Verify the profile loaded by checking Hermes startup log for the stock-research skill register line. If unsure, run a trivial agent prompt and confirm the skill is listed in its tool list.

T+00:35 — Live agent smoke test (15 min)

Use the Hermes agent end-to-end:

  1. Latest-quote path. Ask the agent: "What is the current price of 贵州茅台 (600519)?"

    • Expected: Agent emits a price with close_qfq or close_raw, mentions freshness_seconds < 120, and the market_state matches reality (OPEN / CLOSED_FINAL).
    • Red flag: agent says "price unavailable" or quotes a stale value (freshness_seconds > 86400 outside CLOSED_FINAL).
  2. Historical path. Ask: "What was the close of 600519 on 2024-01-02?"

    • Expected: Agent returns a number with trade_date: "2024-01-02", close_qfq populated (factors are static for historical dates).
    • Red flag: close_qfq is null but close_raw is present — adj_factor lookup is broken.
  3. Cache freshness sanity. Same query twice, 90 seconds apart, during OPEN market state:

    • Expected: Second call's freshness_seconds is ≥ 60 less than the first (cache TTL is 60s during OPEN; after expiry, a new fetch resets it).
    • Red flag: freshness_seconds jumps backward in time, or stays at 0 across calls (cache not respecting TTL).
  4. Unknown code error path. Ask: "What is the price of 999999.SH?"

    • Expected: Agent reports the code is unknown or unavailable. HTTP 502 (not 500) in docker compose logs backend.

T+00:50 — Decision point

  • All four agent checks green + curl smoke green: continue to T+00:55.
  • Any red flag and no <5-min forward fix: execute rollback (§2).

T+00:55 — Status doc + close out (5 min)

In Terminal A:

sh
# Update status index
$EDITOR docs/status/index.md
# Add: "v0.3.0 W4 /price cutover completed YYYY-MM-DD; link to spec"

git add docs/status/index.md
git commit -m "docs(status): W4 /price cutover complete"
git push

Move the AuditLog forward-baseline: the cutover anchor is now live. Old DailyCache paths are dead code (deleted as part of the W4 PR per spec §5.2).


2. Rollback procedure

Trigger conditions (any of):

  • New /price returns 5xx on >5% of agent traffic
  • close_raw consistently null for known-liquid codes
  • Hermes profile reports value parse error from any prompt
  • AuditLog shows >10 critical diffs in 10 minutes (cross-source disagreement spike — only relevant if W7 TDX auditor is also live; ignore in v0.3.0)
  • Operator judgment: anything that doesn't have a forward-fix in <60 minutes

Budget: 30 minutes total.

2.1 Server revert (10 min)

In Terminal B (tvps), reverse the deploy:

sh
# Use the digest captured at T+00:00 from /tmp/w4-pre-images.txt
PREV_DIGEST="ghcr.io/lacatfly/twilight-drive@sha256:..."  # paste from snapshot

# Edit compose to pin to that digest (or use the version tag of the
# previous release, e.g. v0.2.0)
export TWILIGHT_VERSION=v0.2.0   # or whatever was running pre-cutover

docker compose -f deploy/compose.yml pull
docker compose -f deploy/compose.yml up -d

# Verify
sleep 5
docker compose -f deploy/compose.yml ps
curl -s -H "Authorization: Bearer $TOK" \
  "https://api.fsagent.cc/price?code=600519.SH" | jq 'keys'
# Expected: keys include "value", "metric", "code" (old shape)

2.2 Hermes profile revert (10 min)

In Terminal A (local):

sh
git checkout v0.2.0 -- skill/stock-research/
bash scripts/install-skill.sh
# In Hermes shell:
hermes restart

2.3 DuckDB revert (5 min, only if schema regressed)

W4 cutover does not migrate data; the snapshot is a defense-in-depth measure. Only restore if the running service touched the warehouse schema in a way that the v0.2.0 image can no longer read.

In Terminal B:

sh
docker compose -f deploy/compose.yml stop backend
cp ~/twilight/data/warehouse.duckdb.pre-w4 ~/twilight/data/warehouse.duckdb
docker compose -f deploy/compose.yml up -d backend

2.4 Post-rollback verification (5 min)

Same as §1 T+00:15 smoke test, but expecting the old envelope shape. File a follow-up issue describing the rollback trigger; do not retry the cutover the same day.


3. Monitoring queries (post-cutover, first 48h)

The new envelope writes one row per /price call to provider_audit_log via the lifespan-injected AuditLog. Run these from tvps to track health:

sh
docker compose -f deploy/compose.yml exec backend python - <<'PY'
import duckdb
con = duckdb.connect("/data/warehouse.duckdb", read_only=True)

# Error rate, last hour
print(con.execute("""
    SELECT
      COUNT(*) AS total,
      SUM(CASE WHEN error IS NOT NULL THEN 1 ELSE 0 END) AS errors,
      ROUND(100.0 * SUM(CASE WHEN error IS NOT NULL THEN 1 ELSE 0 END) / NULLIF(COUNT(*),0), 2) AS err_pct
    FROM provider_audit_log
    WHERE ts > now() - INTERVAL 1 HOUR
""").fetchall())

# Provider chain distribution (was the fallback exercised more than baseline?)
print(con.execute("""
    SELECT served_by, COUNT(*) AS n
    FROM provider_audit_log
    WHERE ts > now() - INTERVAL 1 HOUR
    GROUP BY served_by ORDER BY n DESC
""").fetchall())

# p95 latency
print(con.execute("""
    SELECT quantile_cont(latency_ms, 0.95) AS p95_ms
    FROM provider_audit_log
    WHERE ts > now() - INTERVAL 1 HOUR
""").fetchall())
PY

Healthy baseline (from v0.2.0):

MetricHealthyInvestigate
err_pct (1h)< 1%> 3%
p95 latency< 400 ms> 600 ms
served_by="akshare-spot" share during OPEN50–90%falls to 0% (akshare provider broken)

4. Known failure modes — preauthored responses

SymptomLikely causeOperator action
close_raw null but close_qfq populatedadj_factor math returning when raw not — should not happenPull logs, file bug; rollback if recurring
freshness_seconds always 0 across callsCache not hit (TTL bug or always-CLOSING state)Check MarketState resolver: select market_state, count(*) from provider_audit_log group by market_state — if all "closing", the trade_cal seed is stale
HTTP 502 on liquid codes during OPENComposer exhausted chain (akshare + tushare both failing)Check docker compose logs backend --tail=200; if upstream issue, rollback won't help — write status page note and wait
HTTP 500 with CanonicalUnitViolationProvider returned data in wrong units (spec §1 incidents 4-5 class)Cutover is exposing a real bug — do not roll back to mask it. File issue, fix forward
source_chain length 1 alwaysComposer never falls back; could be by design (chain ends on first success) — verify against spec §6.2Check spec §6.2 chain entries vs. observed; no action unless a chain entry is provably skipped
Hermes agent emits both value and close_rawOld prompt template still has value mentionSearch Hermes prompts for hardcoded value; update SKILL.md back-compat usage

5. Cross-references

If this runbook drifts from the spec, the spec wins — file a PR to re-sync this doc rather than acting on stale steps.

团队内部文档