主题
Runbook — W4 /price cutover
Operator playbook for executing the v0.3.0 W4 breaking-change cutover on tvps (the Vultr Japan VPS that runs api.fsagent.cc).
This runbook expands the cutover playbook in docs/planning/superpowers/specs/2026-05-11-w4-price-route-cutover.md §7–§8 into concrete commands, expected output, and decision branches.
Audience: the human running the cutover (currently: project owner). Read end-to-end before starting. Total budget: 60 minutes including verification, plus a 30-minute rollback window if needed.
0. Prerequisites — must be true before T+00:00
If any item is NO, do not start. Resolve first.
- [ ] W4 implementation PR(s) merged to
main; commit pinned in~/twilight/sourceon tvps - [ ] All W4 dependency PRs merged (see spec §10): #38, #39, #41, #42, #43, #44, #46, #47, #48 and T2.7 lifespan injection
- [ ]
core/tests/integration/test_service_price_route.pygreen in CI on the cutover commit - [ ] Tushare token still valid (
TUSHARE_TOKENin tvps~/twilight/.env) - [ ] Hermes profile install script
scripts/install-skill.shis on the latest revision in the user's local repo - [ ] No active client traffic that would be disrupted by a 60-second
/priceblip — agent in idle, no scheduled batch
Two terminals open:
sh
# Terminal A (local repo)
cd ~/path/to/twilight-drive
git fetch origin main
git checkout main && git pull
# Terminal B (tvps)
ssh tvps
cd ~/twilight/source1. Cutover timeline
T+00:00 — Pre-flight snapshot (5 min)
In Terminal B (tvps):
sh
# Verify current state
docker compose -f deploy/compose.yml ps
# Expected: twilight-backend running (Up X minutes), restart count 0
# Snapshot the DuckDB warehouse for rollback
cp ~/twilight/data/warehouse.duckdb ~/twilight/data/warehouse.duckdb.pre-w4
ls -lh ~/twilight/data/warehouse.duckdb*
# Expected: both files, same size
# Record current image digest for fast rollback
docker compose -f deploy/compose.yml images | tee /tmp/w4-pre-images.txt
# Expected: prints the IMAGE column with current digest; SAVE the digestDecision: if docker compose ps shows anything other than healthy + running → abort, investigate first. The cutover assumes the current service is in good shape.
T+00:05 — Tag and push (server deploy) (10 min)
In Terminal A (local):
sh
git tag v0.3.0-w4-cutover
git push origin v0.3.0-w4-cutover
# CI will build + publish ghcr.io/lacatfly/twilight-drive:v0.3.0-w4-cutoverWait for the build to publish. Watch with:
sh
gh run watch --exit-status
# Or: open https://github.com/LaCatFly/twilight-drive/actionsWhen the image is published, in Terminal B:
sh
# Set the version and pull
export TWILIGHT_VERSION=v0.3.0-w4-cutover
docker compose -f deploy/compose.yml pull
docker compose -f deploy/compose.yml up -d
# Wait for healthcheck
sleep 5
docker compose -f deploy/compose.yml ps
# Expected: twilight-backend (healthy), restart count 0T+00:15 — Server smoke test (5 min)
Still in Terminal B:
sh
# Pull a token for testing (env or vault, do NOT hardcode)
TOK="$(cat ~/twilight/.tokens/operator-test)"
# Test the new shape on a known liquid stock
curl -s -H "Authorization: Bearer $TOK" \
"https://api.fsagent.cc/price?code=600519.SH" | jq .Expected (new shape, all fields):
jsonc
{
"ts_code": "600519.SH",
"trade_date": "2026-05-09",
"close_raw": 1700.5, // number, not null
"close_qfq": 1700.5, // number when adj_factor present; null OK if Tushare returned no factor
"close_hfq": 1701.2,
"open_raw": 1695.0,
"high_raw": 1710.0,
"low_raw": 1690.0,
"volume": 1234567, // 手
"amount": 2100000.0, // 千元
"adj_factor": 1.0,
"adj_factor_latest": 1.0,
"as_of": "2026-05-09T07:30:00+00:00",
"freshness_seconds": 42,
"source": "tushare-daily", // or akshare-spot during OPEN
"source_chain": ["tushare-daily"],
"market_state": "closed_final",
"units": { "close_raw": "yuan", "volume": "hand", "amount": "kyuan", ... },
"cite": { "kind": "tool", "tool_call_id": "tc_...", ... }
}Red flags (rollback immediately if any):
close_rawisnullor absent — Composer didn't return a Quote- Response is the old envelope (
value/metrickeys) — pull didn't take - HTTP 5xx — Composer or AuditLog is failing; check
docker compose logs backend --tail=200
T+00:25 — Hermes profile reinstall (10 min)
In Terminal A (local):
sh
bash scripts/install-skill.sh
# Expected: copies skill/stock-research/ files into Hermes config dir,
# prints "skill installed" or similarIn the user's main Hermes shell (whatever launched Hermes):
sh
# Restart Hermes so it reloads the new skill
hermes restart
# Or whatever the operator's actual restart command isVerify the profile loaded by checking Hermes startup log for the stock-research skill register line. If unsure, run a trivial agent prompt and confirm the skill is listed in its tool list.
T+00:35 — Live agent smoke test (15 min)
Use the Hermes agent end-to-end:
Latest-quote path. Ask the agent: "What is the current price of 贵州茅台 (600519)?"
- Expected: Agent emits a price with
close_qfqorclose_raw, mentionsfreshness_seconds < 120, and themarket_statematches reality (OPEN / CLOSED_FINAL). - Red flag: agent says "price unavailable" or quotes a stale value (freshness_seconds > 86400 outside CLOSED_FINAL).
- Expected: Agent emits a price with
Historical path. Ask: "What was the close of 600519 on 2024-01-02?"
- Expected: Agent returns a number with
trade_date: "2024-01-02",close_qfqpopulated (factors are static for historical dates). - Red flag:
close_qfqis null butclose_rawis present — adj_factor lookup is broken.
- Expected: Agent returns a number with
Cache freshness sanity. Same query twice, 90 seconds apart, during OPEN market state:
- Expected: Second call's
freshness_secondsis ≥ 60 less than the first (cache TTL is 60s during OPEN; after expiry, a new fetch resets it). - Red flag:
freshness_secondsjumps backward in time, or stays at 0 across calls (cache not respecting TTL).
- Expected: Second call's
Unknown code error path. Ask: "What is the price of 999999.SH?"
- Expected: Agent reports the code is unknown or unavailable. HTTP 502 (not 500) in
docker compose logs backend.
- Expected: Agent reports the code is unknown or unavailable. HTTP 502 (not 500) in
T+00:50 — Decision point
- All four agent checks green + curl smoke green: continue to T+00:55.
- Any red flag and no <5-min forward fix: execute rollback (§2).
T+00:55 — Status doc + close out (5 min)
In Terminal A:
sh
# Update status index
$EDITOR docs/status/index.md
# Add: "v0.3.0 W4 /price cutover completed YYYY-MM-DD; link to spec"
git add docs/status/index.md
git commit -m "docs(status): W4 /price cutover complete"
git pushMove the AuditLog forward-baseline: the cutover anchor is now live. Old DailyCache paths are dead code (deleted as part of the W4 PR per spec §5.2).
2. Rollback procedure
Trigger conditions (any of):
- New
/pricereturns 5xx on >5% of agent traffic close_rawconsistently null for known-liquid codes- Hermes profile reports
valueparse error from any prompt - AuditLog shows >10 critical diffs in 10 minutes (cross-source disagreement spike — only relevant if W7 TDX auditor is also live; ignore in v0.3.0)
- Operator judgment: anything that doesn't have a forward-fix in <60 minutes
Budget: 30 minutes total.
2.1 Server revert (10 min)
In Terminal B (tvps), reverse the deploy:
sh
# Use the digest captured at T+00:00 from /tmp/w4-pre-images.txt
PREV_DIGEST="ghcr.io/lacatfly/twilight-drive@sha256:..." # paste from snapshot
# Edit compose to pin to that digest (or use the version tag of the
# previous release, e.g. v0.2.0)
export TWILIGHT_VERSION=v0.2.0 # or whatever was running pre-cutover
docker compose -f deploy/compose.yml pull
docker compose -f deploy/compose.yml up -d
# Verify
sleep 5
docker compose -f deploy/compose.yml ps
curl -s -H "Authorization: Bearer $TOK" \
"https://api.fsagent.cc/price?code=600519.SH" | jq 'keys'
# Expected: keys include "value", "metric", "code" (old shape)2.2 Hermes profile revert (10 min)
In Terminal A (local):
sh
git checkout v0.2.0 -- skill/stock-research/
bash scripts/install-skill.sh
# In Hermes shell:
hermes restart2.3 DuckDB revert (5 min, only if schema regressed)
W4 cutover does not migrate data; the snapshot is a defense-in-depth measure. Only restore if the running service touched the warehouse schema in a way that the v0.2.0 image can no longer read.
In Terminal B:
sh
docker compose -f deploy/compose.yml stop backend
cp ~/twilight/data/warehouse.duckdb.pre-w4 ~/twilight/data/warehouse.duckdb
docker compose -f deploy/compose.yml up -d backend2.4 Post-rollback verification (5 min)
Same as §1 T+00:15 smoke test, but expecting the old envelope shape. File a follow-up issue describing the rollback trigger; do not retry the cutover the same day.
3. Monitoring queries (post-cutover, first 48h)
The new envelope writes one row per /price call to provider_audit_log via the lifespan-injected AuditLog. Run these from tvps to track health:
sh
docker compose -f deploy/compose.yml exec backend python - <<'PY'
import duckdb
con = duckdb.connect("/data/warehouse.duckdb", read_only=True)
# Error rate, last hour
print(con.execute("""
SELECT
COUNT(*) AS total,
SUM(CASE WHEN error IS NOT NULL THEN 1 ELSE 0 END) AS errors,
ROUND(100.0 * SUM(CASE WHEN error IS NOT NULL THEN 1 ELSE 0 END) / NULLIF(COUNT(*),0), 2) AS err_pct
FROM provider_audit_log
WHERE ts > now() - INTERVAL 1 HOUR
""").fetchall())
# Provider chain distribution (was the fallback exercised more than baseline?)
print(con.execute("""
SELECT served_by, COUNT(*) AS n
FROM provider_audit_log
WHERE ts > now() - INTERVAL 1 HOUR
GROUP BY served_by ORDER BY n DESC
""").fetchall())
# p95 latency
print(con.execute("""
SELECT quantile_cont(latency_ms, 0.95) AS p95_ms
FROM provider_audit_log
WHERE ts > now() - INTERVAL 1 HOUR
""").fetchall())
PYHealthy baseline (from v0.2.0):
| Metric | Healthy | Investigate |
|---|---|---|
| err_pct (1h) | < 1% | > 3% |
| p95 latency | < 400 ms | > 600 ms |
served_by="akshare-spot" share during OPEN | 50–90% | falls to 0% (akshare provider broken) |
4. Known failure modes — preauthored responses
| Symptom | Likely cause | Operator action |
|---|---|---|
close_raw null but close_qfq populated | adj_factor math returning when raw not — should not happen | Pull logs, file bug; rollback if recurring |
freshness_seconds always 0 across calls | Cache not hit (TTL bug or always-CLOSING state) | Check MarketState resolver: select market_state, count(*) from provider_audit_log group by market_state — if all "closing", the trade_cal seed is stale |
| HTTP 502 on liquid codes during OPEN | Composer exhausted chain (akshare + tushare both failing) | Check docker compose logs backend --tail=200; if upstream issue, rollback won't help — write status page note and wait |
HTTP 500 with CanonicalUnitViolation | Provider returned data in wrong units (spec §1 incidents 4-5 class) | Cutover is exposing a real bug — do not roll back to mask it. File issue, fix forward |
source_chain length 1 always | Composer never falls back; could be by design (chain ends on first success) — verify against spec §6.2 | Check spec §6.2 chain entries vs. observed; no action unless a chain entry is provably skipped |
Hermes agent emits both value and close_raw | Old prompt template still has value mention | Search Hermes prompts for hardcoded value; update SKILL.md back-compat usage |
5. Cross-references
- Spec (design):
docs/planning/superpowers/specs/2026-05-11-w4-price-route-cutover.md - Target schema:
02-tushare-integration-design.md§9.2 - Lifespan prerequisite:
2026-05-10-lifespan-integration-spec.md§4 (W2.7) - Composer contract:
02-tushare-integration-design.md§6 - Field mapping (old → new): spec §4
- Deploy config:
deploy/compose.yml,deploy/install.sh
If this runbook drifts from the spec, the spec wins — file a PR to re-sync this doc rather than acting on stale steps.