主题
App Stack Auto-Deploy (NestJS backend + React frontend)
日期: 2026-05-16 修订: 2026-05-16(批判性评审 v2 — 7 个严重缺陷、Prisma 决策改 B、Phase 1 标完成) 目标: 把 twilight-app-backend (NestJS) + twilight-app-frontend (React) 接入现有的 GitHub Actions → cloudflared SSH → ECS auto-deploy 管线,复用 twilight-data 已经跑通的模式。
现状(v2 校准 2026-05-16)
| 项 | 现况 |
|---|---|
twilight-data (FastAPI) | ✅ docker-build.yml 构建 + 推 GHCR;ecs-deploy.sh compose pull/up data;health-gated rollback |
docker-build.yml matrix (3 镜像) | ✅ 已合并 — data/backend/frontend 并行构建推 GHCR + ACR |
twilight-app-backend (NestJS) | ❌ 镜像构建通,ECS 端 compose 未接入 deploy 管线 |
twilight-app-frontend (React) | ❌ 同上 |
backend/health endpoint | ⚠️ 已有 db + newapi 两项;缺 searxng + wechatPay + iLink env presence |
mcp-tushare /healthz | ✅ server.py:42 存在,ping DuckDB |
scripts/admin/spawn-profile.sh | ❌ ECS host 脚本,需 git pull 才能更新 |
profile/template-stock-research-pro/** | ❌ ECS host 文件,需 git pull |
| Prisma schema | ❌ 无 migration deploy 步骤;migrations/ 目录不存在 |
| ProvisionWorker SIGTERM graceful | ❌ in-flight task 被 SIGTERM 杀 → 卡 PROVISIONING 永久 |
目标管线
push to main
↓
docker-build.yml (扩展)
├─ build + push twilight-drive (existing — FastAPI)
├─ build + push twilight-drive-backend (NEW)
└─ build + push twilight-drive-frontend (NEW)
↓
deploy-ecs.yml (existing trigger: workflow_run)
↓
ecs-deploy.sh (扩展)
├─ git pull origin main (NEW — pulls spawn-profile.sh, profile/*)
├─ snapshot rollback tags (扩 TWILIGHT_APP_VERSION)
├─ docker compose -f twilight/source/deploy/compose.yml pull data
├─ docker compose -f twilight/source/deploy/compose.yml up -d data
├─ docker compose -f twilight-app/compose.yml pull app-backend app-frontend
├─ prisma migrate deploy (or db push, depending on project convention)
├─ docker compose -f twilight-app/compose.yml up -d app-backend app-frontend
├─ health: :8081/healthz (data) AND :4000/health (backend) AND :8082 (frontend)
└─ rollback ALL if any health check fails⚠️ 严重缺陷(v2 审查发现,必须修复后才能开 Phase 3)
缺陷 1:Rollback 不还原 schema — 逻辑炸弹
prisma migrate deploy 成功但 backend boot 失败 → rollback 仅 revert image tag,新 schema 留下 → 旧 image 撞新 schema → 永久挂。
修复: deploy 前自动 pg_dump;rollback 路径必须包含 pg_restore。脚本:
bash
pg_dump "$DATABASE_URL" -F c -f "/var/backups/twilight-pre-${NEW_TAG}.dump"rollback 时:
bash
pg_restore -d "$DATABASE_URL" -F c --clean "/var/backups/twilight-pre-${PREV_APP}.dump"缺陷 2:正在执行的脚本被 git reset 覆盖
flock 后 git reset --hard origin/main 覆盖 symlink 指向的 ecs-deploy.sh。bash 按行读取,脚本中途被替换 → 行为未定义。
修复: 脚本头部自我复制再 exec:
bash
if [[ "$0" != /tmp/twilight-deploy.*.sh ]]; then
tmp=$(mktemp /tmp/twilight-deploy.XXXXXX.sh)
cp "$0" "$tmp"; chmod +x "$tmp"
exec "$tmp" "$@"
fi缺陷 3:compose project name 漂移
docker run --rm --network twilight-app_default 写死项目名。deploy 用户 cwd 不同 → docker compose 创建新 project + 新 network → prisma migrate deploy 连不上 db。
修复: 所有 compose 命令加 -p twilight-app 显式指定。
缺陷 4:frontend health 假阳性
nginx try_files 对任何 path 返回 200(包括 SPA 损坏时)。curl :8082/ 不能检测 frontend 实际可用。
修复: frontend/nginx.conf 加:
nginx
location /healthz {
access_log off;
return 200 '{"ok":true}';
add_header Content-Type application/json;
}deploy 脚本 health check 改用 http://localhost:8082/healthz。
缺陷 5:mcp-tushare 不在 matrix 但脚本 pull 它
docker-build.yml matrix 只有 data/backend/frontend。mcp-tushare 在 compose.yml 里用 build: 而非 image:,不能被 pull。脚本 compose pull mcp-tushare 会失败。
修复: deploy 脚本不 pull mcp-tushare;只 health-check 它。mcp-tushare 升级走单独手动流程。
缺陷 6:backend /health 覆盖不足
当前只 check db + newapi。WeChat key / iLink key 缺失 → backend boot OK,付款流程挂。
修复: health.controller.ts 加两项 env presence check(不需要 ping,只需非空):
typescript
wechatPayConfigured: process.env.WECHAT_PAY_MCHID ? "ok" : "fail",
ilinkConfigured: process.env.ILINK_API_KEY ? "ok" : "fail",ok 条件:db=ok && (newapi=ok||skip) && wechatPay=ok && ilink=ok
缺陷 7:ProvisionWorker SIGTERM 必须先于 Phase 3 上线
NestJS deploy 期间 down ~30s。SIGTERM 杀 in-flight ProvisionTask → Instance 卡 PROVISIONING 永久(WeChat 已扣款)。
修复(独立 PR,必须先合): provision-worker.service.ts catch SIGTERM → UPDATE 当前 task status=PENDING, attempts 不递增 → process.exit(0)。
阶段划分
Phase 1 — docker-build.yml 扩展(多镜像 matrix)✅ 完成
改 .github/workflows/docker-build.yml:
yaml
on:
pull_request:
branches: [main]
paths:
- "deploy/Dockerfile"
- "backend/**"
- "frontend/**"
- "core/**"
- "src/**"
- "scripts/admin/**" # baked into deploy/Dockerfile
- "profile/**" # spawn-profile.sh reads these
- ".github/workflows/docker-build.yml"
push:
branches: [main]
paths: # same as above
tags:
- "v*.*.*"
jobs:
build:
strategy:
fail-fast: false
matrix:
include:
- name: data
dockerfile: deploy/Dockerfile
context: .
image: ghcr.io/lacatfly/twilight-drive
- name: backend
dockerfile: backend/Dockerfile
context: backend
image: ghcr.io/lacatfly/twilight-drive-backend
- name: frontend
dockerfile: frontend/Dockerfile
context: frontend
image: ghcr.io/lacatfly/twilight-drive-frontendPath-aware build skip(可选优化):
- 如果 PR/push 只动了
backend/**,仅构建 backend 镜像(用paths-filteraction 判定) - 简化版:每次都构建三个镜像,反正缓存命中后很快
Tag 计算(保留现有逻辑):
${IMAGE}:sha-<short_sha>永远打${IMAGE}:latest+${IMAGE}:main在 push to main 时打${IMAGE}:v<X>.<Y>.<Z>在 tag push 时打
已合并状态: matrix 并行运行,3–4 分钟。GHCR + ACR 双推,ECS 从 ACR 拉(国内不绕 GFW)。
Phase 2 — ECS 端 compose + env 改造
ECS /home/twilight/twilight-app/compose.yml 改为引用版本变量:
yaml
services:
app-backend:
image: ghcr.io/lacatfly/twilight-drive-backend:${TWILIGHT_APP_VERSION:-latest}
app-frontend:
image: ghcr.io/lacatfly/twilight-drive-frontend:${TWILIGHT_APP_VERSION:-latest}ECS /home/twilight/twilight-app/.env 加:
bash
TWILIGHT_APP_VERSION=latest⚠ 这是 ECS 上的手动操作,不在 PR 内。runbook 步骤补到部署文档。
Phase 3 — ecs-deploy.sh 扩展
改 deploy/ecs-deploy.sh(v2,含 7 项缺陷修复):
bash
#!/usr/bin/env bash
set -euo pipefail
# 缺陷 2 修复:脚本自我复制,防止 git reset 中途替换
if [[ "$0" != /tmp/twilight-deploy.*.sh ]]; then
tmp=$(mktemp /tmp/twilight-deploy.XXXXXX.sh)
cp "$0" "$tmp"; chmod +x "$tmp"
exec "$tmp" "$@"
fi
sed_inplace() {
if sed --version >/dev/null 2>&1; then sed -i "$1" "$2"
else sed -i '' "$1" "$2"
fi
}
TWILIGHT_HOME="${TWILIGHT_HOME:-/home/deploy/twilight}"
TWILIGHT_APP_HOME="${TWILIGHT_APP_HOME:-/home/twilight/twilight-app}"
SOURCE_DIR="$TWILIGHT_HOME/source"
ENV_FILE="$TWILIGHT_HOME/.env"
APP_ENV_FILE="$TWILIGHT_APP_HOME/.env"
DATA_COMPOSE="$SOURCE_DIR/deploy/compose.yml"
APP_COMPOSE="$TWILIGHT_APP_HOME/compose.yml"
BACKUP_DIR="${BACKUP_DIR:-/var/backups/twilight}"
LOCK_FILE="${TWILIGHT_LOCK:-/tmp/twilight-deploy.lock}"
HEALTH_TRIES="${TWILIGHT_HEALTH_TRIES:-30}"
HEALTH_SLEEP="${TWILIGHT_HEALTH_SLEEP:-2}"
DATA_HEALTH="${DATA_HEALTH_URL:-http://localhost:8081/healthz}"
BACKEND_HEALTH="${BACKEND_HEALTH_URL:-http://localhost:4000/health}"
# 缺陷 4 修复:用 /healthz 不用 / (nginx try_files 任何路径都 200)
FRONTEND_HEALTH="${FRONTEND_HEALTH_URL:-http://localhost:8082/healthz}"
MCP_HEALTH="${MCP_HEALTH_URL:-http://localhost:9100/healthz}"
# 1. parse + validate tag; 支持单服务 deploy: "deploy <tag> [service]"
cmd="${SSH_ORIGINAL_COMMAND:-}"
args=($cmd)
NEW_TAG="${args[1]:-}"
TARGET_SERVICE="${args[2]:-all}" # all | data | backend | frontend
[[ "$NEW_TAG" =~ ^(sha-[a-f0-9]{7}|latest|main|v[0-9]+\.[0-9]+\.[0-9]+)$ ]] || { echo "invalid tag: $NEW_TAG" >&2; exit 2; }
[[ "$TARGET_SERVICE" =~ ^(all|data|backend|frontend)$ ]] || { echo "invalid service: $TARGET_SERVICE" >&2; exit 2; }
# 2. concurrency lock
exec 9>"$LOCK_FILE"; flock -n 9 || { echo "deploy already running" >&2; exit 3; }
# 3. snapshot rollback tags
PREV_DATA=$(grep -E '^TWILIGHT_VERSION=' "$ENV_FILE" | cut -d= -f2)
PREV_APP=$(grep -E '^TWILIGHT_APP_VERSION=' "$APP_ENV_FILE" | cut -d= -f2)
echo "rollback data=$PREV_DATA app=$PREV_APP"
echo "deploying: $NEW_TAG service=$TARGET_SERVICE"
# 4. git pull source (spawn-profile.sh, profile/, prisma schema)
cd "$SOURCE_DIR"
git fetch --quiet origin main
git reset --hard "origin/main"
# 5. update env tags
sed_inplace "s|^TWILIGHT_VERSION=.*|TWILIGHT_VERSION=$NEW_TAG|" "$ENV_FILE"
sed_inplace "s|^TWILIGHT_APP_VERSION=.*|TWILIGHT_APP_VERSION=$NEW_TAG|" "$APP_ENV_FILE"
DATABASE_URL=$(grep -E '^DATABASE_URL=' "$APP_ENV_FILE" | cut -d= -f2-)
# 6. pg_dump 前置备份(缺陷 1 修复:rollback 时可还原 schema)
mkdir -p "$BACKUP_DIR"
pg_dump "$DATABASE_URL" -F c -f "$BACKUP_DIR/pre-${NEW_TAG}.dump" \
|| { echo "pg_dump failed — abort" >&2; exit 5; }
# 7. pull images (缺陷 5 修复:不 pull mcp-tushare,它用 build: 无法 pull)
DATA_COMPOSE_CMD=(docker compose --env-file "$ENV_FILE" -f "$DATA_COMPOSE")
APP_COMPOSE_CMD=(docker compose -p twilight-app --env-file "$APP_ENV_FILE" -f "$APP_COMPOSE")
if [[ "$TARGET_SERVICE" == "all" || "$TARGET_SERVICE" == "data" ]]; then
"${DATA_COMPOSE_CMD[@]}" pull data
fi
if [[ "$TARGET_SERVICE" == "all" || "$TARGET_SERVICE" == "backend" || "$TARGET_SERVICE" == "frontend" ]]; then
"${APP_COMPOSE_CMD[@]}" pull app-backend app-frontend
fi
# 8. prisma migrate deploy (缺陷 1 修复:migrate deploy 失败即 abort,不进 up)
# -p twilight-app 保证 network name 一致(缺陷 3 修复)
if [[ "$TARGET_SERVICE" == "all" || "$TARGET_SERVICE" == "backend" ]]; then
ACR="${REGISTRY:-ghcr.io/lacatfly}"
docker run --rm \
--network twilight-app_default \
--env-file "$APP_ENV_FILE" \
"${ACR}/twilight-drive-backend:${NEW_TAG}" \
npx prisma migrate deploy \
|| { echo "prisma migrate deploy failed — abort, images not updated" >&2
sed_inplace "s|^TWILIGHT_VERSION=.*|TWILIGHT_VERSION=$PREV_DATA|" "$ENV_FILE"
sed_inplace "s|^TWILIGHT_APP_VERSION=.*|TWILIGHT_APP_VERSION=$PREV_APP|" "$APP_ENV_FILE"
exit 6; }
fi
# 9. up
if [[ "$TARGET_SERVICE" == "all" || "$TARGET_SERVICE" == "data" ]]; then
"${DATA_COMPOSE_CMD[@]}" up -d data
fi
if [[ "$TARGET_SERVICE" == "all" || "$TARGET_SERVICE" == "backend" || "$TARGET_SERVICE" == "frontend" ]]; then
"${APP_COMPOSE_CMD[@]}" up -d app-backend app-frontend
fi
# 10. health-gated wait
check() { curl -fsS --max-time 2 "$1" >/dev/null 2>&1; }
healthy() {
local ok=true
[[ "$TARGET_SERVICE" == "all" || "$TARGET_SERVICE" == "data" ]] && { check "$DATA_HEALTH" && check "$MCP_HEALTH" || ok=false; }
[[ "$TARGET_SERVICE" == "all" || "$TARGET_SERVICE" == "backend" ]] && { check "$BACKEND_HEALTH" || ok=false; }
[[ "$TARGET_SERVICE" == "all" || "$TARGET_SERVICE" == "frontend" ]] && { check "$FRONTEND_HEALTH" || ok=false; }
$ok
}
i=0
while (( i < HEALTH_TRIES )); do
if healthy; then
echo "all healthy after $((i+1)) tries"
# 按 tag 保留最近 5 个 sha- 镜像,prune 其余(不按时间,防误删 prev)
docker image ls --format '{{.Repository}}:{{.Tag}}' | grep ':sha-' | sort | head -n -5 \
| xargs -r docker rmi 2>/dev/null || true
rm -f "$BACKUP_DIR/pre-${PREV_APP}.dump" 2>/dev/null || true
exit 0
fi
i=$((i+1)); sleep "$HEALTH_SLEEP"
done
# 11. rollback ALL
echo "HEALTH FAILED — rolling back data=$PREV_DATA app=$PREV_APP" >&2
sed_inplace "s|^TWILIGHT_VERSION=.*|TWILIGHT_VERSION=$PREV_DATA|" "$ENV_FILE"
sed_inplace "s|^TWILIGHT_APP_VERSION=.*|TWILIGHT_APP_VERSION=$PREV_APP|" "$APP_ENV_FILE"
# schema rollback(缺陷 1 修复)
pg_restore -d "$DATABASE_URL" -F c --clean --if-exists \
"$BACKUP_DIR/pre-${NEW_TAG}.dump" 2>/dev/null || echo "pg_restore warning: check manually" >&2
"${DATA_COMPOSE_CMD[@]}" up -d data
"${APP_COMPOSE_CMD[@]}" up -d app-backend app-frontend
exit 1单服务 deploy 用法:
bash
# data 只
ssh deploy@ssh-ecs.fsagent.cc "deploy sha-abc1234 data"
# backend 只(含 prisma migrate)
ssh deploy@ssh-ecs.fsagent.cc "deploy sha-abc1234 backend"
# frontend 只(无 prisma)
ssh deploy@ssh-ecs.fsagent.cc "deploy sha-abc1234 frontend"
# 全 stack(默认)
ssh deploy@ssh-ecs.fsagent.cc "deploy sha-abc1234"Phase 4 — Prisma schema sync 策略
决定:选 B — prisma migrate deploy ✅(2026-05-16 拍板)
为什么放弃 A:
- destructive 红线靠人工 PR review 把关,不可靠
- deploy 脚本自动跑
db push,一旦 reviewer 漏了 destructive diff → 数据丢失,无法自动检测 - schema rollback 靠
pg_restore,和 B 路线一样需要 pg_dump;B 没有额外成本 instance-lifecycle-rebuild.mdPhase 1 加InstanceStatus枚举 + 7 字段就是首个真正的 migration,正好是切 B 的时机
迁移到 B 的步骤(一次性,10 分钟):
bash
cd backend
# 生成初始 migration(回填现有 schema)
npx prisma migrate diff \
--from-empty \
--to-schema-datamodel prisma/schema.prisma \
--script > prisma/migrations/0001_init.sql
# 创建 migration 目录结构
mkdir -p prisma/migrations/0001_init
mv prisma/migrations/0001_init.sql prisma/migrations/0001_init/migration.sql
# 告诉 prisma 这个 migration 已经在 prod DB 上 applied
npx prisma migrate resolve --applied 0001_init
# 本地开发从此用 migrate dev 代替 db push
npx prisma migrate dev --name <feature-name>ECS deploy 改用:
bash
npx prisma migrate deploy # 只 apply 未 applied 的 migrations,安全本地开发流程变化:
prisma db push→prisma migrate dev --name <feature-name>- 生成的 SQL 文件进 git,PR 内可审 schema 变更
- 无需人工标
db-migration标签,SQL 文件本身就是审计轨迹
Phase 5 — Deploy 用户权限扩展
ecs-deploy.sh 当前以 deploy 用户运行(per docs/deploy/ecs-auto-deploy-runbook.md)。新需求:
git pull在$SOURCE_DIR:deploy 用户需读写 source 目录- 改 source 目录 owner 为
deploy:deploy或加 ACL
- 改 source 目录 owner 为
docker compose -f /home/twilight/twilight-app/compose.yml:deploy 用户需访问 twilight-app 目录- 加 ACL:
setfacl -m u:deploy:rx /home/twilight/twilight-app /home/twilight/twilight-app/compose.yml .env含密钥,给 deploy 用户 ro 访问(必须读才能 docker compose)
- 加 ACL:
prisma migrate deploy:需DATABASE_URL,从 app .env 读pg_dump/pg_restore:ECS 需装postgresql-client(apt install -y postgresql-client)
⚠ 这是 ECS 上的手动设置,不在 PR 内,runbook 补充。
Phase 6.0 — Pre-Deploy Checklist(外部依赖 + 凭据)
新增于 2026-05-16,audit gap #4/#8/#9。每次 deploy 前 GH Actions(或 runbook)必须验证:
外部依赖通:
bash
curl -fsS https://llm.fsagent.cc/v1/models -H "Authorization: Bearer $NEWAPI_ADMIN_TOKEN" >/dev/null
curl -fsS https://ws.fsagent.cc/healthz >/dev/null # SearXNG
curl -fsS https://api.tushare.pro >/dev/null # 域内不挂 GFW
docker pull nousresearch/hermes-agent:0.13.0 --quiet # 上游镜像就位任一失败 → 提前 abort,不走 ecs-deploy.sh。否则 spawn worker 会卡。
ECS-side env 凭据表(runbook 必列):
| Env | 用途 | 谁拿 |
|---|---|---|
WECHAT_PAY_MCHID | 微信支付商户号 | 收款 |
WECHAT_PAY_PRIVATE_KEY | 商户私钥 PEM | 签名 |
WECHAT_PAY_API_V3_KEY | APIv3 密钥 | webhook 解密 |
WECHAT_PAY_CERT_SERIAL | 商户证书序列号 | 签名头 |
ILINK_API_KEY | iLink bot 平台 token | QR + WeChat 回调 |
NEWAPI_BASE_URL / NEWAPI_ADMIN_TOKEN | NewAPI gateway | 发 LLM key |
SEARXNG_URL | https://ws.fsagent.cc | Hermes profile .env |
DATABASE_URL | Postgres | NestJS + prisma db push |
JWT_SECRET | session token | auth |
缺任一 → backend boot OK 但功能挂;frontend health 仍 200,rollback 不触发。这是 health-gate 盲区。Runbook 必须强制人工 spot-check 上述 env 全在。
ProvisionWorker 重启窗口:
Deploy 期间 NestJS down ~30s。ProvisionTask 在 PROCESSING 状态时被 SIGTERM 杀掉 → 行为:
instance-lifecycle-rebuild.mdPhase 3 worker 需加 graceful shutdown:catch SIGTERM → 把 in-flight task UPDATE status=PENDING + attempts 不递增- 下次 boot worker 5s 内自动重捡
该补丁必须与本 deploy plan 同 release 落地。否则用户付款 + 扫码 + spawn 中段 deploy → 容器残留、Instance 卡 PROVISIONING 永远。
Phase 6 — Runbook 更新
改 docs/deploy/ecs-auto-deploy-runbook.md:
- 加 "Step 7 — App stack 接入" 章节
- 列出 ECS 端手动操作清单:
- compose.yml 加
${TWILIGHT_APP_VERSION}占位 .env加TWILIGHT_APP_VERSION=latest- deploy 用户权限扩展(ACL / chown)
- compose.yml 加
- 验证脚本:
ssh deploy@ssh-ecs.fsagent.cc 'deploy sha-abc1234'看是否 invalid tag
不做的事
- ❌ 不引入 multi-arch buildx — 只 build linux/amd64(ECS 是 x86)
- ❌ 不做 blue-green deploy — 当前架构单实例 + health rollback 足够
- ❌ 不做 ECS 端 webhook listener — 走 GH Actions SSH push 模型(已有)
- ❌ 不动 mcp-tushare 镜像构建 — 它有自己的镜像,单独 PR 处理
- ⚠ 但
ecs-deploy.sh仍pull/up mcp-tushare跑老 pinned tag(composeMCP_TUSHARE_VERSION变量)+ 加进 health gate(见 Phase 3)。镜像版本由人工 bump,部署管线只负责拉起 + health check + rollback。
- ⚠ 但
验收清单
原有:
- [x] PR 仅改
backend/**→ docker-build 构建 + 推 backend 镜像 ✅(matrix 已合并) - [x] PR 改
backend/** + frontend/**→ 都构建都推 ✅ - [x] PR 只改 docs → 不触发 docker-build ✅
- [ ] Deploy 跑完:data + app-backend + app-frontend 三个容器都健康
- [ ] Prisma migration apply:deploy 后
SELECT * FROM "_prisma_migrations"含新 migration - [ ]
git reset --hard让scripts/admin/spawn-profile.sh在 ECS host 上是最新版 - [ ] Health 失败:所有三个容器自动回滚到前版本,schema 同步回滚(pg_restore 验证)
- [ ] 部署 lock:两个并发 deploy → 第二个 exit 3
- [ ] 镜像清理:旧 sha- 镜像保留最近 5 个,其余 rmi
v2 新增(缺陷修复验证):
- [ ] 缺陷 1:backend boot 失败时 pg_restore 自动跑,旧 schema 恢复,旧 image 启动成功
- [ ] 缺陷 2:
git reset --hard中途替换 ecs-deploy.sh,当前 deploy 仍跑完原版(验证:ls /tmp/twilight-deploy.*.sh存在) - [ ] 缺陷 3:
docker network ls无twilight-app_twilight-app_default重复 network - [ ] 缺陷 4:frontend
/healthz返回 200;删除/usr/share/nginx/html/index.html后/healthz仍 200(证明独立) - [ ] 缺陷 5:
compose pull mcp-tushare不在脚本里;mcp health check:9100/healthz通过 - [ ] 缺陷 6:backend
/health返回wechatPayConfigured: ok, ilinkConfigured: ok;故意清空ILINK_API_KEY→/health返回 503 - [ ] 缺陷 7:deploy 期间 NestJS SIGTERM → in-flight ProvisionTask 重启后 5s 内被重捡,attempts 不递增
- [ ] 单服务:
deploy sha-xxx frontend不重启 backend 和 data(验证:docker inspect twilight-app-backendStartedAt 不变)
落地顺序(v2)
预补丁(独立小 PR,先合,阻塞 Phase 3):
- PR-A
health.controller.ts加wechatPayConfigured + ilinkConfiguredenv presence check(缺陷 6) - PR-B
frontend/nginx.conf加/healthzlocation(缺陷 4) - PR-C
provision-worker.service.tsSIGTERM graceful shutdown(缺陷 7) - PR-D
backend/prisma/初始化 migration(Phase 4 — 选 B 回填)
主 Phase 3 PR: 5. PR-E deploy/ecs-deploy.sh v2(含 7 项缺陷修复 + 单服务参数 + prisma migrate deploy + pg_dump/restore)
ECS 端手动(Phase 2 + 5): 6. /home/twilight/twilight-app/compose.yml 加 ${TWILIGHT_APP_VERSION:-latest} + :8082:80 port 7. /home/twilight/twilight-app/.env 加 TWILIGHT_APP_VERSION=latest 8. deploy 用户 ACL:setfacl -m u:deploy:rwx "$SOURCE_DIR" + setfacl -m u:deploy:rx "$TWILIGHT_APP_HOME" 9. ECS 安装 postgresql-client(apt install -y postgresql-client) 10. mkdir -p /var/backups/twilight && chown deploy:deploy /var/backups/twilight
Phase 6 Runbook: 11. docs/deploy/ecs-auto-deploy-runbook.md 加 Step 7 + [service] 参数说明
Cutover: 12. 先 deploy sha-<latest> data(单服务,最小风险) 13. 再 deploy sha-<latest> backend(含 prisma migration) 14. 再 deploy sha-<latest> frontend 15. 全通后下次 push to main 自动触发全 stack deploy
风险 + 缓解(v2)
| 风险 | 缓解 |
|---|---|
prisma migrate deploy 跑 destructive SQL → 丢数据 | deploy 前 pg_dump 自动备份;PR 内 migration SQL 可审 |
| pg_restore 失败 → schema 和 image 都是新版 | pg_restore 失败打 warning,人工介入;保留 dump 24h |
| Backend 镜像 boot 慢,health 超时 | HEALTH_TRIES=30 × HEALTH_SLEEP=2s = 60s 启动窗口 |
Frontend /healthz 不存在(nginx 404) | nginx 404 curl -fsS 返回非 0,health loop 不通,deploy rollback |
git reset --hard 覆盖 ecs-deploy.sh | 自我复制到 /tmp 修复 |
| compose project name 漂移 | -p twilight-app 固定 |
| 单服务 deploy 时 pg_dump 仍跑(无需) | pg_dump 只在 TARGET_SERVICE 含 backend 时跑 |
| GHCR 拉取慢(GFW) | 已有 ACR mirror,无变化 |
| mcp-tushare health 失败 → 阻塞全 stack deploy | mcp health 仅在 TARGET_SERVICE=all/data 时 check;mcp 故障走单独修复 |
后续 Phase 7(不在本计划范围)
- mcp-tushare 镜像构建 + auto-deploy(进 matrix + MCP_TUSHARE_VERSION 变量)
- 全 stack metrics + alerting(Grafana + cloudflared metrics endpoint 已有)
- Staging 环境(Vultr Japan 跑一份 compose,push to main 先 deploy staging,手动 approve 才触发 prod)
- pg_dump 定期自动备份(cron,不只 deploy 前)