Skip to content

App Stack Auto-Deploy (NestJS backend + React frontend)

日期: 2026-05-16 修订: 2026-05-16(批判性评审 v2 — 7 个严重缺陷、Prisma 决策改 B、Phase 1 标完成) 目标:twilight-app-backend (NestJS) + twilight-app-frontend (React) 接入现有的 GitHub Actions → cloudflared SSH → ECS auto-deploy 管线,复用 twilight-data 已经跑通的模式。


现状(v2 校准 2026-05-16)

现况
twilight-data (FastAPI)✅ docker-build.yml 构建 + 推 GHCR;ecs-deploy.sh compose pull/up data;health-gated rollback
docker-build.yml matrix (3 镜像)已合并 — data/backend/frontend 并行构建推 GHCR + ACR
twilight-app-backend (NestJS)❌ 镜像构建通,ECS 端 compose 未接入 deploy 管线
twilight-app-frontend (React)❌ 同上
backend/health endpoint⚠️ 已有 db + newapi 两项; searxng + wechatPay + iLink env presence
mcp-tushare /healthzserver.py:42 存在,ping DuckDB
scripts/admin/spawn-profile.sh❌ ECS host 脚本,需 git pull 才能更新
profile/template-stock-research-pro/**❌ ECS host 文件,需 git pull
Prisma schema❌ 无 migration deploy 步骤;migrations/ 目录不存在
ProvisionWorker SIGTERM graceful❌ in-flight task 被 SIGTERM 杀 → 卡 PROVISIONING 永久

目标管线

push to main

docker-build.yml (扩展)
   ├─ build + push  twilight-drive          (existing — FastAPI)
   ├─ build + push  twilight-drive-backend  (NEW)
   └─ build + push  twilight-drive-frontend (NEW)

deploy-ecs.yml (existing trigger: workflow_run)

ecs-deploy.sh (扩展)
   ├─ git pull origin main              (NEW — pulls spawn-profile.sh, profile/*)
   ├─ snapshot rollback tags             (扩 TWILIGHT_APP_VERSION)
   ├─ docker compose -f twilight/source/deploy/compose.yml pull data
   ├─ docker compose -f twilight/source/deploy/compose.yml up -d data
   ├─ docker compose -f twilight-app/compose.yml pull app-backend app-frontend
   ├─ prisma migrate deploy (or db push, depending on project convention)
   ├─ docker compose -f twilight-app/compose.yml up -d app-backend app-frontend
   ├─ health: :8081/healthz (data) AND :4000/health (backend) AND :8082 (frontend)
   └─ rollback ALL if any health check fails


⚠️ 严重缺陷(v2 审查发现,必须修复后才能开 Phase 3)

缺陷 1:Rollback 不还原 schema — 逻辑炸弹

prisma migrate deploy 成功但 backend boot 失败 → rollback 仅 revert image tag,新 schema 留下 → 旧 image 撞新 schema → 永久挂。

修复: deploy 前自动 pg_dump;rollback 路径必须包含 pg_restore。脚本:

bash
pg_dump "$DATABASE_URL" -F c -f "/var/backups/twilight-pre-${NEW_TAG}.dump"

rollback 时:

bash
pg_restore -d "$DATABASE_URL" -F c --clean "/var/backups/twilight-pre-${PREV_APP}.dump"

缺陷 2:正在执行的脚本被 git reset 覆盖

flockgit reset --hard origin/main 覆盖 symlink 指向的 ecs-deploy.sh。bash 按行读取,脚本中途被替换 → 行为未定义。

修复: 脚本头部自我复制再 exec:

bash
if [[ "$0" != /tmp/twilight-deploy.*.sh ]]; then
  tmp=$(mktemp /tmp/twilight-deploy.XXXXXX.sh)
  cp "$0" "$tmp"; chmod +x "$tmp"
  exec "$tmp" "$@"
fi

缺陷 3:compose project name 漂移

docker run --rm --network twilight-app_default 写死项目名。deploy 用户 cwd 不同 → docker compose 创建新 project + 新 network → prisma migrate deploy 连不上 db。

修复: 所有 compose 命令加 -p twilight-app 显式指定。

缺陷 4:frontend health 假阳性

nginx try_files 对任何 path 返回 200(包括 SPA 损坏时)。curl :8082/ 不能检测 frontend 实际可用。

修复: frontend/nginx.conf 加:

nginx
location /healthz {
    access_log off;
    return 200 '{"ok":true}';
    add_header Content-Type application/json;
}

deploy 脚本 health check 改用 http://localhost:8082/healthz

缺陷 5:mcp-tushare 不在 matrix 但脚本 pull 它

docker-build.yml matrix 只有 data/backend/frontend。mcp-tushare 在 compose.yml 里用 build: 而非 image:不能被 pull。脚本 compose pull mcp-tushare 会失败。

修复: deploy 脚本 pull mcp-tushare;只 health-check 它。mcp-tushare 升级走单独手动流程。

缺陷 6:backend /health 覆盖不足

当前只 check db + newapi。WeChat key / iLink key 缺失 → backend boot OK,付款流程挂。

修复: health.controller.ts 加两项 env presence check(不需要 ping,只需非空):

typescript
wechatPayConfigured: process.env.WECHAT_PAY_MCHID ? "ok" : "fail",
ilinkConfigured: process.env.ILINK_API_KEY ? "ok" : "fail",

ok 条件:db=ok && (newapi=ok||skip) && wechatPay=ok && ilink=ok

缺陷 7:ProvisionWorker SIGTERM 必须先于 Phase 3 上线

NestJS deploy 期间 down ~30s。SIGTERM 杀 in-flight ProvisionTask → Instance 卡 PROVISIONING 永久(WeChat 已扣款)。

修复(独立 PR,必须先合): provision-worker.service.ts catch SIGTERM → UPDATE 当前 task status=PENDING, attempts 不递增 → process.exit(0)。


阶段划分

Phase 1 — docker-build.yml 扩展(多镜像 matrix)✅ 完成

.github/workflows/docker-build.yml

yaml
on:
  pull_request:
    branches: [main]
    paths:
      - "deploy/Dockerfile"
      - "backend/**"
      - "frontend/**"
      - "core/**"
      - "src/**"
      - "scripts/admin/**"        # baked into deploy/Dockerfile
      - "profile/**"               # spawn-profile.sh reads these
      - ".github/workflows/docker-build.yml"
  push:
    branches: [main]
    paths: # same as above
    tags:
      - "v*.*.*"

jobs:
  build:
    strategy:
      fail-fast: false
      matrix:
        include:
          - name: data
            dockerfile: deploy/Dockerfile
            context: .
            image: ghcr.io/lacatfly/twilight-drive
          - name: backend
            dockerfile: backend/Dockerfile
            context: backend
            image: ghcr.io/lacatfly/twilight-drive-backend
          - name: frontend
            dockerfile: frontend/Dockerfile
            context: frontend
            image: ghcr.io/lacatfly/twilight-drive-frontend

Path-aware build skip(可选优化):

  • 如果 PR/push 只动了 backend/**,仅构建 backend 镜像(用 paths-filter action 判定)
  • 简化版:每次都构建三个镜像,反正缓存命中后很快

Tag 计算(保留现有逻辑):

  • ${IMAGE}:sha-<short_sha> 永远打
  • ${IMAGE}:latest + ${IMAGE}:main 在 push to main 时打
  • ${IMAGE}:v<X>.<Y>.<Z> 在 tag push 时打

已合并状态: matrix 并行运行,3–4 分钟。GHCR + ACR 双推,ECS 从 ACR 拉(国内不绕 GFW)。


Phase 2 — ECS 端 compose + env 改造

ECS /home/twilight/twilight-app/compose.yml 改为引用版本变量:

yaml
services:
  app-backend:
    image: ghcr.io/lacatfly/twilight-drive-backend:${TWILIGHT_APP_VERSION:-latest}
  app-frontend:
    image: ghcr.io/lacatfly/twilight-drive-frontend:${TWILIGHT_APP_VERSION:-latest}

ECS /home/twilight/twilight-app/.env 加:

bash
TWILIGHT_APP_VERSION=latest

⚠ 这是 ECS 上的手动操作,不在 PR 内。runbook 步骤补到部署文档。


Phase 3 — ecs-deploy.sh 扩展

deploy/ecs-deploy.sh(v2,含 7 项缺陷修复):

bash
#!/usr/bin/env bash
set -euo pipefail

# 缺陷 2 修复:脚本自我复制,防止 git reset 中途替换
if [[ "$0" != /tmp/twilight-deploy.*.sh ]]; then
  tmp=$(mktemp /tmp/twilight-deploy.XXXXXX.sh)
  cp "$0" "$tmp"; chmod +x "$tmp"
  exec "$tmp" "$@"
fi

sed_inplace() {
  if sed --version >/dev/null 2>&1; then sed -i "$1" "$2"
  else sed -i '' "$1" "$2"
  fi
}

TWILIGHT_HOME="${TWILIGHT_HOME:-/home/deploy/twilight}"
TWILIGHT_APP_HOME="${TWILIGHT_APP_HOME:-/home/twilight/twilight-app}"
SOURCE_DIR="$TWILIGHT_HOME/source"
ENV_FILE="$TWILIGHT_HOME/.env"
APP_ENV_FILE="$TWILIGHT_APP_HOME/.env"
DATA_COMPOSE="$SOURCE_DIR/deploy/compose.yml"
APP_COMPOSE="$TWILIGHT_APP_HOME/compose.yml"
BACKUP_DIR="${BACKUP_DIR:-/var/backups/twilight}"

LOCK_FILE="${TWILIGHT_LOCK:-/tmp/twilight-deploy.lock}"
HEALTH_TRIES="${TWILIGHT_HEALTH_TRIES:-30}"
HEALTH_SLEEP="${TWILIGHT_HEALTH_SLEEP:-2}"
DATA_HEALTH="${DATA_HEALTH_URL:-http://localhost:8081/healthz}"
BACKEND_HEALTH="${BACKEND_HEALTH_URL:-http://localhost:4000/health}"
# 缺陷 4 修复:用 /healthz 不用 / (nginx try_files 任何路径都 200)
FRONTEND_HEALTH="${FRONTEND_HEALTH_URL:-http://localhost:8082/healthz}"
MCP_HEALTH="${MCP_HEALTH_URL:-http://localhost:9100/healthz}"

# 1. parse + validate tag; 支持单服务 deploy: "deploy <tag> [service]"
cmd="${SSH_ORIGINAL_COMMAND:-}"
args=($cmd)
NEW_TAG="${args[1]:-}"
TARGET_SERVICE="${args[2]:-all}"  # all | data | backend | frontend
[[ "$NEW_TAG" =~ ^(sha-[a-f0-9]{7}|latest|main|v[0-9]+\.[0-9]+\.[0-9]+)$ ]] || { echo "invalid tag: $NEW_TAG" >&2; exit 2; }
[[ "$TARGET_SERVICE" =~ ^(all|data|backend|frontend)$ ]] || { echo "invalid service: $TARGET_SERVICE" >&2; exit 2; }

# 2. concurrency lock
exec 9>"$LOCK_FILE"; flock -n 9 || { echo "deploy already running" >&2; exit 3; }

# 3. snapshot rollback tags
PREV_DATA=$(grep -E '^TWILIGHT_VERSION=' "$ENV_FILE" | cut -d= -f2)
PREV_APP=$(grep -E '^TWILIGHT_APP_VERSION=' "$APP_ENV_FILE" | cut -d= -f2)
echo "rollback data=$PREV_DATA  app=$PREV_APP"
echo "deploying:    $NEW_TAG  service=$TARGET_SERVICE"

# 4. git pull source (spawn-profile.sh, profile/, prisma schema)
cd "$SOURCE_DIR"
git fetch --quiet origin main
git reset --hard "origin/main"

# 5. update env tags
sed_inplace "s|^TWILIGHT_VERSION=.*|TWILIGHT_VERSION=$NEW_TAG|" "$ENV_FILE"
sed_inplace "s|^TWILIGHT_APP_VERSION=.*|TWILIGHT_APP_VERSION=$NEW_TAG|" "$APP_ENV_FILE"

DATABASE_URL=$(grep -E '^DATABASE_URL=' "$APP_ENV_FILE" | cut -d= -f2-)

# 6. pg_dump 前置备份(缺陷 1 修复:rollback 时可还原 schema)
mkdir -p "$BACKUP_DIR"
pg_dump "$DATABASE_URL" -F c -f "$BACKUP_DIR/pre-${NEW_TAG}.dump" \
  || { echo "pg_dump failed — abort" >&2; exit 5; }

# 7. pull images (缺陷 5 修复:不 pull mcp-tushare,它用 build: 无法 pull)
DATA_COMPOSE_CMD=(docker compose --env-file "$ENV_FILE" -f "$DATA_COMPOSE")
APP_COMPOSE_CMD=(docker compose -p twilight-app --env-file "$APP_ENV_FILE" -f "$APP_COMPOSE")

if [[ "$TARGET_SERVICE" == "all" || "$TARGET_SERVICE" == "data" ]]; then
  "${DATA_COMPOSE_CMD[@]}" pull data
fi
if [[ "$TARGET_SERVICE" == "all" || "$TARGET_SERVICE" == "backend" || "$TARGET_SERVICE" == "frontend" ]]; then
  "${APP_COMPOSE_CMD[@]}" pull app-backend app-frontend
fi

# 8. prisma migrate deploy (缺陷 1 修复:migrate deploy 失败即 abort,不进 up)
#    -p twilight-app 保证 network name 一致(缺陷 3 修复)
if [[ "$TARGET_SERVICE" == "all" || "$TARGET_SERVICE" == "backend" ]]; then
  ACR="${REGISTRY:-ghcr.io/lacatfly}"
  docker run --rm \
    --network twilight-app_default \
    --env-file "$APP_ENV_FILE" \
    "${ACR}/twilight-drive-backend:${NEW_TAG}" \
    npx prisma migrate deploy \
    || { echo "prisma migrate deploy failed — abort, images not updated" >&2
         sed_inplace "s|^TWILIGHT_VERSION=.*|TWILIGHT_VERSION=$PREV_DATA|" "$ENV_FILE"
         sed_inplace "s|^TWILIGHT_APP_VERSION=.*|TWILIGHT_APP_VERSION=$PREV_APP|" "$APP_ENV_FILE"
         exit 6; }
fi

# 9. up
if [[ "$TARGET_SERVICE" == "all" || "$TARGET_SERVICE" == "data" ]]; then
  "${DATA_COMPOSE_CMD[@]}" up -d data
fi
if [[ "$TARGET_SERVICE" == "all" || "$TARGET_SERVICE" == "backend" || "$TARGET_SERVICE" == "frontend" ]]; then
  "${APP_COMPOSE_CMD[@]}" up -d app-backend app-frontend
fi

# 10. health-gated wait
check() { curl -fsS --max-time 2 "$1" >/dev/null 2>&1; }
healthy() {
  local ok=true
  [[ "$TARGET_SERVICE" == "all" || "$TARGET_SERVICE" == "data" ]] && { check "$DATA_HEALTH" && check "$MCP_HEALTH" || ok=false; }
  [[ "$TARGET_SERVICE" == "all" || "$TARGET_SERVICE" == "backend" ]] && { check "$BACKEND_HEALTH" || ok=false; }
  [[ "$TARGET_SERVICE" == "all" || "$TARGET_SERVICE" == "frontend" ]] && { check "$FRONTEND_HEALTH" || ok=false; }
  $ok
}

i=0
while (( i < HEALTH_TRIES )); do
  if healthy; then
    echo "all healthy after $((i+1)) tries"
    # 按 tag 保留最近 5 个 sha- 镜像,prune 其余(不按时间,防误删 prev)
    docker image ls --format '{{.Repository}}:{{.Tag}}' | grep ':sha-' | sort | head -n -5 \
      | xargs -r docker rmi 2>/dev/null || true
    rm -f "$BACKUP_DIR/pre-${PREV_APP}.dump" 2>/dev/null || true
    exit 0
  fi
  i=$((i+1)); sleep "$HEALTH_SLEEP"
done

# 11. rollback ALL
echo "HEALTH FAILED — rolling back data=$PREV_DATA app=$PREV_APP" >&2
sed_inplace "s|^TWILIGHT_VERSION=.*|TWILIGHT_VERSION=$PREV_DATA|" "$ENV_FILE"
sed_inplace "s|^TWILIGHT_APP_VERSION=.*|TWILIGHT_APP_VERSION=$PREV_APP|" "$APP_ENV_FILE"
# schema rollback(缺陷 1 修复)
pg_restore -d "$DATABASE_URL" -F c --clean --if-exists \
  "$BACKUP_DIR/pre-${NEW_TAG}.dump" 2>/dev/null || echo "pg_restore warning: check manually" >&2
"${DATA_COMPOSE_CMD[@]}" up -d data
"${APP_COMPOSE_CMD[@]}" up -d app-backend app-frontend
exit 1

单服务 deploy 用法:

bash
# data 只
ssh deploy@ssh-ecs.fsagent.cc "deploy sha-abc1234 data"
# backend 只(含 prisma migrate)
ssh deploy@ssh-ecs.fsagent.cc "deploy sha-abc1234 backend"
# frontend 只(无 prisma)
ssh deploy@ssh-ecs.fsagent.cc "deploy sha-abc1234 frontend"
# 全 stack(默认)
ssh deploy@ssh-ecs.fsagent.cc "deploy sha-abc1234"

Phase 4 — Prisma schema sync 策略

决定:选 B — prisma migrate deploy ✅(2026-05-16 拍板)

为什么放弃 A:

  • destructive 红线靠人工 PR review 把关,不可靠
  • deploy 脚本自动跑 db push,一旦 reviewer 漏了 destructive diff → 数据丢失,无法自动检测
  • schema rollback 靠 pg_restore,和 B 路线一样需要 pg_dump;B 没有额外成本
  • instance-lifecycle-rebuild.md Phase 1 加 InstanceStatus 枚举 + 7 字段就是首个真正的 migration,正好是切 B 的时机

迁移到 B 的步骤(一次性,10 分钟):

bash
cd backend
# 生成初始 migration(回填现有 schema)
npx prisma migrate diff \
  --from-empty \
  --to-schema-datamodel prisma/schema.prisma \
  --script > prisma/migrations/0001_init.sql

# 创建 migration 目录结构
mkdir -p prisma/migrations/0001_init
mv prisma/migrations/0001_init.sql prisma/migrations/0001_init/migration.sql

# 告诉 prisma 这个 migration 已经在 prod DB 上 applied
npx prisma migrate resolve --applied 0001_init

# 本地开发从此用 migrate dev 代替 db push
npx prisma migrate dev --name <feature-name>

ECS deploy 改用:

bash
npx prisma migrate deploy  # 只 apply 未 applied 的 migrations,安全

本地开发流程变化:

  • prisma db pushprisma migrate dev --name <feature-name>
  • 生成的 SQL 文件进 git,PR 内可审 schema 变更
  • 无需人工标 db-migration 标签,SQL 文件本身就是审计轨迹

Phase 5 — Deploy 用户权限扩展

ecs-deploy.sh 当前以 deploy 用户运行(per docs/deploy/ecs-auto-deploy-runbook.md)。新需求:

  1. git pull$SOURCE_DIR:deploy 用户需读写 source 目录
    • 改 source 目录 owner 为 deploy:deploy 或加 ACL
  2. docker compose -f /home/twilight/twilight-app/compose.yml:deploy 用户需访问 twilight-app 目录
    • 加 ACL:setfacl -m u:deploy:rx /home/twilight/twilight-app /home/twilight/twilight-app/compose.yml
    • .env 含密钥,给 deploy 用户 ro 访问(必须读才能 docker compose)
  3. prisma migrate deploy:需 DATABASE_URL,从 app .env 读
  4. pg_dump/pg_restore:ECS 需装 postgresql-clientapt install -y postgresql-client

⚠ 这是 ECS 上的手动设置,不在 PR 内,runbook 补充。


Phase 6.0 — Pre-Deploy Checklist(外部依赖 + 凭据)

新增于 2026-05-16,audit gap #4/#8/#9。每次 deploy 前 GH Actions(或 runbook)必须验证:

外部依赖通:

bash
curl -fsS https://llm.fsagent.cc/v1/models -H "Authorization: Bearer $NEWAPI_ADMIN_TOKEN" >/dev/null
curl -fsS https://ws.fsagent.cc/healthz >/dev/null   # SearXNG
curl -fsS https://api.tushare.pro >/dev/null         # 域内不挂 GFW
docker pull nousresearch/hermes-agent:0.13.0 --quiet  # 上游镜像就位

任一失败 → 提前 abort,走 ecs-deploy.sh。否则 spawn worker 会卡。

ECS-side env 凭据表(runbook 必列):

Env用途谁拿
WECHAT_PAY_MCHID微信支付商户号收款
WECHAT_PAY_PRIVATE_KEY商户私钥 PEM签名
WECHAT_PAY_API_V3_KEYAPIv3 密钥webhook 解密
WECHAT_PAY_CERT_SERIAL商户证书序列号签名头
ILINK_API_KEYiLink bot 平台 tokenQR + WeChat 回调
NEWAPI_BASE_URL / NEWAPI_ADMIN_TOKENNewAPI gateway发 LLM key
SEARXNG_URLhttps://ws.fsagent.ccHermes profile .env
DATABASE_URLPostgresNestJS + prisma db push
JWT_SECRETsession tokenauth

缺任一 → backend boot OK 但功能挂;frontend health 仍 200,rollback 不触发。这是 health-gate 盲区。Runbook 必须强制人工 spot-check 上述 env 全在。

ProvisionWorker 重启窗口:

Deploy 期间 NestJS down ~30s。ProvisionTask 在 PROCESSING 状态时被 SIGTERM 杀掉 → 行为:

  • instance-lifecycle-rebuild.md Phase 3 worker 需加 graceful shutdown:catch SIGTERM → 把 in-flight task UPDATE status=PENDING + attempts 不递增
  • 下次 boot worker 5s 内自动重捡

该补丁必须与本 deploy plan 同 release 落地。否则用户付款 + 扫码 + spawn 中段 deploy → 容器残留、Instance 卡 PROVISIONING 永远。


Phase 6 — Runbook 更新

docs/deploy/ecs-auto-deploy-runbook.md

  • 加 "Step 7 — App stack 接入" 章节
  • 列出 ECS 端手动操作清单:
    • compose.yml 加 ${TWILIGHT_APP_VERSION} 占位
    • .envTWILIGHT_APP_VERSION=latest
    • deploy 用户权限扩展(ACL / chown)
  • 验证脚本:ssh deploy@ssh-ecs.fsagent.cc 'deploy sha-abc1234' 看是否 invalid tag

不做的事

  • ❌ 不引入 multi-arch buildx — 只 build linux/amd64(ECS 是 x86)
  • ❌ 不做 blue-green deploy — 当前架构单实例 + health rollback 足够
  • ❌ 不做 ECS 端 webhook listener — 走 GH Actions SSH push 模型(已有)
  • ❌ 不动 mcp-tushare 镜像构建 — 它有自己的镜像,单独 PR 处理
    • ⚠ 但 ecs-deploy.shpull/up mcp-tushare 跑老 pinned tag(compose MCP_TUSHARE_VERSION 变量)+ 加进 health gate(见 Phase 3)。镜像版本由人工 bump,部署管线只负责拉起 + health check + rollback。

验收清单

原有:

  • [x] PR 仅改 backend/** → docker-build 构建 + 推 backend 镜像 ✅(matrix 已合并)
  • [x] PR 改 backend/** + frontend/** → 都构建都推 ✅
  • [x] PR 只改 docs → 不触发 docker-build ✅
  • [ ] Deploy 跑完:data + app-backend + app-frontend 三个容器都健康
  • [ ] Prisma migration apply:deploy 后 SELECT * FROM "_prisma_migrations" 含新 migration
  • [ ] git reset --hardscripts/admin/spawn-profile.sh 在 ECS host 上是最新版
  • [ ] Health 失败:所有三个容器自动回滚到前版本,schema 同步回滚(pg_restore 验证)
  • [ ] 部署 lock:两个并发 deploy → 第二个 exit 3
  • [ ] 镜像清理:旧 sha- 镜像保留最近 5 个,其余 rmi

v2 新增(缺陷修复验证):

  • [ ] 缺陷 1:backend boot 失败时 pg_restore 自动跑,旧 schema 恢复,旧 image 启动成功
  • [ ] 缺陷 2:git reset --hard 中途替换 ecs-deploy.sh,当前 deploy 仍跑完原版(验证:ls /tmp/twilight-deploy.*.sh 存在)
  • [ ] 缺陷 3:docker network lstwilight-app_twilight-app_default 重复 network
  • [ ] 缺陷 4:frontend /healthz 返回 200;删除 /usr/share/nginx/html/index.html/healthz 仍 200(证明独立)
  • [ ] 缺陷 5:compose pull mcp-tushare 不在脚本里;mcp health check :9100/healthz 通过
  • [ ] 缺陷 6:backend /health 返回 wechatPayConfigured: ok, ilinkConfigured: ok;故意清空 ILINK_API_KEY/health 返回 503
  • [ ] 缺陷 7:deploy 期间 NestJS SIGTERM → in-flight ProvisionTask 重启后 5s 内被重捡,attempts 不递增
  • [ ] 单服务:deploy sha-xxx frontend 不重启 backend 和 data(验证:docker inspect twilight-app-backend StartedAt 不变)

落地顺序(v2)

预补丁(独立小 PR,先合,阻塞 Phase 3):

  1. PR-A health.controller.tswechatPayConfigured + ilinkConfigured env presence check(缺陷 6)
  2. PR-B frontend/nginx.conf/healthz location(缺陷 4)
  3. PR-C provision-worker.service.ts SIGTERM graceful shutdown(缺陷 7)
  4. PR-D backend/prisma/ 初始化 migration(Phase 4 — 选 B 回填)

主 Phase 3 PR: 5. PR-E deploy/ecs-deploy.sh v2(含 7 项缺陷修复 + 单服务参数 + prisma migrate deploy + pg_dump/restore)

ECS 端手动(Phase 2 + 5): 6. /home/twilight/twilight-app/compose.yml${TWILIGHT_APP_VERSION:-latest} + :8082:80 port 7. /home/twilight/twilight-app/.envTWILIGHT_APP_VERSION=latest 8. deploy 用户 ACL:setfacl -m u:deploy:rwx "$SOURCE_DIR" + setfacl -m u:deploy:rx "$TWILIGHT_APP_HOME" 9. ECS 安装 postgresql-clientapt install -y postgresql-client) 10. mkdir -p /var/backups/twilight && chown deploy:deploy /var/backups/twilight

Phase 6 Runbook: 11. docs/deploy/ecs-auto-deploy-runbook.md 加 Step 7 + [service] 参数说明

Cutover: 12. 先 deploy sha-<latest> data(单服务,最小风险) 13. 再 deploy sha-<latest> backend(含 prisma migration) 14. 再 deploy sha-<latest> frontend 15. 全通后下次 push to main 自动触发全 stack deploy


风险 + 缓解(v2)

风险缓解
prisma migrate deploy 跑 destructive SQL → 丢数据deploy 前 pg_dump 自动备份;PR 内 migration SQL 可审
pg_restore 失败 → schema 和 image 都是新版pg_restore 失败打 warning,人工介入;保留 dump 24h
Backend 镜像 boot 慢,health 超时HEALTH_TRIES=30 × HEALTH_SLEEP=2s = 60s 启动窗口
Frontend /healthz 不存在(nginx 404)nginx 404 curl -fsS 返回非 0,health loop 不通,deploy rollback
git reset --hard 覆盖 ecs-deploy.sh自我复制到 /tmp 修复
compose project name 漂移-p twilight-app 固定
单服务 deploy 时 pg_dump 仍跑(无需)pg_dump 只在 TARGET_SERVICE 含 backend 时跑
GHCR 拉取慢(GFW)已有 ACR mirror,无变化
mcp-tushare health 失败 → 阻塞全 stack deploymcp health 仅在 TARGET_SERVICE=all/data 时 check;mcp 故障走单独修复

后续 Phase 7(不在本计划范围)

  • mcp-tushare 镜像构建 + auto-deploy(进 matrix + MCP_TUSHARE_VERSION 变量)
  • 全 stack metrics + alerting(Grafana + cloudflared metrics endpoint 已有)
  • Staging 环境(Vultr Japan 跑一份 compose,push to main 先 deploy staging,手动 approve 才触发 prod)
  • pg_dump 定期自动备份(cron,不只 deploy 前)

团队内部文档