Deploy pipeline — follow-up tasks

日期： 2026-05-16 状态： Open — pick up after /clear

Continuation backlog after today's ACR cutover. Today's work landed PRs #74–#86 and produced a live twilight-data container served from ACR via --env-file-fixed ecs-deploy.sh. Everything below is still TODO.

Issue references in (R#) point to entries in docs/err/2026-05-16-deploy-pipeline-acr-cutover.md.

Priority 1 — finish the auto-deploy story

T1. Symlink `/usr/local/bin/twilight-deploy` to repo path (R2)

Stops the drift between repo deploy/ecs-deploy.sh and the version actually executed by the SSH forced-command. One-shot on the ECS:

bash

ssh root@ssh-ecs.fsagent.cc bash -lc '
  cp /usr/local/bin/twilight-deploy /usr/local/bin/twilight-deploy.bak.$(date +%s)
  ln -sf /home/twilight/twilight/source/deploy/ecs-deploy.sh /usr/local/bin/twilight-deploy
  ls -la /usr/local/bin/twilight-deploy
'

Then verify a deploy latest SSH call still works.

T2. Consolidate `.env` paths (R1)

Either:

symlink /home/deploy/twilight/.env -> /home/twilight/twilight/.env and remove the duplicate file, OR
change the forced command to TWILIGHT_HOME=/home/deploy/twilight (must chown -R deploy:deploy /home/deploy/twilight/source first; source is currently a symlink into /home/twilight/twilight/source).

Recommend the symlink — minimal change, no ownership thrash.

T3. Make `source/` a real git repo (R3)

Needed for Phase 3 (the git pull step in app-stack-auto-deploy.md).

bash

ssh root@ssh-ecs.fsagent.cc bash -lc '
  cd /home/twilight/twilight/source
  sudo -u twilight git init
  sudo -u twilight git remote add origin https://github.com/LaCatFly/twilight-drive.git
  sudo -u twilight git fetch --quiet origin main
  sudo -u twilight git reset --hard origin/main
  # the deploy user needs read on .git too:
  chgrp -R twilight /home/twilight/twilight/source/.git
  chmod -R g+r /home/twilight/twilight/source/.git
'

After this, git pull is the source-update mechanism on the host.

T4. Phase 3 — extend `ecs-deploy.sh` for the app stack (R4)

Already drafted in 2026-05-16-app-stack-auto-deploy.md (Phase 3 section). Has been blocked on T1+T3. Once they land:

pull data + app-backend + app-frontend in parallel,
run prisma db push --skip-generate for the NestJS schema,
4-service health gate (data + backend + frontend + mcp-tushare),
all-or-nothing rollback.

The Pre-Deploy Checklist (Phase 6.0 in the same doc) is also still unimplemented. Externally-dependent checks (NewAPI reachability, SearXNG, Hermes upstream image) belong in the GH Action, not in the SSH-executed script.

T5. Add mcp-tushare to the matrix or formally document opt-out (R5)

Either:

add a fourth row to the docker-build.yml matrix (name: mcp-tushare, dockerfile: mcp/tushare_mcp/Dockerfile, context: ., repo: twilight-mcp-tushare) and wire it through to the compose image: line, OR
write a one-line ops doc explicitly saying "mcp-tushare is hand-built and hand-tagged; bump it by editing /home/twilight/twilight/.env's MCP_TUSHARE_VERSION= line, then compose up -d mcp-tushare".

Pick one. Current state is "neither", which is what produced the "deploys mcp-tushare but doesn't build it" gap (R5).

Priority 2 — hygiene

T6. Secret-rotation helper (R8)

scripts/deploy/secrets-rotate.sh. Inputs: which secret (ACR fixed password / GHCR PAT / NewAPI admin pwd / etc.). Outputs:

update macOS keychain entry,
gh secret set the new value,
ssh to ECS, refresh ~/.docker/config.json and any .env line.

Reduces the multi-hop manual sync that the cutover exposed.

T7. `core (ruff + pytest)` is red on main (R7)

Either restore core.data or delete the orphaned tests. Today everyone is colour-blind because every PR fails this check identically. gh pr merge --squash ignored UNSTABLE so 10 PRs went in anyway. Don't let that become permanent — fix or remove the test.

T8. Unlock local `~/.docker/config.json` (R6)

chflags nouchg ~/.docker/config.json (or whatever set the lock). Document in docs/deploy/acr-mirror-setup.md so future manual.sh users do not waste an hour on the DOCKER_CONFIG=/tmp/... workaround.

T9. Audit-trail entry for the ACR password in keychain

Memory secrets_keychain.md doesn't yet list aliyun-acr-twilight-drive. Update on next memory pass.

Priority 3 — adjacent work that the cutover deferred

T10. Hermes lifecycle rebuild

Stays open. Plan: 2026-05-16-instance-lifecycle-rebuild.md. Pre-reqs all merged (PR #79 health, PR #80 mcp healthz, PR #81 worker graceful). Next concrete step:

Phase 1 Prisma schema migration (destructive — needs PR-review + pg_dump red-line per app-stack-auto-deploy.md Phase 4).
Then phases 2/3/4 (worker rewrite, service-runtime, frontend state machine).

T11. Hermes image pin enforcement on host

PR #75 merged the :0.13.0 pin into scripts/admin/spawn-profile.sh. Once T3 (git repo on ECS) is done, the host script will track the pin automatically. Until then, the ECS still has whatever was last extracted from the install tarball — verify by reading /home/twilight/twilight/source/scripts/admin/spawn-profile.sh and re-installing if needed.

T12. ACR repo retention policy

Personal Edition has a 5GB/day egress quota and storage caps. With daily-ish merges of three images, the registry will fill up. Add a weekly cron (or GH Action) that deletes ACR tags older than N days, keeping only latest, main, and the last few sha-*. Aliyun CLI supports this; can be a workflow step.

How to resume after `/clear`

Read this file plus docs/err/2026-05-16-deploy-pipeline-acr-cutover.md.
Check git log origin/main --oneline -20 — PRs #74-#86 should all be in.

Health-check live state:

bash

curl -fsS https://api.fsagent.cc/healthz                                # data
curl -fsS https://backend.fsagent.cc/health                             # backend
curl -fsS https://app.fsagent.cc/                                       # frontend

Pick the highest-priority unfinished task (T1 if no other input).
The ACR password is in macOS keychain under aliyun-acr-twilight-drive; CF Access service token is under twilight-cf-access-deploy; SSH key is ~/.ssh/alibaba-ecs (root) or ~/.ssh/twilight-ecs-deploy (deploy user — restricted to deploy <tag> via forced command).

Deploy pipeline — follow-up tasks ​

Priority 1 — finish the auto-deploy story ​

T1. Symlink /usr/local/bin/twilight-deploy to repo path (R2) ​

T2. Consolidate .env paths (R1) ​

T3. Make source/ a real git repo (R3) ​

T4. Phase 3 — extend ecs-deploy.sh for the app stack (R4) ​

T5. Add mcp-tushare to the matrix or formally document opt-out (R5) ​

Priority 2 — hygiene ​

T6. Secret-rotation helper (R8) ​

T7. core (ruff + pytest) is red on main (R7) ​

T8. Unlock local ~/.docker/config.json (R6) ​

T9. Audit-trail entry for the ACR password in keychain ​

Priority 3 — adjacent work that the cutover deferred ​

T10. Hermes lifecycle rebuild ​

T11. Hermes image pin enforcement on host ​

T12. ACR repo retention policy ​

How to resume after /clear ​

Deploy pipeline — follow-up tasks

Priority 1 — finish the auto-deploy story

T1. Symlink `/usr/local/bin/twilight-deploy` to repo path (R2)

T2. Consolidate `.env` paths (R1)

T3. Make `source/` a real git repo (R3)

T4. Phase 3 — extend `ecs-deploy.sh` for the app stack (R4)

T5. Add mcp-tushare to the matrix or formally document opt-out (R5)

Priority 2 — hygiene

T6. Secret-rotation helper (R8)

T7. `core (ruff + pytest)` is red on main (R7)

T8. Unlock local `~/.docker/config.json` (R6)

T9. Audit-trail entry for the ACR password in keychain

Priority 3 — adjacent work that the cutover deferred

T10. Hermes lifecycle rebuild

T11. Hermes image pin enforcement on host

T12. ACR repo retention policy

How to resume after `/clear`