Root cause: Dual-source architecture for owner password (Gitea secret
ENV_OWNER_PASSWORD vs host .env OWNER_PASSWORD) caused drift when
the DB was ever re-seeded or the volume recreated.
Changes:
- Add SeedAudit entity + migration to track one-time seed operations
- EnsureDatabaseAsync checks SeedAudit BEFORE seeding — owner is never
re-created even if the Users table is wiped
- Deploy and rollback workflows now read OWNER_PASSWORD from the host's
persistent .env (single source of truth) instead of Gitea secrets
- compose.yaml documented: OWNER_PASSWORD only used during initial seed
- Cleanup: .gitignore extended for core dumps, changelog/deployment.md
updated with 2026-06-20 session notes
After this fix the DB is the single source of truth for the owner
password after initial seed. The host .env is the single reference
for the initial value.
The inner shell script run via docker:cli had complex escaping
that caused 'unterminated quoted string' errors at runtime.
Moved the deploy logic to an external script file (heredoc in
the workflow YAML), mounted read-only into the docker:cli
container. Pass BUILD_ARGS and SERVICE via environment
variables instead of shell interpolation.
- Postgres memory: 256M→384M limits, 64M→96M reservations
- Added pg_resetwal -f pre-deploy step to recover from corrupt WAL
('PANIC: could not locate a valid checkpoint record' caused by
force-killed postgres during --force-recreate)
- Added data-checksums initdb arg for future corruption detection
- api→postgres and web→api depends_on: service_healthy→service_started
- Deploy wait loop: fail fast on unhealthy, wait on starting (180s)
- Added researcher/executor to ValidAssignees and frontend dropdowns
- web's depends_on on api: change from service_healthy to
service_started+restart (same as api→postgres fix)
- deploy wait loop: fail fast on unhealthy, wait on starting,
increased timeout to 180s (36×5s)
The docker compose --wait flag times out before postgres can
become healthy (start_period=30s). Replaced with explicit
poll loop (5s interval, up to 120s) that checks ps output
for unhealthy/starting states.
Swagger (/swagger) is only enabled in Development mode (Program.cs
gates it behind app.Environment.IsDevelopment()). In production,
nginx serves the frontend catch-all (index.html), so the check
always returns 200 but never actually validates the API layer.
/health already covers API + database + runtime health checks.
No replacement endpoint needed — the smoke test still validates
both the dashboard and the backend API via /health.
Iteration 2 fix: /api/swagger → /swagger (correct ASP.NET default).
Iteration 3 — Concurrency guard:
- concurrency group 'deploy-production': ensures only one deploy
runs at a time (cancel-in-progress: false so queued deploys
wait instead of being cancelled).
- Why: prevents race conditions when CI-triggered workflow_run
and manual workflow_dispatch overlap. Without this, parallel
deploys could corrupt docker compose state or conflict on
shared resources (ports, volumes, version tags).
Iteration 2 — Deploy robustness:
- Health check: Fibonacci-ish backoff (1,2,3,5,8,13s) instead of fixed
5s intervals. Why: containers need variable warmup time; fixed intervals
either wait too long or give up too early. Total budget ~32s vs 30s before.
- Smoke test: now checks /dashboard, /health, and /api/swagger. Why: a
single endpoint check can miss backend-only outages; API Swagger confirms
the ASP.NET layer is healthy.
- Rollback hint: on any failure, prints previous git tag + docker compose
commands for quick manual rollback. Why: reduces MTTR by providing the
exact recovery steps inline.
The \$ escape before ${{ inputs.service }} prevented Gitea from
evaluating the expression, passing literal backslash to the shell.
Also use ${BUILD_ARGS} (shell expansion) instead of \$BUILD_ARGS
so the outer shell passes the actual build args to the DIND container.
Phase 1 — .env provisioning fix:
The previous approach tried to write .env directly to
/opt/openclaw/data/openclaw/workspace/nexus from inside the
runner's job container, but that host path is not mounted there.
Fix: write .env from Gitea secrets into the workspace first,
then sync it along with the source code via the existing
Docker-in-Docker pattern (which can access the host path).
Combined the separate '.env creation' and 'sync code' steps
into a single atomic 'Sync code + .env to host' step.
Phase 1 — Deploy reliability:
- Version bump: derive current version from 'git describe --tags' instead of
VERSION file. This eliminates race conditions where the VERSION file is
stale but the tag already exists from a previous failed run.
- Tag creation: use 'git tag -f' + 'git push --force --tags' to handle
retries gracefully when tags already exist.
- Environment: provision .env at the host deploy path from Gitea secrets
(ENV_POSTGRES_PASSWORD, ENV_JWT_KEY, ENV_OWNER_PASSWORD, ENV_OPENCLAW_TOKEN).
This ensures .env always exists on the host even though it's excluded from
the sync step for security.
Runner label was already fixed in previous commit (runs-on: ubuntu-latest).
The runner registers with labels [linux, dotnet, node, ubuntu-latest, ...]
but did not include 'deploy'. Changed workflow to use the consistently
available ubuntu-latest label. Also added 'deploy' label to the runner
registration for future compatibility.
Runner job containers don't have the /workspace/nexus mount.
- Sync code to host path using a docker run helper (preserves .env)
- Build & deploy from host path using docker:cli image
- Health check with retry loop for slow container startup
The runner job container does not have /workspace/nexus mounted.
Run everything from the checkout directory which has .git and compose.yaml.
- Removed rsync sync step (not needed)
- Version bump uses checkout dir with full git history
- Docker compose runs from checkout dir
- Added fetch-depth:0 and fetch-tags for version tagging
The Gitea runner ubuntu-latest image lacks rsync, causing
the Sync-to-deploy-path step to fail with exit code 127.
Added apt-get install rsync before the sync step.
- deploy.yaml now triggers automatically after successful CI completion
- Adds workflow_run event listener for 'CI - Build & Test'
- Guards deploy to only run when CI conclusion == success
- Preserves manual workflow_dispatch for targeted deploys
- Adds CI/CD note to README