# Production Deployment Plan

**Project:** Knowledge Academy courseware PPTX generator
**Branch:** v3
**Author of this plan:** drafted 2026-05-21
**Execution status:** not started — execute sprint-by-sprint at your own pace

---

## Decisions locked in

| Decision | Choice |
|---|---|
| Auth model | Local username + password (bcrypt) |
| Database | SQLite (via Flask-SQLAlchemy) |
| S3 scope | Static banks + generated outputs |
| Job visibility | Shared — every logged-in user sees all jobs |
| Roles | Single `is_admin` boolean on `User` |

---

## Architecture (final state)

```
┌─────────────────────────────────────────────────────────────────────┐
│  Server (Windows / Linux, single Flask process)                     │
│                                                                     │
│  /opt/courseware-generator/        ← code (from git)                │
│  ├── ppt_generator/                                                 │
│  ├── web/                                                           │
│  ├── scripts/                                                       │
│  ├── tagged_bank/                                                   │
│  ├── drill_down_bank/  (catalogue only — PPT comes from S3)         │
│  └── requirements.txt                                               │
│                                                                     │
│  /var/lib/courseware/              ← persistent volume              │
│  ├── banks/                          ← synced FROM S3 at boot       │
│  │   ├── tagged_infographics/                                       │
│  │   ├── drill_down_cards/                                          │
│  │   ├── sample_templates/                                          │
│  │   └── images/                                                    │
│  ├── uploads/                        ← short-lived, local           │
│  │   ├── templates/                                                 │
│  │   └── outlines/                                                  │
│  ├── tmp_outputs/                    ← scratch before S3 upload     │
│  └── courseware.db                   ← SQLite                       │
└─────────────────────────────────────────────────────────────────────┘
                  │                              │
                  ▼                              ▼
        ┌──────────────────┐          ┌──────────────────┐
        │  S3 bucket        │          │  Anthropic API   │
        │  /banks/...       │          │                  │
        │  /outputs/<id>.pptx│         └──────────────────┘
        └──────────────────┘
```

State flows:
- **Banks:** local → S3 (one-time/manual upload via script). Server pulls from S3 at boot, caches locally, reads from local disk per request.
- **Generated outputs:** built locally in `tmp_outputs/`, uploaded to S3 on completion, S3 key stored in DB, served to user via presigned URL.
- **User uploads (templates, outlines):** stay on local disk, consumed during build, can be cleaned up after job completes.
- **State (history, activity, tokens, users):** SQLite only.

---

## Sprint 1 — SQLite + migrate state out of JSON files

### Goal
Every JSON state file becomes a DB table. App behaves identically to today; just internally backed by SQLite. No user-visible change.

### Files to create

- [ ] `web/db.py` — SQLAlchemy engine + session factory
- [ ] `web/models.py` — `User`, `Job`, `DailyActivity`, `TokenUsage` models
- [ ] `scripts/init_db.py` — create tables + import existing JSON data
- [ ] `scripts/wipe_db.py` — convenience for dev (with confirmation prompt)

### Files to modify

- [ ] `requirements.txt` — add `Flask-SQLAlchemy>=3.1.0` and `SQLAlchemy>=2.0`
- [ ] `web/app.py` — replace JSON reads/writes with DB queries (see below)
- [ ] `ppt_generator/engine/token_tracker.py` — write to `TokenUsage` table instead of `tokens.json`

### Schema (SQLite, via SQLAlchemy)

```python
# web/models.py
from datetime import datetime
from web.db import Base
from sqlalchemy import Column, Integer, String, Boolean, DateTime, Float, ForeignKey, Date
from sqlalchemy.orm import relationship

class User(Base):
    __tablename__ = "users"
    id            = Column(Integer, primary_key=True)
    email         = Column(String(255), unique=True, nullable=False, index=True)
    name          = Column(String(255), nullable=False)
    password_hash = Column(String(255), nullable=False)  # bcrypt
    is_admin      = Column(Boolean, default=False, nullable=False)
    created_at    = Column(DateTime, default=datetime.utcnow, nullable=False)
    is_active     = Column(Boolean, default=True, nullable=False)
    last_login_at = Column(DateTime, nullable=True)

class Job(Base):
    __tablename__ = "jobs"
    id            = Column(String(8), primary_key=True)   # matches current job_id format
    user_id       = Column(Integer, ForeignKey("users.id"), nullable=True)  # nullable for sprint 1 migration; set NOT NULL in sprint 2
    course_title  = Column(String(500), nullable=False, default="")
    template_name = Column(String(500), nullable=True)
    duration_days = Column(Integer, nullable=True)
    language      = Column(String(8), nullable=True)
    status        = Column(String(32), nullable=False)    # queued|running|complete|failed
    progress      = Column(Integer, nullable=False, default=0)
    log_message   = Column(String, nullable=True)
    slide_count   = Column(Integer, nullable=True)
    qa_score      = Column(Integer, nullable=True)
    qa_verdict    = Column(String(16), nullable=True)
    duration_sec  = Column(Integer, nullable=True)
    output_s3_key = Column(String(500), nullable=True)    # populated in sprint 3
    output_local  = Column(String(500), nullable=True)    # filled by sprint 1; deprecate in sprint 3
    created_at    = Column(DateTime, default=datetime.utcnow, nullable=False, index=True)
    completed_at  = Column(DateTime, nullable=True)
    user          = relationship("User")

class DailyActivity(Base):
    __tablename__ = "daily_activity"
    date          = Column(Date, primary_key=True)
    courses       = Column(Integer, nullable=False, default=0)
    slides        = Column(Integer, nullable=False, default=0)

class TokenUsage(Base):
    __tablename__ = "token_usage"
    id            = Column(Integer, primary_key=True)
    job_id        = Column(String(8), ForeignKey("jobs.id"), nullable=True, index=True)
    timestamp     = Column(DateTime, default=datetime.utcnow, nullable=False, index=True)
    model         = Column(String(64), nullable=False)
    input_tokens  = Column(Integer, nullable=False, default=0)
    output_tokens = Column(Integer, nullable=False, default=0)
    cache_creation_tokens = Column(Integer, nullable=False, default=0)
    cache_read_tokens     = Column(Integer, nullable=False, default=0)
```

### Tasks

- [ ] Create `web/db.py` with `engine = create_engine(os.environ["DATABASE_URL"], ...)`, `SessionLocal`, `Base`. Default `DATABASE_URL` to `sqlite:///./courseware.db` for dev.
- [ ] Create `web/models.py` with the four models above.
- [ ] Add `Flask-SQLAlchemy>=3.1.0` and `bcrypt>=4.0.0` to `requirements.txt`. (bcrypt added now so Sprint 2 doesn't re-touch this.)
- [ ] Wire SQLAlchemy into `web/app.py`: `db.init_app(app)` after `Flask(__name__)`. Use `flask.g` or scoped sessions for per-request DB access.
- [ ] Write `scripts/init_db.py`:
  - [ ] `Base.metadata.create_all(engine)`
  - [ ] Read `web/data/history.json`, insert each entry as a `Job` row (user_id=None for now)
  - [ ] Read `web/data/activity.json`, insert each day as a `DailyActivity` row
  - [ ] Read `web/data/tokens.json`, insert each record as a `TokenUsage` row
  - [ ] Print summary of imported counts
- [ ] Replace JSON file reads in `web/app.py`:
  - [ ] `/api/history` → `db.session.query(Job).order_by(Job.created_at.desc()).limit(100).all()`
  - [ ] `/api/stats` → `db.session.query(DailyActivity).order_by(DailyActivity.date).all()`
  - [ ] `/api/job/<id>` → `db.session.get(Job, id)`
- [ ] Replace JSON file writes:
  - [ ] At `/api/generate` start → `INSERT INTO jobs (id, status='queued', ...)`
  - [ ] During pipeline progress callback → `UPDATE jobs SET status, progress, log_message`
  - [ ] On pipeline complete → `UPDATE jobs SET status='complete', completed_at, slide_count, qa_score, output_local`
  - [ ] On pipeline failure → `UPDATE jobs SET status='failed', log_message`
  - [ ] In daily-activity bump → `INSERT OR UPDATE INTO daily_activity` (use SQLite's `ON CONFLICT`)
- [ ] Modify `ppt_generator/engine/token_tracker.py`:
  - [ ] In the patched `messages.create` wrapper, instead of appending to `tokens.json`, open a short DB session and INSERT into `token_usage`.
  - [ ] Get `job_id` via a thread-local set by the pipeline at job start.

### Verification

- [ ] Run `python scripts/init_db.py`. Confirm `courseware.db` is created and contains imported rows. Print: `jobs=N, activity=M, tokens=K`.
- [ ] Start server. Open dashboard. Confirm: history table still shows existing data; activity chart still renders; today's snapshot still accurate.
- [ ] Trigger a new generation. Confirm: new Job row inserted, status flips through queued→running→complete, all JSON behavior gone from the code path.
- [ ] Confirm: `web/data/history.json`, `web/data/activity.json`, `web/data/tokens.json` are NOT updated by the new generation (only the DB is).
- [ ] Move the JSON files to `web/data/_legacy/` as a backup. Do not delete yet.

### Rollback strategy

If Sprint 1 misbehaves: stop the server, copy `web/data/_legacy/*.json` back to `web/data/`, revert the code changes. Old behavior restored. The DB file can be deleted; no data loss because JSON files are preserved.

---

## Sprint 2 — Username/password auth

### Goal
All `/api/*` endpoints require login. Admin can create users.

### Files to create

- [ ] `web/auth.py` — login/logout routes, `@login_required` decorator, `@admin_required` decorator
- [ ] `web/templates/login.html` — login form
- [ ] `scripts/create_admin.py` — bootstrap the first admin user

### Files to modify

- [ ] `requirements.txt` — bcrypt was already added in Sprint 1
- [ ] `web/app.py` — register auth blueprint, add `@login_required` to every `/api/*` route, add `user_id` to every `Job` insert
- [ ] `web/static/js/app.js` — handle 401 responses by redirecting to `/login`; show logged-in user name in header
- [ ] `web/templates/index.html` — add user/logout to header
- [ ] `.env` (and `.env.example`) — new `SESSION_SECRET`, optional `INITIAL_ADMIN_EMAIL`, `INITIAL_ADMIN_PASSWORD`

### Tasks

- [ ] Set Flask session secret from env: `app.secret_key = os.environ["SESSION_SECRET"]`. Generate a random 32-byte hex for prod.
- [ ] Implement `bcrypt` password hashing in `User` (helper methods `set_password()`, `check_password()`).
- [ ] Implement `web/auth.py`:
  - [ ] `POST /login` — verify email+password, set `session["user_id"]`, update `last_login_at`
  - [ ] `POST /logout` — clear session
  - [ ] `GET /login` — render login form
  - [ ] `login_required` decorator: check `session["user_id"]`, load user, store on `flask.g.user`; return 401 if missing
  - [ ] `admin_required` decorator: requires login + `g.user.is_admin`
- [ ] Admin user-management endpoints (admin-only):
  - [ ] `POST /api/users` — create user (email, name, password, is_admin)
  - [ ] `POST /api/users/<id>/password` — reset password
  - [ ] `POST /api/users/<id>/disable` — set is_active=False
  - [ ] `GET /api/users` — list users (without password_hash)
- [ ] Apply `@login_required` to every `/api/*` endpoint EXCEPT `/login`, `/logout`. Test with curl that unauthenticated requests get 401.
- [ ] At `/api/generate`, set `user_id=g.user.id` on the new Job row.
- [ ] Mark `Job.user_id` `NOT NULL` going forward (new jobs only; existing migrated jobs keep NULL).
- [ ] Write `scripts/create_admin.py`:
  - [ ] If `INITIAL_ADMIN_EMAIL`+`INITIAL_ADMIN_PASSWORD` env vars set, create admin user.
  - [ ] Otherwise prompt interactively for email/name/password.
- [ ] Frontend (`web/static/js/app.js`):
  - [ ] Wrap every `fetch()` of `/api/*` to redirect to `/login?next=…` on 401.
  - [ ] On boot, call `GET /api/me` to get logged-in user name, display in header.
  - [ ] Add a "Logout" link that POSTs to `/logout`.
- [ ] Frontend (`web/templates/index.html`):
  - [ ] Add header element for `<span id="user-name"></span>` + logout button.
  - [ ] Hide the user-management UI behind an `is_admin` flag (admin tab in nav, only rendered for admins).

### Verification

- [ ] Run `python scripts/create_admin.py` → confirm admin user exists in DB.
- [ ] Hit `/api/history` without a session → 401.
- [ ] Log in via browser, hit `/api/history` → 200, same data.
- [ ] Create a non-admin user. Log in as them. Confirm: can generate, can see history (shared visibility). Cannot access `/api/users` (403).
- [ ] Confirm: logout clears session, dashboard becomes inaccessible without re-login.

### Things to watch

- **Session timeout vs build duration.** A build takes ~10–12 minutes. Set session lifetime to at least 8 hours (a working day) so users aren't logged out mid-build. Or use sliding sessions.
- **CSRF protection.** Once you add login, `/api/generate` becomes a CSRF target. Either use Flask-WTF / Flask-SeaSurf, or check Origin/Referer headers on POSTs.

---

## Sprint 3 — S3 for banks + generated outputs

### Goal
- Banks live in S3; server syncs them locally at boot.
- Generated PPTXs upload to S3 on completion; download endpoint serves presigned URLs.

### Files to create

- [ ] `scripts/upload_banks_to_s3.py` — one-time/periodic: push local `tagged infographics/`, `drill_down_bank/*.pptx`, `sample template/`, `images/` → S3
- [ ] `scripts/sync_banks_from_s3.py` — boot-time: pull S3 banks → local `BANKS_ROOT`
- [ ] `web/s3.py` — boto3 client wrapper, `upload_pptx()`, `presigned_get_url()`

### Files to modify

- [ ] `requirements.txt` — add `boto3>=1.34.0`
- [ ] `web/app.py` — replace bank/template/image path constants with env-configurable values; download endpoint returns presigned URL instead of `send_file`
- [ ] `ppt_generator/pipeline.py:589-591` — `bank` and `catalogue_def` paths read from env
- [ ] `ppt_generator/engine/slide_builder.py:163` — `drill_dir` read from env
- [ ] `ppt_generator/run_full_course.py:19-21` — paths from env (consistency, even though test-only)
- [ ] `.env` — new vars (see below)

### Environment variables (Sprint 3)

```
S3_BUCKET=knowledge-courseware-prod
AWS_REGION=eu-west-2
AWS_ACCESS_KEY_ID=...
AWS_SECRET_ACCESS_KEY=...
S3_OUTPUTS_PREFIX=outputs/
S3_BANKS_PREFIX=banks/
BANKS_ROOT=/var/lib/courseware/banks
TAGGED_BANK_DIR=${BANKS_ROOT}/tagged_infographics/new_labeled_grouped_infographics
DRILL_DOWN_BANK_DIR=${BANKS_ROOT}/drill_down_cards
SAMPLE_TEMPLATES_DIR=${BANKS_ROOT}/sample_templates
ILLUSTRATIONS_DIR=${BANKS_ROOT}/images
PRESIGNED_URL_TTL=3600
```

(Prod should use an IAM role if the server is on EC2, instead of access-key env vars.)

### Tasks

- [ ] Write `scripts/upload_banks_to_s3.py`:
  - [ ] Use `boto3.client("s3").upload_file()`
  - [ ] Walk local `tagged infographics/`, `drill_down_bank/*.pptx`, `sample template/`, `images/` (entire tree, including topic subfolders)
  - [ ] Set S3 keys under `S3_BANKS_PREFIX`
  - [ ] Print summary: N files, total MB uploaded
  - [ ] Skip files that already exist with same ETag (idempotent re-runs)
- [ ] Run the upload script once, verify in AWS console that files are present
- [ ] Write `scripts/sync_banks_from_s3.py`:
  - [ ] Use `boto3.client("s3").download_file()` for each object under `S3_BANKS_PREFIX`
  - [ ] Target `BANKS_ROOT`, preserve subfolder structure
  - [ ] Print summary: N files, total MB downloaded, MB skipped (already present)
  - [ ] Idempotent: skip if local file's mtime/size matches S3 metadata
- [ ] Wire boot-time sync: either via a startup script (`bash start.sh` runs sync then `python -m web.app`), or call the sync function from `web/app.py` before `app.run()`.
- [ ] Replace hardcoded bank paths:
  - [ ] In `web/app.py:51-52`, read from `os.environ["TAGGED_BANK_DIR"]` and the catalogue path becomes `os.environ.get("TAGGED_CATALOGUE_PATH", "tagged_bank/tagged_catalogue.json")`.
  - [ ] In `ppt_generator/pipeline.py:589-591`, same.
  - [ ] In `ppt_generator/engine/slide_builder.py:163`, `drill_dir = Path(os.environ["DRILL_DOWN_BANK_DIR"])`.
  - [ ] In `_load_illustration_images` callsite — `images_dir = os.environ.get("ILLUSTRATIONS_DIR", default)`.
- [ ] For generated outputs:
  - [ ] After build completes, upload `tmp_outputs/<job_id>.pptx` to `s3://<bucket>/<prefix>/<job_id>.pptx`
  - [ ] Store key in `Job.output_s3_key`
  - [ ] Delete local copy from `tmp_outputs/` (keep DB key as source of truth)
- [ ] Change download endpoint:
  - [ ] `GET /api/download/<id>` → look up `Job.output_s3_key` → generate presigned URL (TTL from env) → return as JSON `{"url": "..."}` OR HTTP 302 redirect.
  - [ ] Frontend in `app.js` handles either pattern.
- [ ] Delete the now-unused `web/uploads/outputs/` directory from local disk in deploy script.

### Verification

- [ ] Banks: delete local `BANKS_ROOT`, restart server, confirm sync downloads them and a generation still works.
- [ ] Banks: change a bank PPT locally, upload via `upload_banks_to_s3.py`, restart server, confirm sync picks up the change.
- [ ] Outputs: trigger a generation. Confirm PPTX appears in S3 at `s3://<bucket>/outputs/<job_id>.pptx`. Click download in dashboard, confirm file downloads correctly via presigned URL.
- [ ] Outputs: confirm `tmp_outputs/<job_id>.pptx` is gone (cleanup worked).
- [ ] Outputs: confirm `Job.output_local` is empty for new jobs; `Job.output_s3_key` is populated.
- [ ] Confirm: pulling down old (pre-Sprint 3) jobs still works — fall back to `output_local` when `output_s3_key` is null.

---

## Configuration summary (final `.env.example`)

```
# Anthropic
ANTHROPIC_API_KEY=sk-ant-...

# Flask
SESSION_SECRET=<32+ random bytes hex>
FLASK_ENV=production

# Database (Sprint 1)
DATABASE_URL=sqlite:////var/lib/courseware/courseware.db

# Admin bootstrap (Sprint 2 — used by scripts/create_admin.py)
INITIAL_ADMIN_EMAIL=admin@theknowledgeacademy.com
INITIAL_ADMIN_PASSWORD=<changeme on first login>

# S3 (Sprint 3)
S3_BUCKET=knowledge-courseware-prod
S3_BANKS_PREFIX=banks/
S3_OUTPUTS_PREFIX=outputs/
AWS_REGION=eu-west-2
AWS_ACCESS_KEY_ID=...    # omit if using IAM role
AWS_SECRET_ACCESS_KEY=...# omit if using IAM role
PRESIGNED_URL_TTL=3600

# Path overrides (Sprint 3)
BANKS_ROOT=/var/lib/courseware/banks
TAGGED_BANK_DIR=${BANKS_ROOT}/tagged_infographics/new_labeled_grouped_infographics
DRILL_DOWN_BANK_DIR=${BANKS_ROOT}/drill_down_cards
SAMPLE_TEMPLATES_DIR=${BANKS_ROOT}/sample_templates
ILLUSTRATIONS_DIR=${BANKS_ROOT}/images
```

---

## Deployment workflow (end-to-end, post Sprint 3)

**Initial deploy:**

```bash
# 1. Provision server with persistent volume mounted at /var/lib/courseware
# 2. Clone code
git clone <repo> /opt/courseware-generator
cd /opt/courseware-generator

# 3. Set up Python env
python -m venv .venv
.venv\Scripts\activate    # Windows; or `source .venv/bin/activate`
pip install -r requirements.txt

# 4. Provision .env (copy from .env.example, fill in real values)
cp .env.example /var/lib/courseware/.env
# edit secrets

# 5. Init DB
python scripts/init_db.py

# 6. Create admin user
python scripts/create_admin.py

# 7. Upload banks to S3 (one-time, from your laptop)
python scripts/upload_banks_to_s3.py

# 8. Sync banks down to server
python scripts/sync_banks_from_s3.py

# 9. Start server
python -m web.app          # or via systemd / Windows Service
```

**Subsequent deploys (code only):**

```bash
git pull
pip install -r requirements.txt    # in case deps changed
# restart server
```

**Bank update (e.g., new illustrations added):**

```bash
# Locally:
python scripts/upload_banks_to_s3.py
# Server (on next restart, sync re-runs; or trigger manually):
python scripts/sync_banks_from_s3.py
```

---

## Things to watch — gotchas and risks

1. **Session length vs build time.** A pipeline run is ~10–12 minutes. Make Flask session lifetime ≥ 8 hours, or implement a "keep alive" ping from the polling endpoint.

2. **SQLite concurrent writes.** SQLite serializes writes. With multiple concurrent generations (the team triggering builds in parallel), token_usage inserts may briefly block. Acceptable at small scale; if it bites, switch to WAL mode (`PRAGMA journal_mode=WAL`).

3. **bcrypt cost.** Default cost 12 is ~250 ms per password check on a typical CPU. Fine for login but DO NOT call `check_password` in a loop.

4. **CSRF on /api/generate.** Once auth is on, browsers can be tricked into POSTing on a logged-in user's behalf. Use Flask-WTF's CSRF token, or check `Origin` / `Referer` headers explicitly.

5. **The `course_title` history bug.** Currently parked. After Sprint 1, the bug becomes "the `Job.course_title` field is empty in DB rows." Fix it at the same time you migrate `/api/generate` — pull from the form payload, not from the cover slide.

6. **Output local cleanup.** Once outputs go to S3, the local `tmp_outputs/` should be cleaned. Add a periodic janitor (cron / scheduled task) that deletes files older than 1 hour, or delete on successful S3 upload.

7. **Boto3 latency on every download.** Generating a presigned URL is local (no API call). But if you switch to "stream from S3", every download is a roundtrip. Stick with presigned URLs.

8. **Bank sync timing.** The boot-time sync runs before Flask starts. For 100+ MB of banks, this is 30–60 seconds on first boot. Subsequent boots are fast (idempotent skip). Acceptable; document in your operational runbook.

9. **No DB migrations tool.** This plan uses `Base.metadata.create_all()` (one-shot). If you start altering schemas after deploy, add `Alembic`. Don't bother for v1.

10. **Backup story for SQLite.** A daily `sqlite3 courseware.db ".backup courseware.db.bak"` to S3 is enough. Document in the runbook. Don't forget — the DB now holds 6+ months of history.

11. **Streamlit / matplotlib in requirements.txt.** Not used in v3 code paths. Remove from `requirements.txt` during Sprint 3 cleanup. Saves ~150 MB install size.

12. **Bus-factor risk on the banks.** Right now they exist only on one laptop. Sprint 3 fixes this by putting them in S3 — make sure step 7 (`upload_banks_to_s3.py`) happens BEFORE the original machine is ever decommissioned.

---

## Estimated effort

| Sprint | Engineering days | Risk |
|---|---|---|
| 1 — SQLite + state migration | 1–2 | Low — pure refactor, JSON files preserved as backup |
| 2 — Auth | 1–2 | Low — well-trodden territory |
| 3 — S3 banks + outputs | 1–2 | Medium — first boto3 integration, presigned URL nuances |
| **Total** | **3–6 days** | |

Each sprint is independently deployable. Don't bundle them. After Sprint 1, the app is a strict improvement (state in DB, JSON files retired). After Sprint 2, the app is locked-down internal. After Sprint 3, the app is cloud-native and the laptop bus-factor risk is gone.
