CLAUDE.md restructured

This commit is contained in:
StyxX65 2026-06-08 14:44:37 +02:00
parent f845a2f686
commit 1903115e02
3 changed files with 155 additions and 180 deletions

210
CLAUDE.md
View File

@ -16,23 +16,23 @@ python -m pytest tests/ -q
**Split modules:** `scan_engine.py` (M365 + file scan), `sse.py` (SSE broadcast), `checkpoint.py`, `app_config.py` (all persistence), `cpr_detector.py`
**Google Drive delta scan** — `routes/google_scan.py` reads `scan_opts.get("delta", False)` (same flag as M365). Per user, delta key is `f"gdrive:{user_email}"` stored in `~/.gdprscanner/delta.json` alongside M365 tokens. First delta-enabled scan fetches all files then records a Changes API start page token via `conn.get_drive_start_token(user_email)`. Subsequent scans call `conn.get_drive_changes(user_email, token)` (Changes API) and update the token. Token save loads the current file fresh before writing (`{**current_tokens, **_new_drive_tokens}`) to avoid overwriting M365 tokens written by a concurrent scan thread. Invalid/expired tokens fall back to full scan automatically. `google_scan_done` now includes `"delta": bool` and `"delta_sources": int`.
**Google Drive delta scan** — `routes/google_scan.py` reads `scan_opts.get("delta", False)` (same flag as M365). Per user, delta key is `f"gdrive:{user_email}"` stored in `~/.gdprscanner/delta.json` alongside M365 tokens. First delta-enabled scan fetches all files then records a Changes API start page token via `conn.get_drive_start_token(user_email)`. Subsequent scans call `conn.get_drive_changes(user_email, token)` and update the token. Invalid/expired tokens fall back to full scan automatically.
**Google connector write-back** — `google_connector.py` exposes three methods on both `GoogleWorkspaceConnector` and `PersonalGoogleConnector` for in-place Drive file redaction: `get_drive_file_mime(user_email, file_id) → str`, `download_drive_file_by_id(user_email, file_id) → bytes`, `update_drive_file(user_email, file_id, content, mime_type)`. Backed by module-level helpers (`_get_drive_file_mime`, `_download_drive_file_by_id`, `_update_drive_file_content`). These use `DRIVE_WRITE_SCOPES` (`drive`, not `drive.readonly`) — the service-account delegation must include this scope or the call raises 403. Do not use `DRIVE_SCOPES` for write operations.
**Google connector write-back** — `google_connector.py` exposes `get_drive_file_mime`, `download_drive_file_by_id`, `update_drive_file` on both connectors for in-place Drive redaction. These use `DRIVE_WRITE_SCOPES` (`drive`, not `drive.readonly`) — the service-account delegation must include this scope or the call raises 403.
**SFTP connector** — `sftp_connector.py` provides `SFTPScanner` with the same `iter_files()` interface as `FileScanner`. `run_file_scan()` in `scan_engine.py` checks `source.get("source_type") == "sftp"` and instantiates `SFTPScanner`; all other file-scan code (SSE, DB, cards) is unchanged. Auth: `"password"` stores credential via `store_sftp_password()` in OS keychain; `"key"` loads the private key from `~/.gdprscanner/sftp_keys/<uuid>` with an optional keychain passphrase. Key files are uploaded via `POST /api/file_sources/upload_key` (paramiko validates format). `SFTP_OK` flag guards graceful degradation if `paramiko` is not installed. Do not add `source_type="sftp"` handling anywhere except `scan_engine.py` — the rest of the pipeline is source-agnostic. Additional single-file I/O methods: `_ssh_connect()` (shared connection builder), `read_file(remote_path) → bytes`, `write_file(remote_path, content)` used by the redaction route; do not duplicate SSH setup outside these methods.
**SFTP connector** — `sftp_connector.py` provides `SFTPScanner` with the same `iter_files()` interface as `FileScanner`. `run_file_scan()` in `scan_engine.py` checks `source.get("source_type") == "sftp"` and instantiates `SFTPScanner`; the rest of the pipeline is source-agnostic. Auth: `"password"` via OS keychain; `"key"` from `~/.gdprscanner/sftp_keys/<uuid>`. `SFTP_OK` flag guards graceful degradation if `paramiko` is not installed. Single-file I/O: `_ssh_connect()`, `read_file(remote_path)`, `write_file(remote_path, content)` — do not duplicate SSH setup outside these methods.
**Shared content processing** — all three scan engines (M365, Google, file) funnel downloaded bytes through a single function: `cpr_detector._scan_bytes(content, filename)`. It dispatches to the correct parser by file extension. `scan_engine.py` uses the `_scan_bytes_timeout` wrapper for PDFs (subprocess + hard timeout). `routes/google_scan.py` uses `_scan_bytes` directly. Do not duplicate file-type handling in per-source code.
**Shared content processing** — all three scan engines funnel downloaded bytes through `cpr_detector._scan_bytes(content, filename)`. `scan_engine.py` uses `_scan_bytes_timeout` for PDFs (subprocess + hard timeout). Do not duplicate file-type handling in per-source code.
**`cpr_detector.SUPPORTED_EXTS` is the single source of truth** for which file extensions are scanned across all sources. `file_scanner.py` imports it as `DEFAULT_EXTENSIONS` so local/SMB scans stay in sync automatically. `scan_engine.py` uses it to gate M365/SharePoint/Teams file downloads. Do not maintain a separate extension list anywhere else.
**`cpr_detector.SUPPORTED_EXTS` is the single source of truth** for which file extensions are scanned. `file_scanner.py` imports it as `DEFAULT_EXTENSIONS`. Do not maintain a separate extension list anywhere else.
**`_scan_bytes` injection pattern** — `scan_engine.py` defines a no-op stub for `_scan_bytes` / `_scan_bytes_timeout` at module level (avoids circular import). `gdpr_scanner.py` overwrites them with the real `cpr_detector` implementations at startup. `routes/google_scan.py` resolves them lazily via `gdpr_scanner.__getattr__`. This is intentional — do not try to import them directly in those modules.
**`_scan_bytes` injection pattern** — `scan_engine.py` defines no-op stubs at module level (avoids circular import). `gdpr_scanner.py` overwrites them at startup. `routes/google_scan.py` resolves them lazily via `gdpr_scanner.__getattr__`. Do not import them directly in those modules.
**Blueprints** in `routes/` — see `routes/CLAUDE.md` for state/SSE rules.
**Blueprints** in `routes/` — see `routes/CLAUDE.md` for SSE constraints, export, preview, scheduler, NER, audit log, viewer, and other route-specific rules.
**Frontend:** `templates/index.html` (SPA), `static/style.css` (all styles), `static/js/*.js` (11 ES modules + `state.js`). `static/app.js` is an archived monolith — no longer loaded.
**Checkpoint / resume** — all three scan engines save progress to `~/.gdprscanner/checkpoint_{prefix}.json` every 25 items. Prefixes: `m365`, `google`, `file_{source_id}`. `checkpoint.py` functions accept a `prefix` keyword (default `"m365"`). Use `_cp_path(prefix)` to get the path — do not hard-code filenames. The Scan button calls `checkCheckpoint(() => startScan(false))` so a resume banner is offered before any grid clearing happens. `POST /api/scan/clear_checkpoint` globs and deletes all `checkpoint_*.json` files.
**Checkpoint / resume** — all three scan engines save progress to `~/.gdprscanner/checkpoint_{prefix}.json` every 25 items. Prefixes: `m365`, `google`, `file_{source_id}`. Use `_cp_path(prefix)` — do not hard-code filenames. The Scan button calls `checkCheckpoint(() => startScan(false))` so a resume banner is offered before any grid clearing. `POST /api/scan/clear_checkpoint` globs and deletes all `checkpoint_*.json` files.
**Data dir** `~/.gdprscanner/`: `scanner.db`, `config.json`, `settings.json`, `schedule.json`, `token.json`, `delta.json`, `checkpoint_m365.json`, `checkpoint_google.json`, `checkpoint_file_*.json`, `smtp.json`, `machine_id` (**never delete** — Fernet key), `role_overrides.json`, `google_sa.json`, `google.json`, `src_toggles.json`, `app.lock`, `viewer_tokens.json`
@ -54,186 +54,40 @@ python -m pytest tests/ -q
**`tests/test_google_scan.py`** — 19 tests for the Google Workspace scan module. Route tests for `GET /api/google/scan/users`, `POST /api/google/scan/start`, `POST /api/google/scan/cancel`. Engine tests for `_run_google_scan` using synchronous invocation with mocked `broadcast`, `_scan_bytes`, `checkpoint.*`, `scan_engine._with_disposition`, and `gdpr_db.get_db`. The `clean_google_state` autouse fixture releases `_google_scan_lock` and clears `_google_scan_abort` after each test.
**`tests/test_route_integration.py`** — 54 Flask test-client tests covering security-sensitive paths: viewer token CRUD and scope validation, `GET /api/db/flagged` role/user scope enforcement, bulk disposition isolation, viewer PIN (set/verify/rate-limit/change/clear), interface PIN gate (multi-step flows require `session["interface_ok"] = True` after PIN set — the `before_request` hook blocks the same endpoint once a PIN exists), scan lock release on `run_scan()` exception, `GET /api/db/sessions` shape and ordering, profile routes CRUD and rename (including the rename-after-copy regression). Uses a tmp-path `ScanDB` monkeypatched into `routes.database._get_db` — tests never touch the real database. Interface PIN tests manipulate the real `config.json` via `setup_method`/`teardown_method` calling `clear_interface_pin()`.
**`tests/test_route_integration.py`** — 54 Flask test-client tests covering security-sensitive paths: viewer token CRUD and scope validation, `GET /api/db/flagged` role/user scope enforcement, bulk disposition isolation, viewer PIN (set/verify/rate-limit/change/clear), interface PIN gate (multi-step flows require `session["interface_ok"] = True` after PIN set), scan lock release on `run_scan()` exception, `GET /api/db/sessions` shape and ordering, profile routes CRUD and rename. Uses a tmp-path `ScanDB` monkeypatched into `routes.database._get_db` — tests never touch the real database.
**Local-file scan fixtures** — `tests/fixtures/local_files/` holds 19 files for manual/UI-level testing of the file scanner. 14 should be flagged; 5 are true negatives. All CPR numbers verified against `is_valid_cpr`. `generate_fixtures.py` (requires `python-docx`, `openpyxl`, `mutagen` — all in venv) regenerates the binary `.docx`/`.xlsx`/`.mp3`/`.flac`/`.mp4` files. Audio fixtures need 2 silent MPEG frames so mutagen can sync; FLAC uses a hand-packed STREAMINFO + Vorbis comment block; MP4 uses a minimal `ftyp`+`moov`/`mvhd` base that mutagen can tag.
**Local-file scan fixtures** — `tests/fixtures/local_files/` holds 19 files (14 flagged, 5 true negatives). `generate_fixtures.py` regenerates the binary files. Audio fixtures need 2 silent MPEG frames so mutagen can sync; FLAC uses a hand-packed STREAMINFO + Vorbis comment block.
**`_CPR_PREFIX_NOISE` in `.docx` fixtures** — `scan_docx` builds a single string by concatenating all run texts with no separators between paragraphs. If a CPR value run is immediately followed by text from the next paragraph without a word boundary, `\b` in `CPR_PATTERN` fails and the number is silently missed. The fixture generator appends a trailing `" "` to every value run so CPRs are always surrounded by word boundaries after concatenation. Do not remove this trailing space — the detection will silently regress.
## Viewer mode (#33) — routes/viewer.py + static/js/viewer.js
Read-only access for DPOs and reviewers. Key invariants:
- **`/view` auth chain** — token (`?token=`) → session cookie (`session["viewer_ok"]`) → PIN form (if PIN configured) → 403. Never skip this order.
- **`window.VIEWER_MODE`** — injected by Jinja2 in `index.html`. `auth.js` reads it at startup; adds `viewer-mode` class to `<body>`. All hide rules are CSS (`body.viewer-mode …`), not scattered JS checks — except `delBtn` in the card builder which is also guarded in JS. Hidden in viewer mode: `.sidebar` (entire left panel), `#logWrap`, `#progressBar`, scan/stop/profile/bulk-delete buttons, share button.
- **`window.VIEWER_SCOPE`** — injected alongside `VIEWER_MODE`. Contains the scope dict from the token (e.g. `{"role": "student"}`). Empty object `{}` means unrestricted. `auth.js` reads it at startup; if `VIEWER_SCOPE.role` is set, it pre-sets `#filterRole` to that value and hides the dropdown so the viewer cannot change it.
- **Token scope** — stored as `"scope": {"role": "student"|"staff"}` or `"scope": {}` in each token dict inside `viewer_tokens.json`. Enforced in two places: server-side (`GET /api/db/flagged` skips items whose `user_role` column does not match `session["viewer_scope"].role`) and client-side (the `#filterRole` dropdown is locked). Server-side is the authoritative guard. **Column name is `user_role`** — do not use `role`; the DB row has no such key and the filter silently returns nothing.
- **`session["viewer_scope"]`** — set when a token is validated at `/view`. Persists for the browser session alongside `session["viewer_ok"]`. Reads from `session.get("viewer_scope", {})` in `/api/db/flagged` — defaults to `{}` (unrestricted) for PIN-authenticated sessions and legacy tokens without a scope key.
- **`viewer_tokens.json` format** — stored as `{"tokens": [...], "__pin__": {"hash": "…", "salt": "…"}}`. Token dicts now include `"scope": {}`. The old bare-list format and tokens without a `scope` key are handled transparently (`t.get("scope", {})`). Do not write the file as a bare list.
- **`app.secret_key`** — derived from `machine_id` bytes so Flask sessions survive restarts. Set once at startup in `gdpr_scanner.py`; do not override it.
- **`GET /api/db/flagged`** — returns `get_session_items()` (last completed scan session, joined with dispositions), filtered by `session["viewer_scope"].role` when set. Used exclusively by `_loadViewerResults()` in `results.js`. Do not confuse with `get_flagged_items()` (single scan_id, no disposition join).
- **Rate-limit state** (`_pin_attempts` dict in `routes/viewer.py`) — in-memory only, resets on server restart. Intentional — a restart clears lockouts without a persistent store.
- **User-scoped tokens (#34)** — scope `{"user": ["alice@m365.dk", "alice@gws.dk"], "display_name": "Alice Smith"}` filters `GET /api/db/flagged` by `account_id IN (list)`, covering both M365 and GWS items for the same person. `scope.user` is always stored as a list; a legacy single-string value is coerced to `[string]` on read. `scope.display_name` is used for UI only (badge, viewer header) — not for filtering. File-scan items (`account_id = ""`) never appear in user-scoped views. `POST /api/viewer/tokens` rejects combined `role`+`user` scope with 400. Share modal: scope-type `<select>` (`#shareScopeType`) reveals either the role dropdown (`#shareScopeRoleWrap`) or a name-search autocomplete (`#shareScopeUserWrap`). Autocomplete reads `S._allUsers`; selecting a row stores `{ emails, display_name }` in module-level `_selectedScopeUser`; editing the input manually clears it (free-text email fallback). In viewer mode, `auth.js` shows `#viewerIdentityBadge` with `VIEWER_SCOPE.display_name`.
- **Date-range scoping** — tokens can carry `valid_from` and/or `valid_to` fields (YYYY-MM-DD) in their scope dict. `GET /api/db/flagged` filters items whose `modified` date falls outside the range using lexicographic string comparison (ISO dates sort correctly without parsing). `POST /api/viewer/tokens` validates format and enforces `valid_from ≤ valid_to`. The share modal shows `#shareValidFrom` / `#shareValidTo` date inputs (apply to any scope type). The token list shows a green date-range badge when a range is stored. All three scope dimensions (role, user, date-range) are independent and combinable.
- **Token onclick attributes** — Copy/Revoke buttons in `_renderTokenList()` pass the token as a single-quoted JS string literal (`'\'' + tok.token + '\''`), never via `JSON.stringify`. `JSON.stringify` produces double-quoted strings that break the surrounding `onclick="…"` HTML attribute.
- **Settings Security pane** — Admin PIN and Viewer PIN groups live in `stPaneSecurity`, not `stPaneGeneral`. `switchSettingsTab('security')` in `sources.js` triggers both `stLoadPinStatus()` and `stLoadViewerPinStatus()`. The Share modal Configure button opens `openSettings('security')`.
- **`stClearViewerPin` guard** — validates that the current-PIN field is non-empty client-side before sending the DELETE request; shows an inline error and focuses the field if empty.
- **Share link base URL**`_getShareBaseUrl()` in `viewer.js` fetches `/api/local_ip` (returns the machine's LAN IP via a UDP probe to `8.8.8.8`) and substitutes it so copied links are routable from other machines. Falls back to `window.location.origin` on error. Both `createShareLink` and `copyTokenLink` are `async` and `await` this helper. Do not revert to a bare `window.location.origin` — that produces `127.0.0.1` links useless to remote viewers.
- **Flask binds to `0.0.0.0`**`gdpr_scanner.py` default `--host`, `m365_launcher.py`, and `build_gdpr.py` all use `host="0.0.0.0"`. Internal loopback URLs (urllib exports, webview window, port probe) intentionally keep `127.0.0.1` — do not change those to `0.0.0.0`.
## Sources panel resize — static/js/log.js + sources.js
- **`_fitSourcesPanel()`** — called at the end of every `renderSourcesPanel()` call. Clears the panel's inline height, reads `scrollHeight` (natural content height), then either restores a saved smaller preference from `localStorage` (`gdpr_sources_h`) or pins the height to `scrollHeight`. This keeps the panel exactly as tall as needed to show all sources.
- **`_initSourcesResize()`** — attaches pointer-drag to `#sourcesResizeHandle`. On `pointerdown` it captures `scrollHeight` as the hard max; drag up shrinks, drag down is capped at that max. Saves to `localStorage` on release; clears the key if the user drags back to full height.
- **Do not add a fixed `max-height` or `height` to `#sourcesPanel` in HTML** — height is controlled entirely by `_fitSourcesPanel()` at runtime.
- **Do not call `_fitSourcesPanel()` before the panel has rendered**`scrollHeight` will be 0. The call in `renderSourcesPanel()` is the correct hook; `_initSourcesResize()` only sets up the drag handler.
**`_CPR_PREFIX_NOISE` in `.docx` fixtures** — `scan_docx` concatenates all run texts with no separators. The fixture generator appends a trailing `" "` to every value run so CPRs are always surrounded by word boundaries. Do not remove this trailing space — the detection will silently regress.
## Scan filter options — scan_engine.py
All options live in the profile `options` dict and apply to **all three scan engines** (M365, Google, file scan).
- **`skip_gps_images` (bool, default `false`)** — When enabled, images whose only PII is GPS coordinates are not flagged. GPS data is still extracted and stored in the card `exif` field if the item is flagged by another signal (faces, EXIF author/comment). The `gps_location` special category is also suppressed. Evaluated via `_exif_has_pii` which rechecks `pii_fields` and `author` when GPS is skipped.
- **`min_cpr_count` (int, default `1`)** — Minimum number of **distinct** CPR numbers in a file before it is flagged. Deduplication uses `list(dict.fromkeys(c["formatted"] for c in cprs))``cprs` is a list of dicts from `extract_matches`, not strings. Do not revert to `dict.fromkeys(cprs)` — that raises `TypeError: unhashable type: 'dict'` on every file with CPR hits. Files with faces or EXIF PII are still flagged regardless of CPR count — the threshold gates only CPR-based hits.
- **`cpr_only` (bool, default `false`)** — When enabled, items whose only hits are email addresses, phone numbers, detected faces, or EXIF/GPS metadata are skipped; only items with at least one qualifying CPR number are flagged. Implemented as a compact short-circuit at each engine's flagging gate: `if not (_cpr_qualifies and cprs) and (cpr_only or (<other PII absent>)): continue`. This preserves existing behavior when `cpr_only=False`. Sidebar toggle `#optCprOnly`; profile editor `#peOptCprOnly`.
- **`ocr_lang` (str, default `"dan+eng"`)** — Tesseract language pack(s) used when scanning scanned PDFs and images. Presets: `dan+eng`, `dan`, `eng`, `dan+eng+deu`, `dan+eng+swe`, `dan+eng+fra`. Threaded through `_scan_bytes`/`_scan_bytes_timeout``document_scanner.scan_pdf`/`scan_image` and the spawned PDF-OCR subprocess worker (`_worker_scan_pdf`). The OCR result cache key already included `lang`, so per-language results are cached independently. Sidebar select `#optOcrLang`; profile editor `#peOptOcrLang`.
- **File scan** reads all options from `source` dict keys (passed directly from the `/api/file_scan/start` payload). **M365 scan** reads them from `scan_opts = options.get("options", {})`. Both paths apply the same `_cpr_qualifies` / `_exif_has_pii` logic before the flagging gate.
- **UI:** sidebar controls `#optSkipGps`, `#optMinCpr`, `#optCprOnly`, `#optOcrLang`; profile editor controls `#peOptSkipGps`, `#peOptMinCpr`, `#peOptCprOnly`, `#peOptOcrLang`. All are saved/loaded by `profiles.js`.
## M365 connector exceptions — m365_connector.py
Exception hierarchy (all inherit `M365Error(Exception)`):
| Exception | Trigger | Handler |
|---|---|---|
| `M365PermissionError` | 403 Forbidden | `scan_error` broadcast with human-readable permission hint |
| `M365DeltaTokenExpired` | 410 Gone on delta endpoint | Caller clears token and falls back to full scan |
| `M365DriveNotFound` | 404 Not Found on any path | `scan_phase` broadcast ("not provisioned — skipped") in `_scan_user_onedrive`; full-scan path's `except Exception: return` also silences it |
**`M365DriveNotFound` — why it exists:** `_get()` previously fell through to `raise_for_status()` on 404, which was caught by the generic `except Exception` handler in `_scan_user_onedrive` and broadcast as a red `scan_error`. The full-scan path (`_iter_drive_folder_for`) silently swallowed the same 404 via `except Exception: return`. Adding the specific exception makes the delta path consistent with the full-scan path: a user without a provisioned OneDrive is skipped without an error card. Common causes: no OneDrive licence, service plan disabled, drive never initialised (account never signed in), account suspended.
**Do not add a 404 handler to `_get()` that returns a fallback value** — that would silently mask genuine path bugs elsewhere. Raising `M365DriveNotFound` keeps the error visible to callers that need to act on it.
- **`skip_gps_images` (bool, default `false`)** — images whose only PII is GPS coordinates are not flagged. GPS data still stored in `exif` field if flagged by another signal.
- **`min_cpr_count` (int, default `1`)** — minimum distinct CPR numbers before flagging. Deduplication uses `list(dict.fromkeys(c["formatted"] for c in cprs))` — do not revert to `dict.fromkeys(cprs)` (raises `TypeError: unhashable type: 'dict'`). Files with faces or EXIF PII are still flagged regardless.
- **`cpr_only` (bool, default `false`)** — skip items whose only hits are email addresses, phone numbers, faces, or EXIF/GPS metadata.
- **`ocr_lang` (str, default `"dan+eng"`)** — Tesseract language packs. Threaded through `_scan_bytes`/`_scan_bytes_timeout``document_scanner` and the PDF-OCR subprocess worker. Cache key already includes `lang`.
- **File scan** reads options from `source` dict keys directly. **M365 scan** reads from `scan_opts = options.get("options", {})`. Both paths apply the same `_cpr_qualifies` / `_exif_has_pii` logic.
- **UI:** sidebar `#optSkipGps`, `#optMinCpr`, `#optCprOnly`, `#optOcrLang`; profile editor `#peOptSkipGps`, `#peOptMinCpr`, `#peOptCprOnly`, `#peOptOcrLang`. All saved/loaded by `profiles.js`.
## Memory management — scan_engine.py
Large M365 tenants can generate enormous memory pressure. Key rules to preserve:
- **Email body stripped at collection time**`_scan_user_email` stores body as `msg["_precomputed_body"]`, deletes `msg["body"]` and `msg["bodyPreview"]`. Processing loop reads `meta.pop("_precomputed_body", "")`. Do not re-add `body` to `$select` without also stripping it.
- **`body_excerpt`** — 500-char plain-text preview stored per flagged email; flows into `flagged_items`, checkpoint JSON, and DB. Do not remove before broadcasting — needed for preview on checkpoint resume.
- **`work_items``deque` before processing** — drained via `popleft()` so each item's memory is released immediately. Do not convert back to a list.
- **`del content` / `del body_text`** — raw bytes and body text deleted immediately after use. Both hit and no-hit paths have explicit deletes.
- **PDF OCR rendered page-by-page**`convert_from_path(first_page=N, last_page=N)` inside the loop; only one page image in memory at a time. Do NOT revert to a bulk call — triggers OOM on large PDFs.
- **OCR memory guard**`_ocr_mem_ok()` checks `psutil.virtual_memory().available >= 500 MB` before each page render.
- **Memory guard**`psutil.virtual_memory().available` checked before each M365 file download; skips if < 300 MB free.
- **Email body stripped at collection time**`_scan_user_email` calls `conn.get_message_body_text(msg)`, stores the result as `msg["_precomputed_body"]`, then deletes `msg["body"]` and `msg["bodyPreview"]` before appending to `work_items`. The processing loop reads `meta.pop("_precomputed_body", "")`. Do not re-add `body` to the `$select` query without also stripping it here.
- **`body_excerpt` — 500-char plain-text preview stored per flagged email** — just before `del body_text` in M365 email processing, `meta["_body_excerpt"] = body_text[:500].strip()`. In `google_scan.py`, a regex HTML-strip of the first 3000 bytes of Gmail body data is stored the same way. `_broadcast_card` in both engines includes `"body_excerpt"` in the card dict so the excerpt flows into `flagged_items`, the checkpoint JSON, and the DB (`body_excerpt TEXT`, migration #10). The M365 email preview route falls back to `_excerpt_page()` when Graph raises or the connector is absent. The Gmail preview shows `_excerpt_page()` as primary content with the "Open in Gmail" link appended. Do not remove the excerpt before broadcasting — that's what makes preview work on checkpoint resume.
- **`work_items``deque` before processing** — converted with `deque(work_items)` and drained via `popleft()` so each item's memory is released immediately after processing. Do not convert back to a list or iterate with `enumerate()`.
- **`del content` in file branch** — raw download bytes are deleted as soon as `content.decode()` is done (before NER/PII counting). Both the hit and no-hit paths have explicit `del content`.
- **`del body_text` in email branch** — deleted after `_broadcast_card` call.
- **PDF OCR rendered page-by-page**`document_scanner.scan_pdf` (and the redact paths) call `convert_from_path(first_page=N, last_page=N)` inside the loop, so only one page image is in memory at a time. Do NOT move back to a bulk `convert_from_path()` call — that allocates all pages at once and triggers OOM kills on large PDFs.
- **OCR memory guard**`_ocr_mem_ok()` checks `psutil.virtual_memory().available >= 500 MB` before each page render. Pages that would exceed this threshold are skipped with a printed warning and recorded as `"skipped"` in `page_methods`.
- **Memory guard**`psutil.virtual_memory().available` checked before each M365 file download; scan skips the file if < 300 MB free.
## Scan history browser — gdpr_db.py
## Export — routes/export.py
- **`GDPRDb.get_session_sources()`** — returns a `set` of source-key strings (e.g. `{"gmail", "gdrive", "email"}`) for every scan in the current session window. Used by both `_build_excel_bytes()` and `_build_article30_docx()` to include zero-hit sources in summary tables. Do not derive the scanned-source set from `by_source` alone — that dict only contains sources with flagged items.
- **Excel Summary sheet vs. per-source tabs** — the Summary sheet shows all scanned sources (even with 0 items). Per-source tabs are only created for sources with items; an empty tab has no value.
- **ART.30 breakdown table** — iterates `scanned_sources` (not `by_source`) so Gmail, Google Drive, etc. appear with `0 | 0 | 0 | —` when the scan found nothing.
- **Role-filtered exports**`_build_excel_bytes(role='')` and `_build_article30_docx(role='')` accept `role='student'` or `role='staff'`. A local `_items` list is built at the top of each function and used everywhere instead of `state.flagged_items` directly — GPS sheet, External transfers sheet, and Art.30 staff/student tables all see only the filtered subset. Route handlers read `request.args.get('role', '')` and forward it. Filenames get `_elever` / `_ansatte` suffix. The `#filterRole` dropdown in the filter bar drives both the client-side grid filter and the export URL param — do not separate them.
- **`POST /api/redact_item`** — rewrites a file in-place with CPR numbers replaced by `██████-████` / `█` blocks, removes the card from the grid, and logs a `"redacted"` disposition. Supported source types and extensions:
- **`local`** — DOCX, XLSX, CSV, TXT, PDF. File is written to a temp path in the same directory then `shutil.move`d (avoids cross-device rename).
- **`onedrive` / `sharepoint` / `teams`** — DOCX, XLSX, PDF. Downloaded via Graph, redacted locally, re-uploaded via `put_drive_item_content()` (PUT with `Content-Type: application/octet-stream`). Requires `Files.ReadWrite.All``SCOPES` in `m365_connector.py` now requests this instead of `Files.Read.All` (superset; scanning still works). Delegated-auth users must re-authenticate after upgrading; app-only tenants need the admin to grant `Files.ReadWrite.All` application permission and re-consent.
- **`gdrive`** — DOCX, XLSX, PDF. MIME type checked first — Google-native Docs/Sheets (exported as DOCX during scan) are refused with a clear message. Downloaded via `download_drive_file_by_id()`, redacted, uploaded back via `update_drive_file()` (`files().update()`). Requires `drive` scope (not `drive.readonly`) on the service-account delegation.
- **`sftp`** — DOCX, XLSX, CSV, TXT, PDF. Source config matched from `_load_file_sources()` by `sftp_host` + `sftp_user` parsed from `item_meta["account_name"]` (the `sftp://user@host/root` URI). Requires the item to still be in `state.flagged_items``account_name` is not persisted to the DB. Read/write via `SFTPScanner.read_file()` / `write_file()`.
- **`smb`** — DOCX, XLSX, CSV, TXT, PDF. Host + share parsed from `full_path` (`//host/share/…`); source config matched from `_load_file_sources()`. Written back via `file_scanner.write_smb_file()` with `CreateDisposition.FILE_SUPERSEDE`.
- The ✂ button (`card-redact-btn`) appears in `appendCard` via `_redactable` logic in `results.js`; hidden in viewer mode and for resolved items. **Keep `_redactExts` / `_cloudRedactExts` in `results.js` and `_REDACT_EXTS` / `_GDRIVE_MIME_MAP` / `_ALL_REDACTABLE_TYPES` in `export.py` in sync** — the button and the route must agree.
- **PDF redaction**`redact_pdf_secure` uses PyMuPDF `page.apply_redactions()` which physically removes text data from the PDF stream (not just an overlay). Falls back to `redact_pdf` (reportlab overlay) if PyMuPDF is absent. Text-based pages use `find_cpr_char_bboxes`; scanned pages render via OCR at 200 DPI and use `find_cpr_image_bboxes`. Raises `RuntimeError` if both backends are unavailable.
## Scan history browser — static/js/history.js + gdpr_db.py + routes/database.py
Allows reviewing results from any past scan session without running a new scan. Key invariants:
- **`S._historyRefScanId`** — `null` = live/SSE mode; positive int = viewing a past session (the highest `scan_id` in that session's 300 s window). Set by `loadHistorySession()`; cleared to `null` by `exitHistoryMode()`.
- **`GET /api/db/sessions`** (`routes/database.py`) — calls `_get_db().get_sessions()`. Returns newest-first list; each entry has `ref_scan_id`, `started_at`, `finished_at`, `sources` (list of source-key strings), `flagged_count`, `total_scanned`, `delta` (bool). No auth restriction — viewer tokens share this endpoint.
- **`get_sessions(limit=50, window_seconds=300)`** (`gdpr_db.py`) — groups `scans` rows by 300 s window (same window logic as `get_session_items`). Groups are built ascending, returned descending. `ref_scan_id` is the highest `scan_id` in each group. Do not change the window size independently of `get_session_items`.
- **`get_session_items(ref_scan_id=N)`** (`gdpr_db.py`) — when `ref_scan_id` is given, anchors the 300 s window to that scan's `started_at`. Falls back to latest scan when `ref_scan_id=None`. Window is **symmetric**: `started_at BETWEEN ref.started_at - 300 AND ref.started_at + 300` — do not revert to a one-sided lower bound or historical sessions will include all newer scans.
- **`GET /api/db/flagged?ref=N`** — passes `ref_scan_id` to `get_session_items`; viewer scope enforcement (role/user filters) still applies. Used by both history mode and the normal post-scan viewer path.
- **History banner** (`#historyBanner`) — shown when `S._historyRefScanId` is set. Contains `#historyBannerText` (session date · sources · N items), `#historyPickerBtn` (opens `#historyDropdown`), and `#historyLatestBtn` (visible only when the viewed session is not the latest). Do not hide/show these elements from outside `history.js`.
- **Session picker** (`#historyDropdown`) — rendered inside `[data-history-wrap]` container so the outside-click handler (`document` listener, closes on clicks outside `[data-history-wrap]`) works correctly. Do not move the picker outside this wrapper.
- **Cache invalidation**`_sessions` and `_latestRefScanId` are module-level in `history.js`. `invalidateHistoryCache()` clears both. All three `*_done` SSE handlers in `scan.js` call `window.invalidateHistoryCache?.()` so the picker reflects the newest scan after completion.
- **Re-scan diff**`loadHistorySession` fetches the immediately preceding session's items after rendering the current session. Items present in the previous session but absent from the current one (compared by `id`) are tagged `_resolved: true` and appended after a `.resolved-divider` separator. `appendCard` in `results.js` adds `.card-resolved` (opacity 0.6), a green `✓ Resolved` badge, and hides the delete button for resolved items. `_setHistoryBanner` accepts an optional `resolvedCount` parameter and appends it to the banner label. Resolved items are NOT added to `S.flaggedData` — they are grid-only and cannot be bulk-selected or exported.
- **Auto-load on page load**`results.js` calls `window.loadHistorySession?.(null)` once when the SSE watchdog confirms `!status.running`. `null` resolves to the latest completed session via `_fetchSessions()[0].ref_scan_id`. The `_initialStatusChecked` guard ensures this fires at most once per page load.
- **`sse_replay_done` retry** — `scan_phase` events replayed from a *completed* scan temporarily set `S._m365ScanRunning = true` (because all flags start at `false` after a page refresh). This causes the watchdog's `loadHistorySession` call to bail before `scan_done` clears the flag. The `sse_replay_done` handler in `scan.js` retries `loadHistorySession(null)` if no scan is running and `S._historyRefScanId` is still `null` after replay. Do not remove this retry — it is the only way to guarantee DB items are loaded when the watchdog fires in the middle of a replay.
- **Mode transitions**`startScan()` calls `window.exitHistoryMode?.()` before clearing the grid, so any history banner is dismissed and `S._historyRefScanId` is reset before SSE events start arriving.
## CPR cross-referencing — gdpr_db.py + routes/database.py + static/js/results.js
- **`GDPRDb.get_related_items(item_id, ref_scan_id, window_seconds=300)`** — self-joins `cpr_index` to find other items in the same session window that share ≥1 CPR hash with `item_id`. Returns rows ordered by `shared_cprs DESC, cpr_count DESC`. Uses the same 300 s symmetric window as `get_session_items` — do not change the window size independently.
- **`GET /api/db/related/<item_id>?ref=N`** (`routes/database.py`) — passes `item_id` and optional `ref_scan_id` to `get_related_items`; normalises JSON columns (same logic as `db_flagged_items`). Returns `[]` when `DB_OK` is false.
- **`#previewRelated`** — `<div>` inserted between `#previewMeta` and the disposition row in `index.html`. Hidden (`display:none`) when not in use; shown by `_loadRelated`.
- **`_loadRelated(f)`** (`results.js`) — async; hides `#previewRelated` if `f.cpr_count` is 0, otherwise fetches `/api/db/related/<id>?ref=N` and renders a clickable list with per-item shared-CPR badge. Called from `openPreview` after `loadDisposition`.
- **`window._openRelated(id, itemData)`** (`results.js`) — resolves the target item: looks up `id` in `S.flaggedData` first (live/history grid already loaded), falls back to `itemData` from the API response (history items not yet in the grid). Calls `openPreview`.
- **No new data collection**`cpr_index` already stores `(cpr_hash, item_id, scan_id)` for every CPR hit at write time. Cross-referencing is entirely a query-time operation.
## Preview — routes/database.py
`GET /api/preview/<item_id>?source_type=…&account_id=…` dispatches by `source_type`:
- **`local` / `smb`** — re-reads the file from disk; renders images as data URIs, text/CSV/PDF/DOCX/XLSX inline, SMB as a link card.
- **`email`** — fetches the M365 message body via Graph and renders it as sandboxed HTML (requires `state.connector`).
- **`gmail`** — Gmail's web UI cannot be embedded (X-Frame-Options). Shows an info card with an "Open in Gmail" link built from the stored `_url` field.
- **`gdrive`** — extracts the Drive file ID from `webViewLink` and returns `https://drive.google.com/file/d/{id}/preview` as an iframe. Falls back to substituting `/view``/preview` in the URL if the pattern doesn't match.
- **All other values** (M365 files: `onedrive`, `sharepoint`, `teams`, or empty) — calls Graph's `/preview` POST endpoint; tries `drive_id`-based path first, then user-drive path, then `/me/drive`.
**`_source_type` must be set in `google_scan.py`** — Gmail items need `meta["_source_type"] = "gmail"` and Drive items `"gdrive"` before `_broadcast_card` is called. Without it, cards carry an empty `source_type` and fall through to the M365 branch, which calls Graph with a Gmail ID and gets a 404.
**`state.connector` guard** — only the `email` branch and the M365 `else` branch require M365 auth. The `local`/`smb`, `gmail`, and `gdrive` branches must not gate on `state.connector` — they work in Google-only deployments.
## Compliance audit log — gdpr_db.py + routes/
- **`audit_log` table** — created by `_DDL` (`CREATE TABLE IF NOT EXISTS`) so it appears automatically on the next server start for existing databases. No migration needed. Schema: `id, ts (Unix float), action, actor, detail, ip`.
- **`ScanDB.log_audit(action, detail, actor, ip)`** — inserts one record and commits immediately. `ScanDB.get_audit_log(limit, action)` returns rows newest-first.
- **`log_audit_event(action, detail, actor, ip)`** — module-level helper in `gdpr_db.py`; silently no-ops on any exception so call sites never raise. Import: `from gdpr_db import log_audit_event as _audit`.
- **`GET /api/audit_log?limit=200&action=<filter>`** — in `routes/app_routes.py`. No auth gate — same access level as other settings endpoints.
- **Recorded events**`profile_save`, `profile_delete` (`routes/profiles.py`); `token_create`, `token_revoke`, `viewer_pin_set/change/clear`, `interface_pin_set/change/clear` (`routes/viewer.py`); `source_add`, `source_update`, `source_delete` (`routes/sources.py`); `scheduler_job_save`, `scheduler_job_delete` (`routes/scheduler.py`); `scan_start`, `scan_stop` (`routes/scan.py`); `smtp_save` (`routes/email.py`); `disposition`, `disposition_bulk`, `admin_pin_set/change` (`routes/database.py`); `item_delete`, `item_redact` (`routes/export.py`).
- **UI** — "Audit Log" tab (`stTabAuditlog` / `stPaneAuditlog`) in the Settings modal. `stLoadAuditLog()` in `sources.js` fetches and renders the table when the tab is opened; uses `window._escHtml` from `log.js`. Exported as `window.stLoadAuditLog`.
- **Do not add `actor` values for end-user identity** — the scanner has no per-user login, so `actor` is always empty for now. The field is reserved for future use.
## SSE teardown — static/js/scan.js
- **Do not close `S.es` in `scan_done` if other scans are still running** — M365 (`scan_done`), Google (`google_scan_done`), and File (`file_scan_done`) each emit their own done event. If M365 finishes first and the SSE is closed, the remaining done events are never received and the UI hangs at 100% indefinitely.
- **Rule:** close `S.es` (and reset `S._userStartedScan`) only inside the branch where *all* concurrent scans have finished: `scan_done` checks `!S._googleScanRunning && !S._fileScanRunning`; `google_scan_done` checks `!S._m365ScanRunning && !S._fileScanRunning`; `file_scan_done` checks `!S._m365ScanRunning && !S._googleScanRunning`.
- **Scheduled scans**`S._userStartedScan` is false for scheduler-triggered runs, so the SSE connection is never closed and future scheduler events continue to arrive.
- **`scan_start` is M365-only** — `run_scan()` broadcasts `scan_start`; `run_file_scan()` and `routes/google_scan.py` must NOT. The `scan_start` handler in `_attachSchedulerListeners` unconditionally sets `S._m365ScanRunning = true`. If a file scan emits `scan_start`, the flag is set without a matching `scan_done` to clear it, and `file_scan_done` refuses to re-enable the scan button because `!S._m365ScanRunning` is false. Use `scan_phase` (file) and `google_scan_phase` (google) instead — these are routed correctly by the phase-source detection logic in `_attachScanListeners`.
- **Two separate abort events**`state._scan_abort` (M365 + file) and `state._google_scan_abort` (Google). `POST /api/scan/stop` sets **both**. `_check_abort()` inside `_run_google_scan` must use the module-level `_scan_abort` alias (`= state._google_scan_abort`), not `gdpr_scanner._scan_abort` (which is the M365 event). Do not conflate them — a Google-only scan must react to Stop, and `gdpr_scanner._scan_abort` is not the right event for that path.
- **`_check_abort()` emits `google_scan_done`, not `scan_cancelled`** — when the Google thread detects the abort signal it broadcasts `google_scan_done` (with `cancelled: True`). Do NOT change this back to `scan_cancelled`. The `scan_cancelled` handler in the frontend unconditionally closes the SSE connection; if a new M365/file scan started while the old Google thread was still winding down, that would drop all remaining events from the new scan. `google_scan_done` is safe because its handler checks `!S._m365ScanRunning && !S._fileScanRunning` before closing the SSE.
- **Google Drive uses a lazy generator, not `list()`**`iter_drive_files()` is iterated directly (no `list()` wrapping) in both the non-delta path and the delta-fallback path. Wrapping in `list()` would block the thread — and the abort check — for the entire file enumeration. The delta start token is recorded *before* the iteration loop so it captures state at the right moment regardless of when the loop is interrupted.
- **`scan_phase` replay sets running flags — handled by `sse_replay_done`** — the `scan_phase` handler sets `S._m365ScanRunning/S._googleScanRunning/S._fileScanRunning = true` whenever no flag is currently set and a matching source keyword is found in the phase text. On page refresh this fires during SSE replay of a completed scan (all flags start `false`), temporarily making the scan appear to still be running. This is intentional for live reconnection but harmless for the completed-scan case because `scan_done` clears the flag and `sse_replay_done` retries `loadHistorySession`. Do not remove the flag-setting logic from `scan_phase` — it is needed to detect reconnection to a genuinely running scan.
## Email sending — routes/email.py + m365_connector.py
- **`_post()` returns `{}` on empty body** — `m365_connector._post()` returns `r.json() if r.content else {}`. The Graph `sendMail` endpoint returns HTTP 202 with **no body** on success; calling `r.json()` on an empty response raises `JSONDecodeError`. Do not change this back to an unconditional `r.json()` — it would falsely report every successful email send as an error.
- **Graph preferred over SMTP**`smtp_test` and `send_report` both try `_send_email_graph()` first when `state.connector` is authenticated. Only falls back to SMTP if Graph raises. If Graph fails and no SMTP host is saved, the Graph exception is surfaced directly (not swallowed by the "No SMTP host" message).
- **Auto-email after manual scan**`_maybe_send_auto_email()` in `routes/scan.py` is called from the `_run()` thread immediately after `run_scan()` returns. Reads `smtp_cfg.get("auto_email_manual")` from `smtp.json`; no-ops if the flag is false, no flagged items, or no recipients. Same Graph-first → SMTP-fallback pattern as the scheduler. Toggle: **Settings → Email report → Email report after manual scan** (`#st-smtpAutoEmail`), saved by `stSmtpSave()` in `scheduler.js`.
- **Gmail vs Google Workspace detection** — auth error handlers check whether the SMTP username ends in `@gmail.com` / `@googlemail.com`. If not, the account is treated as Google Workspace (custom domain) and the error message points to the Workspace admin console rather than the user's personal security settings.
## Scheduler — scan_scheduler.py + routes/scheduler.py + static/js/scheduler.js
- **Job config keys**`id`, `name`, `enabled`, `frequency` (daily/weekly/monthly), `day_of_week`, `day_of_month`, `hour`, `minute`, `profile_id`, `auto_email`, `auto_retention`, `retention_years`, `fiscal_year_end`, `report_only`. Stored in `~/.gdprscanner/schedule.json`. Auto-migrates old single-job format; assigns UUIDs to legacy entries without one.
- **`_execute_scan(job_id)`** — the core execution method. Acquires a per-job lock (`_running_jobs` set), records a DB run via `db.begin_schedule_run()`, then either takes the report-only path (see below) or runs the full scan pipeline (M365 → file → Google), then emails and applies retention if configured. The DB run is finalised in a `finally` block so status/counts are always recorded.
- **Report-only path** — when `report_only=True`, `_execute_scan` short-circuits before the M365 auth check. It populates `_m.flagged_items` from `db.get_session_items()` if the in-memory list is empty, then calls `_send_email_report(job_cfg)` and returns. Does NOT acquire the scan lock; does NOT require M365 auth. Fails with `RuntimeError("No scan results available")` if both in-memory state and the DB are empty, which the outer `except` handler records as a failed run.
- **`_send_email_report(job_cfg)`** — builds Excel via `_m._build_excel_bytes()`, loads SMTP config, tries Graph first (if `state.connector` is authenticated), falls back to SMTP. Adjusts the email body text based on `job_cfg.get("report_only")`: "Scan completed" vs "Report on latest scan results".
- **`_m.flagged_items` and `state.flagged_items` are the same object** — `gdpr_scanner.py` assigns `_state.flagged_items = flagged_items` at startup, so both names reference the same list. In-place updates (`flagged_items[:] = ...`) in the scheduler propagate to routes and vice versa.
- **`scheduler_started` / `scheduler_done` SSE events** — broadcast at start and end of every job (including report-only). `scheduler_done` carries `flagged`, `scanned`, `emailed`, and `job_name`. Do not confuse with `scan_done` (M365) — they are separate event types.
- **UI — job card badge**`schedRenderJobs()` in `scheduler.js` adds a blue "Report only" (`m365_sched_report_only`) badge to the job name when `j.report_only` is true.
- **UI — `schedToggleReportOnly()`** — dims the Profile row (`#schedProfileRow` opacity 0.4), shows/hides `#schedReportOnlyHint`, and forces `#schedAutoEmail` checked. Called from the checkbox `onchange` handler and at the start of `schedAddJob()` / `schedEditJob()`.
- **Profile options merge into file sources** — saved file-source dicts (`~/.gdprscanner/file_sources.json`) only carry connection info (`id, label, path, smb_host, smb_user, smb_domain, keychain_key, sftp_*`). Per-scan options live on the profile (`options.cpr_only`, `options.ocr_lang`, `options.scan_photos`, `options.skip_gps_images`, `options.min_cpr_count`, `options.scan_emails`, `options.scan_phones`, `options.max_file_mb`). The scheduler unpacks them with `{**fs, **_fs_extra}` before calling `_m.run_file_scan(fs)` — mirroring what `startScan()` in `scan.js` does in the browser. Do not pass `fs` directly to `run_file_scan()` without merging — the file scan reads these keys with `source.get(...)` and will silently fall back to defaults, ignoring the profile's settings.
## Claude NER — document_scanner.py + app_config.py + routes/app_routes.py
Optional AI-powered Named Entity Recognition replacing spaCy. Activated via `config.json` keys `claude_ner` (bool) and `claude_api_key` (str).
- **`ANTHROPIC_OK`** — module-level flag in `document_scanner.py`; `True` if `anthropic` is importable. Guards all Claude code paths so the scanner works without the package installed.
- **`_get_claude_ner_config()`** — reads `config.json` via `app_config._load_config()` on each call. File is small and OS-cached — no startup injection needed.
- **`_ner_claude(text, api_key)`** — calls `claude-haiku-4-5-20251001`, sends text in 8 000-char chunks, parses the JSON response. Returns `[{"text": ..., "type": "NAME"|"ADDRESS"|"ORG"}]`. Thread-safe in-memory cache keyed by `hash(text)`, evicts oldest entry when > 2 000 entries.
- **Integration points**`count_pii_types()` and `find_pii_spans_in_text()` both check `_get_claude_ner_config()` before deciding whether to call Claude or spaCy. Claude path uses `re.finditer(re.escape(ent_text), text)` to recover character offsets from Claude's extracted strings.
- **`GET /POST /api/settings/claude`** — GET returns `{"enabled": bool, "api_key_set": bool}` (never exposes the key). POST accepts `{"enabled": bool, "api_key": "..."}``api_key` is optional; omitting or sending `""` leaves the stored key unchanged.
- **`POST /api/settings/claude/test`** — makes a minimal 8-token API call and returns `{"ok": true}` or `{"ok": false, "error": "..."}`. Used by the Test button in the UI.
- **`app_config.get_claude_config()` / `save_claude_config(enabled, api_key=None)`** — the two public helpers; `api_key=None` means "keep existing key".
- **Settings tab `stTabAi` / `stPaneAi`**`switchSettingsTab('ai')` calls `stLoadAiSettings()` in `sources.js`. Shows enable toggle, masked key input with Show/Hide toggle, Save and Test buttons.
- **Do not import `anthropic` at module level in any file other than `document_scanner.py`** — the `routes/app_routes.py` test endpoint imports it locally inside the function body so the server starts without the package if needed.
- **`get_sessions(limit=50, window_seconds=300)`** — groups `scans` rows by 300 s window. Groups built ascending, returned descending. `ref_scan_id` is the highest `scan_id` in each group. Do not change window size independently of `get_session_items`.
- **`get_session_items(ref_scan_id=N)`** — anchors 300 s window to that scan's `started_at`. Window is **symmetric**: `started_at BETWEEN ref.started_at - 300 AND ref.started_at + 300`. Do not revert to a one-sided lower bound.
- **`get_related_items(item_id, ref_scan_id, window_seconds=300)`** — self-joins `cpr_index` to find items sharing ≥1 CPR hash. Uses same 300 s symmetric window — do not change independently.
- **`GET /api/db/flagged?ref=N`** — passes `ref_scan_id` to `get_session_items`; viewer scope enforcement still applies.
- See `static/js/CLAUDE.md` for the frontend history browser behaviour and `sse_replay_done` retry fix.
## Global gotchas
@ -243,7 +97,7 @@ Optional AI-powered Named Entity Recognition replacing spaCy. Activated via `con
## Directory-scoped rules
- `routes/CLAUDE.md` — SSE constraints, scan_progress source field, file_sources, Python gotchas
- `static/js/CLAUDE.md` — profile dropdown, progress bar phase parsing, JS gotchas
- `routes/CLAUDE.md` — SSE constraints, M365 exceptions, export, preview, audit log, email, scheduler, Claude NER, viewer route, Python gotchas
- `static/js/CLAUDE.md` — profile dropdown, progress bar, SSE teardown, history browser, CPR cross-referencing, sources panel resize, viewer JS, JS gotchas
- `templates/CLAUDE.md` — CSS variable names, sizing rules, badge standard, design rules
- `lang/CLAUDE.md` — i18n conventions

View File

@ -19,6 +19,87 @@ All three scan engines must include `"source": "m365"` / `"google"` / `"file"` i
## `_scan_bytes` injection
`scan_engine.py` declares stub versions of `_scan_bytes` / `_scan_bytes_timeout` at module level. `gdpr_scanner.py` replaces them with the real `cpr_detector` implementations at startup. `routes/google_scan.py` pulls them from `gdpr_scanner` via `__getattr__`. Never import these directly in blueprint or engine modules — that breaks the circular-import barrier.
## M365 connector exceptions — m365_connector.py
Exception hierarchy (all inherit `M365Error(Exception)`):
| Exception | Trigger | Handler |
|---|---|---|
| `M365PermissionError` | 403 Forbidden | `scan_error` broadcast with human-readable permission hint |
| `M365DeltaTokenExpired` | 410 Gone on delta endpoint | Caller clears token and falls back to full scan |
| `M365DriveNotFound` | 404 Not Found on any path | `scan_phase` broadcast ("not provisioned — skipped") in `_scan_user_onedrive`; full-scan path's `except Exception: return` also silences it |
**`M365DriveNotFound` — why it exists:** `_get()` previously fell through to `raise_for_status()` on 404, which was caught by the generic `except Exception` handler and broadcast as a red `scan_error`. Adding the specific exception makes the delta path consistent with the full-scan path: a user without a provisioned OneDrive is skipped silently. **Do not add a 404 handler to `_get()` that returns a fallback value** — that would silently mask genuine path bugs.
## Export — routes/export.py
- **`GDPRDb.get_session_sources()`** — returns a `set` of source-key strings for every scan in the current session window. Used by both `_build_excel_bytes()` and `_build_article30_docx()` to include zero-hit sources in summary tables. Do not derive the scanned-source set from `by_source` alone — that dict only contains sources with flagged items.
- **Excel Summary sheet** — shows all scanned sources (even with 0 items). Per-source tabs only created for sources with items.
- **ART.30 breakdown table** — iterates `scanned_sources` (not `by_source`) so Gmail, Drive, etc. appear with `0 | 0 | 0 | —` when the scan found nothing.
- **Role-filtered exports**`_build_excel_bytes(role='')` and `_build_article30_docx(role='')` accept `role='student'` or `role='staff'`. A local `_items` list is built at the top of each function; GPS sheet, External transfers sheet, and Art.30 tables all see only the filtered subset. Filenames get `_elever` / `_ansatte` suffix.
- **`POST /api/redact_item`** — rewrites a file in-place with CPR numbers replaced by `██████-████` / `█` blocks, removes the card from the grid, logs a `"redacted"` disposition. Source types: `local` (DOCX/XLSX/CSV/TXT/PDF, written via temp+move), `onedrive`/`sharepoint`/`teams` (Graph download → redact → PUT, requires `Files.ReadWrite.All`), `gdrive` (Drive API, requires `drive` scope), `sftp` (paramiko read/write, item must still be in `state.flagged_items`), `smb` (smbprotocol `FILE_SUPERSEDE`). **Keep `_redactExts`/`_cloudRedactExts` in `results.js` and `_REDACT_EXTS`/`_GDRIVE_MIME_MAP`/`_ALL_REDACTABLE_TYPES` in `export.py` in sync** — the button and the route must agree.
- **PDF redaction**`redact_pdf_secure` uses PyMuPDF `page.apply_redactions()` (physical removal). Falls back to reportlab overlay if PyMuPDF absent. Text pages use `find_cpr_char_bboxes`; scanned pages use OCR at 200 DPI + `find_cpr_image_bboxes`.
## Preview — routes/database.py
`GET /api/preview/<item_id>?source_type=…&account_id=…` dispatches by `source_type`:
- **`local` / `smb`** — re-reads from disk; renders images as data URIs, text/CSV/PDF/DOCX/XLSX inline.
- **`email`** — fetches M365 message body via Graph (requires `state.connector`).
- **`gmail`** — shows info card with "Open in Gmail" link (X-Frame-Options blocks embedding).
- **`gdrive`** — returns `https://drive.google.com/file/d/{id}/preview` iframe.
- **All other values** (M365 files) — calls Graph `/preview` POST; tries `drive_id`-based path first, then user-drive, then `/me/drive`.
**`_source_type` must be set in `google_scan.py`** — Gmail items need `meta["_source_type"] = "gmail"` and Drive items `"gdrive"` before `_broadcast_card`. Without it, cards fall through to the M365 branch, which calls Graph with a Gmail ID and gets a 404.
**`state.connector` guard** — only the `email` and M365 `else` branches require M365 auth. The `local`/`smb`/`gmail`/`gdrive` branches must not gate on `state.connector` — they work in Google-only deployments.
## Compliance audit log — gdpr_db.py + routes/
- **`audit_log` table** — created by `_DDL` (`CREATE TABLE IF NOT EXISTS`), auto-appears on next server start. Schema: `id, ts (Unix float), action, actor, detail, ip`.
- **`log_audit_event(action, detail, actor, ip)`** — module-level helper; silently no-ops on any exception. Import: `from gdpr_db import log_audit_event as _audit`.
- **`GET /api/audit_log?limit=200&action=<filter>`** — in `routes/app_routes.py`. No auth gate.
- **Recorded events**`profile_save/delete`, `token_create/revoke`, `viewer_pin_set/change/clear`, `interface_pin_set/change/clear`, `source_add/update/delete`, `scheduler_job_save/delete`, `scan_start/stop`, `smtp_save`, `disposition`, `disposition_bulk`, `admin_pin_set/change`, `item_delete`, `item_redact`.
- **`actor` always empty** — no per-user login; field reserved for future use.
## Email sending — routes/email.py + m365_connector.py
- **`_post()` returns `{}` on empty body** — Graph `sendMail` returns HTTP 202 with no body; `r.json()` on empty raises `JSONDecodeError`. Do not revert to unconditional `r.json()`.
- **Graph preferred over SMTP**`smtp_test` and `send_report` try `_send_email_graph()` first; fall back to SMTP only if Graph raises. If Graph fails and no SMTP host saved, the Graph exception surfaces directly.
- **Auto-email after manual scan**`_maybe_send_auto_email()` in `routes/scan.py` called from the `_run()` thread after `run_scan()` returns. Reads `smtp_cfg.get("auto_email_manual")`; no-ops if false, no flagged items, or no recipients.
- **Gmail vs Google Workspace** — auth error handlers check if SMTP username ends in `@gmail.com`/`@googlemail.com`; custom domains are treated as Google Workspace and error message points to the Workspace admin console.
## Scheduler — scan_scheduler.py + routes/scheduler.py
- **Job config keys**`id`, `name`, `enabled`, `frequency` (daily/weekly/monthly), `day_of_week`, `day_of_month`, `hour`, `minute`, `profile_id`, `auto_email`, `auto_retention`, `retention_years`, `fiscal_year_end`, `report_only`. Stored in `~/.gdprscanner/schedule.json`.
- **`_execute_scan(job_id)`** — acquires per-job lock (`_running_jobs` set), records DB run via `db.begin_schedule_run()`, runs M365 → file → Google pipeline, then emails and applies retention. DB run finalised in `finally`.
- **Report-only path** — when `report_only=True`, short-circuits before M365 auth check, populates `_m.flagged_items` from `db.get_session_items()` if empty, calls `_send_email_report()`. Does NOT acquire scan lock; fails with `RuntimeError("No scan results available")` if DB is also empty.
- **`_m.flagged_items` and `state.flagged_items` are the same object** — assigned at startup; in-place updates (`flagged_items[:] = ...`) propagate to both.
- **`scheduler_started` / `scheduler_done` SSE events** — separate from `scan_done` (M365). `scheduler_done` carries `flagged`, `scanned`, `emailed`, `job_name`.
- **Profile options merge into file sources** — scheduler unpacks `{**fs, **_fs_extra}` before calling `run_file_scan(fs)`. Do not pass `fs` directly — the file scan reads `source.get(...)` and silently falls back to defaults without the merge.
## Claude NER — document_scanner.py + app_config.py + routes/app_routes.py
Optional AI-powered NER replacing spaCy. Activated via `config.json` keys `claude_ner` (bool) and `claude_api_key` (str).
- **`ANTHROPIC_OK`** — module-level flag in `document_scanner.py`; `True` if `anthropic` is importable. Guards all Claude code paths.
- **`_ner_claude(text, api_key)`** — calls `claude-haiku-4-5-20251001` in 8 000-char chunks. Thread-safe cache keyed by `hash(text)`, evicts oldest when > 2 000 entries.
- **`GET/POST /api/settings/claude`** — GET returns `{"enabled": bool, "api_key_set": bool}` (never exposes key). POST accepts `{"enabled": bool, "api_key": "..."}` — omitting `api_key` leaves stored key unchanged.
- **`POST /api/settings/claude/test`** — minimal 8-token API call; returns `{"ok": true}` or `{"ok": false, "error": "..."}`.
- **Do not import `anthropic` at module level outside `document_scanner.py`**`routes/app_routes.py` imports it locally inside the function body so the server starts without the package.
## Viewer mode — routes/viewer.py
- **`/view` auth chain** — token (`?token=`) → session cookie (`session["viewer_ok"]`) → PIN form → 403. Never skip this order.
- **Token scope** — stored as `"scope": {"role": "student"|"staff"}`, `{"user": [...], "display_name": "..."}`, or `{}` in `viewer_tokens.json`. Enforced server-side in `GET /api/db/flagged`. **Column name is `user_role`** — do not use `role`.
- **`session["viewer_scope"]`** — set at `/view` token validation. `GET /api/db/flagged` reads `session.get("viewer_scope", {})` — defaults to `{}` (unrestricted) for PIN-authenticated sessions.
- **`viewer_tokens.json` format** — `{"tokens": [...], "__pin__": {"hash": "…", "salt": "…"}}`. Old bare-list format handled transparently. Do not write as bare list.
- **Rate-limit state** (`_pin_attempts` dict) — in-memory only, resets on server restart. Intentional.
- **User-scoped tokens**`scope.user` always a list; legacy single-string coerced on read. File-scan items (`account_id = ""`) never appear in user-scoped views. `POST /api/viewer/tokens` rejects combined `role`+`user` scope with 400.
- **Date-range scoping**`valid_from`/`valid_to` (YYYY-MM-DD) in scope dict; filtered via lexicographic string comparison in `GET /api/db/flagged`. Server validates format and enforces `valid_from ≤ valid_to`.
- **`app.secret_key`** — derived from `machine_id` bytes so sessions survive restarts. Set once at startup; do not override.
- **Flask binds to `0.0.0.0`**`gdpr_scanner.py`, `m365_launcher.py`, and `build_gdpr.py` all use `host="0.0.0.0"`. Internal loopback URLs intentionally keep `127.0.0.1`.
## Gotchas
- **`_load_settings()` return** — does NOT include `file_sources`. Returns only: sources, user_ids, options, retention_years, fiscal_year_end, email_to.

View File

@ -29,9 +29,49 @@ Never revert to `!!window._googleConnected` / `_fileSources.length > 0` — thos
- **`user_ids = "all"` must be deferred** — if `S._allUsers` is empty when `_applyProfile()` runs, set `window._pendingProfileAllUsers = true` instead of calling `.forEach()` on an empty array. `loadUsers()` checks this flag after populating `S._allUsers` and selects everyone. Do not remove this — reverting will silently leave all accounts unchecked whenever a profile is chosen on a fast machine before the user list loads.
- **Source checkboxes may not exist yet**`_applyProfile()` calls `renderSourcesPanel()` first if `#sourcesPanel` contains no `input[data-source-id]` nodes. Same guard used in `loadUsers()`. Without it, `querySelectorAll` returns nothing and the profile's source selection is discarded; the next `renderSourcesPanel()` call re-renders all sources as checked (their default).
## SSE teardown — scan.js
- **Do not close `S.es` in `scan_done` if other scans are still running** — M365 (`scan_done`), Google (`google_scan_done`), and File (`file_scan_done`) each emit their own done event. Close `S.es` only when all concurrent scans have finished: `scan_done` checks `!S._googleScanRunning && !S._fileScanRunning`; `google_scan_done` checks `!S._m365ScanRunning && !S._fileScanRunning`; `file_scan_done` checks `!S._m365ScanRunning && !S._googleScanRunning`.
- **Scheduled scans**`S._userStartedScan` is false for scheduler-triggered runs, so SSE is never closed and future scheduler events continue to arrive.
- **Two separate abort events**`state._scan_abort` (M365 + file) and `state._google_scan_abort` (Google). `POST /api/scan/stop` sets **both**. `_check_abort()` inside `_run_google_scan` must use the module-level `_scan_abort` alias (`= state._google_scan_abort`), not `gdpr_scanner._scan_abort`.
- **`_check_abort()` emits `google_scan_done`, not `scan_cancelled`** — `scan_cancelled` unconditionally closes the SSE; `google_scan_done` checks whether other scans are still running before closing.
- **`scan_phase` replay sets running flags — handled by `sse_replay_done`** — the `scan_phase` handler sets running flags to `true` whenever all flags are `false` and a source keyword is found in the phase text. On page refresh this fires during SSE replay of a completed scan, temporarily making the scan appear running. The `sse_replay_done` handler retries `loadHistorySession(null)` if no scan is running and `S._historyRefScanId` is still `null` after replay. Do not remove either the flag-setting logic or the retry.
- **Google Drive uses a lazy generator, not `list()`**`iter_drive_files()` iterated directly so `_check_abort()` fires between items. Wrapping in `list()` blocks the thread for the entire enumeration.
## Scan history browser — history.js + results.js
- **`S._historyRefScanId`** — `null` = live/SSE mode; positive int = viewing a past session. Set by `loadHistorySession()`; cleared by `exitHistoryMode()`.
- **Auto-load on page load**`results.js` calls `window.loadHistorySession?.(null)` once when the SSE watchdog confirms `!status.running`. `_initialStatusChecked` guard ensures this fires at most once per page load. The `sse_replay_done` handler in `scan.js` retries if `loadHistorySession` bailed due to stale running flags set during replay.
- **History banner** (`#historyBanner`) — shown when `S._historyRefScanId` is set. Do not hide/show from outside `history.js`.
- **Session picker** (`#historyDropdown`) — rendered inside `[data-history-wrap]` so the outside-click handler works correctly. Do not move the picker outside this wrapper.
- **Cache invalidation**`invalidateHistoryCache()` clears `_sessions` and `_latestRefScanId`. All three `*_done` SSE handlers call `window.invalidateHistoryCache?.()`.
- **Re-scan diff** — items present in the previous session but absent from the current one are tagged `_resolved: true`, rendered with `.card-resolved` and a green ✓ badge, and NOT added to `S.flaggedData` (grid-only, cannot be bulk-selected or exported).
- **Mode transitions**`startScan()` calls `window.exitHistoryMode?.()` before clearing the grid.
## CPR cross-referencing — results.js
- **`_loadRelated(f)`** — async; hides `#previewRelated` if `f.cpr_count` is 0, otherwise fetches `/api/db/related/<id>?ref=N` and renders a clickable list with per-item shared-CPR badge. Called from `openPreview`.
- **`window._openRelated(id, itemData)`** — looks up `id` in `S.flaggedData` first, falls back to `itemData` from the API response for items not yet in the grid.
## Sources panel resize — log.js + sources.js
- **`_fitSourcesPanel()`** — called at the end of every `renderSourcesPanel()`. Clears inline height, reads `scrollHeight`, then restores a saved preference from `localStorage` (`gdpr_sources_h`) or pins to `scrollHeight`.
- **`_initSourcesResize()`** — attaches pointer-drag to `#sourcesResizeHandle`. Captures `scrollHeight` as hard max on `pointerdown`; saves to `localStorage` on release.
- **Do not add a fixed `max-height` or `height` to `#sourcesPanel` in HTML** — height controlled entirely by `_fitSourcesPanel()` at runtime.
- **Do not call `_fitSourcesPanel()` before the panel has rendered**`scrollHeight` will be 0.
## Viewer mode — viewer.js
- **`window.VIEWER_MODE`** — injected by Jinja2. `auth.js` adds `viewer-mode` class to `<body>`; all hide rules are CSS (`body.viewer-mode …`) except `delBtn` which is also guarded in JS.
- **`window.VIEWER_SCOPE`** — injected alongside `VIEWER_MODE`. If `VIEWER_SCOPE.role` is set, `auth.js` pre-sets `#filterRole` and hides the dropdown.
- **Token onclick attributes** — Copy/Revoke buttons pass the token as a single-quoted JS string literal, never via `JSON.stringify` (which produces double-quoted strings that break `onclick="…"` attributes).
- **Share link base URL**`_getShareBaseUrl()` fetches `/api/local_ip` (LAN IP via UDP probe to `8.8.8.8`) so copied links are routable from other machines. Both `createShareLink` and `copyTokenLink` are `async`. Do not revert to `window.location.origin` — that produces `127.0.0.1` links.
- **Settings Security pane** — Admin PIN and Viewer PIN groups live in `stPaneSecurity`. `switchSettingsTab('security')` triggers both `stLoadPinStatus()` and `stLoadViewerPinStatus()`.
## Gotchas
- **`scheduler.js` strings must use `t()`** — frequency labels (`m365_sched_freq_daily/weekly/monthly`), "Next" (`m365_sched_next`), "Running..." (`m365_sched_running`), "Disabled" (`m365_sched_disabled`), empty-job text (`m365_sched_no_jobs`), and empty-history text (`m365_sched_no_runs`) all have translation keys. Do not hard-code English strings in `schedLoad()` or `schedRenderJobs()`.
- **`scheduler.js` strings must use `t()`** — frequency labels, "Next", "Running...", "Disabled", empty-job text, and empty-history text all have translation keys. Do not hard-code English strings in `schedLoad()` or `schedRenderJobs()`.
- **Scheduler UI — `schedToggleReportOnly()`** — dims the Profile row, shows/hides `#schedReportOnlyHint`, and forces `#schedAutoEmail` checked. Called from the checkbox `onchange` handler and at the start of `schedAddJob()` / `schedEditJob()`.
- **Profile editor accounts** — default to unchecked. Only explicitly saved `user_ids` are checked.
- **Date presets** — stored as `years * 365` (integer days). Do not use `* 365.25`.
- **`copyTokenLink` is async** — called from `onclick` attributes as a fire-and-forget (the Promise is unhandled, which is fine). It `await`s `_getShareBaseUrl()` to get the machine's LAN IP before building the URL. Do not make it synchronous or revert to `window.location.origin` directly.
- **`copyTokenLink` is async** — called from `onclick` as fire-and-forget. Do not make it synchronous.