GDPRScanner/routes/CLAUDE.md
StyxX65 d8083eb0c0 feat: interface PIN, bulk disposition tagging, Google Drive delta scan, OCR memory fixes
- Interface PIN: optional session-level auth gate for the main scanner UI
  (Settings → Security → Interface PIN). Salted SHA-256 in config.json,
  rate-limited (5 attempts/5 min per IP). /view and viewer auth exempt.
  New /login page, before_request hook, GET/POST/DELETE /api/interface/pin,
  POST /api/interface/pin/verify, POST /api/interface/logout.

- Bulk disposition tagging: Select mode (filter bar "Vælg" button) reveals
  per-card checkboxes. Bulk tag bar at bottom of grid; POST /api/db/disposition/bulk.
  Disposition stats bar (total · unreviewed · retain · delete · % reviewed)
  updates after every save.

- Google Drive delta scan: uses Drive Changes API when delta is enabled.
  Per-user token stored as gdrive:{email} in delta.json. Load-then-merge
  save avoids racing with concurrent M365 token writes.

- PDF OCR OOM fix: render one page at a time with convert_from_path
  (first_page=N, last_page=N). Added _ocr_mem_ok() psutil guard (500 MB
  threshold) before each page render across scan_pdf, redact_fitz_pdf,
  redact_pdf.

- Email test message translation fix: routes/email.py returns structured
  {ok, method, recipients} instead of a hardcoded English string;
  scheduler.js builds the translated message client-side.

- Docs: CHANGELOG, README, TODO, MANUAL-EN, MANUAL-DA all updated.
  Lang files (en/da/de) extended with bulk, interface PIN, and SMTP keys.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-18 18:46:45 +02:00

2.0 KiB

Routes — Architecture Rules

SSE constraints

SSE routes must live in gdpr_scanner.py, not blueprints — blueprints can't stream.

M365 scan emits scan_done; Google emits google_scan_done; file scan emits file_scan_done. Never mix them up.

scan_progress source field

All three scan engines must include "source": "m365" / "google" / "file" in every scan_progress SSE event. Never remove this field — the frontend uses it to route progress to the correct segment.

file_sources

file_sources in profiles are stored as source ID strings by the JS frontend. The scheduler resolves them via _load_file_sources() before calling run_file_scan().

Circular import prohibition

scan_engine.py and gdpr_scanner.py must not import each other. scan_engine imports from sse, checkpoint, app_config, cpr_detector; gdpr_scanner imports scan functions from scan_engine.

_scan_bytes injection

scan_engine.py declares stub versions of _scan_bytes / _scan_bytes_timeout at module level. gdpr_scanner.py replaces them with the real cpr_detector implementations at startup. routes/google_scan.py pulls them from gdpr_scanner via __getattr__. Never import these directly in blueprint or engine modules — that breaks the circular-import barrier.

Gotchas

  • _load_settings() return — does NOT include file_sources. Returns only: sources, user_ids, options, retention_years, fiscal_year_end, email_to.
  • _save_settings() clobbers profile fields — called on every M365 scan start with only M365 sources/user_ids/options. The fix in app_config.py preserves google_sources and file_sources and rebuilds sources as m365_src + google_src + file_src. Do not simplify away this merge logic.
  • loadLastScanSummary() timing — must only be called after the first /api/scan/status poll resolves (inside _sseWatchdog in results.js, guarded by _initialStatusChecked). Calling it on DOMContentLoaded shows a stale "no results" card during a live scan after a hard refresh.