14 KiB
Routes — Architecture Rules
SSE constraints
SSE routes must live in gdpr_scanner.py, not blueprints — blueprints can't stream.
M365 scan emits scan_done; Google emits google_scan_done; file scan emits file_scan_done. Never mix them up.
scan_start is M365-only — run_scan() broadcasts scan_start; run_file_scan() and routes/google_scan.py must NOT. The scan_start handler in _attachSchedulerListeners (scan.js) unconditionally sets S._m365ScanRunning = true. If a file scan emits scan_start, the flag is set with no matching scan_done to clear it — file_scan_done checks !S._m365ScanRunning before re-enabling the scan button, so the button stays disabled permanently after the scan completes.
scan_progress source field
All three scan engines must include "source": "m365" / "google" / "file" in every scan_progress SSE event. Never remove this field — the frontend uses it to route progress to the correct segment.
file_sources
file_sources in profiles are stored as source ID strings by the JS frontend. The scheduler resolves them via _load_file_sources() before calling run_file_scan().
Circular import prohibition
scan_engine.py and gdpr_scanner.py must not import each other. scan_engine imports from sse, checkpoint, app_config, cpr_detector; gdpr_scanner imports scan functions from scan_engine.
_scan_bytes injection
scan_engine.py declares stub versions of _scan_bytes / _scan_bytes_timeout at module level. gdpr_scanner.py replaces them with the real cpr_detector implementations at startup. routes/google_scan.py pulls them from gdpr_scanner via __getattr__. Never import these directly in blueprint or engine modules — that breaks the circular-import barrier.
M365 connector exceptions — m365_connector.py
Exception hierarchy (all inherit M365Error(Exception)):
| Exception | Trigger | Handler |
|---|---|---|
M365PermissionError |
403 Forbidden | scan_error broadcast with human-readable permission hint |
M365DeltaTokenExpired |
410 Gone on delta endpoint | Caller clears token and falls back to full scan |
M365DriveNotFound |
404 Not Found on any path | scan_phase broadcast ("not provisioned — skipped") in _scan_user_onedrive; full-scan path's except Exception: return also silences it |
M365DriveNotFound — why it exists: _get() previously fell through to raise_for_status() on 404, which was caught by the generic except Exception handler and broadcast as a red scan_error. Adding the specific exception makes the delta path consistent with the full-scan path: a user without a provisioned OneDrive is skipped silently. Do not add a 404 handler to _get() that returns a fallback value — that would silently mask genuine path bugs.
Export — routes/export.py
GDPRDb.get_session_sources()— returns asetof source-key strings for every scan in the current session window. Used by both_build_excel_bytes()and_build_article30_docx()to include zero-hit sources in summary tables. Do not derive the scanned-source set fromby_sourcealone — that dict only contains sources with flagged items.- Excel Summary sheet — shows all scanned sources (even with 0 items). Per-source tabs only created for sources with items.
- ART.30 breakdown table — iterates
scanned_sources(notby_source) so Gmail, Drive, etc. appear with0 | 0 | 0 | —when the scan found nothing. - Role-filtered exports —
_build_excel_bytes(role='')and_build_article30_docx(role='')acceptrole='student'orrole='staff'. A local_itemslist is built at the top of each function; GPS sheet, External transfers sheet, and Art.30 tables all see only the filtered subset. Filenames get_elever/_ansattesuffix. POST /api/redact_item— rewrites a file in-place with CPR numbers replaced by██████-████/█blocks, removes the card from the grid, logs a"redacted"disposition. Source types:local(DOCX/XLSX/CSV/TXT/PDF, written via temp+move),onedrive/sharepoint/teams(Graph download → redact → PUT, requiresFiles.ReadWrite.All),gdrive(Drive API, requiresdrivescope),sftp(paramiko read/write, item must still be instate.flagged_items),smb(smbprotocolFILE_SUPERSEDE). Keep_redactExts/_cloudRedactExtsinresults.jsand_REDACT_EXTS/_GDRIVE_MIME_MAP/_ALL_REDACTABLE_TYPESinexport.pyin sync — the button and the route must agree.- PDF redaction —
redact_pdf_secureuses PyMuPDFpage.apply_redactions()(physical removal). Falls back to reportlab overlay if PyMuPDF absent. Text pages usefind_cpr_char_bboxes; scanned pages use OCR at 200 DPI +find_cpr_image_bboxes.
Preview — routes/database.py
GET /api/preview/<item_id>?source_type=…&account_id=… dispatches by source_type:
local/smb— re-reads from disk; renders images as data URIs, text/CSV/PDF/DOCX/XLSX inline.email— fetches M365 message body via Graph (requiresstate.connector).gmail— shows info card with "Open in Gmail" link (X-Frame-Options blocks embedding).gdrive— returnshttps://drive.google.com/file/d/{id}/previewiframe.- All other values (M365 files) — calls Graph
/previewPOST; triesdrive_id-based path first, then user-drive, then/me/drive.
_source_type must be set in google_scan.py — Gmail items need meta["_source_type"] = "gmail" and Drive items "gdrive" before _broadcast_card. Without it, cards fall through to the M365 branch, which calls Graph with a Gmail ID and gets a 404.
state.connector guard — only the email and M365 else branches require M365 auth. The local/smb/gmail/gdrive branches must not gate on state.connector — they work in Google-only deployments.
Compliance audit log — gdpr_db.py + routes/
audit_logtable — created by_DDL(CREATE TABLE IF NOT EXISTS), auto-appears on next server start. Schema:id, ts (Unix float), action, actor, detail, ip.log_audit_event(action, detail, actor, ip)— module-level helper; silently no-ops on any exception. Import:from gdpr_db import log_audit_event as _audit.GET /api/audit_log?limit=200&action=<filter>— inroutes/app_routes.py. No auth gate.- Recorded events —
profile_save/delete,token_create/revoke,viewer_pin_set/change/clear,interface_pin_set/change/clear,source_add/update/delete,scheduler_job_save/delete,scan_start/stop,smtp_save,disposition,disposition_bulk,admin_pin_set/change,item_delete,item_redact,app_update. actoralways empty — no per-user login; field reserved for future use.
Email sending — routes/email.py + m365_connector.py
_post()returns{}on empty body — GraphsendMailreturns HTTP 202 with no body;r.json()on empty raisesJSONDecodeError. Do not revert to unconditionalr.json().- Graph preferred over SMTP —
smtp_testandsend_reporttry_send_email_graph()first; fall back to SMTP only if Graph raises. If Graph fails and no SMTP host saved, the Graph exception surfaces directly. - Auto-email after manual scan —
_maybe_send_auto_email()inroutes/scan.pycalled from the_run()thread afterrun_scan()returns. Readssmtp_cfg.get("auto_email_manual"); no-ops if false, no flagged items, or no recipients. - Gmail vs Google Workspace — auth error handlers check if SMTP username ends in
@gmail.com/@googlemail.com; custom domains are treated as Google Workspace and error message points to the Workspace admin console.
Scheduler — scan_scheduler.py + routes/scheduler.py
- Job config keys —
id,name,enabled,frequency(daily/weekly/monthly),day_of_week,day_of_month,hour,minute,profile_id,auto_email,auto_retention,retention_years,fiscal_year_end,report_only. Stored in~/.gdprscanner/schedule.json. _execute_scan(job_id)— acquires per-job lock (_running_jobsset), records DB run viadb.begin_schedule_run(), runs M365 → file → Google pipeline, then emails and applies retention. DB run finalised infinally.- Report-only path — when
report_only=True, short-circuits before M365 auth check, populates_m.flagged_itemsfromdb.get_session_items()if empty, calls_send_email_report(). Does NOT acquire scan lock; fails withRuntimeError("No scan results available")if DB is also empty. _m.flagged_itemsandstate.flagged_itemsare the same object — assigned at startup; in-place updates (flagged_items[:] = ...) propagate to both.scheduler_started/scheduler_doneSSE events — separate fromscan_done(M365).scheduler_donecarriesflagged,scanned,emailed,job_name.- Profile options merge into file sources — scheduler unpacks
{**fs, **_fs_extra}before callingrun_file_scan(fs). Do not passfsdirectly — the file scan readssource.get(...)and silently falls back to defaults without the merge.
Claude NER — document_scanner.py + app_config.py + routes/app_routes.py
Optional AI-powered NER replacing spaCy. Activated via config.json keys claude_ner (bool) and claude_api_key (str, Fernet-encrypted at rest with an enc: prefix — same scheme as the SMTP password).
ANTHROPIC_OK— module-level flag indocument_scanner.py;Trueifanthropicis importable. Guards all Claude code paths._ner_claude(text, api_key)— callsclaude-haiku-4-5-20251001in 8 000-char chunks. Thread-safe cache keyed byhash(text), evicts oldest when > 2 000 entries.- Always read the key via
app_config.get_claude_api_key()— it decrypts and transparently handles legacy plaintext. Never readconfig.json["claude_api_key"]directly;save_claude_config()writes it encrypted. GET/POST /api/settings/claude— GET returns{"enabled": bool, "api_key_set": bool}(never exposes key). POST accepts{"enabled": bool, "api_key": "..."}— omittingapi_keyleaves stored key unchanged.POST /api/settings/claude/test— minimal 8-token API call; returns{"ok": true}or{"ok": false, "error": "..."}.- Do not import
anthropicat module level outsidedocument_scanner.py—routes/app_routes.pyimports it locally inside the function body so the server starts without the package.
Software update — routes/updates.py
- Git-checkout only —
_supported()requires a.gitdir and notsys.frozen. The frozen desktop build gets{"supported": false}and the UI hides the Settings group. POST /api/update/apply— stash-if-dirty →merge --ff-only origin/<branch>→ pip install only ifrequirements.txtchanged → auditapp_update→_schedule_restart()re-execs the process viaos.execv(same PID; works under systemd andstart_gdpr.sh). Refuses withcode: "scan_running"(409) whilestate._scan_lockorstate._google_scan_lockis held.apply_update()never restarts itself — callers decide. Tests patch_schedule_restart; the auto-update thread calls_restart_self()directly.- Auto-update thread —
start_auto_update_thread()called fromgdpr_scanner.py__main__. Hourly tick, applies at most once per 24 h whenconfig.json["auto_update"]is true; skips (and retries next tick) while a scan runs. update_gdpr.sh— standalone CLI/cron equivalent of the same logic; keep stash/ff-only/requirements behaviour in sync.
Viewer mode — routes/viewer.py
/viewauth chain — token (?token=) → session cookie (session["viewer_ok"]) → PIN form → 403. Never skip this order.- Token scope — stored as
"scope": {"role": "student"|"staff"},{"user": [...], "display_name": "..."}, or{}inviewer_tokens.json. Enforced server-side inGET /api/db/flagged. Column name isuser_role— do not userole. session["viewer_scope"]— set at/viewtoken validation.GET /api/db/flaggedreadssession.get("viewer_scope", {})— defaults to{}(unrestricted) for PIN-authenticated sessions.viewer_tokens.jsonformat —{"tokens": [...], "__pin__": {"hash": "…", "salt": "…"}}. Old bare-list format handled transparently. Do not write as bare list.- Rate-limit state (
_pin_attemptsdict) — in-memory only, resets on server restart. Intentional. - User-scoped tokens —
scope.useralways a list; legacy single-string coerced on read. File-scan items (account_id = "") never appear in user-scoped views.POST /api/viewer/tokensrejects combinedrole+userscope with 400. - Date-range scoping —
valid_from/valid_to(YYYY-MM-DD) in scope dict; filtered via lexicographic string comparison inGET /api/db/flagged. Server validates format and enforcesvalid_from ≤ valid_to. app.secret_key— derived frommachine_idbytes so sessions survive restarts. Set once at startup; do not override.- Flask binds to
0.0.0.0—gdpr_scanner.py,m365_launcher.py, andbuild_gdpr.pyall usehost="0.0.0.0". Internal loopback URLs intentionally keep127.0.0.1.
Gotchas
_load_settings()return — does NOT includefile_sources. Returns only: sources, user_ids, options, retention_years, fiscal_year_end, email_to._save_settings()clobbers profile fields — called on every M365 scan start with only M365 sources/user_ids/options. The fix inapp_config.pypreservesgoogle_sourcesandfile_sourcesand rebuildssourcesasm365_src + google_src + file_src. Do not simplify away this merge logic.loadLastScanSummary()timing — must only be called after the first/api/scan/statuspoll resolves (inside_sseWatchdoginresults.js, guarded by_initialStatusChecked). Calling it onDOMContentLoadedshows a stale "no results" card during a live scan after a hard refresh.