10 KiB
GDPRScanner — Claude Code Context
A GDPR compliance scanner for Danish educational and municipal organisations. Scans Microsoft 365 (Exchange, OneDrive, SharePoint, Teams), Google Workspace (Gmail, Google Drive), and local/SMB file systems for CPR numbers and PII. Produces Excel reports, GDPR Article 30 Word documents, and supports disposition tagging, bulk deletion, scheduled scans, and multi-language UI.
How to run
source venv/bin/activate
python gdpr_scanner.py # http://0.0.0.0:5100 (all interfaces)
python -m pytest tests/ -q
Architecture
Entry point: gdpr_scanner.py — Flask app, scan orchestration globals. SSE route must stay here — blueprints can't stream.
Split modules: scan_engine.py (M365 + file scan), sse.py (SSE broadcast), checkpoint.py, app_config.py (all persistence), cpr_detector.py
Blueprints in routes/ — see routes/CLAUDE.md for state/SSE rules.
Frontend: templates/index.html (SPA), static/style.css (all styles), static/js/*.js (11 ES modules + state.js). static/app.js is an archived monolith — no longer loaded.
Data dir ~/.gdprscanner/: scanner.db, config.json, settings.json, schedule.json, token.json, delta.json, checkpoint.json, smtp.json, machine_id (never delete — Fernet key), role_overrides.json, google_sa.json, google.json, src_toggles.json, app.lock, viewer_tokens.json
Non-obvious files
| File | Why it's not obvious |
|---|---|
app_config.py |
All persistence — profiles, settings, SMTP, lang loading, viewer tokens + PIN |
routes/state.py |
Shared mutable state + scan locks (not a typical Flask state file) |
routes/google_scan.py |
Google scan execution lives here, not in google_connector.py |
routes/viewer.py |
Viewer token + PIN API; also owns brute-force rate-limit state |
static/js/viewer.js |
Share modal, token CRUD, viewer PIN settings UI |
lang/da.json |
Primary language — source of truth is en.json |
build_gdpr.py |
Desktop app builder; contains embedded LAUNCHER_CODE for PyInstaller |
Tests
128 tests in tests/. No integration tests for Flask routes or live M365/Google connections.
Viewer mode (#33) — routes/viewer.py + static/js/viewer.js
Read-only access for DPOs and reviewers. Key invariants:
/viewauth chain — token (?token=) → session cookie (session["viewer_ok"]) → PIN form (if PIN configured) → 403. Never skip this order.window.VIEWER_MODE— injected by Jinja2 inindex.html.auth.jsreads it at startup; addsviewer-modeclass to<body>. All hide rules are CSS (body.viewer-mode …), not scattered JS checks — exceptdelBtnin the card builder which is also guarded in JS. Hidden in viewer mode:.sidebar(entire left panel),#logWrap,#progressBar, scan/stop/profile/bulk-delete buttons, share button.viewer_tokens.jsonformat — stored as{"tokens": [...], "__pin__": {"hash": "…", "salt": "…"}}. The old bare-list format is migrated transparently on first write. Do not write the file as a bare list.app.secret_key— derived frommachine_idbytes so Flask sessions survive restarts. Set once at startup ingdpr_scanner.py; do not override it.GET /api/db/flagged— returnsget_session_items()(last completed scan session, joined with dispositions). Used exclusively by_loadViewerResults()inresults.js. Do not confuse withget_flagged_items()(single scan_id, no disposition join).- Rate-limit state (
_pin_attemptsdict inroutes/viewer.py) — in-memory only, resets on server restart. Intentional — a restart clears lockouts without a persistent store. - Token onclick attributes — Copy/Revoke buttons in
_renderTokenList()pass the token as a single-quoted JS string literal ('\'' + tok.token + '\''), never viaJSON.stringify.JSON.stringifyproduces double-quoted strings that break the surroundingonclick="…"HTML attribute. - Settings Security pane — Admin PIN and Viewer PIN groups live in
stPaneSecurity, notstPaneGeneral.switchSettingsTab('security')insources.jstriggers bothstLoadPinStatus()andstLoadViewerPinStatus(). The Share modal Configure button opensopenSettings('security'). stClearViewerPinguard — validates that the current-PIN field is non-empty client-side before sending the DELETE request; shows an inline error and focuses the field if empty.- Share link base URL —
_getShareBaseUrl()inviewer.jsfetches/api/local_ip(returns the machine's LAN IP via a UDP probe to8.8.8.8) and substitutes it so copied links are routable from other machines. Falls back towindow.location.originon error. BothcreateShareLinkandcopyTokenLinkareasyncandawaitthis helper. Do not revert to a barewindow.location.origin— that produces127.0.0.1links useless to remote viewers. - Flask binds to
0.0.0.0—gdpr_scanner.pydefault--host,m365_launcher.py, andbuild_gdpr.pyall usehost="0.0.0.0". Internal loopback URLs (urllib exports, webview window, port probe) intentionally keep127.0.0.1— do not change those to0.0.0.0.
Sources panel resize — static/js/log.js + sources.js
_fitSourcesPanel()— called at the end of everyrenderSourcesPanel()call. Clears the panel's inline height, readsscrollHeight(natural content height), then either restores a saved smaller preference fromlocalStorage(gdpr_sources_h) or pins the height toscrollHeight. This keeps the panel exactly as tall as needed to show all sources._initSourcesResize()— attaches pointer-drag to#sourcesResizeHandle. Onpointerdownit capturesscrollHeightas the hard max; drag up shrinks, drag down is capped at that max. Saves tolocalStorageon release; clears the key if the user drags back to full height.- Do not add a fixed
max-heightorheightto#sourcesPanelin HTML — height is controlled entirely by_fitSourcesPanel()at runtime. - Do not call
_fitSourcesPanel()before the panel has rendered —scrollHeightwill be 0. The call inrenderSourcesPanel()is the correct hook;_initSourcesResize()only sets up the drag handler.
Memory management — scan_engine.py
Large M365 tenants can generate enormous memory pressure. Key rules to preserve:
- Email body stripped at collection time —
_scan_user_emailcallsconn.get_message_body_text(msg), stores the result asmsg["_precomputed_body"], then deletesmsg["body"]andmsg["bodyPreview"]before appending towork_items. The processing loop readsmeta.pop("_precomputed_body", ""). Do not re-addbodyto the$selectquery without also stripping it here. work_items→dequebefore processing — converted withdeque(work_items)and drained viapopleft()so each item's memory is released immediately after processing. Do not convert back to a list or iterate withenumerate().del contentin file branch — raw download bytes are deleted as soon ascontent.decode()is done (before NER/PII counting). Both the hit and no-hit paths have explicitdel content.del body_textin email branch — deleted after_broadcast_cardcall.- PDF OCR images freed page-by-page — in
document_scanner.scan_pdf,images[page_num-1] = Noneimmediately after OCR. Do not cache or accumulate page images. - Memory guard —
psutil.virtual_memory().availablechecked before each M365 file download; scan skips the file if < 300 MB free.
Export — routes/export.py
GDPRDb.get_session_sources()— returns asetof source-key strings (e.g.{"gmail", "gdrive", "email"}) for every scan in the current session window. Used by both_build_excel_bytes()and_build_article30_docx()to include zero-hit sources in summary tables. Do not derive the scanned-source set fromby_sourcealone — that dict only contains sources with flagged items.- Excel Summary sheet vs. per-source tabs — the Summary sheet shows all scanned sources (even with 0 items). Per-source tabs are only created for sources with items; an empty tab has no value.
- ART.30 breakdown table — iterates
scanned_sources(notby_source) so Gmail, Google Drive, etc. appear with0 | 0 | 0 | —when the scan found nothing.
SSE teardown — static/js/scan.js
- Do not close
S.esinscan_doneif other scans are still running — M365 (scan_done), Google (google_scan_done), and File (file_scan_done) each emit their own done event. If M365 finishes first and the SSE is closed, the remaining done events are never received and the UI hangs at 100% indefinitely. - Rule: close
S.es(and resetS._userStartedScan) only inside the branch where all concurrent scans have finished:scan_donechecks!S._googleScanRunning && !S._fileScanRunning;google_scan_donechecks!S._m365ScanRunning && !S._fileScanRunning;file_scan_donechecks!S._m365ScanRunning && !S._googleScanRunning. - Scheduled scans —
S._userStartedScanis false for scheduler-triggered runs, so the SSE connection is never closed and future scheduler events continue to arrive.
Global gotchas
- Pattern matching in Python — when using
str.replace()to patch JS/HTML, whitespace and quote style must match exactly. Useincheck first and print if not found. __getattr__on modules — only resolvesmodule.nameaccess from outside, not bare name lookups inside function bodies. Always import directly.JSON.stringifyinsideonclick="…"attributes — produces double-quoted strings that terminate the HTML attribute early. Use single-quoted JS string literals instead, ordata-*attributes read from the handler.
Directory-scoped rules
routes/CLAUDE.md— SSE constraints, scan_progress source field, file_sources, Python gotchasstatic/js/CLAUDE.md— profile dropdown, progress bar phase parsing, JS gotchastemplates/CLAUDE.md— CSS variable names, sizing rules, badge standard, design ruleslang/CLAUDE.md— i18n conventions