Add a Role scope dropdown to the Share modal (All roles / Ansatte / Elever).
Scope is stored as {"role": "student"|"staff"} in viewer_tokens.json and
enforced server-side in GET /api/db/flagged via session["viewer_scope"].
Client-side, #filterRole is pre-set and hidden for scoped viewers so the
constraint cannot be bypassed. Existing tokens and PIN sessions remain
unrestricted. Role badge shown on each scoped token row in the Active links list.
Files: app_config.py, routes/viewer.py, routes/database.py, gdpr_scanner.py,
templates/index.html, static/js/viewer.js, static/js/auth.js,
lang/en.json, lang/da.json, lang/de.json,
CLAUDE.md, CHANGELOG.md, README.md, MANUAL-EN.md, MANUAL-DA.md
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
13 KiB
GDPRScanner — Claude Code Context
A GDPR compliance scanner for Danish educational and municipal organisations. Scans Microsoft 365 (Exchange, OneDrive, SharePoint, Teams), Google Workspace (Gmail, Google Drive), and local/SMB file systems for CPR numbers and PII. Produces Excel reports, GDPR Article 30 Word documents, and supports disposition tagging, bulk deletion, scheduled scans, and multi-language UI.
How to run
source venv/bin/activate
python gdpr_scanner.py # http://0.0.0.0:5100 (all interfaces)
python -m pytest tests/ -q
Architecture
Entry point: gdpr_scanner.py — Flask app, scan orchestration globals. SSE route must stay here — blueprints can't stream.
Split modules: scan_engine.py (M365 + file scan), sse.py (SSE broadcast), checkpoint.py, app_config.py (all persistence), cpr_detector.py
Blueprints in routes/ — see routes/CLAUDE.md for state/SSE rules.
Frontend: templates/index.html (SPA), static/style.css (all styles), static/js/*.js (11 ES modules + state.js). static/app.js is an archived monolith — no longer loaded.
Data dir ~/.gdprscanner/: scanner.db, config.json, settings.json, schedule.json, token.json, delta.json, checkpoint.json, smtp.json, machine_id (never delete — Fernet key), role_overrides.json, google_sa.json, google.json, src_toggles.json, app.lock, viewer_tokens.json
Non-obvious files
| File | Why it's not obvious |
|---|---|
app_config.py |
All persistence — profiles, settings, SMTP, lang loading, viewer tokens + PIN |
routes/state.py |
Shared mutable state + scan locks (not a typical Flask state file) |
routes/google_scan.py |
Google scan execution lives here, not in google_connector.py |
routes/viewer.py |
Viewer token + PIN API; also owns brute-force rate-limit state |
static/js/viewer.js |
Share modal, token CRUD, viewer PIN settings UI |
lang/da.json |
Primary language — source of truth is en.json |
build_gdpr.py |
Desktop app builder; contains embedded LAUNCHER_CODE for PyInstaller |
Tests
128 tests in tests/. No integration tests for Flask routes or live M365/Google connections.
Viewer mode (#33) — routes/viewer.py + static/js/viewer.js
Read-only access for DPOs and reviewers. Key invariants:
/viewauth chain — token (?token=) → session cookie (session["viewer_ok"]) → PIN form (if PIN configured) → 403. Never skip this order.window.VIEWER_MODE— injected by Jinja2 inindex.html.auth.jsreads it at startup; addsviewer-modeclass to<body>. All hide rules are CSS (body.viewer-mode …), not scattered JS checks — exceptdelBtnin the card builder which is also guarded in JS. Hidden in viewer mode:.sidebar(entire left panel),#logWrap,#progressBar, scan/stop/profile/bulk-delete buttons, share button.window.VIEWER_SCOPE— injected alongsideVIEWER_MODE. Contains the scope dict from the token (e.g.{"role": "student"}). Empty object{}means unrestricted.auth.jsreads it at startup; ifVIEWER_SCOPE.roleis set, it pre-sets#filterRoleto that value and hides the dropdown so the viewer cannot change it.- Token scope — stored as
"scope": {"role": "student"|"staff"}or"scope": {}in each token dict insideviewer_tokens.json. Enforced in two places: server-side (GET /api/db/flaggedskips items whoserolecolumn does not matchsession["viewer_scope"].role) and client-side (the#filterRoledropdown is locked). Server-side is the authoritative guard. session["viewer_scope"]— set when a token is validated at/view. Persists for the browser session alongsidesession["viewer_ok"]. Reads fromsession.get("viewer_scope", {})in/api/db/flagged— defaults to{}(unrestricted) for PIN-authenticated sessions and legacy tokens without a scope key.viewer_tokens.jsonformat — stored as{"tokens": [...], "__pin__": {"hash": "…", "salt": "…"}}. Token dicts now include"scope": {}. The old bare-list format and tokens without ascopekey are handled transparently (t.get("scope", {})). Do not write the file as a bare list.app.secret_key— derived frommachine_idbytes so Flask sessions survive restarts. Set once at startup ingdpr_scanner.py; do not override it.GET /api/db/flagged— returnsget_session_items()(last completed scan session, joined with dispositions), filtered bysession["viewer_scope"].rolewhen set. Used exclusively by_loadViewerResults()inresults.js. Do not confuse withget_flagged_items()(single scan_id, no disposition join).- Rate-limit state (
_pin_attemptsdict inroutes/viewer.py) — in-memory only, resets on server restart. Intentional — a restart clears lockouts without a persistent store. - Token onclick attributes — Copy/Revoke buttons in
_renderTokenList()pass the token as a single-quoted JS string literal ('\'' + tok.token + '\''), never viaJSON.stringify.JSON.stringifyproduces double-quoted strings that break the surroundingonclick="…"HTML attribute. - Settings Security pane — Admin PIN and Viewer PIN groups live in
stPaneSecurity, notstPaneGeneral.switchSettingsTab('security')insources.jstriggers bothstLoadPinStatus()andstLoadViewerPinStatus(). The Share modal Configure button opensopenSettings('security'). stClearViewerPinguard — validates that the current-PIN field is non-empty client-side before sending the DELETE request; shows an inline error and focuses the field if empty.- Share link base URL —
_getShareBaseUrl()inviewer.jsfetches/api/local_ip(returns the machine's LAN IP via a UDP probe to8.8.8.8) and substitutes it so copied links are routable from other machines. Falls back towindow.location.originon error. BothcreateShareLinkandcopyTokenLinkareasyncandawaitthis helper. Do not revert to a barewindow.location.origin— that produces127.0.0.1links useless to remote viewers. - Flask binds to
0.0.0.0—gdpr_scanner.pydefault--host,m365_launcher.py, andbuild_gdpr.pyall usehost="0.0.0.0". Internal loopback URLs (urllib exports, webview window, port probe) intentionally keep127.0.0.1— do not change those to0.0.0.0.
Sources panel resize — static/js/log.js + sources.js
_fitSourcesPanel()— called at the end of everyrenderSourcesPanel()call. Clears the panel's inline height, readsscrollHeight(natural content height), then either restores a saved smaller preference fromlocalStorage(gdpr_sources_h) or pins the height toscrollHeight. This keeps the panel exactly as tall as needed to show all sources._initSourcesResize()— attaches pointer-drag to#sourcesResizeHandle. Onpointerdownit capturesscrollHeightas the hard max; drag up shrinks, drag down is capped at that max. Saves tolocalStorageon release; clears the key if the user drags back to full height.- Do not add a fixed
max-heightorheightto#sourcesPanelin HTML — height is controlled entirely by_fitSourcesPanel()at runtime. - Do not call
_fitSourcesPanel()before the panel has rendered —scrollHeightwill be 0. The call inrenderSourcesPanel()is the correct hook;_initSourcesResize()only sets up the drag handler.
Scan filter options — scan_engine.py
Both options live in the profile options dict and apply to all three scan engines (M365, Google, file scan).
skip_gps_images(bool, defaultfalse) — When enabled, images whose only PII is GPS coordinates are not flagged. GPS data is still extracted and stored in the cardexiffield if the item is flagged by another signal (faces, EXIF author/comment). Thegps_locationspecial category is also suppressed. Evaluated via_exif_has_piiwhich recheckspii_fieldsandauthorwhen GPS is skipped.min_cpr_count(int, default1) — Minimum number of distinct CPR numbers in a file before it is flagged. Deduplication useslist(dict.fromkeys(cprs))to preserve order. Files with faces or EXIF PII are still flagged regardless of CPR count — the threshold gates only CPR-based hits.- File scan reads both from
sourcedict keys (passed directly from the/api/file_scan/startpayload). M365 scan reads both fromscan_opts = options.get("options", {}). Both paths apply the same_cpr_qualifies/_exif_has_piilogic before the flagging gate. - UI: sidebar controls
#optSkipGps(toggle) and#optMinCpr(number); profile editor controls#peOptSkipGpsand#peOptMinCpr. Both are saved/loaded byprofiles.js.
Memory management — scan_engine.py
Large M365 tenants can generate enormous memory pressure. Key rules to preserve:
- Email body stripped at collection time —
_scan_user_emailcallsconn.get_message_body_text(msg), stores the result asmsg["_precomputed_body"], then deletesmsg["body"]andmsg["bodyPreview"]before appending towork_items. The processing loop readsmeta.pop("_precomputed_body", ""). Do not re-addbodyto the$selectquery without also stripping it here. work_items→dequebefore processing — converted withdeque(work_items)and drained viapopleft()so each item's memory is released immediately after processing. Do not convert back to a list or iterate withenumerate().del contentin file branch — raw download bytes are deleted as soon ascontent.decode()is done (before NER/PII counting). Both the hit and no-hit paths have explicitdel content.del body_textin email branch — deleted after_broadcast_cardcall.- PDF OCR images freed page-by-page — in
document_scanner.scan_pdf,images[page_num-1] = Noneimmediately after OCR. Do not cache or accumulate page images. - Memory guard —
psutil.virtual_memory().availablechecked before each M365 file download; scan skips the file if < 300 MB free.
Export — routes/export.py
GDPRDb.get_session_sources()— returns asetof source-key strings (e.g.{"gmail", "gdrive", "email"}) for every scan in the current session window. Used by both_build_excel_bytes()and_build_article30_docx()to include zero-hit sources in summary tables. Do not derive the scanned-source set fromby_sourcealone — that dict only contains sources with flagged items.- Excel Summary sheet vs. per-source tabs — the Summary sheet shows all scanned sources (even with 0 items). Per-source tabs are only created for sources with items; an empty tab has no value.
- ART.30 breakdown table — iterates
scanned_sources(notby_source) so Gmail, Google Drive, etc. appear with0 | 0 | 0 | —when the scan found nothing. - Role-filtered exports —
_build_excel_bytes(role='')and_build_article30_docx(role='')acceptrole='student'orrole='staff'. A local_itemslist is built at the top of each function and used everywhere instead ofstate.flagged_itemsdirectly — GPS sheet, External transfers sheet, and Art.30 staff/student tables all see only the filtered subset. Route handlers readrequest.args.get('role', '')and forward it. Filenames get_elever/_ansattesuffix. The#filterRoledropdown in the filter bar drives both the client-side grid filter and the export URL param — do not separate them.
SSE teardown — static/js/scan.js
- Do not close
S.esinscan_doneif other scans are still running — M365 (scan_done), Google (google_scan_done), and File (file_scan_done) each emit their own done event. If M365 finishes first and the SSE is closed, the remaining done events are never received and the UI hangs at 100% indefinitely. - Rule: close
S.es(and resetS._userStartedScan) only inside the branch where all concurrent scans have finished:scan_donechecks!S._googleScanRunning && !S._fileScanRunning;google_scan_donechecks!S._m365ScanRunning && !S._fileScanRunning;file_scan_donechecks!S._m365ScanRunning && !S._googleScanRunning. - Scheduled scans —
S._userStartedScanis false for scheduler-triggered runs, so the SSE connection is never closed and future scheduler events continue to arrive.
Global gotchas
- Pattern matching in Python — when using
str.replace()to patch JS/HTML, whitespace and quote style must match exactly. Useincheck first and print if not found. __getattr__on modules — only resolvesmodule.nameaccess from outside, not bare name lookups inside function bodies. Always import directly.JSON.stringifyinsideonclick="…"attributes — produces double-quoted strings that terminate the HTML attribute early. Use single-quoted JS string literals instead, ordata-*attributes read from the handler.
Directory-scoped rules
routes/CLAUDE.md— SSE constraints, scan_progress source field, file_sources, Python gotchasstatic/js/CLAUDE.md— profile dropdown, progress bar phase parsing, JS gotchastemplates/CLAUDE.md— CSS variable names, sizing rules, badge standard, design ruleslang/CLAUDE.md— i18n conventions