- Scheduled jobs can now run in report-only mode (skip scan, email latest DB results) - Compliance audit log records all significant admin actions in an immutable DB table - VERSION bumped to 1.6.28; CHANGELOG [Unreleased] sealed as [1.6.28] — 2026-05-28 - Both manuals updated: CPR-only mode, OCR language, file redaction, related documents, date-range token scoping, report-only jobs, audit log tab, two new FAQ entries - TODO.md updated with all completed tasks Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
38 KiB
GDPRScanner — Claude Code Context
A GDPR compliance scanner for Danish educational and municipal organisations. Scans Microsoft 365 (Exchange, OneDrive, SharePoint, Teams), Google Workspace (Gmail, Google Drive), and local/SMB file systems for CPR numbers and PII. Produces Excel reports, GDPR Article 30 Word documents, and supports disposition tagging, bulk deletion, scheduled scans, and multi-language UI.
How to run
source venv/bin/activate
python gdpr_scanner.py # http://0.0.0.0:5100 (all interfaces)
python -m pytest tests/ -q
Architecture
Entry point: gdpr_scanner.py — Flask app, scan orchestration globals. SSE route must stay here — blueprints can't stream.
Split modules: scan_engine.py (M365 + file scan), sse.py (SSE broadcast), checkpoint.py, app_config.py (all persistence), cpr_detector.py
Google Drive delta scan — routes/google_scan.py reads scan_opts.get("delta", False) (same flag as M365). Per user, delta key is f"gdrive:{user_email}" stored in ~/.gdprscanner/delta.json alongside M365 tokens. First delta-enabled scan fetches all files then records a Changes API start page token via conn.get_drive_start_token(user_email). Subsequent scans call conn.get_drive_changes(user_email, token) (Changes API) and update the token. Token save loads the current file fresh before writing ({**current_tokens, **_new_drive_tokens}) to avoid overwriting M365 tokens written by a concurrent scan thread. Invalid/expired tokens fall back to full scan automatically. google_scan_done now includes "delta": bool and "delta_sources": int.
SFTP connector — sftp_connector.py provides SFTPScanner with the same iter_files() interface as FileScanner. run_file_scan() in scan_engine.py checks source.get("source_type") == "sftp" and instantiates SFTPScanner; all other file-scan code (SSE, DB, cards) is unchanged. Auth: "password" stores credential via store_sftp_password() in OS keychain; "key" loads the private key from ~/.gdprscanner/sftp_keys/<uuid> with an optional keychain passphrase. Key files are uploaded via POST /api/file_sources/upload_key (paramiko validates format). SFTP_OK flag guards graceful degradation if paramiko is not installed. Do not add source_type="sftp" handling anywhere except scan_engine.py — the rest of the pipeline is source-agnostic.
Shared content processing — all three scan engines (M365, Google, file) funnel downloaded bytes through a single function: cpr_detector._scan_bytes(content, filename). It dispatches to the correct parser by file extension. scan_engine.py uses the _scan_bytes_timeout wrapper for PDFs (subprocess + hard timeout). routes/google_scan.py uses _scan_bytes directly. Do not duplicate file-type handling in per-source code.
cpr_detector.SUPPORTED_EXTS is the single source of truth for which file extensions are scanned across all sources. file_scanner.py imports it as DEFAULT_EXTENSIONS so local/SMB scans stay in sync automatically. scan_engine.py uses it to gate M365/SharePoint/Teams file downloads. Do not maintain a separate extension list anywhere else.
_scan_bytes injection pattern — scan_engine.py defines a no-op stub for _scan_bytes / _scan_bytes_timeout at module level (avoids circular import). gdpr_scanner.py overwrites them with the real cpr_detector implementations at startup. routes/google_scan.py resolves them lazily via gdpr_scanner.__getattr__. This is intentional — do not try to import them directly in those modules.
Blueprints in routes/ — see routes/CLAUDE.md for state/SSE rules.
Frontend: templates/index.html (SPA), static/style.css (all styles), static/js/*.js (11 ES modules + state.js). static/app.js is an archived monolith — no longer loaded.
Checkpoint / resume — all three scan engines save progress to ~/.gdprscanner/checkpoint_{prefix}.json every 25 items. Prefixes: m365, google, file_{source_id}. checkpoint.py functions accept a prefix keyword (default "m365"). Use _cp_path(prefix) to get the path — do not hard-code filenames. The Scan button calls checkCheckpoint(() => startScan(false)) so a resume banner is offered before any grid clearing happens. POST /api/scan/clear_checkpoint globs and deletes all checkpoint_*.json files.
Data dir ~/.gdprscanner/: scanner.db, config.json, settings.json, schedule.json, token.json, delta.json, checkpoint_m365.json, checkpoint_google.json, checkpoint_file_*.json, smtp.json, machine_id (never delete — Fernet key), role_overrides.json, google_sa.json, google.json, src_toggles.json, app.lock, viewer_tokens.json
Non-obvious files
| File | Why it's not obvious |
|---|---|
app_config.py |
All persistence — profiles, settings, SMTP, lang loading, viewer tokens + PIN |
routes/state.py |
Shared mutable state + scan locks (not a typical Flask state file) |
routes/google_scan.py |
Google scan execution lives here, not in google_connector.py |
routes/viewer.py |
Viewer token + PIN API; also owns brute-force rate-limit state |
static/js/viewer.js |
Share modal, token CRUD, viewer PIN settings UI |
lang/da.json |
Primary language — source of truth is en.json |
build_gdpr.py |
Desktop app builder; contains embedded LAUNCHER_CODE for PyInstaller |
Tests
182 tests in tests/. No integration tests for live M365/Google connections.
tests/test_google_scan.py — 19 tests for the Google Workspace scan module. Route tests for GET /api/google/scan/users, POST /api/google/scan/start, POST /api/google/scan/cancel. Engine tests for _run_google_scan using synchronous invocation with mocked broadcast, _scan_bytes, checkpoint.*, scan_engine._with_disposition, and gdpr_db.get_db. The clean_google_state autouse fixture releases _google_scan_lock and clears _google_scan_abort after each test.
tests/test_route_integration.py — 54 Flask test-client tests covering security-sensitive paths: viewer token CRUD and scope validation, GET /api/db/flagged role/user scope enforcement, bulk disposition isolation, viewer PIN (set/verify/rate-limit/change/clear), interface PIN gate (multi-step flows require session["interface_ok"] = True after PIN set — the before_request hook blocks the same endpoint once a PIN exists), scan lock release on run_scan() exception, GET /api/db/sessions shape and ordering, profile routes CRUD and rename (including the rename-after-copy regression). Uses a tmp-path ScanDB monkeypatched into routes.database._get_db — tests never touch the real database. Interface PIN tests manipulate the real config.json via setup_method/teardown_method calling clear_interface_pin().
Local-file scan fixtures — tests/fixtures/local_files/ holds 19 files for manual/UI-level testing of the file scanner. 14 should be flagged; 5 are true negatives. All CPR numbers verified against is_valid_cpr. generate_fixtures.py (requires python-docx, openpyxl, mutagen — all in venv) regenerates the binary .docx/.xlsx/.mp3/.flac/.mp4 files. Audio fixtures need 2 silent MPEG frames so mutagen can sync; FLAC uses a hand-packed STREAMINFO + Vorbis comment block; MP4 uses a minimal ftyp+moov/mvhd base that mutagen can tag.
_CPR_PREFIX_NOISE in .docx fixtures — scan_docx builds a single string by concatenating all run texts with no separators between paragraphs. If a CPR value run is immediately followed by text from the next paragraph without a word boundary, \b in CPR_PATTERN fails and the number is silently missed. The fixture generator appends a trailing " " to every value run so CPRs are always surrounded by word boundaries after concatenation. Do not remove this trailing space — the detection will silently regress.
Viewer mode (#33) — routes/viewer.py + static/js/viewer.js
Read-only access for DPOs and reviewers. Key invariants:
/viewauth chain — token (?token=) → session cookie (session["viewer_ok"]) → PIN form (if PIN configured) → 403. Never skip this order.window.VIEWER_MODE— injected by Jinja2 inindex.html.auth.jsreads it at startup; addsviewer-modeclass to<body>. All hide rules are CSS (body.viewer-mode …), not scattered JS checks — exceptdelBtnin the card builder which is also guarded in JS. Hidden in viewer mode:.sidebar(entire left panel),#logWrap,#progressBar, scan/stop/profile/bulk-delete buttons, share button.window.VIEWER_SCOPE— injected alongsideVIEWER_MODE. Contains the scope dict from the token (e.g.{"role": "student"}). Empty object{}means unrestricted.auth.jsreads it at startup; ifVIEWER_SCOPE.roleis set, it pre-sets#filterRoleto that value and hides the dropdown so the viewer cannot change it.- Token scope — stored as
"scope": {"role": "student"|"staff"}or"scope": {}in each token dict insideviewer_tokens.json. Enforced in two places: server-side (GET /api/db/flaggedskips items whoseuser_rolecolumn does not matchsession["viewer_scope"].role) and client-side (the#filterRoledropdown is locked). Server-side is the authoritative guard. Column name isuser_role— do not userole; the DB row has no such key and the filter silently returns nothing. session["viewer_scope"]— set when a token is validated at/view. Persists for the browser session alongsidesession["viewer_ok"]. Reads fromsession.get("viewer_scope", {})in/api/db/flagged— defaults to{}(unrestricted) for PIN-authenticated sessions and legacy tokens without a scope key.viewer_tokens.jsonformat — stored as{"tokens": [...], "__pin__": {"hash": "…", "salt": "…"}}. Token dicts now include"scope": {}. The old bare-list format and tokens without ascopekey are handled transparently (t.get("scope", {})). Do not write the file as a bare list.app.secret_key— derived frommachine_idbytes so Flask sessions survive restarts. Set once at startup ingdpr_scanner.py; do not override it.GET /api/db/flagged— returnsget_session_items()(last completed scan session, joined with dispositions), filtered bysession["viewer_scope"].rolewhen set. Used exclusively by_loadViewerResults()inresults.js. Do not confuse withget_flagged_items()(single scan_id, no disposition join).- Rate-limit state (
_pin_attemptsdict inroutes/viewer.py) — in-memory only, resets on server restart. Intentional — a restart clears lockouts without a persistent store. - User-scoped tokens (#34) — scope
{"user": ["alice@m365.dk", "alice@gws.dk"], "display_name": "Alice Smith"}filtersGET /api/db/flaggedbyaccount_id IN (list), covering both M365 and GWS items for the same person.scope.useris always stored as a list; a legacy single-string value is coerced to[string]on read.scope.display_nameis used for UI only (badge, viewer header) — not for filtering. File-scan items (account_id = "") never appear in user-scoped views.POST /api/viewer/tokensrejects combinedrole+userscope with 400. Share modal: scope-type<select>(#shareScopeType) reveals either the role dropdown (#shareScopeRoleWrap) or a name-search autocomplete (#shareScopeUserWrap). Autocomplete readsS._allUsers; selecting a row stores{ emails, display_name }in module-level_selectedScopeUser; editing the input manually clears it (free-text email fallback). In viewer mode,auth.jsshows#viewerIdentityBadgewithVIEWER_SCOPE.display_name. - Date-range scoping — tokens can carry
valid_fromand/orvalid_tofields (YYYY-MM-DD) in their scope dict.GET /api/db/flaggedfilters items whosemodifieddate falls outside the range using lexicographic string comparison (ISO dates sort correctly without parsing).POST /api/viewer/tokensvalidates format and enforcesvalid_from ≤ valid_to. The share modal shows#shareValidFrom/#shareValidTodate inputs (apply to any scope type). The token list shows a green date-range badge when a range is stored. All three scope dimensions (role, user, date-range) are independent and combinable. - Token onclick attributes — Copy/Revoke buttons in
_renderTokenList()pass the token as a single-quoted JS string literal ('\'' + tok.token + '\''), never viaJSON.stringify.JSON.stringifyproduces double-quoted strings that break the surroundingonclick="…"HTML attribute. - Settings Security pane — Admin PIN and Viewer PIN groups live in
stPaneSecurity, notstPaneGeneral.switchSettingsTab('security')insources.jstriggers bothstLoadPinStatus()andstLoadViewerPinStatus(). The Share modal Configure button opensopenSettings('security'). stClearViewerPinguard — validates that the current-PIN field is non-empty client-side before sending the DELETE request; shows an inline error and focuses the field if empty.- Share link base URL —
_getShareBaseUrl()inviewer.jsfetches/api/local_ip(returns the machine's LAN IP via a UDP probe to8.8.8.8) and substitutes it so copied links are routable from other machines. Falls back towindow.location.originon error. BothcreateShareLinkandcopyTokenLinkareasyncandawaitthis helper. Do not revert to a barewindow.location.origin— that produces127.0.0.1links useless to remote viewers. - Flask binds to
0.0.0.0—gdpr_scanner.pydefault--host,m365_launcher.py, andbuild_gdpr.pyall usehost="0.0.0.0". Internal loopback URLs (urllib exports, webview window, port probe) intentionally keep127.0.0.1— do not change those to0.0.0.0.
Sources panel resize — static/js/log.js + sources.js
_fitSourcesPanel()— called at the end of everyrenderSourcesPanel()call. Clears the panel's inline height, readsscrollHeight(natural content height), then either restores a saved smaller preference fromlocalStorage(gdpr_sources_h) or pins the height toscrollHeight. This keeps the panel exactly as tall as needed to show all sources._initSourcesResize()— attaches pointer-drag to#sourcesResizeHandle. Onpointerdownit capturesscrollHeightas the hard max; drag up shrinks, drag down is capped at that max. Saves tolocalStorageon release; clears the key if the user drags back to full height.- Do not add a fixed
max-heightorheightto#sourcesPanelin HTML — height is controlled entirely by_fitSourcesPanel()at runtime. - Do not call
_fitSourcesPanel()before the panel has rendered —scrollHeightwill be 0. The call inrenderSourcesPanel()is the correct hook;_initSourcesResize()only sets up the drag handler.
Scan filter options — scan_engine.py
All options live in the profile options dict and apply to all three scan engines (M365, Google, file scan).
skip_gps_images(bool, defaultfalse) — When enabled, images whose only PII is GPS coordinates are not flagged. GPS data is still extracted and stored in the cardexiffield if the item is flagged by another signal (faces, EXIF author/comment). Thegps_locationspecial category is also suppressed. Evaluated via_exif_has_piiwhich recheckspii_fieldsandauthorwhen GPS is skipped.min_cpr_count(int, default1) — Minimum number of distinct CPR numbers in a file before it is flagged. Deduplication useslist(dict.fromkeys(c["formatted"] for c in cprs))—cprsis a list of dicts fromextract_matches, not strings. Do not revert todict.fromkeys(cprs)— that raisesTypeError: unhashable type: 'dict'on every file with CPR hits. Files with faces or EXIF PII are still flagged regardless of CPR count — the threshold gates only CPR-based hits.cpr_only(bool, defaultfalse) — When enabled, items whose only hits are email addresses, phone numbers, detected faces, or EXIF/GPS metadata are skipped; only items with at least one qualifying CPR number are flagged. Implemented as a compact short-circuit at each engine's flagging gate:if not (_cpr_qualifies and cprs) and (cpr_only or (<other PII absent>)): continue. This preserves existing behavior whencpr_only=False. Sidebar toggle#optCprOnly; profile editor#peOptCprOnly.ocr_lang(str, default"dan+eng") — Tesseract language pack(s) used when scanning scanned PDFs and images. Presets:dan+eng,dan,eng,dan+eng+deu,dan+eng+swe,dan+eng+fra. Threaded through_scan_bytes/_scan_bytes_timeout→document_scanner.scan_pdf/scan_imageand the spawned PDF-OCR subprocess worker (_worker_scan_pdf). The OCR result cache key already includedlang, so per-language results are cached independently. Sidebar select#optOcrLang; profile editor#peOptOcrLang.- File scan reads all options from
sourcedict keys (passed directly from the/api/file_scan/startpayload). M365 scan reads them fromscan_opts = options.get("options", {}). Both paths apply the same_cpr_qualifies/_exif_has_piilogic before the flagging gate. - UI: sidebar controls
#optSkipGps,#optMinCpr,#optCprOnly,#optOcrLang; profile editor controls#peOptSkipGps,#peOptMinCpr,#peOptCprOnly,#peOptOcrLang. All are saved/loaded byprofiles.js.
M365 connector exceptions — m365_connector.py
Exception hierarchy (all inherit M365Error(Exception)):
| Exception | Trigger | Handler |
|---|---|---|
M365PermissionError |
403 Forbidden | scan_error broadcast with human-readable permission hint |
M365DeltaTokenExpired |
410 Gone on delta endpoint | Caller clears token and falls back to full scan |
M365DriveNotFound |
404 Not Found on any path | scan_phase broadcast ("not provisioned — skipped") in _scan_user_onedrive; full-scan path's except Exception: return also silences it |
M365DriveNotFound — why it exists: _get() previously fell through to raise_for_status() on 404, which was caught by the generic except Exception handler in _scan_user_onedrive and broadcast as a red scan_error. The full-scan path (_iter_drive_folder_for) silently swallowed the same 404 via except Exception: return. Adding the specific exception makes the delta path consistent with the full-scan path: a user without a provisioned OneDrive is skipped without an error card. Common causes: no OneDrive licence, service plan disabled, drive never initialised (account never signed in), account suspended.
Do not add a 404 handler to _get() that returns a fallback value — that would silently mask genuine path bugs elsewhere. Raising M365DriveNotFound keeps the error visible to callers that need to act on it.
Memory management — scan_engine.py
Large M365 tenants can generate enormous memory pressure. Key rules to preserve:
- Email body stripped at collection time —
_scan_user_emailcallsconn.get_message_body_text(msg), stores the result asmsg["_precomputed_body"], then deletesmsg["body"]andmsg["bodyPreview"]before appending towork_items. The processing loop readsmeta.pop("_precomputed_body", ""). Do not re-addbodyto the$selectquery without also stripping it here. body_excerpt— 500-char plain-text preview stored per flagged email — just beforedel body_textin M365 email processing,meta["_body_excerpt"] = body_text[:500].strip(). Ingoogle_scan.py, a regex HTML-strip of the first 3000 bytes of Gmail body data is stored the same way._broadcast_cardin both engines includes"body_excerpt"in the card dict so the excerpt flows intoflagged_items, the checkpoint JSON, and the DB (body_excerpt TEXT, migration #10). The M365 email preview route falls back to_excerpt_page()when Graph raises or the connector is absent. The Gmail preview shows_excerpt_page()as primary content with the "Open in Gmail" link appended. Do not remove the excerpt before broadcasting — that's what makes preview work on checkpoint resume.work_items→dequebefore processing — converted withdeque(work_items)and drained viapopleft()so each item's memory is released immediately after processing. Do not convert back to a list or iterate withenumerate().del contentin file branch — raw download bytes are deleted as soon ascontent.decode()is done (before NER/PII counting). Both the hit and no-hit paths have explicitdel content.del body_textin email branch — deleted after_broadcast_cardcall.- PDF OCR rendered page-by-page —
document_scanner.scan_pdf(and the redact paths) callconvert_from_path(first_page=N, last_page=N)inside the loop, so only one page image is in memory at a time. Do NOT move back to a bulkconvert_from_path()call — that allocates all pages at once and triggers OOM kills on large PDFs. - OCR memory guard —
_ocr_mem_ok()checkspsutil.virtual_memory().available >= 500 MBbefore each page render. Pages that would exceed this threshold are skipped with a printed warning and recorded as"skipped"inpage_methods. - Memory guard —
psutil.virtual_memory().availablechecked before each M365 file download; scan skips the file if < 300 MB free.
Export — routes/export.py
GDPRDb.get_session_sources()— returns asetof source-key strings (e.g.{"gmail", "gdrive", "email"}) for every scan in the current session window. Used by both_build_excel_bytes()and_build_article30_docx()to include zero-hit sources in summary tables. Do not derive the scanned-source set fromby_sourcealone — that dict only contains sources with flagged items.- Excel Summary sheet vs. per-source tabs — the Summary sheet shows all scanned sources (even with 0 items). Per-source tabs are only created for sources with items; an empty tab has no value.
- ART.30 breakdown table — iterates
scanned_sources(notby_source) so Gmail, Google Drive, etc. appear with0 | 0 | 0 | —when the scan found nothing. - Role-filtered exports —
_build_excel_bytes(role='')and_build_article30_docx(role='')acceptrole='student'orrole='staff'. A local_itemslist is built at the top of each function and used everywhere instead ofstate.flagged_itemsdirectly — GPS sheet, External transfers sheet, and Art.30 staff/student tables all see only the filtered subset. Route handlers readrequest.args.get('role', '')and forward it. Filenames get_elever/_ansattesuffix. The#filterRoledropdown in the filter bar drives both the client-side grid filter and the export URL param — do not separate them. POST /api/redact_item— rewrites a local file in-place with CPR numbers replaced by██████-████/█blocks, then removes the card from the grid and logs a"redacted"disposition. Supported extensions:.docx,.xlsx,.csv,.txt(_REDACT_EXTS). The file is written to a temp path in the same directory as the original beforeshutil.move— this avoids cross-device rename failures on mounted volumes. Uses existingdocument_scannerfunctions (redact_docx,redact_xlsx,redact_csv,find_pii_spans_in_text). Only works forsource_type == "local"— SMB/cloud files are not supported (button is hidden on those cards). The button (✂, classcard-redact-btn) appears inappendCardwhen_redactable(f)is true; hidden in viewer mode and for resolved items.
Scan history browser — static/js/history.js + gdpr_db.py + routes/database.py
Allows reviewing results from any past scan session without running a new scan. Key invariants:
S._historyRefScanId—null= live/SSE mode; positive int = viewing a past session (the highestscan_idin that session's 300 s window). Set byloadHistorySession(); cleared tonullbyexitHistoryMode().GET /api/db/sessions(routes/database.py) — calls_get_db().get_sessions(). Returns newest-first list; each entry hasref_scan_id,started_at,finished_at,sources(list of source-key strings),flagged_count,total_scanned,delta(bool). No auth restriction — viewer tokens share this endpoint.get_sessions(limit=50, window_seconds=300)(gdpr_db.py) — groupsscansrows by 300 s window (same window logic asget_session_items). Groups are built ascending, returned descending.ref_scan_idis the highestscan_idin each group. Do not change the window size independently ofget_session_items.get_session_items(ref_scan_id=N)(gdpr_db.py) — whenref_scan_idis given, anchors the 300 s window to that scan'sstarted_at. Falls back to latest scan whenref_scan_id=None. Window is symmetric:started_at BETWEEN ref.started_at - 300 AND ref.started_at + 300— do not revert to a one-sided lower bound or historical sessions will include all newer scans.GET /api/db/flagged?ref=N— passesref_scan_idtoget_session_items; viewer scope enforcement (role/user filters) still applies. Used by both history mode and the normal post-scan viewer path.- History banner (
#historyBanner) — shown whenS._historyRefScanIdis set. Contains#historyBannerText(session date · sources · N items),#historyPickerBtn(opens#historyDropdown), and#historyLatestBtn(visible only when the viewed session is not the latest). Do not hide/show these elements from outsidehistory.js. - Session picker (
#historyDropdown) — rendered inside[data-history-wrap]container so the outside-click handler (documentlistener, closes on clicks outside[data-history-wrap]) works correctly. Do not move the picker outside this wrapper. - Cache invalidation —
_sessionsand_latestRefScanIdare module-level inhistory.js.invalidateHistoryCache()clears both. All three*_doneSSE handlers inscan.jscallwindow.invalidateHistoryCache?.()so the picker reflects the newest scan after completion. - Re-scan diff —
loadHistorySessionfetches the immediately preceding session's items after rendering the current session. Items present in the previous session but absent from the current one (compared byid) are tagged_resolved: trueand appended after a.resolved-dividerseparator.appendCardinresults.jsadds.card-resolved(opacity 0.6), a green✓ Resolvedbadge, and hides the delete button for resolved items._setHistoryBanneraccepts an optionalresolvedCountparameter and appends it to the banner label. Resolved items are NOT added toS.flaggedData— they are grid-only and cannot be bulk-selected or exported. - Auto-load on page load —
results.jscallswindow.loadHistorySession?.(null)once when the SSE watchdog confirms!status.running.nullresolves to the latest completed session via_fetchSessions()[0].ref_scan_id. The_initialStatusCheckedguard ensures this fires at most once per page load. - Mode transitions —
startScan()callswindow.exitHistoryMode?.()before clearing the grid, so any history banner is dismissed andS._historyRefScanIdis reset before SSE events start arriving.
CPR cross-referencing — gdpr_db.py + routes/database.py + static/js/results.js
GDPRDb.get_related_items(item_id, ref_scan_id, window_seconds=300)— self-joinscpr_indexto find other items in the same session window that share ≥1 CPR hash withitem_id. Returns rows ordered byshared_cprs DESC, cpr_count DESC. Uses the same 300 s symmetric window asget_session_items— do not change the window size independently.GET /api/db/related/<item_id>?ref=N(routes/database.py) — passesitem_idand optionalref_scan_idtoget_related_items; normalises JSON columns (same logic asdb_flagged_items). Returns[]whenDB_OKis false.#previewRelated—<div>inserted between#previewMetaand the disposition row inindex.html. Hidden (display:none) when not in use; shown by_loadRelated._loadRelated(f)(results.js) — async; hides#previewRelatediff.cpr_countis 0, otherwise fetches/api/db/related/<id>?ref=Nand renders a clickable list with per-item shared-CPR badge. Called fromopenPreviewafterloadDisposition.window._openRelated(id, itemData)(results.js) — resolves the target item: looks upidinS.flaggedDatafirst (live/history grid already loaded), falls back toitemDatafrom the API response (history items not yet in the grid). CallsopenPreview.- No new data collection —
cpr_indexalready stores(cpr_hash, item_id, scan_id)for every CPR hit at write time. Cross-referencing is entirely a query-time operation.
Preview — routes/database.py
GET /api/preview/<item_id>?source_type=…&account_id=… dispatches by source_type:
local/smb— re-reads the file from disk; renders images as data URIs, text/CSV/PDF/DOCX/XLSX inline, SMB as a link card.email— fetches the M365 message body via Graph and renders it as sandboxed HTML (requiresstate.connector).gmail— Gmail's web UI cannot be embedded (X-Frame-Options). Shows an info card with an "Open in Gmail" link built from the stored_urlfield.gdrive— extracts the Drive file ID fromwebViewLinkand returnshttps://drive.google.com/file/d/{id}/previewas an iframe. Falls back to substituting/view→/previewin the URL if the pattern doesn't match.- All other values (M365 files:
onedrive,sharepoint,teams, or empty) — calls Graph's/previewPOST endpoint; triesdrive_id-based path first, then user-drive path, then/me/drive.
_source_type must be set in google_scan.py — Gmail items need meta["_source_type"] = "gmail" and Drive items "gdrive" before _broadcast_card is called. Without it, cards carry an empty source_type and fall through to the M365 branch, which calls Graph with a Gmail ID and gets a 404.
state.connector guard — only the email branch and the M365 else branch require M365 auth. The local/smb, gmail, and gdrive branches must not gate on state.connector — they work in Google-only deployments.
Compliance audit log — gdpr_db.py + routes/
audit_logtable — created by_DDL(CREATE TABLE IF NOT EXISTS) so it appears automatically on the next server start for existing databases. No migration needed. Schema:id, ts (Unix float), action, actor, detail, ip.ScanDB.log_audit(action, detail, actor, ip)— inserts one record and commits immediately.ScanDB.get_audit_log(limit, action)returns rows newest-first.log_audit_event(action, detail, actor, ip)— module-level helper ingdpr_db.py; silently no-ops on any exception so call sites never raise. Import:from gdpr_db import log_audit_event as _audit.GET /api/audit_log?limit=200&action=<filter>— inroutes/app_routes.py. No auth gate — same access level as other settings endpoints.- Recorded events —
profile_save,profile_delete(routes/profiles.py);token_create,token_revoke,viewer_pin_set/change/clear,interface_pin_set/change/clear(routes/viewer.py);source_add,source_update,source_delete(routes/sources.py);scheduler_job_save,scheduler_job_delete(routes/scheduler.py);scan_start,scan_stop(routes/scan.py);smtp_save(routes/email.py);disposition,disposition_bulk,admin_pin_set/change(routes/database.py);item_delete,item_redact(routes/export.py). - UI — "Audit Log" tab (
stTabAuditlog/stPaneAuditlog) in the Settings modal.stLoadAuditLog()insources.jsfetches and renders the table when the tab is opened; useswindow._escHtmlfromlog.js. Exported aswindow.stLoadAuditLog. - Do not add
actorvalues for end-user identity — the scanner has no per-user login, soactoris always empty for now. The field is reserved for future use.
SSE teardown — static/js/scan.js
- Do not close
S.esinscan_doneif other scans are still running — M365 (scan_done), Google (google_scan_done), and File (file_scan_done) each emit their own done event. If M365 finishes first and the SSE is closed, the remaining done events are never received and the UI hangs at 100% indefinitely. - Rule: close
S.es(and resetS._userStartedScan) only inside the branch where all concurrent scans have finished:scan_donechecks!S._googleScanRunning && !S._fileScanRunning;google_scan_donechecks!S._m365ScanRunning && !S._fileScanRunning;file_scan_donechecks!S._m365ScanRunning && !S._googleScanRunning. - Scheduled scans —
S._userStartedScanis false for scheduler-triggered runs, so the SSE connection is never closed and future scheduler events continue to arrive. scan_startis M365-only —run_scan()broadcastsscan_start;run_file_scan()androutes/google_scan.pymust NOT. Thescan_starthandler in_attachSchedulerListenersunconditionally setsS._m365ScanRunning = true. If a file scan emitsscan_start, the flag is set without a matchingscan_doneto clear it, andfile_scan_donerefuses to re-enable the scan button because!S._m365ScanRunningis false. Usescan_phase(file) andgoogle_scan_phase(google) instead — these are routed correctly by the phase-source detection logic in_attachScanListeners.- Two separate abort events —
state._scan_abort(M365 + file) andstate._google_scan_abort(Google).POST /api/scan/stopsets both._check_abort()inside_run_google_scanmust use the module-level_scan_abortalias (= state._google_scan_abort), notgdpr_scanner._scan_abort(which is the M365 event). Do not conflate them — a Google-only scan must react to Stop, andgdpr_scanner._scan_abortis not the right event for that path.
Email sending — routes/email.py + m365_connector.py
_post()returns{}on empty body —m365_connector._post()returnsr.json() if r.content else {}. The GraphsendMailendpoint returns HTTP 202 with no body on success; callingr.json()on an empty response raisesJSONDecodeError. Do not change this back to an unconditionalr.json()— it would falsely report every successful email send as an error.- Graph preferred over SMTP —
smtp_testandsend_reportboth try_send_email_graph()first whenstate.connectoris authenticated. Only falls back to SMTP if Graph raises. If Graph fails and no SMTP host is saved, the Graph exception is surfaced directly (not swallowed by the "No SMTP host" message). - Auto-email after manual scan —
_maybe_send_auto_email()inroutes/scan.pyis called from the_run()thread immediately afterrun_scan()returns. Readssmtp_cfg.get("auto_email_manual")fromsmtp.json; no-ops if the flag is false, no flagged items, or no recipients. Same Graph-first → SMTP-fallback pattern as the scheduler. Toggle: Settings → Email report → Email report after manual scan (#st-smtpAutoEmail), saved bystSmtpSave()inscheduler.js. - Gmail vs Google Workspace detection — auth error handlers check whether the SMTP username ends in
@gmail.com/@googlemail.com. If not, the account is treated as Google Workspace (custom domain) and the error message points to the Workspace admin console rather than the user's personal security settings.
Scheduler — scan_scheduler.py + routes/scheduler.py + static/js/scheduler.js
- Job config keys —
id,name,enabled,frequency(daily/weekly/monthly),day_of_week,day_of_month,hour,minute,profile_id,auto_email,auto_retention,retention_years,fiscal_year_end,report_only. Stored in~/.gdprscanner/schedule.json. Auto-migrates old single-job format; assigns UUIDs to legacy entries without one. _execute_scan(job_id)— the core execution method. Acquires a per-job lock (_running_jobsset), records a DB run viadb.begin_schedule_run(), then either takes the report-only path (see below) or runs the full scan pipeline (M365 → file → Google), then emails and applies retention if configured. The DB run is finalised in afinallyblock so status/counts are always recorded.- Report-only path — when
report_only=True,_execute_scanshort-circuits before the M365 auth check. It populates_m.flagged_itemsfromdb.get_session_items()if the in-memory list is empty, then calls_send_email_report(job_cfg)and returns. Does NOT acquire the scan lock; does NOT require M365 auth. Fails withRuntimeError("No scan results available")if both in-memory state and the DB are empty, which the outerexcepthandler records as a failed run. _send_email_report(job_cfg)— builds Excel via_m._build_excel_bytes(), loads SMTP config, tries Graph first (ifstate.connectoris authenticated), falls back to SMTP. Adjusts the email body text based onjob_cfg.get("report_only"): "Scan completed" vs "Report on latest scan results"._m.flagged_itemsandstate.flagged_itemsare the same object —gdpr_scanner.pyassigns_state.flagged_items = flagged_itemsat startup, so both names reference the same list. In-place updates (flagged_items[:] = ...) in the scheduler propagate to routes and vice versa.scheduler_started/scheduler_doneSSE events — broadcast at start and end of every job (including report-only).scheduler_donecarriesflagged,scanned,emailed, andjob_name. Do not confuse withscan_done(M365) — they are separate event types.- UI — job card badge —
schedRenderJobs()inscheduler.jsadds a blue "Report only" (m365_sched_report_only) badge to the job name whenj.report_onlyis true. - UI —
schedToggleReportOnly()— dims the Profile row (#schedProfileRowopacity 0.4), shows/hides#schedReportOnlyHint, and forces#schedAutoEmailchecked. Called from the checkboxonchangehandler and at the start ofschedAddJob()/schedEditJob().
Global gotchas
- Pattern matching in Python — when using
str.replace()to patch JS/HTML, whitespace and quote style must match exactly. Useincheck first and print if not found. __getattr__on modules — only resolvesmodule.nameaccess from outside, not bare name lookups inside function bodies. Always import directly.JSON.stringifyinsideonclick="…"attributes — produces double-quoted strings that terminate the HTML attribute early. Use single-quoted JS string literals instead, ordata-*attributes read from the handler.
Directory-scoped rules
routes/CLAUDE.md— SSE constraints, scan_progress source field, file_sources, Python gotchasstatic/js/CLAUDE.md— profile dropdown, progress bar phase parsing, JS gotchastemplates/CLAUDE.md— CSS variable names, sizing rules, badge standard, design ruleslang/CLAUDE.md— i18n conventions