diff --git a/CHANGELOG.md b/CHANGELOG.md index 9058451..a6ad53a 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -7,6 +7,28 @@ Version numbers follow [Semantic Versioning](https://semver.org/spec/v2.0.0.html --- +## [1.6.22] — 2026-04-21 + +### Added + +- **Auto-email after manual scan** — a new **Email report after manual scan** toggle in **Settings → Email report** sends the Excel report to the configured recipients automatically when a manual scan completes. Disabled by default. Stored as `auto_email_manual` in `smtp.json`. Uses the same Graph-first → SMTP-fallback path as scheduled scan auto-email. Only fires when there are flagged items and at least one recipient is saved; errors are logged but never surface to the UI (the scan result is unaffected). + +- **Route integration test suite** — 44 new tests in `tests/test_route_integration.py` covering security-sensitive and data-correctness paths: viewer token CRUD, role and user scope enforcement on `GET /api/db/flagged`, bulk disposition isolation, viewer PIN set/verify/rate-limit/clear, interface PIN gate and multi-step flows, scan lock release on `run_scan()` exception, and `GET /api/db/sessions` shape and ordering. Total test count: 172. + +### Fixed + +- **Role scope filter silently returned nothing** — `GET /api/db/flagged` filtered rows by `row.get("role")` but the column returned from the DB is `user_role`. Role-scoped viewer tokens (`{"role": "student"}` or `{"role": "staff"}`) therefore excluded every item and returned an empty list. Fixed in `routes/database.py`. + +- **Historical session query included newer scans** — `gdpr_db.get_session_items(ref_scan_id=N)` used a lower-bounded window (`started_at >= ref.started_at - 300`) with no upper bound, so any scan that started after the historical reference was also returned. Viewing a past session in the history browser would show items from all subsequent scans as well. Fixed by adding an upper bound (`started_at BETWEEN ref.started_at - 300 AND ref.started_at + 300`). + +- **Scan button stuck disabled after file scan** — `run_file_scan` broadcast a `scan_start` SSE event, which the `scan_start` handler in `_attachSchedulerListeners` intercepted and set `S._m365ScanRunning = true`. When `file_scan_done` fired it checked `!S._m365ScanRunning` before re-enabling the button — finding it still `true`, the button stayed disabled permanently. No `scan_done` (M365) ever arrives to clear the flag. Fixed by removing the `scan_start` broadcast from `run_file_scan`; the `scan_phase "Files — …"` event immediately following already sets `_fileScanRunning` correctly via the phase-source detection in `_attachScanListeners`. + +- **`TypeError: unhashable type: 'dict'` during file and M365 scans** — `_distinct_cprs = list(dict.fromkeys(cprs))` in both scan paths treated `cprs` as a list of strings, but `extract_matches` returns a list of dicts (`{"formatted": "…", "page": …, …}`). The deduplication crashed on the first file that contained CPR numbers, aborting the scan loop. Fixed in both `run_file_scan` (line 251) and `run_scan` (line 1100) by keying on `c["formatted"]`: `list(dict.fromkeys(c["formatted"] for c in cprs))`. + +- **Profile applied early lost user selection and source checkboxes** — two startup race conditions: (1) Profiles with `user_ids = "all"` applied before the M365 user list had loaded ran `.forEach()` on an empty array (no-op); when `loadUsers()` completed it defaulted all users to `selected = false` with nothing to override, leaving the accounts panel completely unchecked. Fixed by adding a `_pendingProfileAllUsers` deferred flag mirroring the existing `_pendingProfileUserIds` mechanism — `loadUsers()` applies it after populating `S._allUsers`. (2) If the profile was selected in the narrow window before `_loadFileSources()` returned and rendered the sources panel, `_applyProfile()` iterated zero checkboxes and the source selection was silently discarded; a subsequent `renderSourcesPanel()` call then re-rendered all sources as checked (their default). Fixed by calling `renderSourcesPanel()` in `_applyProfile()` when no source checkboxes are present in the DOM yet — same guard already used in `loadUsers()`. + +--- + ## [1.6.21] — 2026-04-20 ### Added diff --git a/CLAUDE.md b/CLAUDE.md index 1ad10d0..a72c36c 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -42,7 +42,9 @@ python -m pytest tests/ -q ## Tests -128 tests in `tests/`. No integration tests for Flask routes or live M365/Google connections. +172 tests in `tests/`. No integration tests for live M365/Google connections. + +**`tests/test_route_integration.py`** — 44 Flask test-client tests covering security-sensitive paths: viewer token CRUD and scope validation, `GET /api/db/flagged` role/user scope enforcement, bulk disposition isolation, viewer PIN (set/verify/rate-limit/change/clear), interface PIN gate (multi-step flows require `session["interface_ok"] = True` after PIN set — the `before_request` hook blocks the same endpoint once a PIN exists), scan lock release on `run_scan()` exception, `GET /api/db/sessions` shape and ordering. Uses a tmp-path `ScanDB` monkeypatched into `routes.database._get_db` — tests never touch the real database. Interface PIN tests manipulate the real `config.json` via `setup_method`/`teardown_method` calling `clear_interface_pin()`. **Local-file scan fixtures** — `tests/fixtures/local_files/` holds 13 documents for manual/UI-level testing of the file scanner. 10 should be flagged; 3 are true negatives. All CPR numbers verified against `is_valid_cpr`. `generate_fixtures.py` (requires `python-docx` + `openpyxl`, already in venv) regenerates the binary `.docx`/`.xlsx` files. @@ -55,7 +57,7 @@ Read-only access for DPOs and reviewers. Key invariants: - **`/view` auth chain** — token (`?token=`) → session cookie (`session["viewer_ok"]`) → PIN form (if PIN configured) → 403. Never skip this order. - **`window.VIEWER_MODE`** — injected by Jinja2 in `index.html`. `auth.js` reads it at startup; adds `viewer-mode` class to ``. All hide rules are CSS (`body.viewer-mode …`), not scattered JS checks — except `delBtn` in the card builder which is also guarded in JS. Hidden in viewer mode: `.sidebar` (entire left panel), `#logWrap`, `#progressBar`, scan/stop/profile/bulk-delete buttons, share button. - **`window.VIEWER_SCOPE`** — injected alongside `VIEWER_MODE`. Contains the scope dict from the token (e.g. `{"role": "student"}`). Empty object `{}` means unrestricted. `auth.js` reads it at startup; if `VIEWER_SCOPE.role` is set, it pre-sets `#filterRole` to that value and hides the dropdown so the viewer cannot change it. -- **Token scope** — stored as `"scope": {"role": "student"|"staff"}` or `"scope": {}` in each token dict inside `viewer_tokens.json`. Enforced in two places: server-side (`GET /api/db/flagged` skips items whose `role` column does not match `session["viewer_scope"].role`) and client-side (the `#filterRole` dropdown is locked). Server-side is the authoritative guard. +- **Token scope** — stored as `"scope": {"role": "student"|"staff"}` or `"scope": {}` in each token dict inside `viewer_tokens.json`. Enforced in two places: server-side (`GET /api/db/flagged` skips items whose `user_role` column does not match `session["viewer_scope"].role`) and client-side (the `#filterRole` dropdown is locked). Server-side is the authoritative guard. **Column name is `user_role`** — do not use `role`; the DB row has no such key and the filter silently returns nothing. - **`session["viewer_scope"]`** — set when a token is validated at `/view`. Persists for the browser session alongside `session["viewer_ok"]`. Reads from `session.get("viewer_scope", {})` in `/api/db/flagged` — defaults to `{}` (unrestricted) for PIN-authenticated sessions and legacy tokens without a scope key. - **`viewer_tokens.json` format** — stored as `{"tokens": [...], "__pin__": {"hash": "…", "salt": "…"}}`. Token dicts now include `"scope": {}`. The old bare-list format and tokens without a `scope` key are handled transparently (`t.get("scope", {})`). Do not write the file as a bare list. - **`app.secret_key`** — derived from `machine_id` bytes so Flask sessions survive restarts. Set once at startup in `gdpr_scanner.py`; do not override it. @@ -80,7 +82,7 @@ Read-only access for DPOs and reviewers. Key invariants: Both options live in the profile `options` dict and apply to **all three scan engines** (M365, Google, file scan). - **`skip_gps_images` (bool, default `false`)** — When enabled, images whose only PII is GPS coordinates are not flagged. GPS data is still extracted and stored in the card `exif` field if the item is flagged by another signal (faces, EXIF author/comment). The `gps_location` special category is also suppressed. Evaluated via `_exif_has_pii` which rechecks `pii_fields` and `author` when GPS is skipped. -- **`min_cpr_count` (int, default `1`)** — Minimum number of **distinct** CPR numbers in a file before it is flagged. Deduplication uses `list(dict.fromkeys(cprs))` to preserve order. Files with faces or EXIF PII are still flagged regardless of CPR count — the threshold gates only CPR-based hits. +- **`min_cpr_count` (int, default `1`)** — Minimum number of **distinct** CPR numbers in a file before it is flagged. Deduplication uses `list(dict.fromkeys(c["formatted"] for c in cprs))` — `cprs` is a list of dicts from `extract_matches`, not strings. Do not revert to `dict.fromkeys(cprs)` — that raises `TypeError: unhashable type: 'dict'` on every file with CPR hits. Files with faces or EXIF PII are still flagged regardless of CPR count — the threshold gates only CPR-based hits. - **File scan** reads both from `source` dict keys (passed directly from the `/api/file_scan/start` payload). **M365 scan** reads both from `scan_opts = options.get("options", {})`. Both paths apply the same `_cpr_qualifies` / `_exif_has_pii` logic before the flagging gate. - **UI:** sidebar controls `#optSkipGps` (toggle) and `#optMinCpr` (number); profile editor controls `#peOptSkipGps` and `#peOptMinCpr`. Both are saved/loaded by `profiles.js`. @@ -124,7 +126,7 @@ Allows reviewing results from any past scan session without running a new scan. - **`S._historyRefScanId`** — `null` = live/SSE mode; positive int = viewing a past session (the highest `scan_id` in that session's 300 s window). Set by `loadHistorySession()`; cleared to `null` by `exitHistoryMode()`. - **`GET /api/db/sessions`** (`routes/database.py`) — calls `_get_db().get_sessions()`. Returns newest-first list; each entry has `ref_scan_id`, `started_at`, `finished_at`, `sources` (list of source-key strings), `flagged_count`, `total_scanned`, `delta` (bool). No auth restriction — viewer tokens share this endpoint. - **`get_sessions(limit=50, window_seconds=300)`** (`gdpr_db.py`) — groups `scans` rows by 300 s window (same window logic as `get_session_items`). Groups are built ascending, returned descending. `ref_scan_id` is the highest `scan_id` in each group. Do not change the window size independently of `get_session_items`. -- **`get_session_items(ref_scan_id=N)`** (`gdpr_db.py`) — when `ref_scan_id` is given, anchors the 300 s window to that scan's `started_at`. Falls back to latest scan when `ref_scan_id=None`. +- **`get_session_items(ref_scan_id=N)`** (`gdpr_db.py`) — when `ref_scan_id` is given, anchors the 300 s window to that scan's `started_at`. Falls back to latest scan when `ref_scan_id=None`. Window is **symmetric**: `started_at BETWEEN ref.started_at - 300 AND ref.started_at + 300` — do not revert to a one-sided lower bound or historical sessions will include all newer scans. - **`GET /api/db/flagged?ref=N`** — passes `ref_scan_id` to `get_session_items`; viewer scope enforcement (role/user filters) still applies. Used by both history mode and the normal post-scan viewer path. - **History banner** (`#historyBanner`) — shown when `S._historyRefScanId` is set. Contains `#historyBannerText` (session date · sources · N items), `#historyPickerBtn` (opens `#historyDropdown`), and `#historyLatestBtn` (visible only when the viewed session is not the latest). Do not hide/show these elements from outside `history.js`. - **Session picker** (`#historyDropdown`) — rendered inside `[data-history-wrap]` container so the outside-click handler (`document` listener, closes on clicks outside `[data-history-wrap]`) works correctly. Do not move the picker outside this wrapper. @@ -137,11 +139,13 @@ Allows reviewing results from any past scan session without running a new scan. - **Do not close `S.es` in `scan_done` if other scans are still running** — M365 (`scan_done`), Google (`google_scan_done`), and File (`file_scan_done`) each emit their own done event. If M365 finishes first and the SSE is closed, the remaining done events are never received and the UI hangs at 100% indefinitely. - **Rule:** close `S.es` (and reset `S._userStartedScan`) only inside the branch where *all* concurrent scans have finished: `scan_done` checks `!S._googleScanRunning && !S._fileScanRunning`; `google_scan_done` checks `!S._m365ScanRunning && !S._fileScanRunning`; `file_scan_done` checks `!S._m365ScanRunning && !S._googleScanRunning`. - **Scheduled scans** — `S._userStartedScan` is false for scheduler-triggered runs, so the SSE connection is never closed and future scheduler events continue to arrive. +- **`scan_start` is M365-only** — `run_scan()` broadcasts `scan_start`; `run_file_scan()` and `routes/google_scan.py` must NOT. The `scan_start` handler in `_attachSchedulerListeners` unconditionally sets `S._m365ScanRunning = true`. If a file scan emits `scan_start`, the flag is set without a matching `scan_done` to clear it, and `file_scan_done` refuses to re-enable the scan button because `!S._m365ScanRunning` is false. Use `scan_phase` (file) and `google_scan_phase` (google) instead — these are routed correctly by the phase-source detection logic in `_attachScanListeners`. ## Email sending — routes/email.py + m365_connector.py - **`_post()` returns `{}` on empty body** — `m365_connector._post()` returns `r.json() if r.content else {}`. The Graph `sendMail` endpoint returns HTTP 202 with **no body** on success; calling `r.json()` on an empty response raises `JSONDecodeError`. Do not change this back to an unconditional `r.json()` — it would falsely report every successful email send as an error. - **Graph preferred over SMTP** — `smtp_test` and `send_report` both try `_send_email_graph()` first when `state.connector` is authenticated. Only falls back to SMTP if Graph raises. If Graph fails and no SMTP host is saved, the Graph exception is surfaced directly (not swallowed by the "No SMTP host" message). +- **Auto-email after manual scan** — `_maybe_send_auto_email()` in `routes/scan.py` is called from the `_run()` thread immediately after `run_scan()` returns. Reads `smtp_cfg.get("auto_email_manual")` from `smtp.json`; no-ops if the flag is false, no flagged items, or no recipients. Same Graph-first → SMTP-fallback pattern as the scheduler. Toggle: **Settings → Email report → Email report after manual scan** (`#st-smtpAutoEmail`), saved by `stSmtpSave()` in `scheduler.js`. - **Gmail vs Google Workspace detection** — auth error handlers check whether the SMTP username ends in `@gmail.com` / `@googlemail.com`. If not, the account is treated as Google Workspace (custom domain) and the error message points to the Workspace admin console rather than the user's personal security settings. ## Global gotchas diff --git a/README.md b/README.md index e430e46..698f68e 100644 --- a/README.md +++ b/README.md @@ -589,14 +589,14 @@ python gdpr_scanner.py # GDPRScanner on port 5100 (auto-increments if in use) ### Test suite -GDPRScanner ships with a `pytest` test suite covering the CPR detection engine, configuration layer, checkpoint persistence, and the SQLite database. +GDPRScanner ships with a `pytest` test suite covering the CPR detection engine, configuration layer, checkpoint persistence, the SQLite database, and security-sensitive Flask routes. ```bash pip install pytest pytest tests/ ``` -**128 tests across 4 modules — all expected to pass.** +**172 tests across 5 modules — all expected to pass.** | Module | Tests | Covers | |---|---|---| @@ -604,8 +604,9 @@ pytest tests/ | `tests/test_app_config.py` | 34 | i18n loading, Article 9 keyword detection, config round-trip, admin PIN, profiles CRUD, Fernet encryption | | `tests/test_checkpoint.py` | 18 | Checkpoint key stability, save/load/clear, wrong-key isolation, delta token round-trip | | `tests/test_db.py` | 24 | Scan lifecycle, CPR hash-only storage, data subject lookup, dispositions, export/import cycle | +| `tests/test_route_integration.py` | 44 | Viewer token CRUD, role/user scope enforcement, bulk disposition isolation, viewer PIN, interface PIN gate, scan lock release on failure, session history ordering | -Each new module (`cpr_detector.py`, `app_config.py`, `checkpoint.py`, `gdpr_db.py`) is importable in isolation without Flask or MSAL — tests run without any cloud credentials or a running server. +Each unit-test module (`cpr_detector.py`, `app_config.py`, `checkpoint.py`, `gdpr_db.py`) is importable in isolation without Flask or MSAL — tests run without any cloud credentials or a running server. The test suite should be run before every release and after any change to `document_scanner.py`, `cpr_detector.py`, or `gdpr_db.py`. CPR detection is the legal core of the tool — a false negative means a real GDPR violation goes undetected. diff --git a/SUGGESTIONS.md b/SUGGESTIONS.md new file mode 100644 index 0000000..5e078d4 --- /dev/null +++ b/SUGGESTIONS.md @@ -0,0 +1,352 @@ +# SUGGESTIONS — Feature ideas and implementation history + +This document tracks every significant feature idea: what was proposed, whether it was implemented, and why decisions were made the way they were. Read this before adding a feature — the reasoning behind past decisions is often non-obvious. + +**Status key:** ✅ Done · ✗ Won't do · ○ Open + +--- + +## §1 — M365 email, OneDrive, SharePoint, Teams scanning ✅ + +Core premise: scan Microsoft 365 tenants for CPR numbers across all major storage surfaces — Exchange mailboxes (all folders, recursive), OneDrive personal drives, SharePoint document libraries, and Teams channel file storage — using the Microsoft Graph API. + +**Implemented:** v1.0.0. The `m365_connector.py` client handles auth (application mode + delegated device-code flow), delta tokens, attachment download, and all four scan surfaces. Results stream card-by-card via SSE. + +--- + +## §2 — Incremental / delta scanning ✅ + +Re-scanning a large tenant on every run is too slow for regular compliance use. Microsoft Graph provides `/delta` endpoints for Exchange, OneDrive, and SharePoint that return only items changed since the last sync token. + +**Implemented:** v1.0.0. Delta tokens saved per-user in `~/.gdprscanner/delta.json`. Checkpoint saves mid-scan progress so interrupted runs can resume. `M365DeltaTokenExpired` exception handles the 410 Gone case by falling back to a full scan. + +--- + +## §3 — Article 9 special category detection ✅ + +CPR numbers alone do not tell you whether the data is especially sensitive. Files containing health diagnoses, criminal records, trade union membership, etc. alongside CPR numbers carry significantly higher GDPR risk and may trigger DPIA requirements (Art. 35). + +**Implemented:** v1.2.0. `keywords/da.json` (459 Danish keywords, 9 Art. 9 categories). Proximity filter: a keyword only triggers when within 150 characters of a CPR number, or always if no CPRs are in the document. Purple `⚠ Art. 9` badge on result cards. Art. 30 export gains a DPIA warning when Art. 9 items are present. + +**Why proximity:** Pure keyword presence is too noisy — every GDPR policy document would be flagged. Proximity to a CPR number is a meaningful signal that the document actually concerns a specific individual. + +--- + +## §4 — Data subject lookup (Art. 15/17) ✅ + +Schools must be able to answer subject access requests: "what data do you hold about me?" and delete it on request (Art. 17). This requires a cross-source query by CPR number. + +**Implemented:** v1.0.0. SHA-256 hash of the CPR query compared against stored hashes — the plaintext CPR is never written to the database. Bulk delete with audit logging. Result count and source breakdown returned. + +--- + +## §5 — Disposition tagging and review workflow ✅ + +Without a way to mark items as reviewed, every scan produces the same undifferentiated pile. Compliance officers need to track what has been actioned, what is being retained, and what is scheduled for deletion. + +**Implemented:** v1.0.0. Five disposition states: Unreviewed / Retain (legal basis / legitimate interest / contract) / Delete-scheduled / Deleted. Filter bar, preview panel dropdown, Excel export column. Headless auto-delete runs on `delete-scheduled` items during scheduled scans. + +--- + +## §6 — Article 30 processing register export ✅ + +Danish public authorities are required to maintain a GDPR Article 30 processing register. Generating one manually from scan results is error-prone and time-consuming. + +**Implemented:** v1.0.0. Structured `.docx` export with summary, data categories, staff/student inventory split, retention analysis, compliance trend, deletion audit log, and methodology section. Updated in every major feature release to include new sources (Google, local/SMB), new risk categories (EXIF GPS, faces), and new fields. + +--- + +## §7 — Retention policy enforcement ✅ + +GDPR Art. 5(1)(e) requires personal data not be kept longer than necessary. Schools often use a rolling retention policy (e.g. 5 years) or a fiscal-year-end cutoff (e.g. Dec 31 per Bogføringsloven). + +**Implemented:** v1.0.0. Configurable retention years + fiscal year end. `🗓 Overdue` badge on cards whose modified date exceeds the cutoff. Bulk delete quick filter. Headless `--retention-years` + `--fiscal-year-end` flags for automated enforcement. Auto-retention flag on scheduled scans. + +--- + +## §8 — Local folder and SMB/CIFS network share scanning ✅ + +Most Danish schools also have file servers (Windows Server, Synology NAS, QNAP) that are not covered by Microsoft 365 or Google Workspace. CPR numbers stored in shared drives are a significant risk. + +**Implemented:** v1.4.0. `file_scanner.py` — unified local + SMB iterator. `smbprotocol` for direct SMB2/3 without requiring a mount. Credential storage via OS keychain (`keyring`). Results write to the same database as M365/Google items. Source badges: `📁 Local` / `🌐 Network`. + +**Why smbprotocol instead of requiring a mount:** mounts require elevated privileges and are not available in the packaged desktop app. `smbprotocol` connects directly over TCP. + +--- + +## §9 — Biometric photo scanning (Art. 9) ✅ + +Photographs of identifiable people are biometric data under GDPR Art. 9 regardless of whether they contain CPR numbers. Schools routinely have student photos in OneDrive and SharePoint. + +**Implemented:** v1.3.0. Optional `scan_photos` flag (opt-in — slower). `_detect_photo_faces()` uses OpenCV Haar cascade detection via `document_scanner`. Items flagged when `face_count > 0` even without CPR hits. `📷 N faces` badge on cards. + +**Opt-in rationale:** Haar cascade detection on large tenants adds significant scan time. Enable for targeted compliance audits, not routine scans. + +--- + +## §10 — Google Workspace scanning (Gmail + Drive) ✅ + +Mixed Microsoft/Google environments are common in Danish schools. Gmail and Google Drive are outside the M365 scan scope. + +**Implemented:** v1.5.9. `google_connector.py` — service account OAuth with domain-wide delegation. Gmail message + attachment iterator. Drive file iterator with automatic export of native Docs/Sheets/Slides → DOCX/XLSX/PPTX before scanning. Results write to the same database with `source_type = "gmail"` or `"gdrive"`. + +--- + +## §11 — Database export and import ✅ + +Compliance records need to be portable — for archiving, sharing with a DPO tool, or migrating between installations. + +**Implemented:** v1.2.3. `GET /api/db/export` streams a ZIP of 8 JSON files (CPR hashes only, thumbnails stripped). `POST /api/db/import` supports merge (dispositions + deletion log only) or replace (full wipe and restore). CLI flags: `--export-db`, `--import-db`, `--import-mode`. + +--- + +## §12 — Internationalisation (i18n) ✅ + +The scanner is used by Danish, German, and English-speaking staff. Hardcoded Danish strings exclude other users. + +**Implemented:** v1.0.0 with `.lang` key-value files. Migrated to flat JSON in v1.6.3 (§27). Language switching applied in-place — no page reload, scan results preserved. Three languages: Danish (primary), English, German. + +--- + +## §13 — Article 9 keyword matching compiled to regex ✅ + +Sequential `str.find()` over 459 keywords becomes measurable overhead when scanning large email bodies across thousands of items. + +**Implemented:** v1.2.3. `_load_keywords()` compiles one `re.Pattern` per Article 9 category at startup using a longest-first alternation. Short keywords retain word-boundary anchors to prevent substring false positives. ~10–50× faster for large tenants. + +--- + +## §14 — Manual role overrides ✅ + +Microsoft SKU IDs are not exhaustive — new licences, benefit add-ons, and custom arrangements mean some users are always misclassified. Admins need a way to correct individual users without waiting for a SKU map update. + +**Implemented:** v1.3.2. Click the role badge on any user row to cycle: auto → student → staff → other → clear. Overrides persisted to `~/.gdprscanner/role_overrides.json`. Applied at display time and scan time so all role-filtered views are correct. + +--- + +## §15 — Named, reusable scan profiles ✅ + +Running the same scan repeatedly (e.g. all staff accounts, Email + OneDrive only, 5-year retention) requires reconfiguring the sidebar every time. Profiles should capture the full scan state and be reusable in both UI and headless/scheduled runs. + +**Implemented across multiple releases:** +- §15a (v1.2.1) — backend profile storage, migration from flat settings, profile CRUD API +- §15b (v1.2.1) — CLI flags: `--list-profiles`, `--save-profile`, `--delete-profile`, `--profile` +- §15c (v1.2.2) — profile dropdown in topbar + 💾 save button +- §15d (v1.2.3) — profile management modal (list, use, duplicate, delete) +- §15e (v1.6.3/v1.6.4) — full two-panel editor (all sidebar sections mirrored, including Google and file sources) +- §15f (v1.6.3) — scheduler uses profiles including file sources; `file_sources` saved in profiles + +--- + +## §16 — Unified source management modal ✅ + +Azure credentials, per-source toggles, and file source management were split across three separate sidebar locations. The credential form in particular belonged in a modal, not exposed in the main UI. + +**Implemented:** v1.4.1. Single **⚙ Sources** button opens a tabbed modal: Microsoft 365 tab (credentials + per-source visibility toggles), Google Workspace tab, File sources tab. The sidebar shows only the source panel with the configured sources — no credentials visible. + +--- + +## §17 — Unified source management modal ✅ + +*(See §16 — these are the same feature, §16 is the canonical entry.)* + +--- + +## §18 — EXIF metadata extraction from images ✅ + +GPS coordinates in smartphone photos are Art. 9-adjacent data in a school context — they reveal where a student or staff member was. EXIF author/comment fields can contain personal data added by software (e.g. desktop publishing tools). + +**Implemented:** v1.4.4. `_extract_exif()` extracts GPS (converted to decimal degrees + Google Maps link), author/artist/copyright/description/keywords/user-comment fields from JPEG, PNG, TIFF, WEBP, HEIC. Images flagged even without CPR when GPS or PII-bearing EXIF fields are present. Runs regardless of the `scan_photos` toggle (lightweight — no CV processing). + +--- + +## §19 — Scheduled / automatic scans ✅ + +Manual scans require someone to remember to run them. GDPR compliance is an ongoing obligation — scanning should run automatically on a configurable cadence without requiring cron or Task Scheduler outside the app. + +**Implemented:** v1.5.3. In-process APScheduler with one job per enabled schedule. Supports daily/weekly/monthly, time-of-day, profile selector, auto-email, auto-retention. Config in `~/.gdprscanner/schedule.json`. Multiple independent named jobs added in v1.5.4. Scheduled scans reuse the full `run_scan()` pipeline — checkpoint, delta, broadcast, DB. + +--- + +## §20 — PDF OCR via multiprocessing ✅ + +Tesseract/Poppler subprocesses used for OCR on image-only PDFs cannot be killed from a Python thread. A hung OCR process blocks the scan thread indefinitely. + +**Implemented:** v1.6.5. `_scan_bytes_timeout()` in `cpr_detector.py` spawns a fresh subprocess via `multiprocessing.get_context("spawn")` with a 60-second hard timeout. Process tree terminated if the timeout fires. Image-only PDF detection via `pdfplumber` (text layer check) before spawning avoids OCR entirely for scanned documents — the most common cause of hangs. + +**Why spawn context (not fork):** `fork` inherits Flask's open file descriptors and threading state, causing deadlocks in multiprocessing workers on macOS. `spawn` starts clean. + +--- + +## §21 — SSE event replay for mid-scan browser connections ✅ + +Opening the browser while a scan is already running (common for scheduled scans) showed nothing until the next SSE event fired. The in-progress result cards and log were lost. + +**Implemented:** v1.5.6 (replay buffer) + v1.5.8 (scheduled scan visibility). `_sse_buffer: deque(maxlen=500)` stores all broadcast events. New clients receive the full buffer replay, then switch to live events. Module identity fix (`sys.modules["m365_scanner"] = sys.modules[__name__]`) ensures the scheduler broadcasts to the same SSE queues the browser is reading. + +--- + +## §22 — SMB pre-fetch sliding window ✅ + +SMB scans were single-threaded: read file → scan → read next. On high-latency NAS connections the idle time waiting for the next read dominated scan time. A stalled NAS read also blocked the scan thread indefinitely. + +**Implemented:** v1.6.5. `_smb_collect()` phase walks the tree (directory listing only). `_iter_smb()` phase feeds files through a 5-slot `ThreadPoolExecutor` with a 60-second per-file hard timeout. Stalled reads produce an error card and the scan continues. + +--- + +## §23 — Google Workspace role classification + cross-platform identity mapping ✅ + +Google Workspace users need the same student/staff classification as M365 users for Art. 30 inventory splits and role-scoped exports. In a mixed environment, the same person has both an M365 UPN and a GWS email — they should appear as one person in the accounts list. + +**Implemented:** v1.6.3. +- OU-based role classification: `classification/google_ou_roles.json` maps Organisational Unit paths to roles (edit to match your school's structure; default: `/Elever` → student, `/Personale` → staff). +- `google_connector.list_users()` fetches `orgUnitPath` via `projection=full` and classifies each user. +- Cross-platform identity: M365 and GWS accounts are matched by `displayName` (not email prefix — display names are maintained from the same AD source). Matched users show a `M365+GWS` badge and share a combined row in the accounts panel. + +--- + +## §24 — Rename: M365 Scanner → GDPRScanner ✅ + +The tool now scans M365, Google Workspace, local file systems, and SMB shares. "M365 Scanner" was misleading for users setting up Google or file scanning. + +**Implemented:** v1.6.0. All files renamed (`m365_scanner.py` → `gdpr_scanner.py`, etc.). Config files renamed on first startup via migration shim — existing data preserved automatically. `m365_connector.py` intentionally unchanged (accurately describes the Microsoft Graph connector). + +--- + +## §25 — Split `gdpr_scanner.py` into focused modules ✅ + +`gdpr_scanner.py` was 9 600 lines. Every feature PR touched the same monolith, causing merge conflicts. Unit tests could not import scan logic without pulling in Flask, MSAL, and the entire app. + +**Implemented:** v1.6.1. Five new modules: `sse.py`, `checkpoint.py`, `app_config.py`, `cpr_detector.py`, `scan_engine.py`. `gdpr_scanner.py` imports and re-exports them; blueprints use `__getattr__` for lazy resolution to avoid circular imports. + +**Why `__getattr__` on the module:** blueprints were already resolving names from `gdpr_scanner` at call time. Swapping to direct imports would have required touching every blueprint route. The lazy hook keeps the diff minimal and reversible. + +--- + +## §26 — pytest test suite ✅ + +Compliance software has no tolerance for regressions in CPR detection. Manual testing is not sufficient. + +**Implemented:** v1.6.2. 128 tests across 4 modules: `test_document_scanner.py` (CPR detection accuracy and false positive checks), `test_app_config.py` (i18n, keywords, config, profiles, encryption), `test_checkpoint.py` (checkpoint and delta token persistence), `test_db.py` (scan lifecycle, CPR hash-only storage, dispositions). All tests pass in CI. + +--- + +## §27 — Migrate i18n format from `.lang` to JSON ✅ + +`.lang` files are a bespoke key=value format with no tooling support. JSON is standard, diff-friendly, and parseable by any editor with a JSON schema plugin. + +**Implemented:** v1.6.3. `lang/en.json`, `da.json`, `de.json` — 709 keys each, flat JSON. `app_config.py` loader prefers `.json`, falls back to `.lang` for backward compatibility. Old `.lang` files retained as fallback. + +--- + +## §28 — Personal use disposition value ✅ + +Staff members sometimes store personal files (not work-related) on work equipment. These files are outside GDPR scope per Art. 2(2)(c) but reviewers currently had no way to record that determination — everything had to be "retain" or "delete". + +**Implemented:** v1.6.2. New disposition: **Personal use — out of scope**. Art. 30 report labels it "Personal use — out of GDPR scope (Art. 2(2)(c))". + +--- + +## §29 — Rename `skus/` → `classification/` ✅ + +`skus/` only described M365 SKU data. The directory now also contains Google Workspace OU role mappings — the name was misleading. + +**Implemented:** v1.6.3. `skus/education.json` → `classification/m365_skus.json`. `skus/google_ou_roles.json` → `classification/google_ou_roles.json`. All path references updated. + +--- + +## §30 — Personal Google account OAuth ✅ + +Service account + domain-wide delegation requires a Google Workspace admin to configure. Personal Gmail users and small organisations without Workspace admin access were excluded. + +**Implemented:** v1.6.5. `PersonalGoogleConnector` — device-code OAuth flow (mirrors M365 delegated mode). Token persisted to `~/.gdprscanner/google_token.json`. `list_users()` returns a single-item list so the scan engine needs no changes. Auth-mode toggle in the Sources modal (Workspace / Personal account). + +--- + +## §31 — Built-in user manual ✅ + +The scanner is used by school administrators and municipal compliance officers with no technical background. External documentation links go stale and are not available offline. + +**Implemented:** v1.6.5. `docs/manuals/MANUAL-EN.md` and `MANUAL-DA.md` — 14 sections covering all major features in plain language. `GET /manual` route converts Markdown to a self-contained HTML page with no external dependencies. **`?` button** in the topbar opens the manual in a dedicated window. Bundled in the PyInstaller app. + +--- + +## §32 — Windowed mode for Profiles, Sources, and Settings ✗ Won't do + +**Proposal:** open Profiles, Sources, and Settings as resizable windows instead of full-screen modals so the results grid remains visible alongside configuration. + +**Why not:** The workflow is sequential — configure → scan → review. There is no realistic scenario where a configuration modal and the results grid need to be open simultaneously. Sources is already visible in the sidebar during scanning. The least-work path (Option A, inline iframe) still loads the full JS stack twice and introduces message-passing complexity. The UX gain does not justify the implementation cost or the ongoing maintenance burden. + +--- + +## §33 — Read-only viewer mode with PIN/token URL ✅ + +DPOs, school principals, and compliance coordinators need to review scan results and tag dispositions without access to scan controls, Azure credentials, or settings. Giving them full admin access is not appropriate. + +**Implemented:** v1.6.14. Token-based share links (`/view?token=…`) and PIN alternative. Viewer mode hides the entire sidebar, log panel, scan/stop buttons, and delete controls. Disposition tagging remains fully functional. Viewer tokens support expiry (7d/30d/90d/1yr/never). PIN stored as salted SHA-256 hash. Brute-force guard: 5 failures per IP per 5 minutes. + +--- + +## §34 — User-scoped viewer tokens ✅ + +Role-scoped tokens (#33) let a DPO see all students or all staff. But an individual employee asked "what data do you have about me?" under Art. 15 should see only their own items — not everyone in their role group. + +**Implemented:** v1.6.17. Token scope `{"user": ["alice@m365.dk", "alice@gws.dk"], "display_name": "Alice Smith"}` filters `flagged_items` by `account_id IN (list)`, covering both M365 and GWS items. Share modal gains a **User** scope option with searchable name autocomplete backed by the loaded account list. Viewer header shows the person's full name in a locked identity badge. + +**Why a list of emails (not a single field):** the same person has different `account_id` values in M365 (`alice@school.dk` UPN) and Google Workspace (`alice.smith@school.dk` GWS email). Both must be included to cover items from either platform. + +--- + +## §35 — Scan history browser ✅ + +After a page reload, the previous scan's results were gone — no way to return to them without running a new scan. + +**Implemented:** v1.6.17. Past scan sessions grouped by 300-second concurrent-scan window. `GET /api/db/sessions` returns a newest-first list with timestamps, sources, item count, and delta flag. Session picker dropdown in a history banner above the filter bar. Auto-loads the most recent completed session on page load when no scan is running. Starting a new scan exits history mode. + +--- + +## §36 — Interface PIN ✅ + +In school environments the scanner is often left running on a workstation in an IT room. Any passer-by could open `http://localhost:5100` and access scan results or credentials. + +**Implemented:** v1.6.21. Optional 4–8 digit PIN set in **Settings → Security → Interface PIN**. Unauthenticated requests to the main UI or API redirect to `/login`. `/view` and viewer auth routes are completely exempt — reviewer links are unaffected. Salted SHA-256 hash stored in `config.json`. Rate-limited: 5 failures per IP per 5 minutes. + +--- + +## §37 — Google Drive delta scan ✅ + +Google Drive scans always re-downloaded every file on every run, regardless of what had changed. This made repeated scans of large Google Drives impractical. + +**Implemented:** v1.6.21. Uses the Google Drive Changes API. First delta-enabled run records a start page token per user (`gdrive:{email}` in `delta.json`). Subsequent runs call `conn.get_drive_changes()` and process only changed/new files. Invalid tokens fall back to a full scan automatically. Token save loads `delta.json` fresh before writing to avoid racing with concurrent M365 token saves. + +--- + +## §38 — Route integration tests ✅ + +Security-sensitive paths (viewer token auth, role/user scope enforcement, interface PIN gate) had no automated coverage. The only way a role-scope regression would be caught was manually testing a share link — which nobody did, and a real bug went undetected (`row.get("role")` vs. `row.get("user_role")`). + +**Implemented:** 44 Flask test-client tests in `tests/test_route_integration.py` covering: viewer token CRUD and scope validation, `GET /api/db/flagged` role and user scope enforcement, bulk disposition isolation (untouched items stay unreviewed), viewer PIN (set/verify/rate-limit/change/clear), interface PIN gate with multi-step flows, scan lock always released on `run_scan()` exception, `GET /api/db/sessions` shape and newest-first ordering. All tests run against a tmp-path in-memory database — no cloud credentials required. + +**Bugs caught and fixed:** +- `routes/database.py` role scope filter used `row.get("role")` — column is `user_role`. Role-scoped tokens returned an empty list for all users. +- `gdpr_db.get_session_items(ref_scan_id=N)` had no upper time bound — historical session queries included all subsequent scans. Fixed with `BETWEEN ref - 300 AND ref + 300`. + +**Why test the interface PIN gate separately:** the `before_request` hook in `gdpr_scanner.py` blocks ALL API routes (including `/api/interface/pin` itself) once a PIN is set. Multi-step PIN tests must inject `session["interface_ok"] = True` after the first PIN-set request — otherwise the gate blocks subsequent requests in the same test. + +--- + +## Open ideas + +### Streaming / generator scan pattern for very large tenants + +Current M365 scan: collect all work items first (all users' emails + files), then process. For tenants >500k emails the `work_items` deque can still be several GB even after stripping email HTML. The fix is to process each user's items inline as they are fetched — generator/streaming pattern — so memory is bounded to one user's items at a time. + +**Estimate:** 1–2 days. Requires careful refactoring of `run_scan()` in `scan_engine.py`. Not urgent until a tenant of that size is encountered. + +### Bulk redaction + +Write redacted copies of flagged files with CPR numbers replaced by `XXX XXXX-XXXX`. Would require writing back to OneDrive/SharePoint/Google Drive (upload with the same filename). Legally complex — redaction must be audited. Low priority until a school explicitly requests it. + +### Email notification on scan completion (non-scheduled) ✅ + +Auto-email now fires on manual scans when **Email report after manual scan** is enabled in Settings → Email report. Toggle stored as `auto_email_manual` in `smtp.json`. Implemented in `routes/scan.py` — `_maybe_send_auto_email()` is called from the `_run()` thread after `run_scan()` returns. Same Graph-first → SMTP-fallback pattern as scheduled scans. Only fires when there are flagged items and at least one recipient is configured. diff --git a/VERSION b/VERSION index 49e1fe3..d619516 100644 --- a/VERSION +++ b/VERSION @@ -1 +1 @@ -1.6.21 +1.6.22 diff --git a/gdpr_db.py b/gdpr_db.py index 16ac73a..aa4ce59 100644 --- a/gdpr_db.py +++ b/gdpr_db.py @@ -507,9 +507,9 @@ class ScanDB: FROM flagged_items fi JOIN scans s ON fi.scan_id = s.id LEFT JOIN dispositions d ON d.item_id = fi.id - WHERE s.started_at >= ? AND s.finished_at IS NOT NULL + WHERE s.started_at BETWEEN ? AND ? AND s.finished_at IS NOT NULL ORDER BY fi.cpr_count DESC""", - (latest_start - window_seconds,), + (latest_start - window_seconds, latest_start + window_seconds), ).fetchall() result = [] for r in rows: diff --git a/lang/da.json b/lang/da.json index 3f11fde..0474c2a 100644 --- a/lang/da.json +++ b/lang/da.json @@ -364,6 +364,7 @@ "m365_smtp_recipients": "Modtagere", "m365_smtp_recipients_hint": "Adskil med komma eller semikolon", "m365_smtp_save": "Gem", + "m365_smtp_auto_email_manual": "Send rapport efter manuel scanning", "m365_smtp_send": "Send nu", "m365_smtp_saved": "Indstillinger gemt.", "m365_smtp_sending": "Sender…", diff --git a/lang/de.json b/lang/de.json index b6ea9e6..b8eddd0 100644 --- a/lang/de.json +++ b/lang/de.json @@ -364,6 +364,7 @@ "m365_smtp_recipients": "Empfänger", "m365_smtp_recipients_hint": "Komma- oder semikolongetrennt", "m365_smtp_save": "Speichern", + "m365_smtp_auto_email_manual": "Bericht nach manueller Suche senden", "m365_smtp_send": "Jetzt senden", "m365_smtp_saved": "Einstellungen gespeichert.", "m365_smtp_sending": "Senden…", diff --git a/lang/en.json b/lang/en.json index 9a0c7d1..8066ece 100644 --- a/lang/en.json +++ b/lang/en.json @@ -364,6 +364,7 @@ "m365_smtp_recipients": "Recipients", "m365_smtp_recipients_hint": "Comma or semicolon separated", "m365_smtp_save": "Save", + "m365_smtp_auto_email_manual": "Email report after manual scan", "m365_smtp_send": "Send now", "m365_smtp_saved": "Settings saved.", "m365_smtp_sending": "Sending…", diff --git a/routes/CLAUDE.md b/routes/CLAUDE.md index 2b96a5c..9dea2c3 100644 --- a/routes/CLAUDE.md +++ b/routes/CLAUDE.md @@ -5,6 +5,8 @@ SSE routes must live in `gdpr_scanner.py`, not blueprints — blueprints can't s M365 scan emits `scan_done`; Google emits `google_scan_done`; file scan emits `file_scan_done`. Never mix them up. +**`scan_start` is M365-only** — `run_scan()` broadcasts `scan_start`; `run_file_scan()` and `routes/google_scan.py` must NOT. The `scan_start` handler in `_attachSchedulerListeners` (scan.js) unconditionally sets `S._m365ScanRunning = true`. If a file scan emits `scan_start`, the flag is set with no matching `scan_done` to clear it — `file_scan_done` checks `!S._m365ScanRunning` before re-enabling the scan button, so the button stays disabled permanently after the scan completes. + ## scan_progress source field All three scan engines must include `"source": "m365"` / `"google"` / `"file"` in every `scan_progress` SSE event. Never remove this field — the frontend uses it to route progress to the correct segment. diff --git a/routes/database.py b/routes/database.py index 86182da..87a4a09 100644 --- a/routes/database.py +++ b/routes/database.py @@ -193,7 +193,7 @@ def db_flagged_items(): import json as _json out = [] for row in items: - if role_filt and row.get("role", "") != role_filt: + if role_filt and row.get("user_role", "") != role_filt: continue if user_filt and (row.get("account_id", "") or "").lower() not in user_filt: continue diff --git a/routes/scan.py b/routes/scan.py index 3b1f6e1..2b6c129 100644 --- a/routes/scan.py +++ b/routes/scan.py @@ -3,11 +3,13 @@ Scan stream, start/stop, checkpoint, settings, delta """ from __future__ import annotations import threading +import logging from flask import Blueprint, jsonify, request from routes import state from app_config import ( _save_settings, _load_settings, _load_src_toggles, _save_src_toggles, + _load_smtp_config, ) from checkpoint import ( _checkpoint_key, _load_checkpoint, _clear_checkpoint, @@ -15,6 +17,51 @@ from checkpoint import ( ) bp = Blueprint("scan", __name__) +_log = logging.getLogger(__name__) + + +def _maybe_send_auto_email(): + """Send the scan report email after a manual scan if auto_email_manual is enabled.""" + try: + smtp_cfg = _load_smtp_config() + if not smtp_cfg.get("auto_email_manual"): + return + if not state.flagged_items: + return + recipients = smtp_cfg.get("recipients", []) + if isinstance(recipients, str): + recipients = [r.strip() for r in recipients.replace(";", ",").split(",") if r.strip()] + if not recipients: + return + + from routes.export import _build_excel_bytes + from routes.email import _send_report_email, _send_email_graph + import datetime as _dt + + xl_bytes, fname = _build_excel_bytes() + subject = f"GDPR Scanner — scan report {_dt.datetime.now().strftime('%Y-%m-%d')}" + body_html = ( + "" + "

☁️ GDPR Scanner — scan report

" + f"

Please find the latest scan report attached ({fname}).

" + f"

Generated: {_dt.datetime.now().strftime('%Y-%m-%d %H:%M:%S')}
" + f"Items flagged: {len(state.flagged_items)}

" + "" + ) + + if state.connector and state.connector.is_authenticated(): + try: + _send_email_graph(subject, body_html, recipients, + attachment_bytes=xl_bytes, attachment_name=fname) + _log.info("[auto-email] report sent via Graph to %s", recipients) + return + except Exception as e: + _log.warning("[auto-email] Graph failed, trying SMTP: %s", e) + + _send_report_email(xl_bytes, fname, smtp_cfg, recipients) + _log.info("[auto-email] report sent via SMTP to %s", recipients) + except Exception as e: + _log.error("[auto-email] failed: %s", e) @bp.route("/api/scan/status") @@ -57,6 +104,7 @@ def scan_start(): from scan_engine import run_scan try: run_scan(options) + _maybe_send_auto_email() finally: state._scan_lock.release() threading.Thread(target=_run, daemon=True).start() diff --git a/scan_engine.py b/scan_engine.py index 5157b86..35dfcc5 100644 --- a/scan_engine.py +++ b/scan_engine.py @@ -193,7 +193,6 @@ def run_file_scan(source: dict): total_scanned = 0 total_flagged = 0 - broadcast("scan_start", {"sources": [label]}) broadcast("scan_phase", {"phase": f"Files \u2014 {label}"}) try: @@ -248,7 +247,7 @@ def run_file_scan(source: dict): _exif = _extract_exif(content, rel_path) # Apply filters: distinct CPR threshold and GPS suppression - _distinct_cprs = list(dict.fromkeys(cprs)) # preserve order, deduplicate + _distinct_cprs = list(dict.fromkeys(c["formatted"] for c in cprs)) _cpr_qualifies = len(_distinct_cprs) >= min_cpr_count _exif_has_pii = _exif.get("has_pii") and ( not skip_gps_images or bool(_exif.get("pii_fields") or _exif.get("author")) @@ -1097,7 +1096,7 @@ def run_scan(options: dict): _exif = _extract_exif(content, name) # Apply filters: distinct CPR threshold and GPS suppression - _distinct_cprs = list(dict.fromkeys(cprs)) # preserve order, deduplicate + _distinct_cprs = list(dict.fromkeys(c["formatted"] for c in cprs)) _cpr_qualifies = len(_distinct_cprs) >= min_cpr_count _exif_has_pii = _exif.get("has_pii") and ( not skip_gps_images or bool(_exif.get("pii_fields") or _exif.get("author")) diff --git a/static/js/CLAUDE.md b/static/js/CLAUDE.md index 36181a9..ca82d1d 100644 --- a/static/js/CLAUDE.md +++ b/static/js/CLAUDE.md @@ -22,6 +22,13 @@ Never revert to `!!window._googleConnected` / `_fileSources.length > 0` — thos `_PHASE_SOURCE_MAP` ordering matters — `Google Workspace` must appear before `Gmail` in the map. The email regex uses `/iu` flags — do not drop the `i`. +## Profile startup race conditions — profiles.js + users.js + +`loadProfiles()` (fast, local file) resolves before `loadUsers()` (slow, Graph API). The user can select a profile before `S._allUsers` or the sources panel is populated. + +- **`user_ids = "all"` must be deferred** — if `S._allUsers` is empty when `_applyProfile()` runs, set `window._pendingProfileAllUsers = true` instead of calling `.forEach()` on an empty array. `loadUsers()` checks this flag after populating `S._allUsers` and selects everyone. Do not remove this — reverting will silently leave all accounts unchecked whenever a profile is chosen on a fast machine before the user list loads. +- **Source checkboxes may not exist yet** — `_applyProfile()` calls `renderSourcesPanel()` first if `#sourcesPanel` contains no `input[data-source-id]` nodes. Same guard used in `loadUsers()`. Without it, `querySelectorAll` returns nothing and the profile's source selection is discarded; the next `renderSourcesPanel()` call re-renders all sources as checked (their default). + ## Gotchas - **Profile editor accounts** — default to unchecked. Only explicitly saved `user_ids` are checked. diff --git a/static/js/profiles.js b/static/js/profiles.js index 62f2346..4bbd043 100644 --- a/static/js/profiles.js +++ b/static/js/profiles.js @@ -69,6 +69,11 @@ function _applyProfile(profile) { // File sources may not be rendered yet (they load async), so store their IDs // in S._pendingProfileSources for renderSourcesPanel() to apply after re-render. const profileSources = profile.sources || []; + // Ensure at least M365 source checkboxes are present before reading the DOM. + // renderSourcesPanel() is idempotent and fast — safe to call here. + if (!document.querySelector('#sourcesPanel input[data-source-id]') && typeof renderSourcesPanel === 'function') { + renderSourcesPanel(); + } document.querySelectorAll('#sourcesPanel input[data-source-id]').forEach(function(cb) { cb.checked = profileSources.includes(cb.dataset.sourceId); }); @@ -181,8 +186,13 @@ function _applyProfile(profile) { // ── User selection ──────────────────────────────────────────────────────── if (profile.user_ids === 'all') { - S._allUsers.forEach(u => { u.selected = true; }); - if (S._allUsers.length) renderAccountList(); + if (S._allUsers.length) { + S._allUsers.forEach(u => { u.selected = true; }); + renderAccountList(); + } else { + // Users not loaded yet — defer until loadUsers() resolves + window._pendingProfileAllUsers = true; + } } else if (Array.isArray(profile.user_ids) && profile.user_ids.length) { window._pendingProfileUserIds = profile.user_ids.map(u => u.id || u); _applyPendingProfileUsers(); diff --git a/static/js/scheduler.js b/static/js/scheduler.js index c814ed0..8b2bee6 100644 --- a/static/js/scheduler.js +++ b/static/js/scheduler.js @@ -300,6 +300,8 @@ function stLoadSmtp() { if (tls) tls.checked = d.starttls !== false; const pw = document.getElementById('st-smtpPw'); if (pw) pw.value = d.has_password ? '\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022' : ''; + const ae = document.getElementById('st-smtpAutoEmail'); + if (ae) ae.checked = !!d.auto_email_manual; }).catch(function(){}); } @@ -313,7 +315,8 @@ async function stSmtpSave() { user: document.getElementById('st-smtpUser').value.trim(), from_addr: document.getElementById('st-smtpFrom').value.trim(), recipients: document.getElementById('st-smtpTo').value.split(/[,;]/).map(function(s){return s.trim();}).filter(Boolean), - starttls: document.getElementById('st-smtpTls').checked, + starttls: document.getElementById('st-smtpTls').checked, + auto_email_manual: !!(document.getElementById('st-smtpAutoEmail') || {}).checked, }; if (pw !== null) body.password = pw; st.style.color = 'var(--muted)'; st.textContent = t('m365_smtp_saving','Saving...'); diff --git a/static/js/users.js b/static/js/users.js index 82258f6..c124a9c 100644 --- a/static/js/users.js +++ b/static/js/users.js @@ -28,6 +28,11 @@ async function loadUsers() { u.selected = prevSelected.has(u.id) ? prevSelected.get(u.id) : false; }); S._allUsers = [...fetched, ...toAdd]; + // Apply deferred "select all" from a profile chosen before users loaded + if (window._pendingProfileAllUsers) { + S._allUsers.forEach(u => { u.selected = true; }); + window._pendingProfileAllUsers = false; + } renderAccountList(fetched.length <= 1); // Merge Google users separately so they're not blocked by M365 auth timing _mergeGoogleUsers(); diff --git a/templates/index.html b/templates/index.html index de96ae2..59c4387 100644 --- a/templates/index.html +++ b/templates/index.html @@ -777,6 +777,10 @@ document.addEventListener('DOMContentLoaded', applyI18n); +
+ + +
diff --git a/tests/test_route_integration.py b/tests/test_route_integration.py new file mode 100644 index 0000000..360d4d5 --- /dev/null +++ b/tests/test_route_integration.py @@ -0,0 +1,524 @@ +""" +Route integration tests — security-sensitive paths and data-correctness contracts. + +Covers: + - Viewer token CRUD and scope validation + - GET /api/db/flagged role and user scope enforcement + - POST /api/db/disposition/bulk — only updates selected items + - Viewer PIN set / verify / rate-limit / clear + - Interface PIN set / gate / clear + - Scan lock always released (even when run_scan raises) + - GET /api/db/sessions basic shape +""" +from __future__ import annotations +import time +from unittest.mock import MagicMock + +import pytest + + +# --------------------------------------------------------------------------- +# Module-level app fixture (shared with test_routes.py via flask_app) +# --------------------------------------------------------------------------- + +@pytest.fixture(scope="module") +def flask_app(): + import gdpr_scanner + gdpr_scanner.app.config["TESTING"] = True + gdpr_scanner.app.config["WTF_CSRF_ENABLED"] = False + return gdpr_scanner.app + + +@pytest.fixture() +def client(flask_app): + with flask_app.test_client() as c: + yield c + + +@pytest.fixture() +def db_patch(tmp_path, monkeypatch): + from gdpr_db import ScanDB + import routes.database, routes.export + db = ScanDB(str(tmp_path / "test.db")) + monkeypatch.setattr(routes.database, "_get_db", lambda: db) + monkeypatch.setattr(routes.database, "DB_OK", True) + monkeypatch.setattr(routes.export, "_get_db", lambda: db) + monkeypatch.setattr(routes.export, "DB_OK", True) + return db + + +@pytest.fixture() +def mock_connector(monkeypatch): + from routes import state + conn = MagicMock() + monkeypatch.setattr(state, "connector", conn) + return conn + + +@pytest.fixture(autouse=True) +def clean_state(): + from routes import state + yield + state.flagged_items.clear() + if not state._scan_lock.acquire(blocking=False): + pass + else: + state._scan_lock.release() + + +# --------------------------------------------------------------------------- +# Helpers +# --------------------------------------------------------------------------- + +def _seed_scan(db, items: list[dict]) -> int: + """Create a completed scan and persist items. Returns the scan_id.""" + scan_id = db.begin_scan({"sources": ["email"], "user_ids": [], "options": {}}) + for item in items: + db.save_item(scan_id, item) + db.finish_scan(scan_id, total_scanned=len(items)) + return scan_id + + +def _item(item_id: str, role: str = "staff", account_id: str = "") -> dict: + return { + "id": item_id, + "name": f"{item_id}.docx", + "source": "email", + "source_type": "email", + "account_id": account_id or f"{item_id}@school.dk", + "user_role": role, + "cpr_count": 1, + "face_count": 0, + "size_kb": 10, + "modified": "2025-01-01T00:00:00", + } + + +def _clear_viewer_pins(): + """Remove both viewer and interface PINs between tests.""" + from app_config import clear_viewer_pin, clear_interface_pin + clear_viewer_pin() + clear_interface_pin() + + +# --------------------------------------------------------------------------- +# Viewer token CRUD +# --------------------------------------------------------------------------- + +class TestViewerTokenCRUD: + def test_create_and_list(self, client): + r = client.post("/api/viewer/tokens", + json={"label": "Test token", "expires_days": 7}) + assert r.status_code == 201 + data = r.get_json() + assert "token" in data + tok = data["token"] + + r2 = client.get("/api/viewer/tokens") + assert r2.status_code == 200 + tokens = r2.get_json() + assert any(t["token"] == tok for t in tokens) + + def test_delete_existing_token(self, client): + r = client.post("/api/viewer/tokens", json={"label": "to-delete"}) + tok = r.get_json()["token"] + + r2 = client.delete(f"/api/viewer/tokens/{tok}") + assert r2.status_code == 200 + assert r2.get_json()["ok"] is True + + r3 = client.get("/api/viewer/tokens") + tokens = r3.get_json() + assert not any(t["token"] == tok for t in tokens) + + def test_delete_nonexistent_token_returns_404(self, client): + r = client.delete("/api/viewer/tokens/doesnotexist123") + assert r.status_code == 404 + + def test_validate_valid_token(self, client): + tok = client.post("/api/viewer/tokens", json={}).get_json()["token"] + r = client.post("/api/viewer/tokens/validate", json={"token": tok}) + assert r.status_code == 200 + assert r.get_json()["valid"] is True + + def test_validate_invalid_token(self, client): + r = client.post("/api/viewer/tokens/validate", + json={"token": "notarealtoken00000000"}) + assert r.status_code == 401 + assert r.get_json()["valid"] is False + + +class TestViewerTokenScopeValidation: + def test_role_and_user_mutually_exclusive(self, client): + r = client.post("/api/viewer/tokens", json={ + "scope": {"role": "student", "user": "alice@school.dk"} + }) + assert r.status_code == 400 + assert "mutually exclusive" in r.get_json()["error"] + + def test_invalid_role_value(self, client): + r = client.post("/api/viewer/tokens", json={ + "scope": {"role": "teacher"} + }) + assert r.status_code == 400 + assert "role" in r.get_json()["error"] + + def test_user_email_must_contain_at(self, client): + r = client.post("/api/viewer/tokens", json={ + "scope": {"user": "notanemail"} + }) + assert r.status_code == 400 + assert "email" in r.get_json()["error"].lower() + + def test_valid_role_scope_stored(self, client): + r = client.post("/api/viewer/tokens", + json={"scope": {"role": "student"}}) + assert r.status_code == 201 + assert r.get_json()["scope"] == {"role": "student"} + + def test_valid_user_scope_stored(self, client): + r = client.post("/api/viewer/tokens", json={ + "scope": { + "user": ["alice@m365.dk", "alice@gws.dk"], + "display_name": "Alice Smith", + } + }) + assert r.status_code == 201 + scope = r.get_json()["scope"] + assert scope["user"] == ["alice@m365.dk", "alice@gws.dk"] + assert scope["display_name"] == "Alice Smith" + + +# --------------------------------------------------------------------------- +# GET /api/db/flagged — scope enforcement +# --------------------------------------------------------------------------- + +class TestFlaggedScopeEnforcement: + def test_no_scope_returns_all_items(self, client, db_patch): + _seed_scan(db_patch, [ + _item("s1", role="student"), + _item("s2", role="staff"), + ]) + r = client.get("/api/db/flagged") + assert r.status_code == 200 + ids = {row["id"] for row in r.get_json()} + assert "s1" in ids + assert "s2" in ids + + def test_role_scope_student_excludes_staff(self, client, db_patch): + _seed_scan(db_patch, [ + _item("r1", role="student"), + _item("r2", role="staff"), + ]) + with client.session_transaction() as sess: + sess["viewer_ok"] = True + sess["viewer_scope"] = {"role": "student"} + r = client.get("/api/db/flagged") + ids = {row["id"] for row in r.get_json()} + assert "r1" in ids + assert "r2" not in ids + + def test_role_scope_staff_excludes_students(self, client, db_patch): + _seed_scan(db_patch, [ + _item("t1", role="student"), + _item("t2", role="staff"), + ]) + with client.session_transaction() as sess: + sess["viewer_ok"] = True + sess["viewer_scope"] = {"role": "staff"} + r = client.get("/api/db/flagged") + ids = {row["id"] for row in r.get_json()} + assert "t2" in ids + assert "t1" not in ids + + def test_user_scope_returns_only_matching_account_id(self, client, db_patch): + _seed_scan(db_patch, [ + _item("u1", account_id="alice@m365.dk"), + _item("u2", account_id="bob@m365.dk"), + ]) + with client.session_transaction() as sess: + sess["viewer_ok"] = True + sess["viewer_scope"] = {"user": ["alice@m365.dk"]} + r = client.get("/api/db/flagged") + ids = {row["id"] for row in r.get_json()} + assert "u1" in ids + assert "u2" not in ids + + def test_user_scope_matches_both_platform_emails(self, client, db_patch): + # Same person — M365 UPN and GWS email both in scope + _seed_scan(db_patch, [ + _item("p1", account_id="alice@m365.dk"), + _item("p2", account_id="alice@gws.dk"), + _item("p3", account_id="bob@m365.dk"), + ]) + with client.session_transaction() as sess: + sess["viewer_ok"] = True + sess["viewer_scope"] = {"user": ["alice@m365.dk", "alice@gws.dk"]} + r = client.get("/api/db/flagged") + ids = {row["id"] for row in r.get_json()} + assert "p1" in ids + assert "p2" in ids + assert "p3" not in ids + + def test_user_scope_case_insensitive(self, client, db_patch): + _seed_scan(db_patch, [_item("ci1", account_id="Alice@M365.dk")]) + with client.session_transaction() as sess: + sess["viewer_ok"] = True + sess["viewer_scope"] = {"user": ["alice@m365.dk"]} + r = client.get("/api/db/flagged") + ids = {row["id"] for row in r.get_json()} + assert "ci1" in ids + + def test_ref_param_loads_historical_session(self, client, db_patch): + # Push first scan >300 s into the past so it occupies its own session window. + old_id = _seed_scan(db_patch, [_item("h1")]) + db_patch._connect().execute( + "UPDATE scans SET started_at = started_at - 400 WHERE id = ?", (old_id,) + ) + db_patch._connect().commit() + _seed_scan(db_patch, [_item("h2")]) + + r = client.get(f"/api/db/flagged?ref={old_id}") + ids = {row["id"] for row in r.get_json()} + assert "h1" in ids + # h2 belongs to a different (newer) session window — must not appear + assert "h2" not in ids + + +# --------------------------------------------------------------------------- +# POST /api/db/disposition/bulk +# --------------------------------------------------------------------------- + +class TestBulkDisposition: + def test_updates_selected_items(self, client, db_patch): + _seed_scan(db_patch, [_item("b1"), _item("b2"), _item("b3")]) + r = client.post("/api/db/disposition/bulk", json={ + "item_ids": ["b1", "b2"], + "status": "retain-legal", + }) + assert r.status_code == 200 + assert r.get_json()["saved"] == 2 + + assert db_patch.get_disposition("b1")["status"] == "retain-legal" + assert db_patch.get_disposition("b2")["status"] == "retain-legal" + + def test_unselected_item_unchanged(self, client, db_patch): + _seed_scan(db_patch, [_item("c1"), _item("c2")]) + client.post("/api/db/disposition/bulk", json={ + "item_ids": ["c1"], + "status": "delete-scheduled", + }) + d = db_patch.get_disposition("c2") + # c2 was not in the bulk request — must remain unreviewed + assert d is None or d.get("status", "unreviewed") == "unreviewed" + + def test_missing_item_ids_returns_400(self, client, db_patch): + r = client.post("/api/db/disposition/bulk", + json={"status": "retain-legal"}) + assert r.status_code == 400 + + def test_missing_status_returns_400(self, client, db_patch): + r = client.post("/api/db/disposition/bulk", + json={"item_ids": ["x"]}) + assert r.status_code == 400 + + def test_without_db_returns_503(self, client, monkeypatch): + import routes.database + monkeypatch.setattr(routes.database, "DB_OK", False) + r = client.post("/api/db/disposition/bulk", + json={"item_ids": ["x"], "status": "retain-legal"}) + assert r.status_code == 503 + + +# --------------------------------------------------------------------------- +# Viewer PIN +# --------------------------------------------------------------------------- + +class TestViewerPin: + def setup_method(self): + _clear_viewer_pins() + + def teardown_method(self): + _clear_viewer_pins() + + def test_status_no_pin(self, client): + r = client.get("/api/viewer/pin") + assert r.status_code == 200 + assert r.get_json()["pin_set"] is False + + def test_set_and_status_reflects_set(self, client): + client.post("/api/viewer/pin", json={"pin": "1234"}) + r = client.get("/api/viewer/pin") + assert r.get_json()["pin_set"] is True + + def test_set_too_short_rejected(self, client): + r = client.post("/api/viewer/pin", json={"pin": "123"}) + assert r.status_code == 400 + + def test_set_too_long_rejected(self, client): + r = client.post("/api/viewer/pin", json={"pin": "123456789"}) + assert r.status_code == 400 + + def test_set_non_digits_rejected(self, client): + r = client.post("/api/viewer/pin", json={"pin": "abcd"}) + assert r.status_code == 400 + + def test_verify_correct_pin_sets_session(self, client): + client.post("/api/viewer/pin", json={"pin": "4321"}) + r = client.post("/api/viewer/pin/verify", json={"pin": "4321"}) + assert r.status_code == 200 + assert r.get_json()["ok"] is True + + def test_verify_wrong_pin_returns_401(self, client): + client.post("/api/viewer/pin", json={"pin": "4321"}) + r = client.post("/api/viewer/pin/verify", json={"pin": "9999"}) + assert r.status_code == 401 + + def test_verify_rate_limit_after_5_failures(self, client): + client.post("/api/viewer/pin", json={"pin": "5678"}) + from routes.viewer import _pin_attempts + _pin_attempts.clear() + for _ in range(5): + client.post("/api/viewer/pin/verify", json={"pin": "0000"}) + r = client.post("/api/viewer/pin/verify", json={"pin": "0000"}) + assert r.status_code == 429 + _pin_attempts.clear() + + def test_change_pin_requires_current(self, client): + client.post("/api/viewer/pin", json={"pin": "1111"}) + r = client.post("/api/viewer/pin", + json={"pin": "2222", "current_pin": "9999"}) + assert r.status_code == 403 + + def test_change_pin_with_correct_current(self, client): + client.post("/api/viewer/pin", json={"pin": "1111"}) + r = client.post("/api/viewer/pin", + json={"pin": "2222", "current_pin": "1111"}) + assert r.status_code == 200 + # Old PIN no longer valid + r2 = client.post("/api/viewer/pin/verify", json={"pin": "1111"}) + assert r2.status_code == 401 + + def test_clear_pin_requires_current(self, client): + client.post("/api/viewer/pin", json={"pin": "3333"}) + r = client.delete("/api/viewer/pin", json={"current_pin": "0000"}) + assert r.status_code == 403 + + def test_clear_pin_with_correct_current(self, client): + client.post("/api/viewer/pin", json={"pin": "3333"}) + r = client.delete("/api/viewer/pin", json={"current_pin": "3333"}) + assert r.status_code == 200 + assert client.get("/api/viewer/pin").get_json()["pin_set"] is False + + +# --------------------------------------------------------------------------- +# Interface PIN +# --------------------------------------------------------------------------- + +class TestInterfacePin: + def setup_method(self): + _clear_viewer_pins() + + def teardown_method(self): + _clear_viewer_pins() + + def test_status_no_pin(self, client): + r = client.get("/api/interface/pin") + assert r.get_json()["pin_set"] is False + + def test_set_and_verify(self, client): + r = client.post("/api/interface/pin", json={"pin": "7777"}) + assert r.status_code == 200 + # Gate is now active — authenticate before the status check + with client.session_transaction() as sess: + sess["interface_ok"] = True + assert client.get("/api/interface/pin").get_json()["pin_set"] is True + + def test_non_digit_rejected(self, client): + r = client.post("/api/interface/pin", json={"pin": "abcd"}) + assert r.status_code == 400 + + def test_set_requires_current_when_set(self, client): + client.post("/api/interface/pin", json={"pin": "7777"}) + with client.session_transaction() as sess: + sess["interface_ok"] = True + r = client.post("/api/interface/pin", + json={"pin": "8888", "current_pin": "0000"}) + assert r.status_code == 403 + + def test_clear_requires_current(self, client): + client.post("/api/interface/pin", json={"pin": "7777"}) + with client.session_transaction() as sess: + sess["interface_ok"] = True + r = client.delete("/api/interface/pin", json={"current_pin": "0000"}) + assert r.status_code == 403 + + def test_clear_with_correct_current(self, client): + client.post("/api/interface/pin", json={"pin": "7777"}) + with client.session_transaction() as sess: + sess["interface_ok"] = True + r = client.delete("/api/interface/pin", json={"current_pin": "7777"}) + assert r.status_code == 200 + assert client.get("/api/interface/pin").get_json()["pin_set"] is False + + +# --------------------------------------------------------------------------- +# Scan lock released on run_scan() exception +# --------------------------------------------------------------------------- + +class TestScanLockReleasedOnError: + def test_lock_released_when_run_scan_raises(self, client, mock_connector, + monkeypatch): + import scan_engine + from routes import state + + def _boom(opts): + raise RuntimeError("simulated scan failure") + + monkeypatch.setattr(scan_engine, "run_scan", _boom) + r = client.post("/api/scan/start", json={"sources": ["email"]}) + assert r.status_code == 200 + + # Wait for the background thread to finish and release the lock + deadline = time.time() + 2.0 + while True: + acquired = state._scan_lock.acquire(blocking=False) + if acquired: + state._scan_lock.release() + break + assert time.time() < deadline, "scan lock was never released after exception" + time.sleep(0.05) + + +# --------------------------------------------------------------------------- +# GET /api/db/sessions +# --------------------------------------------------------------------------- + +class TestDbSessions: + def test_returns_list(self, client, db_patch): + r = client.get("/api/db/sessions") + assert r.status_code == 200 + assert isinstance(r.get_json(), list) + + def test_completed_scan_appears_in_sessions(self, client, db_patch): + _seed_scan(db_patch, [_item("sess1")]) + r = client.get("/api/db/sessions") + sessions = r.get_json() + assert len(sessions) >= 1 + s = sessions[0] + assert "ref_scan_id" in s + assert "flagged_count" in s + assert s["flagged_count"] == 1 + + def test_sessions_ordered_newest_first(self, client, db_patch): + # Create two scans >300 s apart so each forms its own session window. + old_id = _seed_scan(db_patch, [_item("old1")]) + db_patch._connect().execute( + "UPDATE scans SET started_at = started_at - 400 WHERE id = ?", (old_id,) + ) + db_patch._connect().commit() + _seed_scan(db_patch, [_item("new1")]) + sessions = client.get("/api/db/sessions").get_json() + assert len(sessions) == 2 + # Newest session (highest ref_scan_id) must be first + assert sessions[0]["ref_scan_id"] > sessions[1]["ref_scan_id"]