# Changelog All notable changes to GDPR Scanner are documented here. Format follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/). Version numbers follow [Semantic Versioning](https://semver.org/spec/v2.0.0.html). --- ## [Unreleased] ### Added - **GitHub Actions CI/CD — macOS build** — `.github/workflows/build.yml` now also builds a macOS `.app` bundle (`macos-13`, Intel x86_64 / Rosetta) on every push to `main` and on `v*` tags. Released as `GDPRScanner_macos_x86_64.zip`. ### Fixed - **CI — Windows artifact never uploaded** — PyInstaller `--onedir` puts the exe inside `dist/GDPRScanner/`, not at `dist/*.exe`. The artifact glob never matched, so no Windows build appeared in releases. A PowerShell packaging step now zips `dist\GDPRScanner\` into `GDPRScanner_windows_x64.zip` (mirroring the existing Linux step). - **`EFFORT_ESTIMATE.md`** — build effort estimate document covering component-by-component hour breakdowns and complexity drivers for the project. - **Settings → Security tab** — new dedicated pane in the Settings modal. Admin PIN and Viewer PIN groups moved here from the General tab, which now contains only Appearance and About. The Share modal's **Configure** button navigates directly to the Security tab. - **Viewer mode layout** — the sidebar, log panel, and progress bar are now hidden in viewer mode so results fill the full window width. The `🔍 GDPRScanner` brand is shown in the top-left of the topbar (replacing the sidebar header) at the same size and weight as the normal sidebar title. ### Fixed - **Share modal — Revoke / Copy buttons broken** — `JSON.stringify(token)` produced a double-quoted string that terminated the surrounding `onclick="…"` HTML attribute early, so neither button fired its handler. Both now pass the token as a single-quoted JS string literal, which is safe for the hex token format. - **Viewer PIN — Clear PIN rejected with "current PIN is incorrect"** — clicking **Clear PIN** without first typing in the Current PIN field sent an empty string to the server, which correctly rejected it. A client-side guard now validates the field is non-empty before sending the request, and focuses the input with an inline error message if it is empty. - **Share modal — all UI strings now translated** — the Share results modal and Viewer PIN settings group were fully hardcoded in English. All visible strings are now backed by i18n keys (`share_*`, `viewer_pin_*`) in `en.json`, `da.json`, and `de.json`. --- ## [1.6.14] — 2026-04-10 ### Added — read-only viewer mode (#33) A DPO, school principal, or compliance coordinator can now review scan results and tag dispositions without access to scan controls, credentials, or settings. **Token links** - New `🔗` **Share** button in the topbar opens a token management modal. - **Create** generates a 64-char hex token (`secrets.token_hex(32)`) with an optional label and expiry (7 d / 30 d / 90 d / 1 yr / never). - **Copy** copies the full `http://host:5100/view?token=…` URL to the clipboard. - **Revoke** deletes the token immediately; any browser using it is locked out on next navigation. - Tokens are stored in `~/.gdprscanner/viewer_tokens.json` with `created_at`, `expires_at`, and `last_used_at` metadata. Expired tokens are cleaned up on each list fetch. **PIN alternative** - A 4–8 digit numeric PIN can be set in **Settings → General → Viewer PIN**. - Opening `/view` without a token shows a PIN entry form (`templates/viewer_pin.html`). - Correct PIN sets a Flask session cookie (`session["viewer_ok"]`) valid for the browser session — no token needed after that. - Brute-force guard: 5 failed attempts per 5 minutes per IP returns 429. - PIN stored as salted SHA-256 inside `viewer_tokens.json` (no extra dependencies). **`/view` route** - Checks `?token=` first (validates + binds session), then existing session cookie, then PIN form (if a PIN is configured), then 403. - Serves the same `index.html` with `window.VIEWER_MODE = true` injected. - Invalid/expired tokens show `templates/viewer_denied.html`. **Viewer mode (JS)** - `auth.js` — bypasses M365 auth check entirely; adds `viewer-mode` class to ``; shows scanner screen immediately. - `results.js` — on `DOMContentLoaded` calls `_loadViewerResults()` which fetches `GET /api/db/flagged` (all items from the last completed scan session, joined with dispositions) and renders the grid directly — no SSE required. - CSS (`body.viewer-mode`) hides: Sources/Options/Accounts sidebar panels; Scan/Stop buttons; profile bar; config-group buttons; resume banner; bulk-delete button; per-card delete button; data-subject delete button; Share button. - Disposition tagging (select + Save) remains fully functional — `/api/db/disposition` has no auth guard. - Filter bar, Excel export, Art.30 export, preview panel, and log remain accessible. **New files:** `routes/viewer.py`, `static/js/viewer.js`, `templates/viewer_pin.html`, `templates/viewer_denied.html` **Files changed:** `app_config.py`, `gdpr_scanner.py`, `templates/index.html`, `static/style.css`, `static/js/auth.js`, `static/js/results.js`, `static/js/scheduler.js`, `routes/database.py` --- ### Fixed — memory exhaustion during large M365 scans Addressed root causes of runaway memory growth (reported: up to 90 GB RSS) that could crash the host machine during scans of large Microsoft 365 tenants. **`scan_engine.py`** - **Email body HTML stripped at collection time** — Graph API returns the full `body` field (raw HTML, up to ~1 MB per message) for every email fetched. Previously, all message dicts — including the raw HTML — were accumulated in `work_items` before any scanning began. For 1 000 users × 2 000 emails this could mean >100 GB in `work_items` alone. The body is now converted to plain text immediately on collection (`_precomputed_body`), and the raw `body` and `bodyPreview` keys are deleted from the dict before it is queued. The processing loop reads `_precomputed_body` via `pop()` and `del`s it after use. - **`work_items` converted to `deque` before processing** — items are now released from memory one by one via `popleft()` as they are processed, rather than keeping the entire list alive for the duration of the scan. `gc.collect()` is called immediately after conversion and after each checkpoint save. - **`content` bytes freed as early as possible in the file processing branch** — raw download bytes are now `del`'d immediately after `content.decode()` (before the expensive NER/PII pass), and also in the no-hits `else` branch where they were previously kept alive until the next loop iteration. - **`body_text` freed after use in the email branch** — `del body_text` added after `_broadcast_card` so large plain-text bodies do not linger until the next iteration. - **Memory guard before file downloads** — uses `psutil.virtual_memory().available` to skip a file download and log a warning if fewer than 300 MB of RAM are available, preventing a single large file from pushing an already-pressured machine into OOM. **`document_scanner.py`** - **PDF OCR page images freed page by page** — `convert_from_path()` renders all pages at 300 DPI before scanning begins (~26 MB per A4 page; a 100-page PDF ≈ 2.6 GB). Each rendered `PIL.Image` is now nulled out (`images[page_num-1] = None`) immediately after OCR, so only one page image is live at a time instead of the entire document. ### Changed — Sources panel is now resizable and collapsible The **KILDER** sidebar panel now behaves consistently with the other sidebar sections. - **Collapsible** — the `▾` / `▸` toggle was already wired up; collapse state is already persisted in `localStorage`. No change needed here. - **Resizable** — a drag handle (`sources-resize-handle`) added at the bottom of the panel body. Dragging up shrinks the panel (scroll appears); dragging down is capped at the panel's natural content height — you cannot expand it beyond what is needed to show all sources. Height preference persisted in `localStorage` under `gdpr_sources_h`. - **Auto-fit on render** — `_fitSourcesPanel()` is called at the end of every `renderSourcesPanel()` invocation. On first load and whenever sources are added or removed (e.g. connecting Google), the panel height snaps to exactly fit all visible sources. A previously saved smaller height is honoured only if it is still smaller than the new content height; dragging back to full height clears the saved preference. - The old `max-height: calc(5 * 26px)` fixed cap is removed. **Files changed:** `templates/index.html`, `static/style.css`, `static/js/log.js` (`_fitSourcesPanel`, `_initSourcesResize`), `static/js/sources.js`, `static/js/results.js`. --- ## [1.6.13] — 2026-04-10 ### Added — developer tooling - **`run_tests.sh`** — shell script to activate the venv and run the full test suite. Accepts any `pytest` arguments: `./run_tests.sh`, `./run_tests.sh -q`, `./run_tests.sh tests/test_app_config.py`. - **Directory-scoped `CLAUDE.md` rules** — `routes/CLAUDE.md`, `static/js/CLAUDE.md`, `templates/CLAUDE.md`, `lang/CLAUDE.md` replace the previous single-file context document. Each file is loaded automatically by Claude Code only when working in the relevant directory. ### Fixed — documentation - **`README.md` project files table** — removed four phantom entries (`Dockerfile`, `docker-compose.yml`, `.dockerignore`, `scanner_audit.jsonl`); corrected `static/app.js` description to "archived monolith — no longer loaded"; fixed manual paths (`MANUAL-EN.md` → `docs/manuals/MANUAL-EN.md`); added missing files: `scan_engine.py`, `sse.py`, `checkpoint.py`, `app_config.py`, `cpr_detector.py`, `google_connector.py`, `static/style.css`, `static/js/*.js`, `routes/google_auth.py`, `routes/google_scan.py`, `run_tests.sh`, `docs/setup/` guides. - **`docs/manuals/MANUAL-EN.md`**, **`docs/manuals/MANUAL-DA.md`** — version header updated from 1.6.11 → 1.6.13; footer updated from v1.6.8 → v1.6.13. ### Changed — blueprint migration batch 3, 4, 5 (auth, database, export — migration complete) All remaining direct `@app.route` registrations removed from `gdpr_scanner.py`. Flask now routes every API endpoint exclusively through its blueprint. Only `GET /` and `GET /api/scan/stream` (SSE) remain in `gdpr_scanner.py`. **`routes/auth.py`** — rewritten with direct imports (batch 3, 6 routes): - `MSAL_OK`, `M365Connector`, `M365Error` imported from `m365_connector` - `_load_config`, `_save_config` imported from `app_config` - Dead module-level globals `_pending_flow` and `_auth_poll_result` removed from `gdpr_scanner.py` - Routes removed: `/api/auth/status`, `/api/auth/start`, `/api/auth/poll`, `/api/auth/userinfo`, `/api/auth/signout`, `/api/auth/config` **`routes/database.py`** — rewritten with direct imports (batch 4, 15 routes): - `_get_db`, `DB_OK` from `gdpr_db`; `_set_admin_pin`, `_verify_admin_pin`, `_admin_pin_is_set` from `app_config`; `_clear_checkpoint`, `_DELTA_PATH` from `checkpoint`; `_extract_exif`, `_html_esc`, `_placeholder_svg` from `cpr_detector` - `SCANNER_OK` determined by local `import document_scanner` try/except - `db_export` improved: uses `NamedTemporaryFile` instead of `mktemp` (safer for frozen apps) - Email preview HTML: full CSS ruleset (`*, *::before, *::after`, `img`, `table`, scrollbar) from gdpr_scanner.py version restored - Routes removed: `/api/db/stats`, `/api/db/trend`, `/api/db/scans`, `/api/db/subject`, `/api/db/overdue`, `/api/db/disposition` (×2), `/api/db/deletion_log`, `/api/db/reset`, `/api/admin/pin` (×2), `/api/db/export`, `/api/db/import`, `/api/preview/`, `/api/thumb` **`routes/export.py`** — rewritten with direct imports (batch 5, 3 routes): - `_get_db`, `DB_OK` from `gdpr_db`; `_GUID_RE`, `_resolve_display_name` from `app_config`; `M365PermissionError` from `m365_connector` - `app.logger` replaced with `logging.getLogger(__name__)` - Dead `delete_item()` helper removed from `gdpr_scanner.py` (was unreachable; blueprint has its own copy) - Routes removed: `/api/export_excel`, `/api/export_article30`, `/api/delete_bulk` **`tests/test_routes.py`** — `db_patch` fixture updated: now patches `routes.database._get_db` / `routes.database.DB_OK` and `routes.export._get_db` / `routes.export.DB_OK` (was patching `gdpr_scanner._get_db`/`gdpr_scanner.DB_OK` which no longer have any effect). Two `test_without_db_returns_503` tests updated to monkeypatch `routes.database.DB_OK` instead of `gdpr_scanner.DB_OK`. --- ## [1.6.12] — 2026-04-10 ### Fixed — profile editor save drops users from non-active role groups In `_pmgmtSaveFullEdit` (profile management editor), the save function applied the active role filter (`_pmgmtRoleActive`) to the list of checked checkboxes before saving. Since `_pmgmtFilterAccounts` hides rows via `display:none` but does not uncheck them, users from other role groups that remained checked (but hidden) were silently discarded on save. The role filter at save time is removed — all checked checkboxes are now captured regardless of which role tab is visible. --- ## [1.6.11] — 2026-04-10 ### Changed — blueprint migration batch 1 (scan + app_routes) 15 direct `@app.route` registrations removed from `gdpr_scanner.py`. Flask now routes all of these exclusively through their blueprint counterparts, which previously existed as dead code shadowed by the direct routes. **`routes/scan.py`** — rewritten with direct imports (was entirely non-functional as dead code due to bare-name `NameError`s behind the shadow): - Added `GET /api/scan/status` (new — was only in gdpr_scanner.py) - Added `GET /api/src_toggles`, `POST /api/src_toggles` (new — was only in gdpr_scanner.py) - `scan_checkpoint_info` — added missing `check_only` handling present in the gdpr_scanner.py version - All state references converted from bare names to `state._scan_lock` / `state._scan_abort`; `run_scan` imported lazily from `scan_engine` inside `_run` to avoid circular imports - `_save_settings`, `_load_settings`, `_load_src_toggles`, `_save_src_toggles` imported from `app_config` - `_checkpoint_key`, `_load_checkpoint`, `_clear_checkpoint`, `_load_delta_tokens`, `_DELTA_PATH` imported from `checkpoint` **`routes/app_routes.py`** — cleaned up: - `APP_VERSION` now computed locally from `VERSION` file (was a bare-name reference to gdpr_scanner.py global) - `_LANG_DIR` computed at module level; fixed `sys` / `_sys` alias mismatch in `get_langs` (bug in blueprint that never manifested while shadowed) - `_set_lang_override`, `_load_lang_forced` imported directly from `app_config` - `get_langs` — added missing `langs.sort()` present in the gdpr_scanner.py version **`tests/test_routes.py`** — `mock_connector` fixture simplified: no longer needs to patch `gdpr_scanner._connector` since the direct `scan/start` route is gone; `state.connector` alone is sufficient. `run_scan` stub in `test_authenticated_returns_started` updated to target `scan_engine` directly. **Routes removed from `gdpr_scanner.py`:** `/api/about`, `/api/langs`, `/api/set_lang`, `/api/lang`, `/api/scan/status`, `/api/scan/start`, `/api/scan/stop`, `/api/scan/checkpoint`, `/api/scan/clear_checkpoint`, `/api/settings/save`, `/api/settings/load`, `/api/src_toggles`, `/api/delta/status`, `/api/delta/clear` **Still in `gdpr_scanner.py`:** `GET /` (root), `GET /api/scan/stream` (SSE — cannot be in a blueprint), and the `auth`, `users`, `sources`, `database`, `export` route groups (31 routes — next batches). --- ## [1.6.10] — 2026-04-10 ### Fixed — Google Drive `exportSizeLimitExceeded` warning Native Google Workspace files too large for Drive's export API (Google's server-side limit, distinct from the 20 MB local cap) now produce a clean skip message instead of a stray `WARNING googleapiclient.http — Encountered 403 Forbidden with reason "exportSizeLimitExceeded"` in the log. A `logging.Filter` subclass is installed on the `googleapiclient.http` logger at import time to suppress the duplicate external warning; the `except HttpError` block in `_drive_iter` detects the reason and logs `[gdrive] skip '' — file too large for Google export API (exportSizeLimitExceeded)` with the file ID. ### Fixed — peak memory during large file/SMB scans (OOM risk reduction) Three targeted buffer-lifetime fixes reduce peak RSS during large scans: - **`cpr_detector.py`** — `del content` after writing the PDF bytes to a temp file in `_scan_bytes_timeout`. The 20 MB buffer was previously held in the main process for the entire duration of `p.join(timeout)` (up to 60 s), overlapping with the spawned subprocess's ~150–300 MB heap. It is now freed before the subprocess starts. - **`scan_engine.py`** — `del content` after the thumbnail block in `run_file_scan`. The raw file buffer was kept alive through card dict construction and the start of the next loop iteration; it is now freed as soon as the thumbnail (or placeholder SVG) has been generated. - **`file_scanner.py`** — `PREFETCH_WINDOW` reduced from 2 to 1. Halves the maximum number of concurrently-held SMB read buffers (from 2 × 20 MB to 1 × 20 MB). --- ## [1.6.9] — 2026-04-10 ### Changed — frontend migrated to ES modules **Phase 2 complete:** All 10 split JS files converted from `