# Changelog All notable changes to GDPR Scanner are documented here. Format follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/). Version numbers follow [Semantic Versioning](https://semver.org/spec/v2.0.0.html). --- ## [Unreleased] ### Fixed - **Blank results grid after a browser refresh (especially after a server restart)** — restoring the last scan session on page load was one-shot: `_sseWatchdog()` called `loadHistorySession(null)` a single time, guarded by `_initialStatusChecked`. If that attempt was blocked — a completed scan's replayed `scan_phase` event leaves a `_*ScanRunning` flag set, and the `loadHistorySession` guard then bails — nothing retried, because `sse_replay_done` (the other retry path) only fires when the SSE replay buffer is non-empty, and the buffer is empty after a server restart (so refreshing after the in-app self-update reliably showed an empty grid even though the results were in the database). The watchdog now re-attempts the restore on every 4-second poll while nothing is shown and no scan is running, clearing stale running flags first (both scan locks are confirmed free at that point). Additionally, `/api/scan/status` now reports `google_running` separately from `running` (which only ever reflected the M365 + file lock), so a refresh during a live Google scan is detected instead of being treated as idle. --- ## [1.7.7] — 2026-06-15 ### Changed - **Share modal no longer leaves a stale link in the create box** — after clicking "Create", the generated-link preview row ("Copy link:") stayed visible at the top of the modal even though the new link was already listed under Active links with its own Copy button — so it looked like the form hadn't cleared. The redundant preview row is removed; creating a link now resets the form and briefly highlights the new entry in the Active links list, where it can be copied. (The 1.7.4 fix cleared the input fields but not this preview row.) ### Added - **Reverse-proxy / HTTPS setup guide** — new `docs/setup/ZORAXY_SETUP.md` walks through putting the scanner behind Zoraxy with a Let's Encrypt certificate on a LAN-only deployment: DNS A-record to a private IP, ACME via DNS-01 challenge (HTTP-01 cannot reach a LAN-only host), proxy rule to `127.0.0.1:5100`, binding the app to loopback with `--host 127.0.0.1`, and scanner-specific verification (SSE streaming, HTTPS share links, self-update). Linked from the README (new "HTTPS / reverse proxy" section) and SECURITY.md. ### Fixed - **SECURITY.md corrections** — the web UI binds to `0.0.0.0` by default, not `127.0.0.1` as claimed; the MSAL token cache path was still the pre-1.x `~/.gdpr_scanner_config.json` (actual: `~/.gdprscanner/token.json`). --- ## [1.7.6] — 2026-06-11 ### Fixed - **Update restart leaked the listening socket and hopped to port 5101** — Werkzeug marks its server socket inheritable (`srv.socket.set_inheritable(True)`, unconditionally, for its debug reloader), so the in-app update's `os.execv` restart carried the old listening socket into the new process as a zombie listener: same PID listening on both 5100 (never accepted — clients hang) and 5101 (the actual server). The 1.7.3 `SO_REUSEADDR`/grace-period fix couldn't help because the port genuinely was occupied — by the restarting process itself. `_restart_self()` now marks every fd above stderr close-on-exec before the exec (`_mark_fds_cloexec()`, enumerating `/proc/self/fd` on Linux), so the old socket dies with the exec and the new server rebinds 5100 immediately. --- ## [1.7.5] — 2026-06-11 ### Fixed - **Stale UI after updating the server** — Flask served `/static/` files with no `Cache-Control` header, so browsers cached JS/CSS heuristically (often for days). After a server update — including the new in-app self-update, whose post-install reload hit the cache — the backend was new but the frontend stayed old, and fixes appeared "not to work" until a hard refresh. `SEND_FILE_MAX_AGE_DEFAULT = 0` now makes every static file revalidate via ETag: unchanged files answer with a cheap 304, changed files are re-fetched immediately on the next normal page load. --- ## [1.7.4] — 2026-06-10 ### Fixed - **Share modal kept stale input after creating a link** — clicking "Create" only cleared the label field; scope type, user email, date range, and expiry kept their values, so the next link silently inherited the previous link's scope settings. The form-reset logic from `openShareModal()` is now a shared `_resetShareForm()` helper called after every successful create (the generated link row stays visible for copying). --- ## [1.7.3] — 2026-06-10 ### Fixed - **App restart no longer hops to a new port** — the in-app update restart (and any quick stop/start) left connections from the previous instance in TIME_WAIT, and the startup port probe did a plain `bind()` that treats TIME_WAIT as occupied — so the restarted app silently came up on 5101 and the browser's reload poll never found it. The probe now sets `SO_REUSEADDR` (matching how Werkzeug actually binds, so an actively listening port is still detected as occupied), and the requested port gets a 10-second grace period before the auto-increment fallback kicks in, covering the brief window where the old process hasn't fully released the socket. - **Share links now respect a reverse proxy** — `_getShareBaseUrl()` rewrote every copied share link to `http://:5100` (via `/api/local_ip`), which would bypass TLS when the scanner sits behind a reverse proxy (Zoraxy, Caddy, nginx, …): a DPO opening the link would silently fall back to plain HTTP. The LAN-IP rewrite now only applies in the case it was built for — browsing the app at `localhost` over HTTP, where `window.location.origin` would produce links unusable from other machines. Any HTTPS or non-localhost origin is used as-is. --- ## [1.7.2] — 2026-06-10 ### Fixed - **Copy buttons did nothing over plain HTTP** — the share modal's "Copy" buttons (new link + active links) and the log panel's copy button called `navigator.clipboard.writeText()` directly. The Clipboard API only exists in secure contexts (HTTPS or localhost), so when the scanner is reached at `http://:5100` the call threw synchronously and the intended `execCommand` fallback never ran — the button silently did nothing. `_copyText()` in `viewer.js` now feature-detects the API, falls back to `document.execCommand('copy')`, and as a last resort shows the link in a `prompt()` for manual copying; `log.js` reuses the same helper via `window._copyText`. `_getShareBaseUrl()` now caches the LAN-IP lookup so the token-list Copy buttons copy synchronously within the click gesture (required for `execCommand`). --- ## [1.7.1] — 2026-06-10 ### Added - **Software update from the GUI** — a new **Settings → General → Software update** group lets the operator check for and install updates without touching the server shell. "Check for updates" fetches origin and shows either "You are running the latest version" or the list of pending commits; "Install update" fast-forwards the git checkout to `origin/`, reinstalls dependencies only if `requirements.txt` changed, writes an `app_update` audit-log entry, and restarts the app in place by re-exec'ing the process (`os.execv` — same PID, so it works both under systemd and when launched via `start_gdpr.sh`). The page polls until the server is back and reloads itself. Local server-side edits are auto-stashed (kept, never discarded) before the merge. Updating is refused with a clear message while any scan is running. An **"Install updates automatically"** toggle (stored in `config.json` under `auto_update`) enables a background thread that checks once a day and installs unattended, skipping (and retrying hourly) while a scan runs. The group is only shown when the app runs from a git checkout — the frozen desktop build hides it. New blueprint `routes/updates.py` with `GET /api/update/check`, `POST /api/update/apply`, `GET/POST /api/update/settings`; 11 new tests in `tests/test_updates.py` with fully mocked git. - **`update_gdpr.sh`** — standalone CLI/cron equivalent of the GUI update: fetch + fast-forward-only merge with auto-stash of local hotfixes, dependency reinstall only when `requirements.txt` changed, and a `systemctl restart` if a `gdprscanner.service` unit exists (override with `GDPR_SERVICE`). `./update_gdpr.sh --check` reports pending commits without changing anything; safe to run from cron (quiet no-op when already up to date). ### Fixed - **Delta token status hid the source count** — the "Tokens saved" line under the Δ Delta scan toggle always showed the bare translation ("Tokens gemt") because the source count only existed in the JS fallback string, which is ignored whenever the lang key exists. The translations now carry a `{n}` placeholder ("Tokens gemt for {n} kilde(r)") substituted in `checkDeltaStatus()`, and the row gained a "?" hint bubble explaining what the saved change-tokens do and that "Clear tokens" forces the next scan to be a full scan. - **Stale data-file paths in docs and UI text** — README, SECURITY.md, MAINTAINER.md, the `--headless` argparse help (`--settings`, `--reset-db`, epilog), the DB-import replace warning/confirm strings (all three languages), and two code comments still referenced the pre-1.x flat dotfile layout (`~/.gdpr_scanner_delta.json`, `~/.gdpr_scanner_smtp.json`, `~/.gdpr_scanner_machine_id`, `~/.gdpr_scanner.db`). All now point to the actual locations under `~/.gdprscanner/` (`delta.json`, `smtp.json`, `machine_id`, `scanner.db`). The legacy-migration rename tables in `gdpr_scanner.py` intentionally keep the old names. --- ## [1.7.0] — 2026-06-10 ### Added - **PDF redaction for local files** — the ✂ redact button now works on local PDF files in addition to DOCX, XLSX, CSV, and TXT. Text-based PDFs are redacted using PyMuPDF's physical redaction (`page.apply_redactions()`), which removes the underlying text data from the PDF stream — not just paints over it. Scanned/image-based PDFs go through the OCR bbox path: CPR positions are found via Tesseract then physically painted and sanitised. Falls back to a reportlab overlay if PyMuPDF is not installed; raises a clear error if both libraries are absent. - **Google Drive file redaction** — the ✂ redact button now works on native DOCX, XLSX, and PDF files stored in Google Drive (both Google Workspace service-account and personal OAuth connectors). The file is downloaded via the Drive API, redacted locally using the same PyMuPDF / python-docx / openpyxl pipeline as local files, then uploaded back as a new revision via `files().update()`. Google Docs/Sheets exported as DOCX are detected by MIME type and refused with a clear message (re-upload after exporting manually). Requires the `drive` scope (not `drive.readonly`) on the service-account domain-wide delegation grant; a 403 surfaces the exact Google error so admins can add the scope. Methods added: `get_drive_file_mime`, `download_drive_file_by_id`, `update_drive_file` on both `GoogleWorkspaceConnector` and `PersonalGoogleConnector`. - **SFTP file redaction** — the ✂ button now works on SFTP files (DOCX, XLSX, CSV, TXT, PDF). The file is downloaded via paramiko, redacted locally, then written back with `sftp.open(path, "wb")`. Source config is matched from `_load_file_sources()` by host + username; credentials are resolved from the keychain via `_resolve_sftp_credentials`. Requires the item to be in the current session's `state.flagged_items` (SFTP host info is not stored in the DB). New method: `SFTPScanner.write_file(remote_path, content)`. - **SMB file redaction** — the ✂ button now works on SMB/CIFS network share files (DOCX, XLSX, CSV, TXT, PDF). Source config is looked up by matching the host parsed from `full_path` (`//host/share/…`). File is downloaded and re-uploaded using smbprotocol with `CreateDisposition.FILE_SUPERSEDE` so the file is atomically replaced. New function: `file_scanner.write_smb_file(path, content, username, password, domain)`. - **AI-enhanced NER via Claude** — Named Entity Recognition (names, addresses, organisations) can now be powered by Claude Haiku instead of spaCy. Enable in **Settings → AI / NER**: paste an Anthropic API key, toggle on, click Test to confirm. When enabled, `document_scanner.py` calls the Claude API (`claude-haiku-4-5-20251001`) instead of spaCy for all three scan engines; results are cached in-memory per document (bounded at 2 000 entries) so repeated scans of the same file never re-charge the API. Falls back to spaCy automatically if the key is missing or the `anthropic` package is not installed. API key stored in `config.json` under `claude_api_key`; toggle stored under `claude_ner`. Routes: `GET/POST /api/settings/claude`, `POST /api/settings/claude/test`. ### Changed - **Redacted and deleted cards stay in the grid until the next scan** — previously redacting (✏) or deleting (🗑) a card — or running a bulk delete — removed the affected cards from the grid and from `S.flaggedData`/`S.filteredData` immediately. Now each item is kept and marked: the card is greyed (`card-resolved` styling), shows a `✏ Redacted` (green) or `🗑 Deleted` (red) badge, and its action buttons are hidden so it can't be re-processed. The operator can see what was handled during the session; the grid is rebuilt on the next scan run, which clears the markers. Implemented with `_redacted` / `_deleted` flags in `results.js` (`appendCard`, `redactItem`, `deleteItem`, `executeBulkDelete`, `deleteSubjectItems`); handled items are also excluded from the bulk-delete match set. `POST /api/delete_bulk` now returns `deleted_ids` so the grid marks exactly the items the server actually deleted (partial failures stay active). Also fixes a latent bug in the data-subject delete flow where `renderGrid()` was called with no argument and threw, falsely reporting "Delete failed" after a successful erasure. ### Fixed - **Selected card scrolled out of view when opening the preview** — opening the preview panel narrows `.grid-area`, which reflows the `auto-fill` grid to fewer columns and moves every card to a new row. The single-frame `scrollIntoView` ran while the browser's scroll-anchoring re-adjusted `scrollTop` mid-reflow, fighting the scroll so the clicked card ended up off-screen. Fixed by disabling scroll anchoring on `.grid-area` (`overflow-anchor: none`) and deferring the scroll by two animation frames so it runs against the settled layout; the card is now centred (`block: 'center'`) instead of `'nearest'` so it stays clearly visible. - **Cards not shown after browser refresh** — when the browser reconnected to the SSE stream after a completed scan, the `scan_phase` events in the replay buffer temporarily set `S._m365ScanRunning = true` (all running flags start at `false` after a page reload). The watchdog's `loadHistorySession` call fired in this window and bailed on the stale flag; once `scan_done` cleared the flag, `_initialStatusChecked` was already `true` so `loadHistorySession` was never retried. Fixed by having the `sse_replay_done` handler retry `loadHistorySession(null)` when no scan is running and `S._historyRefScanId` is still `null` after replay. - **Settings modal too narrow for seven tabs** — widened from 640 px to 720 px so all tab labels fit on one line without wrapping. - **Card action buttons invisible in grid view** — `.card` was missing `position: relative`, so the `position:absolute` delete (🗑), redact (✏), and bulk-select checkbox elements anchored to the viewport instead of the card and were then clipped away by the card's `overflow:hidden`. They only appeared in list view, where those elements are `position:static` and flow inline. Added `position: relative` to `.card` so all three position correctly within each card. Also gave `.card-redact-btn` the same `0.35` baseline opacity as the delete button (it was `opacity:0` at rest) so it's discoverable without hovering. ### Security - **Stored XSS in the results grid** — scan-derived strings (file name, account/display name, folder, source label, modified date, image `alt`) were interpolated straight into `innerHTML` and `title=` attributes across the card, list, preview, data-subject lookup, and related-documents views. Because these values come from scanned content (e.g. a OneDrive file deliberately named with markup), a crafted filename could execute script in a reviewer's session — including a shared read-only viewer/DPO session. A new `esc()` helper in `static/js/results.js` (escapes `& < > " '`) is now applied to every untrusted field before embedding. The related-documents `onclick` JSON is also escaped with `.replace(/"/g,'"')` to match the delete/redact button pattern, closing an attribute-injection hole where a filename containing `"` could break out of the handler. - **Reflected XSS in `/api/thumb`** — the `?name=` query parameter was embedded unescaped into the placeholder SVG served as `image/svg+xml`, so opening a crafted `/api/thumb?name=