# Changelog

All notable changes to GDPR Scanner are documented here.

Format follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/).
Version numbers follow [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

---

## [Unreleased]

### Added

- **PDF redaction for local files** — the ✂ redact button now works on local PDF files in addition to DOCX, XLSX, CSV, and TXT. Text-based PDFs are redacted using PyMuPDF's physical redaction (`page.apply_redactions()`), which removes the underlying text data from the PDF stream — not just paints over it. Scanned/image-based PDFs go through the OCR bbox path: CPR positions are found via Tesseract then physically painted and sanitised. Falls back to a reportlab overlay if PyMuPDF is not installed; raises a clear error if both libraries are absent.

- **Google Drive file redaction** — the ✂ redact button now works on native DOCX, XLSX, and PDF files stored in Google Drive (both Google Workspace service-account and personal OAuth connectors). The file is downloaded via the Drive API, redacted locally using the same PyMuPDF / python-docx / openpyxl pipeline as local files, then uploaded back as a new revision via `files().update()`. Google Docs/Sheets exported as DOCX are detected by MIME type and refused with a clear message (re-upload after exporting manually). Requires the `drive` scope (not `drive.readonly`) on the service-account domain-wide delegation grant; a 403 surfaces the exact Google error so admins can add the scope. Methods added: `get_drive_file_mime`, `download_drive_file_by_id`, `update_drive_file` on both `GoogleWorkspaceConnector` and `PersonalGoogleConnector`.

- **SFTP file redaction** — the ✂ button now works on SFTP files (DOCX, XLSX, CSV, TXT, PDF). The file is downloaded via paramiko, redacted locally, then written back with `sftp.open(path, "wb")`. Source config is matched from `_load_file_sources()` by host + username; credentials are resolved from the keychain via `_resolve_sftp_credentials`. Requires the item to be in the current session's `state.flagged_items` (SFTP host info is not stored in the DB). New method: `SFTPScanner.write_file(remote_path, content)`.

- **SMB file redaction** — the ✂ button now works on SMB/CIFS network share files (DOCX, XLSX, CSV, TXT, PDF). Source config is looked up by matching the host parsed from `full_path` (`//host/share/…`). File is downloaded and re-uploaded using smbprotocol with `CreateDisposition.FILE_SUPERSEDE` so the file is atomically replaced. New function: `file_scanner.write_smb_file(path, content, username, password, domain)`.

- **AI-enhanced NER via Claude** — Named Entity Recognition (names, addresses, organisations) can now be powered by Claude Haiku instead of spaCy. Enable in **Settings → AI / NER**: paste an Anthropic API key, toggle on, click Test to confirm. When enabled, `document_scanner.py` calls the Claude API (`claude-haiku-4-5-20251001`) instead of spaCy for all three scan engines; results are cached in-memory per document (bounded at 2 000 entries) so repeated scans of the same file never re-charge the API. Falls back to spaCy automatically if the key is missing or the `anthropic` package is not installed. API key stored in `config.json` under `claude_api_key`; toggle stored under `claude_ner`. Routes: `GET/POST /api/settings/claude`, `POST /api/settings/claude/test`.

### Changed

- **Redacted cards stay in the grid until the next scan** — previously redacting a card (✏) removed it from the grid and from `S.flaggedData`/`S.filteredData` immediately. Now the item is kept and marked redacted: the card is greyed (`card-resolved` styling), shows a `✏ Redacted` badge, and its delete/redact action buttons are hidden so it can't be re-processed. The operator can see what was handled during the session; the grid is rebuilt on the next scan run, which clears the redacted markers. Implemented with a `_redacted` flag in `results.js` (`appendCard` + `redactItem`); no server change.

### Fixed

- **Cards not shown after browser refresh** — when the browser reconnected to the SSE stream after a completed scan, the `scan_phase` events in the replay buffer temporarily set `S._m365ScanRunning = true` (all running flags start at `false` after a page reload). The watchdog's `loadHistorySession` call fired in this window and bailed on the stale flag; once `scan_done` cleared the flag, `_initialStatusChecked` was already `true` so `loadHistorySession` was never retried. Fixed by having the `sse_replay_done` handler retry `loadHistorySession(null)` when no scan is running and `S._historyRefScanId` is still `null` after replay.

- **Settings modal too narrow for seven tabs** — widened from 640 px to 720 px so all tab labels fit on one line without wrapping.

- **Card action buttons invisible in grid view** — `.card` was missing `position: relative`, so the `position:absolute` delete (🗑), redact (✏), and bulk-select checkbox elements anchored to the viewport instead of the card and were then clipped away by the card's `overflow:hidden`. They only appeared in list view, where those elements are `position:static` and flow inline. Added `position: relative` to `.card` so all three position correctly within each card. Also gave `.card-redact-btn` the same `0.35` baseline opacity as the delete button (it was `opacity:0` at rest) so it's discoverable without hovering.

### Security

- **Stored XSS in the results grid** — scan-derived strings (file name, account/display name, folder, source label, modified date, image `alt`) were interpolated straight into `innerHTML` and `title=` attributes across the card, list, preview, data-subject lookup, and related-documents views. Because these values come from scanned content (e.g. a OneDrive file deliberately named with markup), a crafted filename could execute script in a reviewer's session — including a shared read-only viewer/DPO session. A new `esc()` helper in `static/js/results.js` (escapes `& < > " '`) is now applied to every untrusted field before embedding. The related-documents `onclick` JSON is also escaped with `.replace(/"/g,'&quot;')` to match the delete/redact button pattern, closing an attribute-injection hole where a filename containing `"` could break out of the handler.

- **Reflected XSS in `/api/thumb`** — the `?name=` query parameter was embedded unescaped into the placeholder SVG served as `image/svg+xml`, so opening a crafted `/api/thumb?name=<script>…` URL directly executed script in the app origin. `cpr_detector._placeholder_svg` now HTML-escapes both the type label and the filename before embedding them in the SVG.

- **Claude API key now encrypted at rest** — the Anthropic API key was stored in plaintext in `config.json` while the SMTP password was already Fernet-encrypted. `save_claude_config()` now encrypts the key with the same machine-keyed Fernet (`_encrypt_password`); a new `get_claude_api_key()` decrypts it for use. Legacy plaintext keys are still read transparently and re-encrypted on the next save. Readers in `document_scanner.py` and `routes/app_routes.py` updated accordingly.

---

## [1.6.28] — 2026-05-28

### Added

- **Date-range scoping for viewer tokens** — tokens can now carry optional `valid_from` and `valid_to` scope fields (YYYY-MM-DD). When set, `GET /api/db/flagged` filters items whose `modified` date falls outside the range. The share modal now shows two date inputs ("Items from" / "Items until") that apply to any scope type (all/role/user). The token list shows a green date-range badge when a range is stored. The server validates format and enforces `valid_from ≤ valid_to`. All three scope dimensions (role, user, date-range) are independent and combinable.

- **CPR-only mode** — a new `cpr_only` scan option (sidebar toggle `#optCprOnly`, profile editor `#peOptCprOnly`) makes all three scan engines skip items that have no qualifying CPR numbers. Files whose only hits are email addresses, phone numbers, detected faces, or EXIF/GPS metadata are not flagged. The flag already detected is still shown on cards when `cpr_only=false` (default). Gated in all three engines: file scan skip condition, M365 email flagging, M365 file flagging, and Google Gmail/Drive flagging.

- **OCR language override** — a new `ocr_lang` scan option (sidebar select `#optOcrLang`, profile editor `#peOptOcrLang`) lets operators choose the Tesseract language pack(s) used when scanning scanned PDFs and images. Presets: `dan+eng` (default), `dan`, `eng`, `dan+eng+deu`, `dan+eng+swe`, `dan+eng+fra`. The setting flows from the UI through the profile, into all three scan engines (M365 `_scan_bytes_timeout`, M365 attachments `_scan_bytes`, M365 files `_scan_bytes`, Google `_scan_bytes` for both Gmail and Drive). The `lang` parameter is threaded through `cpr_detector._scan_bytes` → `document_scanner.scan_pdf` / `scan_image` and the spawned PDF-OCR subprocess worker. The OCR cache key already included `lang`, so per-language results are cached independently.

- **Built-in file redaction for local files** — a scissor button (`✂`) appears on cards for local DOCX, XLSX, CSV, and TXT files. Clicking it rewrites the file in-place with all detected CPR numbers replaced by `██████-████` (DOCX/XLSX) or `█`-blocks (CSV/TXT), then removes the card from the grid and logs a `"redacted"` disposition. The redaction is atomic: a temp file in the same directory is written first and then moved over the original, so a crash never leaves a half-written file. Implemented in `routes/export.py` (`POST /api/redact_item`) using the existing `document_scanner` redact functions; front-end in `results.js` (`redactItem`) with the button hidden for non-local or unsupported-extension items and for resolved/viewer-mode cards.

- **`DELETE /api/delete_item` route registration fix** — the `delete_item` handler in `routes/export.py` was missing its `@bp.route` decorator, so the endpoint was never registered in Flask's URL map. The route now works correctly.

- **Scheduled report-only email job** — scheduled jobs can now be configured as "report only" (toggle `#schedReportOnly`). When enabled, the job skips the scan entirely and instead emails the latest scan results already in the database. If the in-memory result list is empty (e.g. after a server restart), results are loaded from the DB via `get_session_items()`. M365 authentication is not required for report-only jobs — email is sent Graph-first if authenticated, SMTP otherwise. Jobs fail with a clear error if no scan results are available. The job list card shows a blue "Report only" badge. Setting `report_only=True` in the editor automatically enables "Email report automatically" and dims the Profile field (unused for report-only runs).

- **Compliance audit log** — every significant admin action is now written to an immutable `audit_log` table in the scanner database. Recorded events: profile save/delete, viewer token create/revoke, viewer/interface/admin PIN set/change/clear, file source add/update/delete, scheduler job save/delete, scan start/stop, SMTP config save, single and bulk disposition changes, item delete, and item redact. Each record stores a Unix timestamp, an action key, a human-readable detail string, and the client IP address. Accessible via `GET /api/audit_log` (returns newest-first, max 1000 entries; filterable by `?action=`). Visible in the Settings modal under a new **Audit Log** tab; the table refreshes whenever the tab is opened. The `log_audit_event()` module-level helper in `gdpr_db.py` silently no-ops if the DB is unavailable, so all call sites are safe in test and offline contexts.

### Fixed

- **Stop button had no effect on Google Workspace scans** — `POST /api/scan/stop` only set `state._scan_abort` (the M365/file abort event) and never touched `state._google_scan_abort`. Separately, `_check_abort()` inside `_run_google_scan` was checking `gdpr_scanner._scan_abort` (the M365 event) instead of the module-level `_scan_abort` alias that points to `state._google_scan_abort`. Both bugs combined meant neither the Stop button nor `POST /api/google/scan/cancel` had any effect on a running Google scan. Fixed by having `scan_stop()` set both events and having `_check_abort()` use the correct module-level alias.

- **Settings tab labels wrapping to two lines** — adding the Audit Log tab pushed the six-tab row past the 540 px modal width, causing "E-mailrapport" (and similar long translations) to break onto a second line. The modal is now 640 px wide and tabs carry `white-space:nowrap`; `.settings-tabs` retains `flex-wrap:wrap` as a safety net on very small screens.

---

## [1.6.27] — 2026-05-27

### Added

- **Email body excerpt preserved for offline preview** — when an M365 email or Gmail message is flagged, the first 500 characters of its plain-text body are stored in the card (`body_excerpt`), the checkpoint JSON, and a new `body_excerpt` DB column (migration #10). The M365 email preview now falls back to this excerpt when Graph is unavailable (not authenticated, token expired) or when resuming from a checkpoint without a live connection. The Gmail preview now shows the stored excerpt as the primary content (with the "Open in Gmail" link appended below) rather than the previous plain link-card. A helper `_excerpt_page()` in `routes/database.py` renders the excerpt with the same header layout as the full Graph-fetched preview.

- **Re-scan diff — resolved items in history view** — when browsing a past scan session, items that were flagged in the immediately preceding session but are no longer present in the current one are automatically appended below a "N items no longer present" divider. Resolved items are greyed out and carry a green `✓ Resolved` badge; the delete button is hidden since the file is already gone. The history banner updates to show the resolved count alongside the flagged count. The diff is computed client-side by fetching the previous session's items and comparing IDs — no new API endpoint needed. Implemented in `history.js` (`loadHistorySession`) and `results.js` (`appendCard`).

- **Google Workspace scan test suite** — 19 new tests in `tests/test_google_scan.py` covering all three routes (`GET /api/google/scan/users`, `POST /api/google/scan/start`, `POST /api/google/scan/cancel`) and the core scan engine (`_run_google_scan`). Route tests verify: 401 when unauthenticated, 409 when scan already running, lock released on both normal completion and exception, abort event cleared on start. Engine tests verify: CPR hits are broadcast as `scan_file_flagged`, clean items are not, `source_type` is correctly set to `"gmail"` for Gmail items and `"gdrive"` for Drive items, and `google_scan_done` always fires with correct `flagged_count` / `total_scanned` values.

---

## [1.6.26] — 2026-04-29

### Fixed

- **Previous scan results visible when a new scan starts** — two async functions (`loadHistorySession` and `loadLastScanSummary`) could resolve after `startScan` had already cleared the grid. `loadHistorySession` would re-populate the grid with old history items; `loadLastScanSummary` would re-show the last-scan summary card. Both functions now bail early after each `await` if any of the three scan-running flags (`S._m365ScanRunning`, `S._googleScanRunning`, `S._fileScanRunning`) is set — those flags are written synchronously by `startScan` before any awaits, so the check is race-free.

- **Selected card scrolls out of view when preview panel opens** — clicking a card in grid view opens the 420 px preview panel, which shrinks the grid area and reflows the card columns. The selected card was no longer visible. `openPreview()` now schedules a `requestAnimationFrame` after removing `.hidden` from the panel so the card is scrolled back into view (`scrollIntoView block: nearest`) once the layout has settled.

- **Gmail and Google Drive preview crashed with a 404 Graph API error** — `_source_type` was never set on Google items in `routes/google_scan.py`, so Gmail and Google Drive cards carried an empty `source_type`. The preview route in `routes/database.py` only checked for `"local"`, `"smb"`, and `"email"` before falling through to the M365 else-branch, which tried to call `https://graph.microsoft.com/.../drive/items/gmail:{id}/preview` — always a 404. Fixed by tagging Gmail items as `_source_type = "gmail"` and Google Drive items as `"gdrive"` at scan time. The preview route now handles both: Google Drive files get an embeddable `https://drive.google.com/file/d/{id}/preview` iframe; Gmail messages (not embeddable) show an info card with an "Open in Gmail" link. The `state.connector` (M365 auth) guard was also moved inside the `email` and M365 `else` branches so Google-only setups no longer receive a 401 when opening a Gmail or Drive preview.

---

## [1.6.25] — 2026-04-25

### Added

- **Checkpoint / resume for Google and File scans** — stopping a Google Workspace or file (local/SMB/SFTP) scan mid-way and restarting now resumes from where it left off, exactly like M365 scans have always done. Each engine writes its own checkpoint file (`checkpoint_google.json`, `checkpoint_file_{source_id}.json`) every 25 items. On restart, previously found cards are re-emitted via SSE so the grid is repopulated before new items arrive. The Scan button now always checks for a live checkpoint before starting — if one exists the resume banner is shown regardless of whether the user reloaded the page. `POST /api/scan/checkpoint` returns a per-engine breakdown; `POST /api/scan/clear_checkpoint` wipes all `checkpoint_*.json` files. Google users' email addresses are included in the checkpoint payload from the frontend so the server can compute a matching key. `checkpoint.py` functions gained a `prefix` keyword argument (default `"m365"`) — existing M365 call sites are unchanged.

- **CPR cross-referencing (related documents)** — clicking any flagged card that contains CPR hits now shows a "Related documents" section in the preview panel listing other items from the same scan session that share at least one CPR number. Items are ordered by number of shared CPRs; clicking any entry opens it in the preview panel. Works in both live mode and history mode (respects `?ref=N`). Powered by a self-join on the existing `cpr_index` table — no new data collection needed. New `GDPRDb.get_related_items(item_id, ref_scan_id)` method and `GET /api/db/related/<item_id>?ref=N` endpoint in `routes/database.py`. Frontend: `#previewRelated` div in the preview panel, `_loadRelated(f)` in `results.js`, `window._openRelated(id, itemData)` helper (looks up live `S.flaggedData` first, falls back to API response for history items).

- **Email address and Danish phone number detection** — all three scan engines (M365, Google Workspace, local/SMB/SFTP) can now flag files and messages containing email addresses or Danish phone numbers in addition to CPR numbers. Detection is opt-in per profile: two new toggle options **Scan for email addresses** and **Scan for phone numbers** (default off) appear in the scan options panel and profile editor. When enabled, matches are stored as `email_count` / `phone_count` on each DB row and surfaced as colour-coded badges in list view, grid view, and the preview panel. Email regex requires a structurally valid address (`local@domain.tld`); phone regex covers 8-digit Danish numbers with optional `+45`/`0045` prefix and common spacing patterns. Both are deduplicated before counting. Requires DB migration (adds two INTEGER columns to `flagged_items`; applied automatically on first startup via `_MIGRATIONS`).

- **SFTP as a 4th file connector** — SFTP servers can now be added as file sources alongside local folders, SMB shares, and cloud sources. A new `SFTPScanner` class in `sftp_connector.py` implements the same `iter_files()` interface as `FileScanner`, so `run_file_scan()`, SSE broadcasting, DB persistence, card building, scheduled scans, and exports work without changes. Supports password auth and SSH private key auth (RSA, Ed25519, ECDSA, DSS); passphrases stored in the OS keychain. Key files uploaded via `POST /api/file_sources/upload_key` and stored in `~/.gdprscanner/sftp_keys/` with `chmod 600`. SFTP sources appear with a 🔒 icon in the sources panel. Requires `paramiko>=3.4` (optional — scanner falls back gracefully if not installed). New source-type selector (Local / Network (SMB) / SFTP) replaces the SMB path-prefix auto-detection in the add-source form.

- **`POST /api/file_sources/upload_key`** — new endpoint that validates and stores an SSH private key file, returning a `key_path` for use in the source definition.

- **SFTP entry in export SOURCE_MAP** — Excel and Article 30 exports render SFTP sources as "🔒 SFTP" with a purple tint (`EDE9F7`), consistent with the existing per-source tab and summary table logic.

### Fixed

- **File source form placeholders untranslated** — all nine placeholder texts in the Add source and Edit source forms (source name, path, SMB host/user, SFTP host/user/path, passphrase) were hardcoded English strings. Nine new `data-i18n-placeholder` keys added to `en.json`, `da.json`, and `de.json`; all 12 affected `<input>` elements now carry `data-i18n-placeholder` attributes.

- **"Name" and "Auth" labels untranslated in SFTP form** — the source-name label and the Auth toggle label in the add-source panel had no `data-i18n` attributes. Added keys `m365_fsrc_name` (DA: "Navn") and `m365_fsrc_sftp_auth` (same across languages). The name label used an inner `<span data-i18n>` to preserve the required-field `*` indicator, which would have been clobbered by a `data-i18n` on the outer `<label>` element. The same clobber bug was fixed for the `m365_fsrc_label` usage in the edit form.

- **Password field placeholder showed "Stored in OS keychain" in English** — added translation key `m365_fsrc_pw_keychain_placeholder` (DA: "Gemt i OS-nøglering") and applied `data-i18n-placeholder` to the three password inputs across both forms (SMB add, SFTP add, SMB edit).

---

## [1.6.24] — 2026-04-25

### Fixed

- **Scheduler UI showed untranslated English strings** — frequency labels ("Daily", "Weekly", "Monthly"), "Next:", "Running...", "Disabled", and both empty-state messages ("No scheduled scans yet." / "No scheduled runs yet") were hardcoded English strings in `scheduler.js` instead of using `t()`. All six call sites in `schedLoad()`, `schedRenderJobs()`, and `schedLoadHistory()` now call `t()` with the appropriate key. Three new translation keys added to `en.json`, `da.json`, and `de.json`: `m365_sched_no_jobs`, `m365_sched_running`, `m365_sched_disabled`.

---

## [1.6.23] — 2026-04-21

### Added

- **Video file metadata scanning** — `.mp4`, `.mov`, `.m4v`, `.avi`, `.mkv`, `.wmv`, `.flv`, `.webm` files are now included in all scan sources (M365 OneDrive/SharePoint/Teams, Google Drive, local/SMB). No frame or audio analysis is performed; only container metadata is extracted: GPS coordinates (iPhone/Android QuickTime `©xyz` atom, ISO 6709 format), author/artist, title, comment/description, and recording date. A smartphone recording with an embedded GPS location is flagged with the `gps_location` special category, exactly like a geotagged photo. AVI metadata (RIFF INFO `INAM`/`IART`/`ICMT`) is parsed without any external library. Requires `mutagen>=1.47` (added to `requirements.txt`).

- **Audio file metadata scanning** — `.mp3`, `.flac`, `.ogg`, `.m4a`, `.aac`, `.wma`, `.wav`, `.opus`, `.aiff` files are now scanned for PII-bearing tags across all sources. Extracted fields: title, artist, album artist, composer, lyricist, conductor, author, copyright, comment, description. No audio content is transcribed. Uses `mutagen.File(easy=True)` which normalises tag formats across ID3 (MP3), MPEG-4 (M4A/AAC), Vorbis (FLAC/OGG), and ASF (WMA) into a unified lowercase-key interface. A voice recording saved with a student's name in the artist tag will be flagged with `exif_pii`. Fixed a silent bug in `_extract_audio_metadata` where `mutagen.File(io.BytesIO(content), filename)` was passing the BytesIO as the `filename` positional argument; corrected to `mutagen.File(fileobj=..., filename=...)`.

- **Audio and video test fixtures** — `tests/fixtures/local_files/generate_fixtures.py` now generates 6 new fixtures: `14_audio_artist_pii.mp3`, `15_audio_artist_pii.flac` (artist name → flag), `16_audio_no_pii.mp3`, `17_audio_no_pii.flac` (no tags → no flag), `18_video_gps.mp4` (GPS + artist → flag), `19_video_no_pii.mp4` (no tags → no flag). Total fixtures: 19 (14 flagged, 5 negative).

### Fixed

- **Audio and video files not appearing in local/SMB file scan** — `file_scanner.py` maintained its own hardcoded `DEFAULT_EXTENSIONS` set that was never updated when video and audio extensions were added to `cpr_detector.SUPPORTED_EXTS`. Fixed by importing `SUPPORTED_EXTS` from `cpr_detector` directly; `DEFAULT_EXTENSIONS` is now an alias for it. `cpr_detector.SUPPORTED_EXTS` is the single source of truth for all scan sources (M365, Google Drive, local, SMB).

- **Profile copy rename not reflected in left column until modal reopen** — saving a renamed profile via the full editor (`_pmgmtSaveFullEdit`) called `loadProfiles()` to refresh `S._profiles` but never called `_renderProfileMgmt()`, so the left-column list was not repainted. The new name only appeared after closing and reopening the modal. Fixed by calling `_renderProfileMgmt()` immediately after `loadProfiles()` and re-applying the `.active` highlight to the correct row. 10 new route integration tests added for all profile API endpoints; total test count: 182.

---

## [1.6.22] — 2026-04-21

### Added

- **Auto-email after manual scan** — a new **Email report after manual scan** toggle in **Settings → Email report** sends the Excel report to the configured recipients automatically when a manual scan completes. Disabled by default. Stored as `auto_email_manual` in `smtp.json`. Uses the same Graph-first → SMTP-fallback path as scheduled scan auto-email. Only fires when there are flagged items and at least one recipient is saved; errors are logged but never surface to the UI (the scan result is unaffected).

- **Route integration test suite** — 44 new tests in `tests/test_route_integration.py` covering security-sensitive and data-correctness paths: viewer token CRUD, role and user scope enforcement on `GET /api/db/flagged`, bulk disposition isolation, viewer PIN set/verify/rate-limit/clear, interface PIN gate and multi-step flows, scan lock release on `run_scan()` exception, and `GET /api/db/sessions` shape and ordering. Total test count: 172.

### Fixed

- **Role scope filter silently returned nothing** — `GET /api/db/flagged` filtered rows by `row.get("role")` but the column returned from the DB is `user_role`. Role-scoped viewer tokens (`{"role": "student"}` or `{"role": "staff"}`) therefore excluded every item and returned an empty list. Fixed in `routes/database.py`.

- **Historical session query included newer scans** — `gdpr_db.get_session_items(ref_scan_id=N)` used a lower-bounded window (`started_at >= ref.started_at - 300`) with no upper bound, so any scan that started after the historical reference was also returned. Viewing a past session in the history browser would show items from all subsequent scans as well. Fixed by adding an upper bound (`started_at BETWEEN ref.started_at - 300 AND ref.started_at + 300`).

- **Scan button stuck disabled after file scan** — `run_file_scan` broadcast a `scan_start` SSE event, which the `scan_start` handler in `_attachSchedulerListeners` intercepted and set `S._m365ScanRunning = true`. When `file_scan_done` fired it checked `!S._m365ScanRunning` before re-enabling the button — finding it still `true`, the button stayed disabled permanently. No `scan_done` (M365) ever arrives to clear the flag. Fixed by removing the `scan_start` broadcast from `run_file_scan`; the `scan_phase "Files — …"` event immediately following already sets `_fileScanRunning` correctly via the phase-source detection in `_attachScanListeners`.

- **`TypeError: unhashable type: 'dict'` during file and M365 scans** — `_distinct_cprs = list(dict.fromkeys(cprs))` in both scan paths treated `cprs` as a list of strings, but `extract_matches` returns a list of dicts (`{"formatted": "…", "page": …, …}`). The deduplication crashed on the first file that contained CPR numbers, aborting the scan loop. Fixed in both `run_file_scan` (line 251) and `run_scan` (line 1100) by keying on `c["formatted"]`: `list(dict.fromkeys(c["formatted"] for c in cprs))`.

- **Profile applied early lost user selection and source checkboxes** — two startup race conditions: (1) Profiles with `user_ids = "all"` applied before the M365 user list had loaded ran `.forEach()` on an empty array (no-op); when `loadUsers()` completed it defaulted all users to `selected = false` with nothing to override, leaving the accounts panel completely unchecked. Fixed by adding a `_pendingProfileAllUsers` deferred flag mirroring the existing `_pendingProfileUserIds` mechanism — `loadUsers()` applies it after populating `S._allUsers`. (2) If the profile was selected in the narrow window before `_loadFileSources()` returned and rendered the sources panel, `_applyProfile()` iterated zero checkboxes and the source selection was silently discarded; a subsequent `renderSourcesPanel()` call then re-rendered all sources as checked (their default). Fixed by calling `renderSourcesPanel()` in `_applyProfile()` when no source checkboxes are present in the DOM yet — same guard already used in `loadUsers()`.

---

## [1.6.21] — 2026-04-20

### Added

- **Local-file scan test fixtures** — `tests/fixtures/local_files/` contains 13 ready-made files (`.txt`, `.csv`, `.docx`, `.xlsx`) covering every detection scenario: CPR with explicit label, mod-11–valid CPR without label, post-2007 CPR with/without context keyword, protected number (day+40), multiple CPRs in one file, mixed PII (CPR + email + Art. 9 health data), and three true-negative cases (clean content, invoice false-positive, post-2007 serial number without context). All CPR numbers are mathematically valid; false-positive fixtures are verified to produce zero hits. Run `generate_fixtures.py` to regenerate the binary files.

- **Interface PIN** — optional session-level authentication gate for the main scanner interface. Set a 4–8 digit PIN in **Settings → Security → Interface PIN**; anyone reaching `http://host:5100` is redirected to `/login` and must enter the PIN before accessing scan controls, settings, or results. Viewer tokens and the `/view` route are completely unaffected — reviewers continue to use their own auth chain. The PIN is stored as a salted SHA-256 hash in `config.json`. Brute-force protection: 5 failed attempts per IP locks out for 5 minutes. A `POST /api/interface/logout` endpoint clears the session. PIN management via `GET/POST/DELETE /api/interface/pin`.

### Fixed

- **"Vælg" (select mode) button did nothing** — `toggleSelectMode`, `toggleCardSelect`, `selectAllVisible`, and `applyBulkDisposition` were defined inside an ES module but never assigned to `window`, so all `onclick` attributes calling them silently failed. Added the four missing `window.*` exports at the bottom of `results.js`.

- **Progress counter frozen at M365 total during Google/file scan** — the `scan_progress` handler in `scan.js` only updated `progressStats` and `progressEta` for `source === "m365"`. When M365 finished first, the counter stayed at its final value (e.g. "15083 / 15083 ETA 0s") for the entire duration of the Google and file scans. Fixed in two places: `scan_done` now clears the stats/ETA elements immediately when another scan is still running; `scan_progress` for Google/file sources now shows a running `"X scanned"` count (using the `scanned` field those engines already send) and clears ETA, but only while M365 is not running — M365 stats continue to dominate during concurrent scans.

- **PDF OCR kills process on large files** — `document_scanner` previously called `convert_from_path()` once for the entire PDF before the processing loop, allocating all page images in memory simultaneously. A 50-page A4 PDF at 300 DPI required ~1.3 GB in a single allocation, triggering the OS OOM killer. Fixed by rendering one page at a time with `convert_from_path(first_page=N, last_page=N)` inside the loop across `scan_pdf`, `redact_fitz_pdf`, and `redact_pdf`. Peak OCR memory is now bounded to roughly one page (~26 MB at 300 DPI) regardless of document length.

- **No bulk disposition tagging** — each result card had to be opened individually to set a disposition. Added a Select mode (filter bar "Vælg" button) that reveals per-card checkboxes. Selecting one or more items shows a bulk tag bar at the bottom of the grid with a disposition dropdown and Apply button. Calls `POST /api/db/disposition/bulk`; updates all selected items in-memory and clears the selection. "Select all visible" / "Deselect all" toggle available in the bar. Hidden in viewer mode.

- **No disposition progress summary** — added a thin stats bar between the filter bar and the grid showing total · unreviewed · retain · delete · % reviewed. Updates after every single or bulk disposition save and after each grid render. Unreviewed count is highlighted in red until everything is tagged; turns green at 100%.

- **Google Drive always did a full scan** — Drive scanning in `routes/google_scan.py` used `conn.iter_drive_files()` on every run, re-downloading every file regardless of what changed. Added Google Drive delta scan using the Drive Changes API. When `delta` is enabled in scan options, the first run records a Changes API start page token per user (`gdrive:{email}` key in `delta.json`). Subsequent runs call `conn.get_drive_changes(user_email, token)` and only process files that have been added or modified since the last scan. Invalid or expired tokens fall back to a full scan automatically. Token save loads the current `delta.json` fresh before writing to avoid racing with concurrent M365 token saves. `google_scan_done` SSE event now includes `delta` and `delta_sources` fields.

- **No memory guard before OCR page renders** — added `_ocr_mem_ok()` check (`psutil.virtual_memory().available >= 500 MB`) before each page render in all three OCR paths. Pages that would exceed the threshold are skipped and recorded as `"skipped"` in `page_methods` with a printed warning rather than crashing the scan.

---

## [1.6.20] — 2026-04-18

### Fixed

- **Graph `sendMail` reported as failure despite email being delivered** — `_post()` in `m365_connector.py` called `r.json()` unconditionally after `raise_for_status()`. The Graph `sendMail` endpoint returns HTTP 202 with an empty body on success, causing `json.JSONDecodeError: Expecting value: line 1 column 1 (char 0)`. This was caught by the `smtp_test` exception handler and surfaced as an error even though the email had been sent. Fixed by returning `r.json() if r.content else {}` so any Graph endpoint that responds with no body (sendMail, delete operations, etc.) is handled correctly.

- **Graph error hidden when SMTP host not configured** — when Graph failed and no SMTP host was saved, `smtp_test` returned the generic "No SMTP host configured" message, swallowing the actual Graph error. The `if not host` branch now surfaces the Graph exception text alongside the Mail.Send permission guidance so the real cause is visible.

- **Gmail vs Google Workspace SMTP error messages** — the auth failure handler now detects whether the username is a personal Gmail address (`@gmail.com`) or a Google Workspace custom-domain account, and shows a different message for each. Personal Gmail: existing App Password troubleshooting steps. Google Workspace: explains that SMTP access is controlled by the Workspace admin console (2-Step Verification policy, SMTP relay service), not the user's personal security settings.

---

## [1.6.19] — 2026-04-18

### Fixed

- **Gmail SMTP error message misleading when App Password already in use** — the auth failure handler in both `smtp_test` and `send_report` unconditionally told the user to "create an App Password", even when they were already using one. Gmail returns the same `535` / `Username and Password not accepted` error for a wrong app password, a revoked app password, spaces left in the 16-character code, or a wrong username — none of which are helped by the old message. The Gmail branch now lists the three most common causes (spaces in the code, revoked password, wrong username) and still links to the App Password page to generate a new one. The Microsoft personal account branch is unchanged.

---

## [1.6.18] — 2026-04-18

### Fixed

- **Art.30 and Excel exports missing GWS and local/SMB sources** — two silent failures caused Google Workspace and file-scan results to be absent from all exports after a page reload.
  - `routes/google_scan.py`: called `_db.end_scan()` (method does not exist on `GDPRDb` — the correct name is `finish_scan`). The resulting `AttributeError` was swallowed by the bare `except Exception: pass` guard, so `finished_at` was never written on GWS scan records. Since `get_session_items()` requires `finished_at IS NOT NULL`, every GWS scan was permanently invisible to both export functions.
  - `routes/google_scan.py`: emitted `"scan_done"` at completion instead of `"google_scan_done"`, causing the M365 done handler to fire for Google scans and breaking the SSE teardown logic.
  - `scan_engine.py` (`run_file_scan`): called `_db.begin_scan(sources=…, user_count=0, options=source)` with keyword arguments, but `begin_scan(self, options: dict)` only accepts a single positional dict. The `TypeError` was caught silently, leaving `_db_scan_id = None`; all subsequent `save_item` calls were skipped, so local and SMB items were never written to the database.

---

## [1.6.17] — 2026-04-18

### Added

- **Scan history browser** — results from any past scan session can now be reviewed without running a new scan. On page load, when no scan is running, the last completed session is automatically loaded into the results grid. A **History** banner appears above the filter bar showing the session date, scanned sources, and item count. A **Sessions** button in the banner opens a dropdown listing all past sessions newest-first, each showing date, time, source labels, item count, and Delta / Latest badges. Clicking a session loads its items. A **Latest scan** button (shown only when browsing a past session) jumps back to the most recent session. Starting a new scan exits history mode and takes over the grid with live SSE results. Session cache is invalidated on each scan completion so the picker always reflects the true state of the database.

  - `gdpr_db.py` — new `get_sessions(limit, window_seconds)` groups all completed scans by the 300-second concurrent-scan window and returns session summaries newest-first. `get_session_items()` gains an optional `ref_scan_id` parameter to anchor the session window to any past scan.
  - `routes/database.py` — new `GET /api/db/sessions`; `GET /api/db/flagged` now accepts `?ref=<scan_id>` to serve items for a specific historical session.
  - `static/js/history.js` (new) — `loadHistorySession(refScanId)`, `openHistoryPicker()`, `closeHistoryPicker()`, `exitHistoryMode()`, `invalidateHistoryCache()` all exposed on `window`.
  - `state.js` — `_historyRefScanId: null` tracks which session is currently displayed (`null` = live/SSE).
  - `results.js` — initial status check calls `loadHistorySession(null)` instead of `loadLastScanSummary()`.
  - `scan.js` — `startScan()` calls `exitHistoryMode()`; all three `*_done` handlers call `invalidateHistoryCache()`.

- **User-scoped viewer tokens (#34)** — viewer token links can now be restricted to a specific person so the recipient sees only their own flagged files, across both M365 and Google Workspace. The Share modal's scope selector gains a **User** option that opens a searchable name autocomplete backed by the already-loaded `S._allUsers` list. Typing filters by display name or email; each row shows the person's full name, role badge, and all associated email addresses (M365 UPN and GWS email shown together for dual-platform users). Selecting a name fills the input with the display name and stores both email addresses internally. Scope is stored as `{"user": ["alice@m365.dk", "alice@gws.dk"], "display_name": "Alice Smith"}`. Server-side enforcement in `GET /api/db/flagged` filters `WHERE account_id IN (list)` so items from either platform are included. The viewer header shows the person's full name in a locked identity badge (`#viewerIdentityBadge`); `#filterRole` is hidden. Token rows in the Active links list show the display name badge. Free-text email entry still works as a fallback when no accounts are loaded. File-scan items (`account_id = ""`) never appear in user-scoped views — consistent with the existing role-scope behaviour.

---

## [1.6.16] — 2026-04-18

### Added

- **User-scoped viewer tokens (#34)** — viewer token links can now be restricted to a specific person so the recipient sees only their own flagged files, across both M365 and Google Workspace. The Share modal's scope selector gains a **User** option that opens a searchable name autocomplete backed by the already-loaded `S._allUsers` list. Typing filters by display name or email; each row shows the person's full name, role badge, and all associated email addresses (M365 UPN and GWS email shown together for dual-platform users). Selecting a name fills the input with the display name and stores both email addresses internally. Scope is stored as `{"user": ["alice@m365.dk", "alice@gws.dk"], "display_name": "Alice Smith"}`. Server-side enforcement in `GET /api/db/flagged` filters `WHERE account_id IN (list)` so items from either platform are included. The viewer header shows the person's full name in a locked identity badge (`#viewerIdentityBadge`); `#filterRole` is hidden. Token rows in the Active links list show the display name badge. Free-text email entry still works as a fallback when no accounts are loaded. File-scan items (`account_id = ""`) never appear in user-scoped views — consistent with the existing role-scope behaviour.

---

## [1.6.15] — 2026-04-12

### Added

- **Role-scoped viewer tokens** — viewer token links can now be restricted to a single role so the recipient can only see student or staff items. A new **Role scope** dropdown (All roles / Ansatte / Elever) in the Share modal is selected when creating a token. The scope is stored as `"scope": {"role": "student"|"staff"}` in `viewer_tokens.json`. Enforcement is two-layered: `GET /api/db/flagged` filters items server-side using `session["viewer_scope"].role` set at token validation time; the `#filterRole` dropdown in the viewer is pre-set and hidden so the constraint cannot be bypassed client-side. Tokens without a scope field (existing tokens, PIN sessions) remain unrestricted. Role badge (Ansatte / Elever) shown on each scoped token row in the Active links list.

- **Role filter in results + role-scoped exports** — a new **Role** dropdown in the filter bar (All roles / Ansatte / Elever) narrows the results grid to staff or student items. Clicking **Excel** or **Art.30** while a role is selected exports only that group — the `?role=student|staff` param is forwarded to both export endpoints. `_build_excel_bytes()` and `_build_article30_docx()` now accept a `role` param; all internal sheets (GPS, External transfers, Art.30 staff/student tables) respect the filter. Filenames get an `_elever` or `_ansatte` suffix.

- **Scan filter options for student environments** — two new profile options reduce noise when scanning student accounts:
  - **Ignore GPS in images** (`skip_gps_images`) — images whose only PII signal is an embedded GPS coordinate are not flagged. Smartphones embed location in every camera photo by default, generating large numbers of low-priority flags in school contexts. GPS data is still extracted and shown in the detail card when the image is flagged by another signal (faces, EXIF author/comment). Applies to M365, Google, and file scans.
  - **Min. CPR count per file** (`min_cpr_count`, default 1) — a file is only flagged if it contains at least this many *distinct* CPR numbers. Set to 2 to avoid reporting a student's own consent form or registration document (one CPR) while still flagging class lists and grade sheets with multiple students' CPRs. Deduplication is by value — a CPR repeated 10 times counts as 1 distinct number. Applies to M365, Google, and file scans.
  - Both options are saved in profiles and editable in the Profile Manager editor.

- **GitHub Actions CI/CD — macOS build** — `.github/workflows/build.yml` now also builds a macOS `.app` bundle (`macos-15`, Apple Silicon ARM64) on every push to `main` and on `v*` tags. Released as `GDPRScanner_macos_arm64.zip`. (Originally `macos-13` / Intel, changed when GitHub retired that runner.)

### Fixed

- **OneDrive 404 errors during delta scans** — `GET /users/{id}/drive/root/delta` returns 404 for users with no OneDrive licence, a disabled service plan, a drive that was never provisioned (account never signed in), or a suspended account. Previously these 404s fell through to `requests.raise_for_status()` and were caught by the generic `except Exception` handler in `_scan_user_onedrive`, broadcasting a red `scan_error` card. Full scans never showed the error because `_iter_drive_folder_for` has a bare `except Exception: return`. Fixed by adding `M365DriveNotFound(M365Error)` to `m365_connector.py`, raising it from `_get()` on HTTP 404, and handling it explicitly in `_scan_user_onedrive` with a `scan_phase` broadcast ("OneDrive (user): not provisioned — skipped") before the generic exception handler.

- **CI — Windows artifact never uploaded** — PyInstaller `--onedir` puts the exe inside `dist/GDPRScanner/`, not at `dist/*.exe`. The artifact glob never matched, so no Windows build appeared in releases. A PowerShell packaging step now zips `dist\GDPRScanner\` into `GDPRScanner_windows_x64.zip` (mirroring the existing Linux step).
- **`EFFORT_ESTIMATE.md`** — build effort estimate document covering component-by-component hour breakdowns and complexity drivers for the project.
- **Settings → Security tab** — new dedicated pane in the Settings modal. Admin PIN and Viewer PIN groups moved here from the General tab, which now contains only Appearance and About. The Share modal's **Configure** button navigates directly to the Security tab.
- **Viewer mode layout** — the sidebar, log panel, and progress bar are now hidden in viewer mode so results fill the full window width. The `🔍 GDPRScanner` brand is shown in the top-left of the topbar (replacing the sidebar header) at the same size and weight as the normal sidebar title.
- **Share modal — Revoke / Copy buttons broken** — `JSON.stringify(token)` produced a double-quoted string that terminated the surrounding `onclick="…"` HTML attribute early, so neither button fired its handler. Both now pass the token as a single-quoted JS string literal, which is safe for the hex token format.
- **Viewer PIN — Clear PIN rejected with "current PIN is incorrect"** — clicking **Clear PIN** without first typing in the Current PIN field sent an empty string to the server, which correctly rejected it. A client-side guard now validates the field is non-empty before sending the request, and focuses the input with an inline error message if it is empty.
- **Share modal — all UI strings now translated** — the Share results modal and Viewer PIN settings group were fully hardcoded in English. All visible strings are now backed by i18n keys (`share_*`, `viewer_pin_*`) in `en.json`, `da.json`, and `de.json`.
- **Excel / ART.30 export — Gmail and Google Drive missing from summary** — `by_source` was built from flagged items only, so sources that produced zero hits were silently skipped. Both the Excel Summary sheet and the ART.30 "Breakdown by source" table now include every source that was actually scanned, showing `0` items and `0` CPR hits where nothing was found. New `GDPRDb.get_session_sources()` method reads the `sources` JSON column from all scans in the current session window to determine which sources ran.
- **Scan never finishes when M365 + Google run concurrently** — `scan_done` (M365 finished) was closing the SSE connection immediately via `S.es.close()`, even when `S._googleScanRunning` or `S._fileScanRunning` was still true. The `google_scan_done` / `file_scan_done` events therefore never arrived, leaving the progress bar stuck at 100% indefinitely. SSE teardown is now deferred until the last concurrent scan completes: `scan_done` only closes the connection if neither Google nor File is still running; `google_scan_done` and `file_scan_done` close it when they are the final scan to finish.

---

## [1.6.14] — 2026-04-10

### Added — read-only viewer mode (#33)

A DPO, school principal, or compliance coordinator can now review scan results and tag dispositions without access to scan controls, credentials, or settings.

**Token links**

- New `🔗` **Share** button in the topbar opens a token management modal.
- **Create** generates a 64-char hex token (`secrets.token_hex(32)`) with an optional label and expiry (7 d / 30 d / 90 d / 1 yr / never).
- **Copy** copies the full `http://host:5100/view?token=…` URL to the clipboard.
- **Revoke** deletes the token immediately; any browser using it is locked out on next navigation.
- Tokens are stored in `~/.gdprscanner/viewer_tokens.json` with `created_at`, `expires_at`, and `last_used_at` metadata. Expired tokens are cleaned up on each list fetch.

**PIN alternative**

- A 4–8 digit numeric PIN can be set in **Settings → General → Viewer PIN**.
- Opening `/view` without a token shows a PIN entry form (`templates/viewer_pin.html`).
- Correct PIN sets a Flask session cookie (`session["viewer_ok"]`) valid for the browser session — no token needed after that.
- Brute-force guard: 5 failed attempts per 5 minutes per IP returns 429.
- PIN stored as salted SHA-256 inside `viewer_tokens.json` (no extra dependencies).

**`/view` route**

- Checks `?token=` first (validates + binds session), then existing session cookie, then PIN form (if a PIN is configured), then 403.
- Serves the same `index.html` with `window.VIEWER_MODE = true` injected.
- Invalid/expired tokens show `templates/viewer_denied.html`.

**Viewer mode (JS)**

- `auth.js` — bypasses M365 auth check entirely; adds `viewer-mode` class to `<body>`; shows scanner screen immediately.
- `results.js` — on `DOMContentLoaded` calls `_loadViewerResults()` which fetches `GET /api/db/flagged` (all items from the last completed scan session, joined with dispositions) and renders the grid directly — no SSE required.
- CSS (`body.viewer-mode`) hides: Sources/Options/Accounts sidebar panels; Scan/Stop buttons; profile bar; config-group buttons; resume banner; bulk-delete button; per-card delete button; data-subject delete button; Share button.
- Disposition tagging (select + Save) remains fully functional — `/api/db/disposition` has no auth guard.
- Filter bar, Excel export, Art.30 export, preview panel, and log remain accessible.

**New files:** `routes/viewer.py`, `static/js/viewer.js`, `templates/viewer_pin.html`, `templates/viewer_denied.html`

**Files changed:** `app_config.py`, `gdpr_scanner.py`, `templates/index.html`, `static/style.css`, `static/js/auth.js`, `static/js/results.js`, `static/js/scheduler.js`, `routes/database.py`

---

### Fixed — memory exhaustion during large M365 scans

Addressed root causes of runaway memory growth (reported: up to 90 GB RSS) that could crash the host machine during scans of large Microsoft 365 tenants.

**`scan_engine.py`**

- **Email body HTML stripped at collection time** — Graph API returns the full `body` field (raw HTML, up to ~1 MB per message) for every email fetched. Previously, all message dicts — including the raw HTML — were accumulated in `work_items` before any scanning began. For 1 000 users × 2 000 emails this could mean >100 GB in `work_items` alone. The body is now converted to plain text immediately on collection (`_precomputed_body`), and the raw `body` and `bodyPreview` keys are deleted from the dict before it is queued. The processing loop reads `_precomputed_body` via `pop()` and `del`s it after use.
- **`work_items` converted to `deque` before processing** — items are now released from memory one by one via `popleft()` as they are processed, rather than keeping the entire list alive for the duration of the scan. `gc.collect()` is called immediately after conversion and after each checkpoint save.
- **`content` bytes freed as early as possible in the file processing branch** — raw download bytes are now `del`'d immediately after `content.decode()` (before the expensive NER/PII pass), and also in the no-hits `else` branch where they were previously kept alive until the next loop iteration.
- **`body_text` freed after use in the email branch** — `del body_text` added after `_broadcast_card` so large plain-text bodies do not linger until the next iteration.
- **Memory guard before file downloads** — uses `psutil.virtual_memory().available` to skip a file download and log a warning if fewer than 300 MB of RAM are available, preventing a single large file from pushing an already-pressured machine into OOM.

**`document_scanner.py`**

- **PDF OCR page images freed page by page** — `convert_from_path()` renders all pages at 300 DPI before scanning begins (~26 MB per A4 page; a 100-page PDF ≈ 2.6 GB). Each rendered `PIL.Image` is now nulled out (`images[page_num-1] = None`) immediately after OCR, so only one page image is live at a time instead of the entire document.

### Changed — Sources panel is now resizable and collapsible

The **KILDER** sidebar panel now behaves consistently with the other sidebar sections.

- **Collapsible** — the `▾` / `▸` toggle was already wired up; collapse state is already persisted in `localStorage`. No change needed here.
- **Resizable** — a drag handle (`sources-resize-handle`) added at the bottom of the panel body. Dragging up shrinks the panel (scroll appears); dragging down is capped at the panel's natural content height — you cannot expand it beyond what is needed to show all sources. Height preference persisted in `localStorage` under `gdpr_sources_h`.
- **Auto-fit on render** — `_fitSourcesPanel()` is called at the end of every `renderSourcesPanel()` invocation. On first load and whenever sources are added or removed (e.g. connecting Google), the panel height snaps to exactly fit all visible sources. A previously saved smaller height is honoured only if it is still smaller than the new content height; dragging back to full height clears the saved preference.
- The old `max-height: calc(5 * 26px)` fixed cap is removed.

**Files changed:** `templates/index.html`, `static/style.css`, `static/js/log.js` (`_fitSourcesPanel`, `_initSourcesResize`), `static/js/sources.js`, `static/js/results.js`.

---

## [1.6.13] — 2026-04-10

### Added — developer tooling

- **`run_tests.sh`** — shell script to activate the venv and run the full test suite. Accepts any `pytest` arguments: `./run_tests.sh`, `./run_tests.sh -q`, `./run_tests.sh tests/test_app_config.py`.
- **Directory-scoped `CLAUDE.md` rules** — `routes/CLAUDE.md`, `static/js/CLAUDE.md`, `templates/CLAUDE.md`, `lang/CLAUDE.md` replace the previous single-file context document. Each file is loaded automatically by Claude Code only when working in the relevant directory.

### Fixed — documentation

- **`README.md` project files table** — removed four phantom entries (`Dockerfile`, `docker-compose.yml`, `.dockerignore`, `scanner_audit.jsonl`); corrected `static/app.js` description to "archived monolith — no longer loaded"; fixed manual paths (`MANUAL-EN.md` → `docs/manuals/MANUAL-EN.md`); added missing files: `scan_engine.py`, `sse.py`, `checkpoint.py`, `app_config.py`, `cpr_detector.py`, `google_connector.py`, `static/style.css`, `static/js/*.js`, `routes/google_auth.py`, `routes/google_scan.py`, `run_tests.sh`, `docs/setup/` guides.
- **`docs/manuals/MANUAL-EN.md`**, **`docs/manuals/MANUAL-DA.md`** — version header updated from 1.6.11 → 1.6.13; footer updated from v1.6.8 → v1.6.13.

### Changed — blueprint migration batch 3, 4, 5 (auth, database, export — migration complete)

All remaining direct `@app.route` registrations removed from `gdpr_scanner.py`. Flask now routes every API endpoint exclusively through its blueprint. Only `GET /` and `GET /api/scan/stream` (SSE) remain in `gdpr_scanner.py`.

**`routes/auth.py`** — rewritten with direct imports (batch 3, 6 routes):
- `MSAL_OK`, `M365Connector`, `M365Error` imported from `m365_connector`
- `_load_config`, `_save_config` imported from `app_config`
- Dead module-level globals `_pending_flow` and `_auth_poll_result` removed from `gdpr_scanner.py`
- Routes removed: `/api/auth/status`, `/api/auth/start`, `/api/auth/poll`, `/api/auth/userinfo`, `/api/auth/signout`, `/api/auth/config`

**`routes/database.py`** — rewritten with direct imports (batch 4, 15 routes):
- `_get_db`, `DB_OK` from `gdpr_db`; `_set_admin_pin`, `_verify_admin_pin`, `_admin_pin_is_set` from `app_config`; `_clear_checkpoint`, `_DELTA_PATH` from `checkpoint`; `_extract_exif`, `_html_esc`, `_placeholder_svg` from `cpr_detector`
- `SCANNER_OK` determined by local `import document_scanner` try/except
- `db_export` improved: uses `NamedTemporaryFile` instead of `mktemp` (safer for frozen apps)
- Email preview HTML: full CSS ruleset (`*, *::before, *::after`, `img`, `table`, scrollbar) from gdpr_scanner.py version restored
- Routes removed: `/api/db/stats`, `/api/db/trend`, `/api/db/scans`, `/api/db/subject`, `/api/db/overdue`, `/api/db/disposition` (×2), `/api/db/deletion_log`, `/api/db/reset`, `/api/admin/pin` (×2), `/api/db/export`, `/api/db/import`, `/api/preview/<item_id>`, `/api/thumb`

**`routes/export.py`** — rewritten with direct imports (batch 5, 3 routes):
- `_get_db`, `DB_OK` from `gdpr_db`; `_GUID_RE`, `_resolve_display_name` from `app_config`; `M365PermissionError` from `m365_connector`
- `app.logger` replaced with `logging.getLogger(__name__)`
- Dead `delete_item()` helper removed from `gdpr_scanner.py` (was unreachable; blueprint has its own copy)
- Routes removed: `/api/export_excel`, `/api/export_article30`, `/api/delete_bulk`

**`tests/test_routes.py`** — `db_patch` fixture updated: now patches `routes.database._get_db` / `routes.database.DB_OK` and `routes.export._get_db` / `routes.export.DB_OK` (was patching `gdpr_scanner._get_db`/`gdpr_scanner.DB_OK` which no longer have any effect). Two `test_without_db_returns_503` tests updated to monkeypatch `routes.database.DB_OK` instead of `gdpr_scanner.DB_OK`.

---

## [1.6.12] — 2026-04-10

### Fixed — profile editor save drops users from non-active role groups

In `_pmgmtSaveFullEdit` (profile management editor), the save function applied the active role filter (`_pmgmtRoleActive`) to the list of checked checkboxes before saving. Since `_pmgmtFilterAccounts` hides rows via `display:none` but does not uncheck them, users from other role groups that remained checked (but hidden) were silently discarded on save. The role filter at save time is removed — all checked checkboxes are now captured regardless of which role tab is visible.

---

## [1.6.11] — 2026-04-10

### Changed — blueprint migration batch 1 (scan + app_routes)

15 direct `@app.route` registrations removed from `gdpr_scanner.py`. Flask now routes all of these exclusively through their blueprint counterparts, which previously existed as dead code shadowed by the direct routes.

**`routes/scan.py`** — rewritten with direct imports (was entirely non-functional as dead code due to bare-name `NameError`s behind the shadow):
- Added `GET /api/scan/status` (new — was only in gdpr_scanner.py)
- Added `GET /api/src_toggles`, `POST /api/src_toggles` (new — was only in gdpr_scanner.py)
- `scan_checkpoint_info` — added missing `check_only` handling present in the gdpr_scanner.py version
- All state references converted from bare names to `state._scan_lock` / `state._scan_abort`; `run_scan` imported lazily from `scan_engine` inside `_run` to avoid circular imports
- `_save_settings`, `_load_settings`, `_load_src_toggles`, `_save_src_toggles` imported from `app_config`
- `_checkpoint_key`, `_load_checkpoint`, `_clear_checkpoint`, `_load_delta_tokens`, `_DELTA_PATH` imported from `checkpoint`

**`routes/app_routes.py`** — cleaned up:
- `APP_VERSION` now computed locally from `VERSION` file (was a bare-name reference to gdpr_scanner.py global)
- `_LANG_DIR` computed at module level; fixed `sys` / `_sys` alias mismatch in `get_langs` (bug in blueprint that never manifested while shadowed)
- `_set_lang_override`, `_load_lang_forced` imported directly from `app_config`
- `get_langs` — added missing `langs.sort()` present in the gdpr_scanner.py version

**`tests/test_routes.py`** — `mock_connector` fixture simplified: no longer needs to patch `gdpr_scanner._connector` since the direct `scan/start` route is gone; `state.connector` alone is sufficient. `run_scan` stub in `test_authenticated_returns_started` updated to target `scan_engine` directly.

**Routes removed from `gdpr_scanner.py`:** `/api/about`, `/api/langs`, `/api/set_lang`, `/api/lang`, `/api/scan/status`, `/api/scan/start`, `/api/scan/stop`, `/api/scan/checkpoint`, `/api/scan/clear_checkpoint`, `/api/settings/save`, `/api/settings/load`, `/api/src_toggles`, `/api/delta/status`, `/api/delta/clear`

**Still in `gdpr_scanner.py`:** `GET /` (root), `GET /api/scan/stream` (SSE — cannot be in a blueprint), and the `auth`, `users`, `sources`, `database`, `export` route groups (31 routes — next batches).

---

## [1.6.10] — 2026-04-10

### Fixed — Google Drive `exportSizeLimitExceeded` warning

Native Google Workspace files too large for Drive's export API (Google's server-side limit, distinct from the 20 MB local cap) now produce a clean skip message instead of a stray `WARNING googleapiclient.http — Encountered 403 Forbidden with reason "exportSizeLimitExceeded"` in the log. A `logging.Filter` subclass is installed on the `googleapiclient.http` logger at import time to suppress the duplicate external warning; the `except HttpError` block in `_drive_iter` detects the reason and logs `[gdrive] skip '<name>' — file too large for Google export API (exportSizeLimitExceeded)` with the file ID.

### Fixed — peak memory during large file/SMB scans (OOM risk reduction)

Three targeted buffer-lifetime fixes reduce peak RSS during large scans:

- **`cpr_detector.py`** — `del content` after writing the PDF bytes to a temp file in `_scan_bytes_timeout`. The 20 MB buffer was previously held in the main process for the entire duration of `p.join(timeout)` (up to 60 s), overlapping with the spawned subprocess's ~150–300 MB heap. It is now freed before the subprocess starts.
- **`scan_engine.py`** — `del content` after the thumbnail block in `run_file_scan`. The raw file buffer was kept alive through card dict construction and the start of the next loop iteration; it is now freed as soon as the thumbnail (or placeholder SVG) has been generated.
- **`file_scanner.py`** — `PREFETCH_WINDOW` reduced from 2 to 1. Halves the maximum number of concurrently-held SMB read buffers (from 2 × 20 MB to 1 × 20 MB).

---

## [1.6.9] — 2026-04-10

### Changed — frontend migrated to ES modules

**Phase 2 complete:** All 10 split JS files converted from `<script defer>` to `<script type="module">`.

- `static/js/state.js` introduced as the shared state module — exports a single `S` object holding all previously-global mutable state (`flaggedData`, `_allUsers`, `_profiles`, `_fileSources`, `_srcPct`, scan-running flags, etc.). All 10 modules import `{ S }` from `state.js` and mutate its properties in place.
- Every function called from an inline HTML `onclick=` handler is explicitly exported via `window.fnName = fnName` at the bottom of each module (~80 exports across 10 files).
- `var LANG` retained in the inline `<script>` block (not a module) so it remains a true global accessible from all modules as a bare name.
- `app.js` retained as archive; no longer loaded by `index.html`.

### Fixed — connector.js SyntaxError caused by duplicate function declarations

`openFileSourcesModal` and `closeFileSourcesModal` were declared **twice** at module top level in `connector.js` — once as redirect stubs pointing to the new unified Sources modal, and once as the old `#fsrcBackdrop` implementations left over from the pre-unification code. In ES module strict mode, duplicate `function` declarations in the same scope are a **SyntaxError**. The engine rejected the entire module at parse time, meaning none of its ~35 `window.*` exports were ever set. Symptoms:

- **"Kilder" (Sources) button did nothing** — `window.openSourcesMgmt` was never set
- Google status dot, file source loading, and sources panel re-render all silently failed — `window.smGoogleRefreshStatus`, `window._loadFileSources` etc. were undefined
- Sources panel showed only M365 sources even when Google Workspace was configured

**Fix:** removed the stale `async function openFileSourcesModal` / `function closeFileSourcesModal` bodies (lines 511–518). The redirect stubs at lines 505–506 (`openSourcesMgmt('files')`) are the correct new behaviour. Also removed the duplicate `window.openFileSourcesModal` and `window.closeFileSourcesModal` assignments that appeared twice in the exports block.

### Fixed — Profiler modal did not open when `_renderProfileMgmt` threw

If `_renderProfileMgmt()` threw a runtime error (e.g. due to downstream failures from the connector.js parse error), `openProfileMgmtModal` would abort before reaching `classList.add('open')`, leaving the modal invisibly closed. The function now wraps both `_renderProfileMgmt()` and `_pmgmtOpenEditor()` in individual try-catch blocks. Any error is logged to the console; the modal opens regardless.

### Fixed — blocking alert on every unhandled async error

`ui.js` contained a duplicate `unhandledrejection` listener that called `alert()` for every unhandled Promise rejection. Background API calls (Google status, file sources, src_toggles) could fire these alerts at page load, and browsers that had already suppressed one alert silently blocked all subsequent ones. Removed the `alert()` handler; the `console.error` handler is retained.

---

## [1.6.8] — 2026-04-09

### Fixed — memory pressure during large scans

**SMB prefetch window reduced**
- `PREFETCH_WINDOW` reduced from 5 to 2 in `file_scanner.py`. Peak in-flight SMB memory drops from ~250 MB to ~40 MB during large network share scans.
- `MAX_FILE_BYTES` reduced from 50 MB to 20 MB — files larger than 20 MB are skipped rather than buffered in full.

**PDF subprocess concurrency limited**
- A module-level `threading.Semaphore(1)` in `cpr_detector.py` ensures at most one PDF OCR subprocess runs at a time. Previously, multiple threads could each spawn a ~200 MB subprocess simultaneously, causing OOM under load.

**Google Workspace export buffer reduced**
- `_MAX_EXPORT_BYTES` in `google_connector.py` reduced from 50 MB to 20 MB.
- `_drive_iter` now explicitly deletes the `BytesIO` buffer (`del buf`) before yielding each file's bytes, releasing the double-buffer peak immediately rather than waiting for GC.

### Fixed — Excel and Article 30 exports missing sources

**Gmail and Google Drive tabs added to Excel export**
- `SOURCE_MAP` in `routes/export.py` was missing `gmail`, `gdrive`, `local`, and `smb` entries. Items from these sources were silently dropped — they were grouped internally but never written to a sheet.
- All eight source types now have dedicated tabs: Outlook, OneDrive, SharePoint, Teams, Gmail, Google Drive, Local, Network.
- The same fix applies to the inline Excel builder in `gdpr_scanner.py`.

**Concurrent scan results captured in exports**
- M365, Google Workspace, and file scans each create their own `scan_id`. The previous DB fallback used `get_flagged_items()`, which only returned results for the single most-recently-completed scan — silently dropping the other sources after page reload.
- New `get_session_items(window_seconds=300)` in `gdpr_db.py` returns items from all scans whose `started_at` falls within a 5-minute session window of the latest completed scan.
- Both `export_excel()` and `export_article30()` now use `get_session_items()` as their DB fallback. `_build_article30_docx()` also uses it directly.

### Changed — "Email" source renamed to "Outlook"

The `email` source type (Microsoft Exchange mailboxes) is now consistently labelled **Outlook** everywhere:
- Source badges on result cards (`SOURCE_BADGES.email`)
- Filter bar dropdown
- `_sourceLabel()` in JS
- Excel tab label
- `m365_src_email`, `m365_filter_email`, `m365_phase_emails` in all three lang files (`en.json`, `da.json`, `de.json`)
- Article 30 report uses **Exchange (Outlook)** for the formal legal context

Rationale: with Gmail also present, "Email" was ambiguous. "Outlook" ties the source unambiguously to Microsoft 365.

### Changed — progress bar moved above log panel

- `#progressBar` moved from below the topbar to just above `#logWrap` (above the activity log).
- The bar is now a permanent placeholder — always visible, never hidden. `display: flex` is the permanent state; `display: none` is no longer used.
- Background changed from `var(--surface)` to `var(--bg)` to match the log area. Border changed from `border-bottom` to `border-top`.
- New `_clearProgressBar()` helper resets phase, stats, ETA, and file fields on scan end, leaving the bar visually empty at idle. All previous `style.display` assignments removed.

### Fixed — profile manager Cancel closes entire modal

- Clicking **Cancel** in the profile editor previously closed the editor panel but left the profile list modal open behind it. `_pmgmtCloseEditor()` now calls `closeProfileMgmt()` to dismiss the full modal.
- Dead stub `function _pmgmtCancelEdit(id) {}` removed.

### Changed — exports available without running a new scan

- The filter bar (including Excel and Art.30 export buttons) is always visible on page load.
- Exports now use `get_session_items()` as the DB fallback, so the buttons produce a complete report from the previous session immediately after page reload — no new scan required.

### Fixed — profile loading clobbered by scan start

- `_save_settings()` is called on every M365 scan start with a payload containing only M365 `sources`, `user_ids`, and `options`. It was writing this back via `_profile_from_settings()`, which has no `google_sources` field — permanently stripping Google and file source selections from the active profile after each scan.
- `_save_settings()` now preserves `google_sources` and `file_sources` from the existing profile when the payload does not include them, and rebuilds the combined `sources` array as M365 + google + file.
- `_profile_from_settings()` updated to pass through `google_sources` when present in the payload.

### Fixed — "no results" shown during live scan after hard refresh

- Hard-refreshing the browser mid-scan caused the "Ingen CPR-numre fundet" card to appear immediately, before the SSE watchdog had detected the running scan.
- `loadLastScanSummary()` is no longer called directly on `DOMContentLoaded`. It is now called inside `_sseWatchdog` on the first status poll, only if no scan is currently running (`_initialStatusChecked` flag).

### Fixed — progress bar source pill showing "Email" instead of "Outlook"

- `_PHASE_SOURCE_MAP` entry for Exchange mail phases still had `label: 'Email'`. Updated to `'Outlook'` to match the rename applied elsewhere.

### Changed — profile manager UI simplified

- Removed the redundant **×** close button from the list panel header — the editor panel's **×** already closes the entire modal.
- Removed the **Luk** (Close) button from the list panel footer — the footer now contains only **+ Ny profil**.
- The editor footer **Cancel/Annuller** button replaced with a single **Luk** button that closes the entire modal (consistent with `_pmgmtCloseEditor()` behaviour).

### Changed — log panel collapsible

- A **▾/▸** toggle button added to the left of the log header. Clicking it collapses or expands the log panel (resize handle + log body together, wrapped in `#logSectionBody`).
- State persists in `localStorage` via the existing `toggleSection` / `restoreSectionStates` mechanism (`sc_logSection` key).

### Changed — log header buttons translated

- **All**, **Errors**, and **Copy** buttons in the log header now use `data-i18n` attributes and are fully translated in all three lang files.
- Translation keys added: `btn_errors` (da: Fejl, de: Fehler), `log_copy` (da: Kopier, de: Kopieren).
- Symbol prefix `⎘` removed from the Copy button label.

### Changed — project documentation structure

- User manuals moved from project root to `docs/manuals/` (`MANUAL-DA.md`, `MANUAL-EN.md`).
- Setup guides moved from project root to `docs/setup/` (`M365_SETUP.md`, `GOOGLE_SETUP.md`).
- `routes/app_routes.py` and `build_gdpr.py` updated to reference the new manual paths.
- `README.md` links updated accordingly.

### Fixed — disposition carry-forward across scans

When a previously reviewed file reappears in a new scan it now shows its prior disposition immediately on the result card — no need to open the preview panel first.

- `get_prior_disposition(item_id)` added to `ScanDB` in `gdpr_db.py`. Returns the stored disposition status if it differs from `'unreviewed'`, otherwise `None`.
- `get_flagged_items()` and `get_session_items()` in `gdpr_db.py` now `LEFT JOIN dispositions` and return `COALESCE(d.status, 'unreviewed')` as `disposition` on every row. Exports and the results grid therefore reflect the latest review decision without an extra round-trip.
- `_with_disposition(card, db)` helper added to `scan_engine.py`. Injects the prior disposition into a card dict before it is broadcast as `scan_file_flagged`. Used at all four broadcast points:
  - `scan_engine.py` — file scan (line ~297)
  - `scan_engine.py` — checkpoint resume re-emit loop (line ~357)
  - `scan_engine.py` — M365 scan (line ~456)
  - `routes/google_scan.py` — Google Workspace scan (line ~225)
- The frontend already reads `f.disposition || 'unreviewed'` for filter matching — no JS changes required.

---

## [1.6.7] — 2026-04-06

### Fixed — emoji/symbol removal from all buttons and indicators

**All UI buttons stripped of emoji and symbol prefixes**
- Every interactive element in the topbar, filter bar, modals, and settings panels now uses plain text only. Removed: `▶`, `■`, `💾`, `✕`, `⚙`, `🕐`, `⬇`, `⬆`, `🗑`, `📋`, `☰`, `⊞`.
- Affected buttons: Scan, Stop, Save (profile), Clear (profile), Profiler/Profiles, Kilder/Sources, Indstillinger/Settings, Excel, Art.30, Slet/Delete (bulk), Liste/List, Gitter/Grid, Export (DB), Import (DB), Reset DB, scheduled scan title.
- Labels updated in `templates/index.html` and all three lang files (`da.json`, `en.json`, `de.json`).

**Filter bar — Clear button standardised**
- The `×` clear-filter button was an oversized bare symbol (`font-size: 16px`, no border). Replaced with a proper text button (`Ryd`/`Clear`/`Löschen`) matching the 26 px filter bar standard: bordered, `border-radius: 5px`, turns red on hover.
- Translation key `m365_filter_clear` added to all three lang files.

**Scheduler indicator — "Next:" label translated**
- The hardcoded `'Next: '` prefix in `schedUpdateSidebarIndicator()` is now `t('m365_sched_next', 'Next')`. Key added to all three lang files (da: `Næste`, de: `Nächste`).
- Clock emoji `🕐` removed from the indicator and from `m365_sched_title` in all lang files.

### Fixed — result card badges, progress bar on browser refresh

**Result card badges — standardised to 9 px pill style**
- All result card badges now follow the app-wide badge standard: `font-size: 9px; padding: 1px 5px; border-radius: 10px`.
- `.source-badge` (OneDrive, Exchange, Gmail, etc.) had no CSS definition at all — it now has the correct size, padding, and border-radius.
- `.cpr-badge` reduced from `10px / 2px 6px` to `9px / 1px 5px`.
- `.photo-face-badge`, `.special-cat-badge`, `.overdue-badge`, `.role-pill` reduced from `10px` / `border-radius: 4px` to `9px / 1px 5px / border-radius: 10px`.
- Removed camera emoji (📷) from the Faces badge.
- `.card-source` gains `flex-wrap: wrap` so badges wrap on narrow cards instead of overflowing.

**Progress bar — survives browser refresh**
- Refreshing the browser mid-scan no longer causes the progress bar to appear without coloured segment pills.
- Three code paths now defensively set the correct running flag and call `_renderProgressSegments()` before the track is needed:
  - `scan_start` SSE handler (sets `_m365ScanRunning`).
  - `scan_progress` SSE handler (sets the flag matching the event's `source` field — covers mid-scan reconnects where `scan_start` has scrolled out of the 500-event replay buffer).
  - `scan_phase` SSE handler (infers source from phase text; fires before `scan_progress` in the replay sequence).
  - `_sseWatchdog` (sets `_m365ScanRunning` immediately on detecting a running scan via `/api/scan/status`, which checks the M365 lock).

### Improved — scan responsiveness, UI layout, preview panel

**Scan abort responsiveness**
- Stop now takes effect within one Graph API round-trip across all collection phases. Previously, pressing Stop only checked the abort flag in the *processing* loop — the entire collection phase (email folder enumeration, OneDrive file listing, Teams channel fetching, SharePoint site iteration) ran to completion first, which could take 10+ minutes on large tenants.
- Abort checks added to: email folder loop (inside `_scan_user_email`), OneDrive items loop (delta and full modes in `_scan_user_onedrive`), Teams team loop and channel loop (inside `_scan_user_teams`), SharePoint site loop, and all outer per-user loops.
- Side effect: the scheduler no longer fails with "Manual scan already running" when a job fires shortly after the user pressed Stop — the lock is now released promptly.

**Scheduler — graceful skip on lock contention**
- When a scheduled job fires while a manual (or other scheduled) scan holds the lock, the job now logs `Skipped — a scan is already running` and returns cleanly. Previously it raised `RuntimeError("Manual scan already running")`, which was logged as a hard failure with a full traceback in the UI.

**Filter bar — always visible, full-width, 26 px**
- Filter bar was hidden until the first result arrived. It is now always visible.
- Moved from inside the left column to a direct child of `.main`, above `.content-area`. The preview panel's top edge now aligns with the grid's top edge rather than overlapping the filter bar.
- All filter bar controls standardised to `height: 26 px` (`input`, `select`, `button`) to match the topbar control standard. Redundant inline `padding`/`font-size`/`border-radius` stripped from button inline styles.

**Preview panel**
- Resizable: a 5 px drag handle on the left edge lets the user adjust the panel width. Handle uses pointer capture (`setPointerCapture`) so dragging over the iframe or releasing outside the browser window always terminates the drag cleanly. Width is persisted in `sessionStorage` and restored when the panel is next opened.
- Min width: 280 px; max width: 70% of window width.
- Fixed: clicking the close (×) button had no effect. Root cause: `panel.style.width` set by the resize logic is an inline style and overrides the CSS class `.hidden { width: 0 }`. Fix: `closePreview()` now clears `panel.style.width = ''` before adding `.hidden`; `openPreview()` restores the saved width when showing the panel.
- Email preview iframe: added `* { max-width: 100% }`, `overflow-x: hidden`, `table { table-layout: fixed }`, and `img { height: auto }` to prevent wide HTML emails from creating a horizontal scrollbar inside the 420 px panel.
- Email preview iframe scrollbar: matches the app's 4 px thin scrollbar style.

**Thin scrollbars everywhere**
- `.grid-area` (results grid) and `.log-panel` now use the same 4 px thin scrollbar style (`scrollbar-width: thin; width: 4px`) as `#accountsList` and `#sourcesPanel`. Previously they used the system-default wide scrollbar.

**Scheduler next scan indicator**
- `#schedNextIndicator` was a plain `display: block` div with no height constraint, causing it to sit taller than adjacent topbar controls. Fixed to `height: 26 px; display: inline-flex; align-items: center` with a border and border-radius matching the surrounding pill buttons.

**Log and preview resize — pointer capture fix**
- Both resize handles (`logResizeHandle`, `previewResizeHandle`) switched from `mousedown` + `document.addEventListener('mousemove'/'mouseup')` to `pointerdown` + `setPointerCapture`. The old approach lost the drag when the cursor moved over the iframe (which has its own input context) or left the browser window. Pointer capture routes all pointer events to the handle until `pointerup`/`pointercancel` regardless of cursor position.

**Manuals updated (MANUAL-DA.md, MANUAL-EN.md)**
- Version 1.6.4 → 1.6.6.
- Section 2: activity log description now mentions copy button, error filter, and resize handle.
- Section 4.4: progress bar description updated — source pill labels listed, old "current phase" wording removed.
- Section 8: profiles section updated for loader model, ✕ clear button, and explicit mention that Google/file sources are saved.

---

## [1.6.6] — 2026-04-06

### Improved — UX polish II (clusters, badges, log panel, progress bar)

**Pill clusters**
- KONTI section header: Alle / Ingen / ↻ converted from bare text links to a pill cluster (`height: 22px`), matching the pattern used in the Profile editor.
- Profile list rows (Profiler modal): Brug + Kopier grouped into a pill cluster; Slet kept as a separate standalone danger button.

**Badge sizing**
- Platform badges (M365, GWS, M365+GWS) and role badges (Ansat, Elev, Anden) standardised to `font-size: 9px; padding: 1px 5px; border-radius: 10px` across the main sidebar account list and the profile modal. Previously the sidebar used larger inline styles (`font-size: 10px; padding: 2px 7px`) that made badges visually heavier than in the modal.

**Account rows**
- Main sidebar account row padding reduced from `4px 0` to `2px 0`, matching the compact density of the profile modal account list.
- SKU debug search icon button standardised to `height: 26px` to match the adjacent role filter cluster.

**Log panel — full rebuild**
- Color-coded log levels: `.log-err` (red `var(--danger)`), `.log-ok` (green `var(--success)`), `.log-warn` (orange `#e0922a)`). Level classes were already passed to `log()` but had no CSS — all entries appeared in the same muted colour.
- Live scanning indicator: a single italic `▶ filename` line at the bottom of the log updates in place via `scan_file` SSE events. Never scrolls; clears automatically when the scan finishes. Avoids flooding the log with per-file entries.
- Copy button (`⎘ Copy`) in the log header copies all log text to clipboard; flashes `✓ Copied` for 1.5 s.
- Log level filter (`All` / `Errors`) in log header — hides info lines when Errors mode is active.
- Resizable: drag handle at the top edge of the panel resizes vertically and **snaps to the nearest full line** (row = 18 px: 16 px line-height + 2 px margin; 2–30 lines range).
- Default height set to **8 lines exactly** (`height: 154px` = 8 × 18 + 10 px padding).
- Persistent across page refresh: up to 300 lines saved to `sessionStorage`; restored on `DOMContentLoaded`; cleared on new scan start.
- Smart scroll: auto-scroll only triggers when already within 24 px of the bottom — scrolling up to read earlier entries stops the follow behaviour.

**Progress bar — segmented multi-source**
- Replaced the single `progressFill` bar with a dynamically segmented track (`#progressTrack`). One segment per active scan type (M365 / Google / Files), equal width, separated by a 1 px gap. Segments are added at scan start and removed as each source finishes.
- Color-coded: M365 = blue (`var(--accent)`), Google = dark green (`#3a7d44`), Files = purple-gray (`#7a6a9e`).
- Each segment fills independently — M365 at 80% and Google at 20% are shown simultaneously with no interference. Eliminates the `_maxPct` hack (bar stuck at 100% after first source finishes).
- Backend (`scan_engine.py`, `routes/google_scan.py`): all `scan_progress` SSE events now include `"source": "m365"` / `"google"` / `"file"`. Frontend routes each event to the correct segment by `d.source`.
- Stats (`X / Y`) and ETA only update from M365 events — the only source with meaningful totals and time estimates.

**Progress bar — phase display**
- `#progressWho` replaces the plain-text phase span. Renders a colour-coded source pill (`[Email]`, `[OneDrive]`, `[Gmail]`, `[GDrive]`, `[Local]`, etc.) followed by the user's full display name.
- Source pill uses the universal badge standard: `font-size: 9px; padding: 1px 5px; border-radius: 10px; font-weight: 500`.
- `_setProgressPhase()` identifies the source from the full phase string via `_PHASE_SOURCE_MAP`, then splits on ` — ` to extract the username. Phases without a dash (e.g. `📂 folder: 3 msg(s)`) fall back to the last known user (`_progressCurrentUser`).
- `_resolveDisplayName()` resolves email addresses in Google phase strings to the user's display name via `_allUsers`. Also strips trailing count suffixes (`: 3 file(s)`).
- Pill labels standardised: `Email`, `OneDrive`, `SharePoint`, `Teams`, `Gmail`, `GDrive`, `Local` — matching the source names used elsewhere in the UI.
- All 25 `scan_phase` strings now produce a pill: `📂` emoji maps to `Email`; `Google Workspace — email` phases resolve to display name; file scan startup uses `Files — {label}`; Google per-user phase uses `Google Workspace — {email}`.
- Source map ordering: `Google Workspace` matched before `Gmail` so the GWS startup phase shows `[Gmail]` only when no broader match applies.
- Fixed: email regex was missing the `i` flag (`/E-?mail.../u` → `/E-?mail.../iu`), causing Danish `"Indsamler e-mails"` to fall through to plain text.

**Scheduler — Google and file sources**
- Scheduled scans now run Google Workspace sources. `_build_options` extracts `google_sources` from the profile (with legacy fallback for profiles that stored gmail/gdrive inside `sources`). A separate Google scan block runs after the file scan loop using `_google_scan_lock`.

**Profile dropdown — loader model**
- Removed the selectable "Standard (sidebar)" / "Default (sidebar)" empty option. Profiles are now **loaders**, not persistent modes — selecting one pushes its settings into the sidebar; the sidebar is always the live state.
- Replaced with a `disabled` placeholder `"— Vælg profil —"` shown when no profile has been loaded.
- Added a `✕` clear button (`#profileClearBtn`) that appears next to the dropdown when a profile is active. Clicking it clears `_activeProfileId` and resets the dropdown to the placeholder **without touching the sidebar** — the loaded settings remain.
- `clearActiveProfile()` function added.
- Lang keys: `m365_profile_default` removed, `m365_profile_placeholder` added (da/en/de).

**Bug fixes**
- Profile role filter respected at scan time: `getSelectedUsers()` now filters the returned list by `_activeRoleFilter`, preventing hidden-role users from being silently included in M365 scans and profile saves via the topbar quick-save.
- Profile editor role filter respected at save time: `_pmgmtSaveFullEdit` now excludes IDs whose role doesn't match `_pmgmtRoleActive`. Prevents "select all → filter by staff → save" from silently saving student accounts that were checked but hidden.
- Profile editor role filter state reset on open: `_openEditorForProfile` resets `_pmgmtRoleActive = ''` so a stale filter from a previous session doesn't silently hide accounts when the editor is reopened.
- Google and file sources not saved in profiles: `_pmgmtSaveFullEdit` now checks whether the checkboxes are actually present in `#peSourcesPanel` (DOM query) rather than using `!!window._googleConnected` and `_fileSources.length > 0` as proxies. The async status fetches could complete after the editor opened, leaving the panel without checkboxes while the proxy read `true`, silently discarding the user's selection.
- Profile editor now re-renders `#peSourcesPanel` when `smGoogleRefreshStatus()` resolves or `_loadFileSources()` completes if the editor is open and the panel has no Google/file checkboxes yet.

---

## [1.6.5] — 2026-04-04

### Improved — UX polish pass (topbar, sidebar, clusters)

**Topbar**
- All topbar elements normalised to `height: 26px`: Scan/Stop buttons, profile dropdown, save button, config cluster, stats pill, icon buttons (🔍, ?, 🌙). Previously each had independent padding, making the topbar uneven.
- Config buttons (Profiler, Kilder, Indstillinger) extracted from `#profileBar` into a dedicated `.config-group` pill cluster separated by a `.topbar-sep` divider — visually distinct from the profile selector group.
- Data subject lookup moved from the sidebar footer into the topbar as a 🔍 icon button (left of `?`). Sidebar strip removed.

**Sidebar**
- KILDER, INDSTILLINGER, and KONTI sections are now collapsible. Each header gets a `▾`/`▸` chevron (`section-collapse-btn`). Collapse state persists in `localStorage` per section. KONTI releases its `flex:1` when collapsed.
- Role filter buttons (Alle / Ansat / Elev) converted to a pill cluster (`.role-filter-btn`) matching the topbar cluster pattern. SKU debug button stays separate.
- Date preset buttons (1 år / 2 år / 5 år / 10 år / Alle) converted to a pill cluster.
- All pill cluster buttons, input fields, and date picker set to `height: 26px` — the universal control height across the UI.
- Toggle size reduced from `36×20px` to `32×18px` with knob gap tightened from 3px to 2px. Knob-to-track ratio improved for a sleeker look.
- Role filter buttons display live counts: "Alle (277)", "Ansat (62)", "Elev (254)". Updated by `updateRoleFilterCounts()`, called from `renderAccountList()`.

**Empty state**
- On load, fetches `/api/db/stats`. If a previous scan exists, shows a summary card (hits, unique CPR subjects, items scanned, date, sources) instead of the bare placeholder. The placeholder is shown below as a "start new scan" prompt. Summary hidden when a scan starts.

### Added — Single-instance lock

- **`~/.gdprscanner/app.lock`** — an exclusive process lock is acquired at startup to prevent two instances from running simultaneously against the same database and settings files.
- **Desktop (`build_gdpr.py` launcher)**: lock is checked before Flask starts. If another instance holds the lock the app prints `"GDPRScanner is already running."` to stderr and exits immediately.
- **Server (`gdpr_scanner.py`)**: same guard in interactive web-UI mode (not headless — batch runs may legitimately coexist with a live server).
- Uses `fcntl.flock(LOCK_EX | LOCK_NB)` on macOS/Linux and `msvcrt.locking` on Windows. The OS releases the lock automatically on crash or clean exit — no stale lockfiles.

### Added — Port auto-increment + stdout port signal

- **`gdpr_scanner.py`** (server mode): if the requested port (default 5100, or `--port N`) is already in use, the server auto-increments up to 100 ports and logs a warning: `[!] Port 5100 in use — using 5101 instead`.
- **`build_gdpr.py` launcher** (desktop mode): `find_free_port()` was already present; auto-increment was already the desktop behaviour.
- Both modes emit `GDPR_PORT=<n>` (flush=True) to stdout before Flask starts — a machine-readable signal parseable by any parent process or wrapper script that needs to know the actual bound port.

### Added — Built-in user manual (#31 ✅)

- **`MANUAL-EN.md`** / **`MANUAL-DA.md`** — standalone end-user manuals in English and Danish. 14 sections covering all major features: Getting started, Sources panel, Running a scan, Understanding results, Reviewing results, Bulk actions, Profiles, Scheduler, Export & email, Article 30 report, Data subject lookup, Settings, Retention policy, and FAQ. Written for school administrators and municipal compliance officers — no technical knowledge assumed.
- **`GET /manual`** — new Flask route in `routes/app_routes.py`. Reads `?lang=da|en` (falls back to the current UI language). Finds the appropriate `.md` file relative to the project root, converts it to a fully self-contained styled HTML page, and returns it without any external dependencies.
- **`_md_to_html(md)`** — zero-dependency Markdown-to-HTML converter using only Python's `re` and `html` stdlib modules. Handles headings with anchor IDs, fenced code blocks, tables, ordered/unordered lists, blockquotes, bold, italic, inline code, links, and horizontal rules.
- **`?` button** in the topbar (right of the theme toggle) — opens the manual in a dedicated window (960×800, resizable) using the current `langSelect` value. In the packaged desktop app the window is a native pywebview window (`pywebview.api.open_manual()`); in the browser it opens via `window.open()`. Repeated clicks reuse the same window rather than spawning new ones. Does not interrupt any in-progress scan.
- Manual page: 860 px max-width layout, language switcher (DA ↔ EN), 🖨 print button, `@media print` CSS (toolbar hidden, `h2` page breaks, external link URLs appended for paper printing).

### Fixed — Manual not found in packaged app

- `MANUAL-DA.md` and `MANUAL-EN.md` were missing from the PyInstaller bundle — `build_gdpr.py` now includes all `MANUAL-*.md` files as root-level data files (`--add-data MANUAL-*.md:.`). The route already used `sys._MEIPASS` for the frozen path; the files simply weren't being copied in.
- `build_gdpr.py` `LAUNCHER_CODE` — added `open_manual(lang)` method to the `Api` class. Creates a new pywebview window for the manual URL; reuses the existing window if already open.

### Fixed — Email routing, profile source persistence, SMTP error messages

**`routes/email.py`** — structural rewrite
- Removed `__getattr__` module-level hook. Bare-name lookups inside function bodies do not go through `__getattr__` (Python resolves them via `LOAD_GLOBAL` directly from `__dict__`), so `_load_smtp_config`, `_save_smtp_config`, `_build_excel_bytes`, and `_send_report_email` all raised `NameError` at runtime when the blueprint route won instead of the app-level duplicate.
- `_load_smtp_config`, `_save_smtp_config` now imported directly from `app_config`. `_build_excel_bytes` imported from `routes.export`.
- `_send_report_email(xl_bytes, fname, smtp_cfg, recipients)` was called in three places but never defined anywhere. Now defined as a module-level helper: builds a `MIMEMultipart("mixed")` message with the Excel as a `MIMEBase` attachment and sends via the configured SMTP server.
- `_send_email_graph` moved into the blueprint (was only used by the duplicate app-level routes).

**`gdpr_scanner.py`**
- Removed four duplicate app-level routes that were masking the broken blueprint: `GET /api/smtp/config`, `POST /api/smtp/config`, `POST /api/smtp/test`, `POST /api/send_report`.
- `from routes.email import _send_report_email` added after blueprint imports so `scan_scheduler.py` (`_m._send_report_email`) and the CLI headless path both resolve the function correctly.

**SMTP error messages** (`routes/email.py`)
- All three auth/connection error handlers (smtp_test, send_report, _send_report_email) now classify errors by host type before choosing a message:
  - DNS / connection failure (`nodename nor servname`, `getaddrinfo`, `Connection refused`, timeout) → "Could not connect to SMTP server — check hostname and port."
  - Corporate M365 host (`office365`, `microsoft`) + auth error → M365 admin centre / enable Authenticated SMTP guidance.
  - Personal Microsoft host (`outlook`, `live`, `hotmail`) + auth error → App Password guidance at `account.microsoft.com/security`.
  - Gmail host + auth error → App Password guidance at Google Account Security.
  - Anything else → raw SMTP error, unmodified.
- Previously `530` (generic "authentication required") unconditionally triggered the M365 admin centre message even when the configured host was Gmail or a personal Outlook account.

**`static/app.js`** — profile source persistence
- `_pmgmtSaveFullEdit` was overwriting `google_sources` and `file_sources` with `[]` whenever the editor was opened and those checkboxes weren't rendered (Google not connected / file sources not loaded). Now preserves the profile's existing `google_sources` when `_googleConnected` is false, and `file_sources` when `_fileSources` is empty.
- `_applyProfile` built `_pendingProfileSources` by filtering against `_fileSources` — which is empty at profile-apply time (async load not yet complete), so the pending list was always empty and file source checkboxes defaulted to `checked=true` regardless of the profile. Now stores `profile.file_sources` directly (falling back to non-M365/Google IDs from `profile.sources`).
- Added `_pendingGoogleSources` (mirrors `_pendingProfileSources` for Google). Set in `_applyProfile` from `profile.google_sources`; consumed in `renderSourcesPanel()` the first time Gmail/Drive checkboxes appear (when Google connects after the profile was applied). Previously they defaulted to `checked=true`.

### Fixed — Progress bar and profile sources

**`static/app.js`**
- Progress bar fluctuated and ETA flickered when M365, Google, and file scans ran concurrently. Root cause: all three scan types broadcast `scan_progress` on the same SSE stream and their events interleave. Fixed with two changes: (1) `_maxPct` tracks the highest `pct` seen across all concurrent scans — the bar only ever moves forward; (2) ETA and stats counter are only written when the incoming event actually carries those fields (`d.eta !== undefined`, `d.total` present) — a Google/file event without ETA no longer wipes the ETA set by the M365 event a millisecond earlier.
- `progressPhase` was being overwritten with the current filename by `scan_progress` events, causing it to alternate between phase text ("Google Workspace scan…") and individual filenames. Current filename now correctly updates `progressFile` instead.
- Profile editor (`_openEditorForProfile`) only passed `profile.sources` (M365 IDs) to `_renderEditorSources` — Google and Local/SMB source checkboxes were always unchecked when reopening a saved profile. Now passes the union of `sources`, `google_sources`, and `file_sources`.

### Added — SMB pre-fetch cache (#22 ✅)

- SMB file scans now decouple directory traversal from file reads. A 5-slot sliding-window `ThreadPoolExecutor` keeps up to 5 reads in flight simultaneously, with a 60-second hard timeout per file. A stalled NAS read produces an error card in the UI and the scan continues — the scan thread is never blocked.
- **`file_scanner.py`** — `_smb_collect()` new method walks the SMB tree (directory listing only, no reads), yielding file descriptors plus `_COLLECT_SKIP` / `_COLLECT_ERROR` sentinels for over-size files and listing failures. `_iter_smb()` rewritten: phase 1 collects all candidates; phase 2 resolves sentinels immediately then feeds real files through the executor window. `PREFETCH_WINDOW = 5` and `SMB_READ_TIMEOUT = 60` constants added. Local scanner (`_iter_local`) untouched.

### Added — PDF OCR via multiprocessing (#20 ✅)

- PDF files are now scanned in local/SMB file scans. Previously excluded because Tesseract/Poppler subprocesses could hang indefinitely.
- **`cpr_detector.py`** — new `_worker_scan_pdf()` (module-level, required for `spawn` context) runs `document_scanner.scan_pdf()` in a fresh subprocess and returns results via a `multiprocessing.Queue`. New `_scan_bytes_timeout()` wraps PDF scanning: writes content to a temp file, spawns the worker via `multiprocessing.get_context("spawn")`, joins with a 60-second hard timeout, and terminates the process tree if it exceeds the limit. Non-PDF files delegate straight to `_scan_bytes()`.
- **`scan_engine.py`** — `run_file_scan()` now calls `_scan_bytes_timeout()` instead of `_scan_bytes()` for all files. Stub added to module-level injected globals.
- **`gdpr_scanner.py`** — `_scan_bytes_timeout` imported from `cpr_detector` and injected into `scan_engine`.
- **`file_scanner.py`** — `.pdf` removed from `FILE_SCAN_EXTENSIONS` exclusion; all default extensions now included.

### Fixed — Post-v1.6.4 release bugs (continued)

**`routes/google_scan.py`**
- `_run_google_scan()` crashed with `UnboundLocalError: cannot access local variable 'data'` when `user_emails` was not passed in the request. The fallback `data.get("user_emails", [])` referenced the request-handler local `data` which is not in scope inside the scan function — `data` and `options` are the same object. Removed the redundant fallback.

**`routes/export.py`** — Article 30 report
- `SOURCE_LABELS` was missing `gmail`, `gdrive`, `local`, and `smb` — all four source types rendered as raw keys in every table (inventory, Art. 9, photo, deletion audit log). Now map to "Gmail", "Google Drive", "Local files", "Network / SMB".
- Per-source breakdown table only iterated M365 sources (`email`, `onedrive`, `sharepoint`, `teams`) — Google and local/SMB findings were completely absent from the summary even when present. Loop now covers all eight source types.
- Methodology bullet (`a30_method_4`) only mentioned Microsoft Graph sources. Updated in `en.json`, `da.json`, `de.json`, and the hardcoded fallback to also mention Google Workspace (service account + domain-wide delegation) and local/SMB file shares.

**`scheduler.py`**
- Removed stale file. `scan_scheduler.py` fully supersedes it; `routes/scheduler.py` and `gdpr_scanner.py` both import from `scan_scheduler`. The old file had diverged significantly (missing UUID migration, connector auto-reconnect, file source resolution, debug SSE events).

**`templates/index.html`**
- Removed 9 unused CSS classes: `.sidebar-sub`, `.btn-secondary`, `.log-ok`, `.log-err`, `.log-warn`, `.user-bar`, `.sign-out-btn`, `.source-badge`, `.srcmgmt-coming-soon`.

### Added — Personal Google account OAuth (#30 ✅)

- Personal Google accounts can now be scanned without a service account or Workspace admin. A device-code OAuth flow (mirrors M365 delegated mode) lets a user sign in interactively with their own Google account.
- **`google_connector.py`** — new `PersonalGoogleConnector` class: `get_device_code_flow()` / `complete_device_code_flow()` static methods hit Google's device-auth endpoint; `_refresh_if_needed()` handles transparent token refresh via `google.oauth2.credentials.Credentials`; `list_users()` returns a single-item list (the signed-in user) so the scan engine needs no changes. `iter_gmail_messages()` / `iter_drive_files()` share the same iteration logic as `GoogleConnector` via extracted `_gmail_iter()` / `_drive_iter()` module-level helpers.
- Token persisted to `~/.gdprscanner/google_token.json` (chmod 600). New helpers: `save_personal_token`, `load_personal_token`, `delete_personal_token`.
- **`routes/google_auth.py`** — four new endpoints: `GET /api/google/personal/status`, `POST /api/google/personal/start`, `POST /api/google/personal/poll`, `POST /api/google/personal/signout`. Background thread blocks on `complete_device_code_flow`; frontend polls — identical pattern to M365 delegated auth.
- **`routes/state.py`** — `google_pending_flow` and `google_poll_result` added.
- **`templates/index.html`** — auth-mode toggle (Workspace / Personal account) in the Google pane; personal section with client ID/secret fields and inline device-code box (reuses `.device-code-box` CSS); workspace setup guide hidden in personal mode.
- **`static/app.js`** — `smGoogleSetMode()` switches visible sections; `smGoogleRefreshStatus()` now checks both `/api/google/auth/status` and `/api/google/personal/status` in parallel; `smGooglePersonalStart()`, `smGooglePersonalPoll()`, `smGooglePersonalSignOut()` added.
- **`lang/en.json`, `da.json`, `de.json`** — 14 new keys each.

### Fixed — Post-v1.6.4 release bugs

**`checkpoint.py`**
- Scheduled scans crashed with `string indices must be integers, not 'str'` when `user_ids` in the profile contained plain ID strings rather than dicts. `_checkpoint_key()` now handles both formats: `u["id"] if isinstance(u, dict) else u`.

**`scan_engine.py`**
- Same root cause as above: `run_scan()` now normalises `user_ids` entries to dicts at the top of the function before any access, so both plain strings and `{id, displayName, userRole}` objects work correctly.

**`scan_scheduler.py`**
- `file_sources` in profiles are stored as source ID strings by the JS frontend. The scheduler now resolves each ID to its full source dict via `_load_file_sources()` before calling `run_file_scan()`. Plain path strings are also handled as a fallback.
- Full traceback is now included in the `scheduler_error` SSE event so failures are diagnosable from the UI status panel without needing the CLI.

**`routes/app_routes.py`**
- `/api/langs` (language selector endpoint) only globbed `*.lang` files — after the v1.6.3 JSON migration the language dropdown was silently empty. Now globs both `*.json` and `*.lang` with deduplication, matching the existing logic in `gdpr_scanner.py`.

**`static/app.js`**
- Profile editor (`_pmgmtSaveFullEdit`) did not update `file_sources` or `google_sources` when the user changed source checkboxes — both fields were carried forward unchanged via `...profile`. Now splits `#peSourcesPanel` checkboxes by `data-source-type` and writes `file_sources`, `google_sources`, and `sources` explicitly on every save.

**`gdpr_scanner.py`**
- `/api/langs` only globbed `*.lang` files — after migrating to JSON, the language selector showed nothing. Now globs both `*.json` and `*.lang`, deduplicates by language code, and sorts alphabetically.
- `SOURCE_LABELS` was missing `gmail`, `gdrive`, `local`, and `smb` entries — these sources now get correct tab names in Excel export and correct labels in the Article 30 report.
- Excel export filename changed from `m365_scan_*.xlsx` to `gdpr_scan_*.xlsx`.
- Article 30 methodology paragraph now mentions Google Workspace scanning via service account with domain-wide delegation. DA and DE lang files updated to match.

**`routes/google_scan.py`**
- Gmail and Google Drive result cards showed the email address as account name instead of the user's display name. Fixed: `_user_display_map` is now built from `list_users()` and applied to each scanned item.
- Role badge (Elev/Ansat/Anden) was missing on Google results when `user_emails` came from the request rather than `list_users()`. Fixed: role map is now populated in both cases.
- Google scan now emits `google_scan_done` instead of `scan_done` so the progress bar stays open until both M365 and Google scans finish.

**`scan_engine.py`**
- File scan now emits `file_scan_done` instead of `scan_done` so the progress bar stays open until all active scan types finish.
- `pct` in both Google and file scan progress events was hardcoded at 50 — now increments from 10 to a max of 90.

**`static/app.js`**
- Progress bar now tracks three independent flags (`_m365ScanRunning`, `_googleScanRunning`, `_fileScanRunning`) and only hides when all active scans have completed.
- `google_scan_done` and `file_scan_done` SSE event handlers added.
- Source filter dropdown (search results) and bulk delete source dropdown were missing Gmail, Google Drive, Lokal, and Netværk (SMB) options.
- Profile preset buttons (1 år / 2 år / etc.) were never highlighted when applying a profile — matching used `years × 365.25` but profiles store `years × 365`. Fixed.
- `_fileScanRunning` flag set correctly at scan start from `fileSources.length`.

**`routes/state.py` / `routes/google_scan.py`**
- M365 and Google scans shared `_scan_lock` — Google now uses `_google_scan_lock` and `_google_scan_abort` so both platforms scan in parallel.

**`templates/index.html`**
- Sources, Settings and Schedule indicator moved from sidebar section header / footer into the topbar, to the right of the Profiles button.
- Source filter dropdown and bulk delete dropdown updated with Google and file source options.

**`README.md`**
- All emoji removed (role badges, action icons, status indicators). Plain text equivalents used throughout.
- `lang/da.json` and `lang/de.json` updated with Google Workspace methodology text for the Article 30 report.

---

## [1.6.4] — 2026-04-03

### Added — Full profile editor (#15e ✅)

- Two-panel modal (profile list left, full editor right). Click a profile row to edit it; the active row is highlighted.
- **+ Ny profil** button in the left panel footer — creates a blank profile and opens the editor immediately, works when no profiles exist.
- Editor sections match the sidebar exactly:
  - **Navn** — name + description fields
  - **Kilder** — same rendering as the main KILDER panel, including M365, Google Workspace, and file/SMB sources
  - **Konti** — role filter (Alle / Ansat / Elev), text search, Alle / Ingen select buttons, + Tilføj konto manual entry, platform badges (M365 / GWS / M365+GWS), role badges
  - **Indstillinger** — date picker with year presets (1/2/5/10/Alle), Scan e-mailindhold, Scan vedhæftede filer, Maks. vedhæftet filstørrelse (MB), Maks. e-mails pr. bruger, Delta-scanning, Søg efter ansigter i billeder — all as toggle sliders
  - **Opbevaringspolitik** — always visible; Opbevaringsår + Regnskabsår slut dropdown
- Annuller, ×, and Gem all close the full modal. Auto-opens first profile on modal open.
- Profile editor defaults match the main window: accounts are unchecked by default; only explicitly saved `user_ids` are shown as checked.

### Fixed — Parallel M365 + Google scanning

- M365 and Google scans shared `_scan_lock` — starting both simultaneously caused "Google scan already running" immediately after scan start. Fixed: `routes/state.py` now defines `_google_scan_lock` and `_google_scan_abort` as separate threading primitives; `routes/google_scan.py` uses these instead of the M365 lock. Both platforms now scan in parallel.

### Fixed — User selection defaults

- All users now default to `selected: false` on page load (previously `true`). The profile editor follows the same rule.
- "Vælg alle" button renamed to "Alle" to match the main sidebar.

---

## [1.6.3] — 2026-04-03

### Fixed — Post-v1.6.3 release bugs

**`static/app.js`**
- Source toggle state (Email, OneDrive, SharePoint, Teams, Gmail, Google Drev) not persisted across restarts. Fixed: all toggles now save to `~/.gdprscanner/src_toggles.json` via a new `/api/src_toggles` endpoint and are restored on page load.
- Deselecting M365 sources in Source Management did not update account badges — `M365 + GWS` still shown. Fixed: badge now uses `hasM365Src` and `effectiveGws` computed inside `renderAccountList()`, and M365 source toggles now call `renderAccountList()` on change.
- Google-only scans reported wrong account count in live log (e.g. "26 konto(er)" when 1 was selected). Root cause: `getSelectedUsers()` returned all selected users including Google-only accounts. Fixed: `getSelectedUsers()` now returns only M365 users; Google users are counted separately for the log message. The "select at least one account" guard no longer blocks Google-only scans.
- Cross-platform identity matching used email prefix (`anne.hansen` before `@`) — changed to `displayName` matching since both M365 and GWS are maintained from the same AD source.
- `_onGoogleSourceToggle()` and M365 source toggles did not call `renderAccountList()` — account badges not updated when toggling sources in Source Management.

**`routes/google_auth.py`**
- Removed `/api/google/auth/sources` endpoint and `src_gmail`/`src_drive` keys from the status response — replaced by unified `/api/src_toggles` endpoint in `gdpr_scanner.py`.

**`app_config.py` / `gdpr_db.py` / `checkpoint.py` / `google_connector.py` / `m365_connector.py` / `scan_scheduler.py` / `scheduler.py` / `gdpr_scanner.py`**
- All data files moved from `~/` root into `~/.gdprscanner/` subdirectory with cleaner short names (`scanner.db`, `config.json`, `token.json`, etc.). A migration shim runs on first startup and moves existing `~/.gdpr_scanner_*` files automatically. `MAINTAINER.md` updated with new file locations.

**`scan_scheduler.py`**
- Scheduled scans ignored `file_sources` from the profile — `_build_options()` dropped them. Fixed: `file_sources` now included in opts, and `run_file_scan()` is called for each file source in the profile during a scheduled run (#15f ✅).

**`static/app.js` — profile save**
- `file_sources` in profile was hardcoded to `[]` — now saves the actual checked file sources from `buildScanPayload()` (#15f).

### Fixed — Post-release (continued)

**`routes/state.py` / `routes/google_scan.py`**
- M365 and Google scans shared `_scan_lock` — starting both simultaneously caused "Google scan already running" immediately. Fixed: Google scan now uses its own `_google_scan_lock` and `_google_scan_abort` so both platforms can run in parallel.

**`static/app.js`** — profile editor (#15e ✅)
- Profile editor drawer implemented: two-panel modal (profile list left, full editor right). Click any profile to open its editor.
- Editor sections: Navn + beskrivelse, Kilder (same rendering as main KILDER panel, including Google and file sources), Konti (with Alle / Ansat / Elev role filter, text search, Alle / Ingen select buttons, + Tilføj konto manual add), Indstillinger (full mirror of sidebar — date picker with year presets, Scan e-mailindhold, Scan vedhæftede filer, Maks. vedhæftet filstørrelse, Maks. e-mails pr. bruger, Delta-scanning, Søg efter ansigter i billeder, all as toggle sliders), Opbevaringspolitik (always visible — Opbevaringsår + Regnskabsår slut).
- + Ny profil button in left panel footer — creates a blank profile and opens the editor immediately, works even when no profiles exist.
- Annuller, ×, and Gem all close the full modal (not just the editor panel).
- Auto-opens first profile's editor when modal opens.

**`static/app.js`** — defaults
- All users now default to `selected: false` on load (were `true`). Profile editor follows the same rule — only explicitly saved user_ids are shown as checked.
- "Vælg alle" button renamed to "Alle" to match the main sidebar.

**`routes/state.py`**
- Added `_google_scan_lock` and `_google_scan_abort` as separate threading primitives for Google scans.

---

### Added — Google Workspace full integration

**Accounts panel**
- Google Workspace users now appear in the Accounts panel alongside M365 users. Each row shows a platform badge: `M365` (blue) or `GWS` (green).
- Account list filters by checked sources: check only Google sources → only GWS accounts shown; check only M365 → only M365 accounts; check both → all; check none → empty.
- Role filter (All / Ansat / Elev) works across both platforms.
- `_mergeGoogleUsers()` — dedicated async function fetches `/api/google/scan/users` and merges results into `_allUsers` independently of M365 auth timing. Called on page load, on Google connect/disconnect, and after M365 `loadUsers()`.

**Scanning**
- Selected Google user emails are now passed as `user_emails` to `/api/google/scan/start` — only selected accounts are scanned, not all users in the domain.
- `routes/google_scan.py` — `_scan_lock` and `_scan_abort` now imported directly from `routes.state` (previously relied on `__getattr__`, which does not resolve bare names inside function bodies — caused `NameError` on scan start).
- `user_emails` now read from the top-level request body in addition to the nested `options` dict.
- Gmail scan result cards now correctly labelled "Gmail" (source_type was `email` → mapped to "Exchange"). Fixed in `google_connector.py`.
- Gmail and Google Drive cards now show styled source badges (`badge-gmail` red tint, `badge-gdrive` blue tint). Previously fell back to unstyled.

**Profiles**
- Google sources (`gmail`, `gdrive`) and selected Google user emails are now saved to scan profiles and correctly restored on load.
- Fixed `googleSources` `const` temporal dead zone — declaration moved before use in `buildScanPayload()`.

### Added — OU-based role classification for Google Workspace (#23 Phase 1 ✅)

- **`classification/google_ou_roles.json`** — maps Google Workspace Organisational Unit paths to roles. Edit to match your school's OU structure; no code change required. Default: `/Elever` → student, `/Personale` → staff.
- **`google_connector.py`** — `list_users()` fetches `orgUnitPath` via `projection=full` and classifies each user via `classify_ou_role()`.
- **`routes/google_scan.py`** — role map built from `list_users()` result; each scan card now carries the correct `user_role`.

### Added — Documentation split

- **`M365_SETUP.md`** — step-by-step Microsoft 365 setup (app registration, permissions, auth modes, headless config, troubleshooting).
- **`GOOGLE_SETUP.md`** — step-by-step Google Workspace setup (service account, domain-wide delegation, scopes, OU role mapping, troubleshooting).
- **`README.md`** — trimmed from 774 to 611 lines; setup detail moved to the two new files.

### Changed — i18n migrated from `.lang` to JSON (#27 ✅)

- `lang/en.json`, `da.json`, `de.json` — 709 keys each, standard flat JSON.
- `app_config.py` — loader now prefers `.json`, falls back to `.lang` for backward compatibility.
- Old `.lang` files retained as fallback; can be deleted once JSON files are confirmed working.

### Changed — `skus/` renamed to `classification/` (#29 ✅)

- `skus/education.json` → `classification/m365_skus.json`
- `skus/google_ou_roles.json` → `classification/google_ou_roles.json`
- All path references updated in `m365_connector.py`, `google_connector.py`, `routes/users.py`, `gdpr_scanner.py`, `build_gdpr.py`, all lang files, and `static/app.js`.

### Changed — UI polish (icons removed, badges added)

- Role filter buttons (Staff / Student), scan option labels (Delta scan, Scan photos, Retention policy), and account list role badges — all emoji removed, plain text only.
- Role badge on account rows changed from emoji icon button to plain outline pill (`Ansat` / `Elev` / `Anden`).
- Scan result cards — role icon prefix replaced with small inline badge.
- All six lang files cleaned of emoji in role, mode, option, and Art.30 inventory keys.
- Progress bar fixed at 32px height — emoji in filenames no longer push the bar taller.
- Scrollbars in Sources and Accounts panels thinned to 4px.

### Fixed — Account list / source interaction

- Deselecting all sources now empties the account list.
- Deselecting M365 sources no longer disables Accounts when Google sources are still checked.
- `_updateAccountsVisibility()` now checks all source types, not just M365.

### Fixed — Role override cycling

- Role override never cleared for users loaded with a pre-existing override (`roleOverride: true` from a previous session) because `_autoRole` was never populated from the server. Fixed: replaced `_autoRole` comparison with a step counter — after 3 clicks the override clears regardless of the original auto role.
- Role badge changed from `<span>` to `<button type="button">` inside label rows — prevents label click-forwarding to the checkbox (which caused the first user to receive the override instead of the clicked user).

---

## [1.6.2] — 2026-03-28

### Added — Google Workspace account list and source integration

- **`static/app.js`** — Google Workspace users (292 users in testing) now appear in the Accounts panel with `GWS` badge (blue = M365, green = GWS). M365 users carry `M365` badge.
- Account list filters by checked sources: check only Google sources → only GWS accounts shown; check only M365 → only M365 accounts; check both → all accounts; check none → empty list.
- Role filter buttons (All / Ansat / Elev) work across both platforms.
- `_mergeGoogleUsers()` — dedicated function fetches `/api/google/scan/users` and merges results into `_allUsers` independently of M365 auth timing. Called on page load, on Google connect/disconnect, and after M365 `loadUsers()`.
- `startScan()` — selected Google user emails now passed as `user_emails` to `/api/google/scan/start`, so only the chosen accounts are scanned (previously ignored selection and scanned all users).
- **`routes/google_scan.py`** — `_scan_lock` and `_scan_abort` now imported directly from `routes.state` (previously relied on `__getattr__` which doesn't resolve bare names inside function bodies — caused `NameError` on scan start).
- `user_emails` now read from the top-level request body in addition to the nested `options` dict.

### Added — OU-based role classification for Google Workspace (#23 Phase 1)

- **`classification/google_ou_roles.json`** — new file mapping Google Workspace Organisational Unit paths to roles. Edit to match your school's OU structure; no code change required. Default: `/Elever` → student, `/Personale` → staff.
- **`google_connector.py`** — `list_users()` now fetches `orgUnitPath` via `projection=full` and classifies each user via `classify_ou_role()`. Each user dict now includes `userRole` and `orgUnitPath`.

### Added — Documentation split

- **`M365_SETUP.md`** — step-by-step Microsoft 365 setup guide (app registration, permissions, auth modes, headless config, role classification, troubleshooting).
- **`GOOGLE_SETUP.md`** — step-by-step Google Workspace setup guide (service account, domain-wide delegation, OAuth scopes, OU role mapping, troubleshooting).
- **`README.md`** — trimmed from 774 to 611 lines. Auth/permissions/headless detail moved to setup guides. Two new "Microsoft 365" and "Google Workspace" sections link to the respective files.

### Changed — UI polish (icons removed)

- Role filter buttons (Staff / Student) — emoji removed, plain text only.
- Scan option labels (Delta scan, Scan photos for faces, Retention policy) — emoji removed.
- Account list role badge — replaced clickable emoji button (`👔`/`🎓`/`👤`) with plain outline pill badge (`Ansat` / `Elev`), matching the platform badge style.
- Scan result cards — role icon prefix removed from account name; replaced with small inline outline badge.
- All three lang files (`en.lang`, `da.lang`, `de.lang`) cleaned of emoji in `m365_role_staff`, `m365_role_student`, `m365_opt_delta`, `m365_opt_scan_photos`, `m365_opt_retention`, `m365_mode_delegated`, `m365_bulk_overdue_btn`, `a30_inv_staff`, `a30_inv_students`.

### Fixed — Profile save/load with Google sources

- Google sources (`gmail`, `gdrive`) and selected Google user emails now saved in scan profiles and correctly restored on load.
- `googleSources` `const` declaration moved before use in `buildScanPayload()` — fixed temporal dead zone `ReferenceError`.

### Fixed — Account list / source interaction

- Deselecting all sources now empties the account list (previously kept showing all users).
- Selecting only Google sources no longer disables the Accounts section (previously greyed out when no M365 sources were checked).
- `_updateAccountsVisibility()` now checks all source types, not just M365.

### Added — Google Workspace role classification via OU mapping (#23 Phase 1)

- **`classification/google_ou_roles.json`** — new file mapping Google Workspace Organizational
  Unit paths to roles (`student` / `staff`). Edit to match your school's OU structure;
  no code change required. Default prefixes: `/Elever` → student, `/Personale` → staff.
- **`google_connector.py`** — `list_users()` now requests `orgUnitPath` from the Admin
  Directory API and classifies each user via `classify_ou_role()`. Each user dict now
  includes `userRole` and `orgUnitPath`.
- **`routes/google_scan.py`** — role map built from `list_users()` result; scan cards
  now carry the correct `user_role` instead of always `"other"`.

### Fixed — Post-split and app runtime bugs (additional)

**`routes/database.py`**
- Settings panel showed "Scanned: 0, Flagged: 0, Scans: 0" because `get_stats()`
  returns `{}` when no scan has a `finished_at` timestamp (interrupted or first-run).
  Fixed: stats endpoint now queries `flagged_items` and `scans` tables directly so
  counts are always correct regardless of scan completion state. Stats populate on
  app start from existing DB data — no re-scan required.
- DB export produced a ZIP but nothing was downloaded in the native app because
  `URL.createObjectURL()` does not work in pywebview. Fixed: `exportDB()`,
  `exportExcel()`, and `exportArticle30()` in `static/app.js` now detect pywebview
  and call `window.pywebview.api.save_db_export()` / `save_excel()` / `save_article30()`
  which use the native macOS/Windows save dialog. Browser fallback preserved.
- Added `save_db_export()` and `save_article30()` methods to the pywebview `Api`
  class in `build_gdpr.py`. Fixed `save_excel` filename from `m365_scan_` to `gdpr_scan_`.

**`scan_engine.py`**
- `run_file_scan()` called `_db.start_scan()` which does not exist — the correct
  method is `begin_scan()`. Silent exception meant `_db_scan_id` was always `None`
  and no file scan results were ever written to the database. Fixed.

### Added — Personal use disposition value (#28)

Staff members using work equipment for private purposes will now appear in scan
results. Added `personal-use` as a disposition value so reviewers can explicitly
mark items as outside the organisation's compliance scope.

- New disposition: **Personal use — out of scope** in both UI dropdowns
- Art. 30 report labels it "Personal use — out of GDPR scope (Art. 2(2)(c))"
- Translated in EN / DA / DE

**Legal basis:** GDPR Article 2(2)(c) — processing by a natural person in the
course of a purely personal activity is outside GDPR scope.

### Added — pytest test suite (#26)

112 tests across 4 modules — all passing.

| Test module | Tests | What it covers |
|---|---|---|
| `tests/test_document_scanner.py` | 36 | `is_valid_cpr`, `extract_matches`, `scan_docx`, `scan_xlsx`, `_scan_bytes` — CPR detection, false-positive suppression, binary edge cases |
| `tests/test_app_config.py` | 34 | i18n loading, Article 9 keyword detection, config round-trip, admin PIN, profiles CRUD, Fernet encryption |
| `tests/test_checkpoint.py` | 18 | `_checkpoint_key` stability, save/load/clear, wrong-key isolation, delta token round-trip |
| `tests/test_db.py` | 24 | Scan lifecycle, `save_item`, CPR hash-only storage, `lookup_data_subject`, dispositions, export/import cycle |

**Support files:**
- `tests/conftest.py` — shared fixtures: `docx_with_cpr`, `docx_no_cpr`, `xlsx_with_cpr`, `xlsx_no_cpr`, `txt_with_art9`, `binary_garbage`, `tmp_db`
- `pytest.ini` — test discovery config

**Run with:** `pytest tests/` from the project root.

### Fixed — Six post-split runtime bugs

All bugs introduced by the #25 module split — the pre-split code had none of these.

**`gdpr_scanner.py`**
- `_current_scan_id` imported as a string binding (`from sse import _current_scan_id`), so `scan_stream()` always saw `""` — SSE replay filter excluded all events and the progress bar showed nothing. Fixed: reads `sse._current_scan_id` at call time via module reference.
- `_connector` assignment only updated the local module global, not `_state.connector`. `scan_engine.py` reads `_state.connector`, which stayed `None` after sign-in — every scan reported "Not connected to M365". Fixed: all five `_connector = ...` assignments now dual-assign `_connector = _state.connector = ...`.

**`scan_engine.py`**
- `_load_role_overrides`, `_resolve_display_name`, `_scan_text_direct` were undefined bare names inside `run_scan()` — raised `NameError` at runtime. Fixed: proper imports from `app_config` and `cpr_detector`.
- `PHOTO_EXTS` and `SUPPORTED_EXTS` were stub empty sets at import time; injection via `_se.PHOTO_EXTS = ...` replaced the module attribute but function bodies still saw the empty stubs. Fixed: `scan_engine.py` now imports these directly from `cpr_detector` at module level.
- `scan_progress` SSE event broadcasts `index` and `pct`; the UI handler read `d.completed` — progress bar was always 0%. Fixed in `static/app.js`: handler now reads `d.pct` (pre-calculated server-side) and populates `progressStats` (n / total) and `progressEta` elements that were wired in HTML but never written.
- Source collection (OneDrive, SharePoint, Teams) completed silently with no count in the live log. Fixed: broadcasts `📁 OneDrive — user: N file(s)`, `🌐 SharePoint: N file(s)`, `💬 Teams — user: N file(s)` after each successful collection.

**`cpr_detector.py`**
- `_scan_text_direct()` called `ds.scan_text()` which internally calls `extract_cpr_and_dates()` — a function that does not exist in `document_scanner.py` (pre-existing bug in that module). Result: every email body scan returned zero CPRs. Same bug affected `.txt` files and the unknown-extension fallback in `_scan_bytes()`. Fixed: all three replaced with `ds.extract_matches(text, 1, "text")` which works correctly.

**`static/app.js`**
- `scan_file_flagged` handler called `renderCards()` which is not defined anywhere — silent `ReferenceError` in the browser, cards pushed to `flaggedData` but never displayed. Fixed: replaced with `applyFilters()` which calls `renderGrid()` and shows the filter bar.
- `scan_done` handler never showed the filter bar (containing Excel and Art.30 export buttons) when results existed — only the stats numbers updated. Fixed: `scan_done` now explicitly shows the filter bar and calls `applyFilters()` when `flaggedData.length > 0`.

---

## [1.6.1] — 2026-03-28

### Changed — Split `gdpr_scanner.py` into focused modules (#25)

`gdpr_scanner.py` was 5554 lines. It is now 3591 lines and delegates to five
focused modules. No behaviour changes — all existing routes, blueprints, and
imports continue to work unchanged.

**New files:**

| Module | Lines | Contents |
|---|---|---|
| `sse.py` | 52 | `broadcast()`, `_sse_queues`, `_sse_buffer`, `_current_scan_id` |
| `checkpoint.py` | 79 | `_save_checkpoint()`, `_load_checkpoint()`, `_checkpoint_key()`, delta token load/save |
| `app_config.py` | 553 | i18n, Article 9 keywords, config, admin PIN, profiles, settings, SMTP, file sources, Fernet encryption |
| `cpr_detector.py` | 381 | `_scan_bytes()`, `_extract_exif()`, `_detect_photo_faces()`, `_make_thumb()`, `_get_pii_counts()` |
| `scan_engine.py` | 1006 | `run_scan()`, `run_file_scan()` — M365 and file-system scan orchestration |

**Changed files:**

- `gdpr_scanner.py` — imports and re-exports from all five modules; keeps Flask
  app init, `@app.route` definitions, blueprint registration, and `__main__` entry point
- `routes/state.py` — `_scan_lock` and `_scan_abort` moved here from `gdpr_scanner.py`
  so `scan_engine.py` can reference them without a circular import

**Isolation:** each new module is importable in isolation with fallback stubs,
enabling unit tests (#26) to import `cpr_detector` or `checkpoint` without
pulling in Flask, MSAL, or the full application.

---

## [1.6.0] — 2026-03-28

### Changed — Rename: M365 Scanner → GDPRScanner (#24)

The tool now scans M365, Google Workspace, local file systems, and SMB network
shares. The name "M365 Scanner" was misleading. This release renames everything
with no behaviour changes.

**Files renamed:**

| Old | New |
|---|---|
| `m365_scanner.py` | `gdpr_scanner.py` |
| `m365_db.py` | `gdpr_db.py` |
| `build_m365.py` | `build_gdpr.py` |
| `build_m365.sh` | `build_gdpr.sh` |
| `start_m365.sh` *(created by install_macos.sh)* | `start_gdpr.sh` |
| `start_m365.bat` *(created by install_windows.ps1)* | `start_gdpr.bat` |

**Config files renamed on first startup (migration shim):**

| Old `~/` path | New `~/` path |
|---|---|
| `.m365_scanner_config.json` | `.gdpr_scanner_config.json` |
| `.m365_scanner.db` | `.gdpr_scanner.db` |
| `.m365_scanner_token.json` | `.gdpr_scanner_token.json` |
| `.m365_scanner_delta.json` | `.gdpr_scanner_delta.json` |
| `.m365_scanner_settings.json` | `.gdpr_scanner_settings.json` |
| `.m365_scanner_smtp.json` | `.gdpr_scanner_smtp.json` |
| `.m365_scanner_role_overrides.json` | `.gdpr_scanner_role_overrides.json` |
| `.m365_scanner_file_sources.json` | `.gdpr_scanner_file_sources.json` |
| `.m365_scanner_machine_id` | `.gdpr_scanner_machine_id` |
| `.m365_scanner_checkpoint.json` | `.gdpr_scanner_checkpoint.json` |
| `.m365_scanner_schedule.json` | `.gdpr_scanner_schedule.json` |
| `.m365_scanner_msal_cache.bin` | `.gdpr_scanner_msal_cache.bin` |
| `.m365_scanner_lang` | `.gdpr_scanner_lang` |

The migration runs silently at startup — existing scan history, credentials,
settings, and role overrides are preserved automatically.

**Intentionally unchanged:**
- `m365_connector.py` — kept as-is; it is the Microsoft Graph connector and
  the `m365_` prefix accurately describes what it connects to
- i18n keys with the `m365_` prefix that describe M365-specific UI elements
  (Azure credential fields, device code flow screens) — the prefix is correct

**Run with:**
```
python gdpr_scanner.py [--port 5100]
```

---

## [1.5.9] — 2026-03-28

### Added — Google Workspace scanning (#10)

Organisations running mixed Microsoft/Google environments can now scan Gmail
and Google Drive alongside M365 in a single tool. The Google Workspace tab in
Source Management is now fully active (was "Coming soon" stub).

**New files:**
- `google_connector.py` — service account OAuth with domain-wide delegation;
  Gmail message + attachment iterator; Drive file iterator with automatic export
  of native Docs/Sheets/Slides → DOCX/XLSX/PPTX before scanning
- `routes/google_auth.py` — `/api/google/auth/status`, `/connect`, `/disconnect`
- `routes/google_scan.py` — `/api/google/scan/start`, `/cancel`, `/users`

**Changed files:**
- `routes/state.py` — `google_connector` slot added
- `m365_scanner.py` — Google blueprints registered; `GOOGLE_CONNECTOR_OK` /
  `GOOGLE_AUTH_OK` flags; connector auto-restored from saved key on startup
- `templates/index.html` — Google tab activated; full credentials pane with key
  file upload, admin email field, Gmail + Drive source toggles, and setup guide
- `static/app.js` — `smGoogleRefreshStatus()`, `smGoogleConnect()`,
  `smGoogleDisconnect()`, `getGoogleScanOptions()`, key file reader
- `requirements.txt`, `install_windows.ps1`, `install_macos.sh` — three new
  optional Google API dependencies
- `lang/en.lang`, `da.lang`, `de.lang` — 14 new i18n keys each

**Dependencies (optional — scanner starts without them):**
```
pip install google-auth google-auth-httplib2 google-api-python-client
```

**Setup required in Google Workspace Admin Console:**
1. Create a Google Cloud project; enable Gmail API, Drive API, Admin SDK
2. Create a service account; download the JSON key; enable domain-wide delegation
3. In Workspace Admin → Security → API Controls → Domain-wide delegation add the
   service account client ID with scopes:
   `gmail.readonly`, `drive.readonly`, `admin.directory.user.readonly`

**Scan results** write to the same SQLite database with `source_type = "gmail"`
or `"gdrive"` — Article 30 reports and data subject lookups cover both platforms
automatically.

---

## [1.5.8] — 2026-03-28

### Fixed — Scheduled scans invisible in the browser (#21)

Scheduled scans now show full live progress in the browser — progress bar,
phase text, flagged cards, and log entries — exactly like manual scans.

**Root cause (critical):** When run as `python m365_scanner.py`, the module
loads as `__main__`. The scheduler's `import m365_scanner as _m` loaded a
**second copy** of the module with its own empty `_sse_queues`. Events from
`_m.broadcast()` went nowhere — the browser's SSE connection was reading from
`__main__`'s queues.

**Fix:** `sys.modules["m365_scanner"] = sys.modules[__name__]` at the top of
the module ensures all imports share the single running instance.

### Fixed — SSE event replay for late-connecting browsers (#21)

Opening the browser mid-scan (manual or scheduled) now replays all buffered
progress events so the live log and card grid are fully populated.

**Additional root causes and fixes:**

- `_autoConnectSSEIfRunning()` only attached `scheduler_*` listeners on page
  load — replayed `scan_phase`, `scan_file_flagged`, and `scan_done` events
  were silently ignored
- Idle SSE connections died silently (Flask/Werkzeug threading); the browser
  had no live connection when a scheduled scan fired minutes/hours later

**Changes — Python (`m365_scanner.py`):**
- Module identity fix: `sys.modules["m365_scanner"] = sys.modules[__name__]`
- Added `_current_scan_id` global — unique timestamp-based ID set at the start
  of every scan (M365 and file scans) and cleared after `scan_done`
- `broadcast()` injects `scan_id` into every SSE event payload
- `scan_stream()` filters the replay buffer to only include events matching the
  current `scan_id`, preventing stale replay from previous scans
- New `sse_replay` / `sse_replay_done` marker events bracket the replayed block
  so the browser can distinguish replay from live events
- New `GET /api/scan/status` lightweight endpoint returning `{running, scan_id}`

**Changes — JavaScript (`static/app.js`):**
- Extracted `_attachScanListeners(es)` and `_attachSchedulerListeners(es)` —
  shared by both `startScan()` and `_autoConnectSSEIfRunning()`
- `_attachSchedulerListeners` now shows the progress bar on `scheduler_started`
  and hides it on `scheduler_done` / `scheduler_error`
- SSE polling watchdog (`_sseWatchdog`) checks `/api/scan/status` every 4s;
  reopens the SSE connection via `_ensureSSE()` if it has died
- `_userStartedScan` flag — `scan_done` only closes the SSE connection for
  user-initiated scans; scheduled scans keep it alive
- Fixed `es.onerror` handler — no longer silently nulls `es`

### Fixed — File scan `scan_complete` → `scan_done` event name

`run_file_scan()` was broadcasting `scan_complete` on finish, but the JS only
listens for `scan_done`. Renamed to `scan_done` with the same `total_scanned` /
`flagged_count` payload shape as M365 scans.

### Fixed — Resume scan used wrong profile

`startScan()` never told the server which profile was active. Settings were
always saved to the Default profile. Now `profile_id` is sent in the scan start
payload and `_save_settings()` accepts a `profile_id` parameter (takes
precedence over `profile_name`).

### Fixed — `install_macos.sh` launcher scripts

- `start_gdpr.sh` and `build_m365.sh` templates now use `exec python3` instead
  of `exec python` — fixes "not found" after removing python.org interpreter
- spaCy model install: creates a `pip` shim in `venv/bin/` (spaCy's
  `shutil.which("pip")` couldn't find the venv's pip3), falls back to direct
  `pip install` if `spacy download` still fails, and prepends `venv/bin` to
  PATH explicitly

### Added — Diagnostic logging

- `[run_scan]` prints sources, user count, app_mode, and a sample user entry
  at scan start — helps verify scheduled scans use the correct profile
- `[SSE]` console.log messages in the browser for `scan_phase`, `scan_done`,
  `scan_file_flagged`, `scheduler_started`, `scheduler_done`, `scheduler_error`
  — aids debugging SSE delivery issues

### Added — i18n keys (EN / DA / DE)

- `m365_sse_reconnecting` — shown when page load detects a running scan
- `m365_sse_replay_note` — logged after replayed events finish

---

## [1.5.7] — 2026-03-28

### Fixed — Missing translations in Settings modal

Several strings in the Settings → General and Settings → Scheduler tabs were
displaying in English regardless of the active language.

**Missing lang keys added** (EN / DA / DE):
- `btn_save` — Save / Gem / Speichern (used by scheduler editor Save button and others)
- `m365_settings_about` — About / Om / Über
- `m365_settings_save_pin` — Save PIN / Gem PIN / PIN speichern
- `m365_sched_freq_daily/weekly/monthly` — frequency labels in job list and editor
- `m365_sched_dow_mon` through `_sun` — day-of-week labels

**Template fixes:**
- "About" group heading now has `data-i18n="m365_settings_about"`
- "Save PIN" button uses dedicated key `m365_settings_save_pin` instead of generic `btn_save`
- Frequency and day-of-week `<option>` elements now have `data-i18n` attributes
- Scheduler job list (`schedRenderJobs`) and status update now use `t()` for frequency labels

### Changed — Theme toggle replaced with slider

The "Toggle dark / light" text button in Settings → General is replaced with a
standard toggle slider (consistent with all other toggles in the UI). The slider
reflects the current theme state when the tab opens and toggles the theme on click.

---

## [1.5.6] — 2026-03-28

### Feature — SSE event replay (#21)

Opening the browser mid-scan (e.g. while a scheduled scan is running) now
replays all buffered events so the live log and result cards populate
immediately, rather than showing nothing until the next event fires.

**`m365_scanner.py`:**
- Added `_sse_buffer: deque = deque(maxlen=500)` — a ring buffer that stores
  every `broadcast()` event
- `broadcast()` appends to the buffer before sending to SSE clients
- `run_scan()` clears the buffer at the start of each scan so stale events
  from the previous scan are not replayed
- Removed duplicate `@app.route("/api/scan/stream")` — route is now handled
  exclusively by the `routes/scan.py` blueprint

**`routes/scan.py`:**
- `scan_stream()` replays `_m._sse_buffer` immediately when a new client
  connects, then switches to live events
- All globals accessed directly via `import m365_scanner as _m` to avoid
  `__getattr__` resolution failures that caused 500 errors
- A `: connected` comment line is sent first to confirm the stream is flowing

**`static/app.js`:**
- `_autoConnectSSEIfRunning()` — new function called on `DOMContentLoaded`
  that always opens the SSE connection on page load. If a scan is already
  running, buffered events replay immediately. If the buffer is empty, no
  events fire and the log stays quiet.
- Handles `scan_phase`, `scan_progress`, `scan_start`, `scan_file_flagged`,
  `scan_done`, `scheduler_started`, `scheduler_done`, `scheduler_error` events
- `startScan()` closes and reopens the SSE connection to get a clean stream
  for each manual scan

**`m365_scanner.py` — CLI output when no browser connected:**
- `broadcast()` now prints key events to the terminal when `_sse_queues` is
  empty (i.e. no browser tab is watching), so scheduled scans are visible in
  the CLI: scan phases, file progress, errors, and completion summary

---

## [1.5.5] — 2026-03-28

### Fixed — Scheduler: multiple bugs after multi-job implementation

**`scheduler.py` renamed to `scan_scheduler.py`**

Python's stdlib includes a `sched`/`scheduler` module that was being resolved
instead of the project's own `scheduler.py`, causing `module 'scheduler' has no
attribute 'load_jobs'`. Renaming the project file to `scan_scheduler.py` eliminates
the collision entirely. All imports updated in `routes/scheduler.py` and
`m365_scanner.py`.

**Jobs with missing UUID assigned on load**

Jobs saved before the multi-job refactor had `"id": ""`. `load_jobs()` now detects
any job with a missing or empty id and assigns a fresh UUID, then rewrites the file.
This fixed "Delete failed: id required" and silent edit failures.

**Enabled toggle added to each scheduler row**

Each job row now has an inline toggle switch instead of a static ✓/— indicator.
Toggling saves the change immediately via `/api/scheduler/jobs/save`. The job
description also shows "Next: [date]" after the status fetch resolves.

**Edit no longer duplicates the job**

`_sched().reload()` inside the save route was not wrapped in its own try/except.
If APScheduler threw (e.g. not yet started), the exception propagated and caused
the save to fall through to the "create new" path. Both `reload()` calls (save and
delete) are now wrapped in `try/except: pass`.

**Delete button now works**

The delete button was passing the HTML-escaped job name through the onclick
attribute — names with apostrophes or special characters broke the JS string.
Fixed by passing only `id` and looking up the name from `_schedJobs` inside
`schedDeleteJob()`. The route and JS both have proper error handling now.

**"Not authenticated" on scheduled run**

`state.connector` is assigned once at startup (`_state.connector = _connector`)
and never updated when the user authenticates later. The scheduler now reads
`_m._connector` directly from the live `m365_scanner` module at run time,
guaranteeing it sees the current authenticated connector.
`flagged_items` and `scan_meta` reads also updated to use `_m.flagged_items`
and `_m.scan_meta` directly.

---

## [1.5.4] — 2026-03-28

### Feature — Multiple scheduled scans

The Settings → Scheduler tab now supports multiple independent named scan jobs,
replacing the previous single-job form.

**`scheduler.py`**
- Config format changed from a single dict to `{"jobs": [...]}`. Each job has
  its own `id` (UUID), `name`, and all existing fields (frequency, time, profile,
  auto-email, auto-retention).
- Old single-job `~/.m365_scanner_schedule.json` files are automatically migrated
  to the new format on first load — no manual changes needed.
- `ScanScheduler` registers one APScheduler job per enabled scan and tracks
  running state and last-run info per job independently.
- Backward-compat shims (`load_schedule_config`, `save_schedule_config`) kept
  for any existing integrations.

**`routes/scheduler.py`** — new CRUD endpoints:
- `GET  /api/scheduler/jobs` — list all jobs
- `POST /api/scheduler/jobs/save` — create or update a job (by id)
- `POST /api/scheduler/jobs/delete` — delete a job by id
- `POST /api/scheduler/jobs/run_now` — run a specific job immediately by id
- Old `/api/scheduler/config` and `/api/scheduler/run_now` kept as backward-compat shims

**`templates/index.html`** — scheduler pane replaced with a job list (styled like
File sources) and an inline editor that slides open when adding or editing. Each
row shows the job name, frequency summary, enabled/running status pill, and
▶ Run / ✏ Edit / ✕ Delete buttons. Schedule configuration lives exclusively in
the editor — nothing schedule-related appears in the sidebar except the existing
"Next: …" indicator.

**`static/app.js`** — all `sched*` functions rewritten for multi-job:
`schedLoad`, `schedRenderJobs`, `schedAddJob`, `schedEditJob`, `schedSaveJob`,
`schedDeleteJob`, `schedRunJob`, `schedCancelEdit`, `schedLoadHistory`,
`schedUpdateSidebarIndicator`.

**Lang keys added:** `m365_sched_add`, `m365_sched_name`, `m365_sched_editor_new`,
`m365_sched_editor_edit`, `m365_sched_name_required`, `m365_sched_no_runs`,
`btn_cancel` (da/en/de).

---

## [1.5.3] — 2026-03-27

### Added — Suggestion #19: Scheduled / automatic scans

In-process scheduler using APScheduler so GDPR scans run automatically on a
configurable cadence — no cron or Task Scheduler setup required.

**Backend:**
- New `scheduler.py` module wrapping APScheduler `BackgroundScheduler` with a
  single coalescing job; misfire grace time 1 hour.
- Config stored in `~/.m365_scanner_schedule.json` (daily/weekly/monthly,
  time-of-day, profile selector, auto-email, auto-retention).
- Run history persisted in new `schedule_runs` DB table (migration #7).
- `routes/scheduler.py` blueprint — `GET/POST /api/scheduler/config`,
  `GET /api/scheduler/status`, `POST /api/scheduler/run_now`,
  `GET /api/scheduler/history`.
- Scheduler starts automatically on `app.run`; status printed at boot.
- Scheduled scans reuse the full `run_scan()` pipeline (checkpoints, delta,
  broadcast, DB) — identical to interactive scans.
- Auto-email sends the Excel report via Graph or SMTP after each scheduled scan.
- Auto-retention optionally enforces the retention policy on overdue items.

**UI:**
- **Settings → Scheduler** tab — enable/disable toggle, frequency picker
  (daily/weekly/monthly), time-of-day, profile selector, auto-email and
  auto-retention toggles, status display, run history, "Run now" button.
- **Sidebar** — 🕐 next-scan indicator near the settings button; click to
  open scheduler config. Polls every 60 s.
- **Scan log** — scheduled scans appear with 🕐 prefix via SSE events
  (`scheduler_started`, `scheduler_done`, `scheduler_error`).

**Build / deps:**
- `APScheduler>=3.10` added to `requirements.txt`.
- `scheduler.py` and APScheduler hidden imports added to `build_m365.py`.
- Schedule config added to `--purge` cleanup list.
- Lang keys added for DA / EN / DE.

---

## [1.5.2] — 2026-03-27

### Fixed — File/SMB scan: image-only PDFs no longer hang the scanner

`scan_pdf()` in `document_scanner` launches Tesseract OCR and Poppler subprocesses
when a PDF has no text layer. These subprocesses cannot be killed from a Python thread,
causing the scanner to hang indefinitely on scanned documents (e.g. ESTA applications,
invoice scans).

**Fix:** Before calling `scan_pdf()`, `_scan_bytes()` now opens the PDF with `pdfplumber`
(pure Python, no subprocesses) and checks whether any page has a text layer using the
existing `is_text_page()` helper. If all pages are image-only, the file is skipped
immediately with no CPR hits — which is correct, since machine-readable CPR numbers
cannot exist in an image-only PDF.

Text-layer PDFs (the majority) pass the check and are scanned normally. Only image-only
PDFs (scanned documents) are skipped.

This replaces multiple failed approaches (`ThreadPoolExecutor` timeouts,
`shutdown(wait=False)`, extension-based skipping) that either blocked on context manager
exit or removed legitimate file types from scanning.

### Fixed — SMB scanning: multiple smbprotocol 1.14+ API changes

See v1.5.1 for details. Additional fix in this release:

- `smb_host` is now auto-derived from the path (`//host/share` → `host`) when not
  explicitly stored in the source JSON, so SMB sources saved without an explicit host
  field still connect correctly.

### Fixed — Routes blueprint: globals resolved lazily to prevent circular imports

Each route blueprint (`routes/*.py`) now uses Python's module `__getattr__` hook to
lazily resolve globals from `m365_scanner` at call time, not at import time. This
prevents the circular import that caused double blueprint registration on startup.

### Added — File source Edit button

See v1.5.1.

---

## [1.5.1] — 2026-03-27

### Fixed — SMB scanning: multiple smbprotocol 1.14+ API incompatibilities

Several functions in `file_scanner.py` used deprecated or renamed smbprotocol APIs:

- **`uuid4_str()` removed** — `Connection()` now requires a `uuid.UUID` object, not a string. Changed to `uuid.uuid4()` directly; added `import uuid` at module level.
- **`RequestedOpcodes` removed from `smbprotocol.open`** — was imported but never used; removed.
- **`FilePipePrinterAccessMask.FILE_LIST_DIRECTORY` → `DirectoryAccessMask.FILE_LIST_DIRECTORY`** — directory listing requires `DirectoryAccessMask`, not the file/pipe mask.
- **`FileDirectoryInformation` moved** — from `smbprotocol.query_info` to `smbprotocol.file_info`; import updated.
- **`FileInformationClass` enum** — `query_directory()` expects `FileInformationClass.FILE_DIRECTORY_INFORMATION` (int enum), not a class instance.
- **`query_directory()` kwargs renamed** — `file_name=` → `pattern=`, `output_buffer_length=` → `max_output=`.
- **Filename bytes** — `file_name` field now returns UTF-16-LE bytes; decoded to str with error handling.
- **`smb_host` auto-derivation** — if `smb_host` is not explicitly stored in the source JSON, it is now extracted from the path (`//host/share` → `host`). `is_smb` no longer requires `smb_host` to be pre-set.

### Fixed — SMB scanning: junk directories skipped

Added `SKIP_DIRS` constant — a set of folder names silently skipped in both local and SMB walks:

```
.recycle  .recycler  $recycle.bin  .trash  .trashes
.sync  .btsync  .syncthing
.git  .svn  .hg
__pycache__  node_modules
.spotlight-v100  .fseventsd  .temporaryitems
system volume information  lost+found
```

Local walker prunes these from `_dirs[:]` before `os.walk` descends. SMB walker checks before recursing. Hidden directories (`.` prefix) are also skipped in both.

`STATUS_END_OF_FILE` errors (zero-byte placeholder files from Bittorrent Sync, `.sync/stream_test.txt` etc.) are now silently skipped instead of logged as warnings.

### Fixed — SMB/local file scans: OCR disabled, per-file timeout added

PDF scanning via `document_scanner.scan_pdf()` would trigger Tesseract OCR on image-based PDFs (scanned forms, photos) causing single files to hang for minutes.

**`_scan_bytes_timeout()`** — new wrapper around `_scan_bytes` using `ThreadPoolExecutor` with a 30-second deadline per file. Timed-out files are logged as errors and scanning continues.

**`skip_ocr=True`** — file scan loop now passes `skip_ocr=True` to `_scan_bytes`, disabling OCR and reducing DPI to 150. Only the text layer is extracted from PDFs. This is appropriate for bulk compliance scanning where image-only PDFs rarely contain machine-readable CPR numbers.

### Added — File source Edit button

Each file source row in **⚙ Sources → File sources** now has an **✏ Edit** button between Scan and Delete. Clicking it pre-fills the add form with the existing name, path, SMB host, and username (password shown as placeholder dots). Saving with an existing ID updates the source in-place. The Add button label changes to **Save changes** while editing and reverts on save.

---

## [1.5.0] — 2026-03-27

### Refactor — HTML template and JavaScript extracted from m365_scanner.py

`m365_scanner.py` was a ~9600-line monolith containing HTML, CSS, JavaScript,
and Python all in one string. This made frontend edits unsafe (no linting,
no syntax highlighting, string escaping hazards) and diffs unreadable.

**What changed:**

- `templates/index.html` — the full HTML/CSS template (1418 lines), served via
  Flask's `render_template()` with two Jinja2 variables: `app_version` and
  `lang_json`
- `static/app.js` — all JavaScript (2832 lines), served by Flask's built-in
  static file handler at `/static/app.js`
- `m365_scanner.py` — reduced from 9586 to 5334 lines (44% smaller);
  now contains only Python: business logic, API routes, and configuration

**Flask configuration updated:**

```python
app = Flask(__name__,
            template_folder=os.path.join(BASE_DIR, "templates"),
            static_folder=os.path.join(BASE_DIR, "static"))
```

`BASE_DIR` resolves to `sys._MEIPASS` when running as a PyInstaller bundle,
or to the directory containing `m365_scanner.py` otherwise — the same pattern
already used for `lang/`, `keywords/`, and `classification/`.

**Build script updated:**

`build_m365.py` now bundles `templates/` and `static/` alongside the existing
`lang/`, `keywords/`, and `classification/` directories.

**Zero behaviour change** — the app works identically. Only the file organisation changed.

---

## [1.5.0] — 2026-03-27

### Refactor — HTML template and JavaScript extracted from m365_scanner.py

`m365_scanner.py` was a ~9600-line monolith containing HTML, CSS, JavaScript,
and Python all in one string. This makes frontend edits unsafe (no linting,
no syntax highlighting, string-escaping hazards) and diffs unreadable.

**New files:**

- `templates/index.html` — full HTML/CSS template (1452 lines) served via
  Flask `render_template()`. Two Jinja2 variables: `{{ app_version }}` and
  `{{ lang_json | safe }}`.
- `static/app.js` — all JavaScript (2832 lines) served by Flask's built-in
  static file handler at `/static/app.js`.

**Flask app updated:**

```python
app = Flask(__name__,
            template_folder=os.path.join(BASE_DIR, "templates"),
            static_folder=os.path.join(BASE_DIR, "static"))
```

`BASE_DIR` resolves to `sys._MEIPASS` when running as a PyInstaller bundle,
or the directory of `m365_scanner.py` otherwise — the same pattern already
used for `lang/`, `keywords/`, and `classification/`. `build_m365.py` updated to bundle
both new directories.

**Result:** `m365_scanner.py` reduced from 9586 to ~2100 lines of pure Python.
Zero behaviour change.

### Refactor — Routes split into Flask Blueprints

All 55 API routes extracted from `m365_scanner.py` into a `routes/` package.
Shared mutable state lives in `routes/state.py`; blueprints import from there
to avoid circular imports.

```
routes/
  __init__.py       package marker
  state.py          shared globals: connector, flagged_items, LANG, …
  auth.py           /api/auth/*                           174 lines
  users.py          /api/users/* + role overrides         222 lines
  scan.py           /api/scan/* + /api/settings/*         123 lines
  sources.py        /api/file_sources/* + /api/file_scan   93 lines
  profiles.py       /api/profiles/*                        48 lines
  email.py          /api/smtp/* + /api/send_report        210 lines
  database.py       /api/db/* + /api/admin/* + preview    536 lines
  export.py         Excel + Art.30 export + bulk delete  1177 lines
  app_routes.py     /api/about + /api/langs + /api/lang    67 lines
```

### Housekeeping — Document Scanner files removed

The following files belonged to the standalone Document Scanner product and
have been removed from this repository:

- `server.py` — Document Scanner web app
- `scanner_worker.py` — Document Scanner process-pool worker
- `build.py` — Document Scanner build script
- `build_app.sh` — Document Scanner shell build script
- `Dockerfile` — Document Scanner Docker image
- `docker-compose.yml` — Document Scanner Docker Compose file
- `doc_scanner_icon.png` — Document Scanner app icon

`requirements.txt` rewritten for the M365 Scanner only. Removed
`pdf2image`, `pytesseract`, `pypdf`, `reportlab`, `img2pdf`, and `py7zr`
(Document Scanner dependencies). Added `cryptography>=42.0` (SMTP password
encryption, already in use since v1.4.7).

---

## [1.4.8] — 2026-03-27

### Changed — Email: Microsoft Graph API preferred over SMTP

Both **Test** and **Send now** now try the Microsoft Graph API first when the
scanner is authenticated to Microsoft 365. This avoids SMTP AUTH entirely —
no port 587, no app password, no admin centre changes needed.

**New `_send_email_graph()` helper** — sends via `/me/sendMail` (delegated mode)
or `/users/{sender}/sendMail` (app mode). Supports optional Excel attachment for
the full report. Requires the `Mail.Send` Graph permission on the Azure app
registration (Application or Delegated, depending on auth mode).

**Priority order:**
1. **Microsoft Graph API** — used when connected to M365
2. **SMTP** — fallback if not connected or Graph fails

**Error surfacing** — Graph permission errors (403 / Forbidden / Mail.Send /
insufficient privileges) are now returned directly with a clear actionable
message: add `Mail.Send` permission to the Azure app registration and grant
admin consent. Previously the error was silently swallowed and the scanner
fell through to SMTP, masking the real problem.

**SMTP AUTH error** — if SMTP is used and Microsoft 365 returns error 530 5.7.57
("Client not authenticated"), the error message now includes a plain-English tip
explaining how to enable SMTP AUTH in the M365 admin centre, or how to use Graph
instead.

### Changed — Test button sends a real email to configured recipients

The SMTP **Test** button previously only verified connectivity (EHLO/STARTTLS
handshake). It now sends an actual HTML test email to the configured recipients,
making it easy to verify end-to-end delivery including spam filtering.

---

## [1.4.7] — 2026-03-27

### Security — SMTP password encrypted at rest

Previously the SMTP password was stored as plaintext in `~/.m365_scanner_smtp.json`.
It is now encrypted using **Fernet symmetric encryption** (`cryptography` library,
already a dependency).

**Implementation:**
- A random Fernet key is generated on first use and saved to
  `~/.m365_scanner_machine_id` (chmod 0o600 — owner-readable only)
- Passwords are stored as `enc:<ciphertext>` in the JSON file
- `_encrypt_password()` / `_decrypt_password()` handle the encode/decode cycle
- `_load_smtp_config()` transparently decrypts on load; `_save_smtp_config()`
  encrypts on save
- **Legacy plaintext passwords** (no `enc:` prefix) are read as-is and
  re-encrypted next time settings are saved — no migration step required
- Encrypted blobs are **machine-specific** — the ciphertext cannot be decrypted
  on another machine without the key file
- Graceful fallback to plaintext if `cryptography` is unavailable (rare)
- The GET `/api/smtp/config` endpoint never returns the password to the browser;
  it returns only `has_password: true/false`

### Fixed — EXIF `has_pii` false positives on screenshots

`_EXIF_PII_TAGS` previously included `HostComputer`, `DocumentName`, and `PageName`.
These are set automatically by macOS/Windows on every screenshot (machine name, app
name) and contain no personal data about an individual. Removed from the tag set.

Minimum content length of 3 characters added — a field must contain at least 3
non-whitespace characters to trigger a `has_pii` flag. Prevents empty or
single-character values from causing false positives.

**Affected fields retained:** `Artist`, `Copyright`, `ImageDescription`,
`UserComment`, `XPAuthor`, `XPSubject`, `XPComment`, `XPKeywords` — all fields
a human would deliberately fill with personal information.

### Fixed — Accounts section not greyed out when switching to a file-only profile

`_applyProfile()` restores source checkboxes but did not call
`_updateAccountsVisibility()` afterwards. Switching to a profile with no M365
sources selected left the accounts section fully interactive. Fixed by calling
`_updateAccountsVisibility()` immediately after the checkbox restore loop.

---

## [1.4.6] — 2026-03-27

### Changed — Excel export updated for EXIF, GPS, and file sources

**New columns in all source sheets:**
- **GPS** — ✔ tick when GPS coordinates are present in the item's EXIF data
- **EXIF author** — author/artist name extracted from EXIF metadata
- Special category column now filters out `gps_location` and `exif_pii` (represented by the dedicated GPS column instead)

**New source types in `SOURCE_MAP`:**
- `local` — 📁 Local (green tab), for files from local folder scans
- `smb` — 🌐 Network (blue tab), for files from SMB/CIFS network shares
- Both get their own sheet when results exist; skipped silently if empty

**Summary sheet:**
- Row 4: "Items with GPS data" count (shown only when non-zero)
- Summary table shifted to row 7 to accommodate (was row 6)
- Source rows now skipped when a source has zero items

**New GPS locations sheet:**
- Teal tab — created only when GPS items exist
- Columns: Name, Latitude, Longitude, Maps link (blue hyperlink), Account, Date Modified
- Auto-filter enabled; alternating row colours

**Bug fix:** dead old function body (164 lines after the `return`) removed — the previous `str_replace` only replaced the docstring, leaving unreachable code in the file.

---

## [1.4.5] — 2026-03-26

### Fixed — `_detect_photo_faces` missing after EXIF insertion

The `str_replace` that added `_extract_exif()` accidentally consumed the
`def _detect_photo_faces` function definition (it was part of the replaced
string). All image scans raised `NameError: name '_detect_photo_faces' is not
defined`. Function restored at its original position before `_scan_bytes()`.

### Fixed — Progress bar shows "undefined / undefined" during file scans

The M365 `scan_progress` SSE event sends `{index, total, pct, file, eta}`.
The file scanner sent only `{scanned, flagged}`. The JS handler blindly read
`d.index` and `d.total`, producing `undefined / undefined`.

**Fixes:**
- `run_file_scan()` now broadcasts `{scanned, flagged, file, pct}` so the
  current filename and a progress indicator are shown while scanning.
- The `scan_progress` JS handler now checks which fields are present and
  renders accordingly: `index / total` for M365 scans, `N · M flagged` for
  file scans.

### Fixed — Local file preview: PDF, XLSX, DOCX now render content

`/api/preview/<id>` for `source_type=local` previously showed only a metadata
placeholder for PDF and Office files. Now:

| Type | Preview |
|---|---|
| PDF | First 5 pages extracted via `pdfplumber`, CPR numbers highlighted in red |
| XLSX / XLSM | First 50 rows of up to 3 sheets as a styled table |
| CSV | First 50 rows as a table |
| DOCX / DOC | First 80 paragraphs as text, CPR numbers highlighted |

All fall back to a metadata card if the library is unavailable or the file
cannot be parsed. `document_scanner` (already imported) provides access to
`pdfplumber` and `openpyxl`.

---

## [1.4.4] — 2026-03-26

### Added — #18 EXIF metadata extraction from images

**New function `_extract_exif(content, filename)`** — extracts structured EXIF data from JPEG, PNG, TIFF, WEBP, and HEIC images using Pillow (already a dependency). No new packages required.

**Extracted fields:**
- **GPS coordinates** — converted from DMS rational values to decimal degrees; Google Maps link generated
- **Author / Artist / Copyright / Description / UserComment / Keywords** — checked for PII content
- **Device** — camera make and model
- **Datetime** — DateTimeOriginal or DateTime

**Behaviour changes:**
- EXIF extraction runs on all scanned images regardless of the "Scan photos" toggle — it is lightweight (no CV processing) and always relevant
- Images with GPS or PII-bearing EXIF fields are flagged even without CPR hits
- `special_category` gains `"gps_location"` and/or `"exif_pii"` entries as appropriate
- Face detection (`_detect_photo_faces`) still requires the "🖼 Scan photos for faces" opt-in

**UI:**
- **🌍 GPS badge** — teal pill on result cards (grid and list view) when GPS coordinates are present
- **Preview panel** — local image previews now show a collapsible "EXIF data" section beneath the image with GPS (clickable Google Maps link), author, date, device, and any other PII-bearing fields

**Applies to both M365 and file system scans** — OneDrive/SharePoint images and local/SMB files go through the same extraction path.

---

## [1.4.3] — 2026-03-26

### Added — General Settings modal

Three sidebar sections (✉ Email report, 🗄 Database, and the language selector + About link) have been removed from the sidebar and consolidated into a single **⚙ Settings** modal, opened via a button in the sidebar footer.

**General tab** — language selector (mirrors the hidden `langSelect`), theme toggle, and About info (version, Python, MSAL, Requests, openpyxl versions).

**Email report tab** — full SMTP configuration (host, port, username, password, from address, STARTTLS, recipients), Save, and Send now. Pre-fills from saved config. `openSmtpModal()` now redirects to this tab for backward compatibility.

**Database tab** — DB stats (total items, flagged items, scan count), ⬇ Export, ⬆ Import, and 🗑 Reset DB. `exportDB()` and `openImportDBModal()` work unchanged.

**🔍 Data subject lookup** remains as a sidebar shortcut since it is part of the active compliance workflow.

---

## [1.4.2] — 2026-03-26

### Added — Dynamic sources panel in sidebar

The sidebar sources panel is now fully dynamic. Previously the four M365 sources (Email, OneDrive, SharePoint, Teams) were hardcoded checkboxes. Now:

- **`renderSourcesPanel()`** builds the list at runtime from `_M365_SOURCES` (the four fixed M365 entries) and `_fileSources` (saved local/SMB sources). A "File sources" group header appears automatically when any file sources are configured.
- Per-source visibility toggles in the ⚙ Sources modal (Microsoft 365 tab) control which M365 sources appear in the panel. Toggling one off removes it from the panel immediately.
- File sources added in the Sources modal appear as checkboxes in the panel alongside the M365 sources, with 📁 (local) or 🌐 (SMB) icons.
- The panel shows up to 5 rows before scrolling (`max-height: calc(5 * 26px)`).
- **Profile save/restore** — file source selections are now included when saving a profile. `buildScanPayload()` merges M365 and file source IDs into `allSources`; `_applyProfile()` restores all of them. A `_pendingProfileSources` mechanism handles the async case where file sources load after the profile is applied.

### Added — Hint tooltips on Delta scan, Scan photos, Retention policy toggles

Each of the three advanced option toggles now has a circled **?** icon to the right of the label. Clicking it shows a speech bubble (fixed-positioned, `z-index: 9999`) with the hint text, positioned to the right of the icon and visible above the main content area. Only one bubble can be open at a time; clicking anywhere outside closes it.

### Changed — ⚙ Profiles button moved to topbar

The accent-coloured **⚙ Profiles** button was removed from the Database section in the sidebar. A plain **⚙ Profiles** button (matching the style of **⚙ Sources**) now appears to the right of the 💾 save button in the topbar profile bar.

### Changed — App mode badge (modeBadge) removed

The `modeBadge` button and `userBar` div have been removed from the sidebar. Connection status and mode (App / Delegated) are now shown exclusively in the Sources modal (Microsoft 365 tab) — connection info row with green/grey status dot, display name, email, and mode label.

### Fixed — Sources modal: credentials pre-filled from saved config

`smRefreshStatus()` now calls `/api/auth/status` (correct endpoint) and pre-fills Client ID, Tenant ID, and Client Secret fields from the saved config. Connects via `/api/auth/config` + `/api/auth/start`; disconnects via `/api/auth/signout` + `signOut()`.

### Fixed — File source naming: Name field required; auto-suggest from path

The "Label" field renamed to "Name" and marked required (red asterisk). `fsrcAutoName()` suggests a name as the user types the path — last path segment for local paths, `host / share` for SMB paths. The user's own name is never overwritten once typed.

### Fixed — Sources panel fixed height with scroll

`#sourcesPanel` in the sidebar now has `max-height: calc(5 * 26px); overflow-y: auto` so it shows exactly 5 rows before scrolling, regardless of how many sources are configured.

### Fixed — Fiscal year end dropdown alignment

The "Fiscal year end" label and select were previously side-by-side, causing the label to wrap on long translations (e.g. "Regnskabsårs afslutning"). Now stacked vertically (`flex-direction: column`) with `width: 100%` on the select.

### Fixed — ⚙ cog size inconsistency between Sources and Profiles buttons

Both buttons previously used `⚙️` (U+2699 + variation selector U+FE0F), which can render at emoji size rather than text size. Replaced with plain `⚙` (U+2699) in both so they render at identical size.

### Fixed — MB label removed from max attachment size picker

The "MB" text span to the right of the attachment size number input has been removed.

### Fixed — File source selections included in profiles

`buildScanPayload()` now collects both M365 and file source IDs and merges them into `allSources`, which is saved as `profile.sources`. Previously only M365 source IDs were saved.

---

## [1.4.1] — 2026-03-26

### Added — #17 Unified source management modal

Replaced the fragmented sidebar source configuration with a single **⚙️ Sources** button above the sources panel. This opens a tabbed modal:

**Microsoft 365 tab:** Azure credentials (Client ID, Tenant ID, Client Secret) moved from the auth screen into the modal — can be updated or cleared post-connect. Per-source toggles (Email, OneDrive, SharePoint, Teams) control which sources appear in the sidebar panel. Disconnect button signs out without leaving the page.

**Google Workspace tab:** Stub with "Coming soon" — placeholder for Gmail and Google Drive when implemented.

**File sources tab:** Full file source management (list, add, delete, scan) moved from the standalone "📁 File sources" sidebar row into this tab. The separate sidebar row is removed.

**Sidebar change:** The "📁 File sources" sidebar section is removed. The sources panel now has a compact **⚙️ Sources** button in its header row. The panel itself respects the per-source visibility toggles set in the modal — if a user disables OneDrive, it disappears from the panel immediately.

**Backward compatibility:** `openFileSourcesModal()` redirects to `openSourcesMgmt('files')` so any existing call sites continue to work.

---

## [1.4.0] — 2026-03-26

### Added — #8 File system scanning (local folders and SMB/CIFS network shares)

**New file: `file_scanner.py`** — unified local + network file iterator.

`FileScanner.iter_files()` yields `(relative_path, bytes, metadata)` regardless
of whether the source is a local path or a network share. All CPR scanning, card
streaming, and DB persistence stay in `m365_scanner.py` — `file_scanner.py` only
handles how files are accessed.

**Local scanning** uses `os.walk()` on any path (workstation, USB drive, or
already-mounted network share). **SMB/CIFS scanning** uses `smbprotocol` directly
without requiring a mount — supports SMB2/3 with NTLM or domain credentials.
`smbprotocol` is optional: if not installed, the scanner falls back to local-only
mode with a logged warning.

**Credential storage priority (SMB):**
1. OS keychain via `keyring` (recommended — password never touches the filesystem)
2. `NAS_PASSWORD` environment variable
3. `.env` file (chmod 600) via `python-dotenv`

Both optional dependencies (`smbprotocol`, `keyring`, `python-dotenv`) are added
to `requirements.txt` as opt-in extras.

**Results** write to the same SQLite DB as M365 items with
`source_type = "local"` or `"smb"`, so the Article 30 report and data subject
lookup cover all sources in a single view. File/network cards use 📁 and 🌐
source badges respectively.

**UI — 📁 File sources sidebar section:**

- **Manage button** → opens the File Sources modal
- **Add source form** — label, path; SMB fields (host, user, password) appear
  automatically when the path starts with `//` or `\`; host is auto-filled from
  the path
- **Per-source ▶ Scan button** — starts a scan immediately; results stream into
  the main grid via SSE exactly like an M365 scan
- **Delete** — removes a source definition (does not affect scan results already
  in the DB)
- Sources persist in `~/.m365_scanner_file_sources.json`

**New API routes:**

| Route | Method | Description |
|---|---|---|
| `/api/file_sources` | GET | List all file source definitions |
| `/api/file_sources/save` | POST | Add or update a source |
| `/api/file_sources/delete` | POST | Remove a source by id |
| `/api/file_sources/store_creds` | POST | Store SMB password in OS keychain |
| `/api/file_scan/start` | POST | Start a file scan (non-blocking) |

**New CLI flags:**

```bash
# Scan a local folder
python m365_scanner.py --scan-path ~/Documents

# Scan an SMB share (password from OS keychain)
python m365_scanner.py --scan-path //nas.school.dk/shares \
  --smb-user "DOMAIN\\henrik" --smb-keychain-key gdpr-scanner-nas

# One-time credential storage
python m365_scanner.py --smb-store-creds --smb-host nas.school.dk \
  --smb-user "DOMAIN\\henrik"

# With photo scanning and file size limit
python m365_scanner.py --scan-path //nas/staff --scan-photos --max-file-mb 100
```

**`build_m365.py`** — `file_scanner.py` added to PyInstaller datas bundle.

---

## [1.3.11] — 2026-03-26

### Fixed — Face detection: excessive false positives on background elements

Haar cascade detection with `minNeighbors=5` and `min_size=40px` was triggering
on background textures, bottle labels, artwork, and out-of-focus persons,
reporting up to 16 faces for a photo containing 1–2 actual subjects.

**Changes in `_detect_photo_faces()` (`m365_scanner.py`):**

- `min_size` raised **40 → 80 px** — eliminates detections on small background
  features; out-of-focus background persons and objects are too small in pixels
  to exceed this threshold
- `minNeighbors` raised **5 → 8** — each candidate region must be confirmed by
  8 overlapping scale-pyramid detections instead of 5; random texture patterns
  rarely survive this many confirmations

If over-detection persists on a specific image, `minNeighbors=10` and
`min_size=100` are reasonable next steps before genuine faces are missed.

### Fixed — Result cards: replaced 👤 + separate role-pill with unified role icon

The account-pill (showing the owner's display name) previously prepended a
static `👤` via CSS `::before` and rendered a separate `role-pill` span
(🎓/👔) alongside it. Both elements have been merged: the account-pill now
prefixes the display name directly with the role icon — **🎓 name** for
students, **👔 name** for staff, **👤 name** for unclassified — removing the
redundant separate badge and saving horizontal space in both grid and list view.

---

## [1.3.10] — 2026-03-26

### Changed — Role classification: fragment-first, ID-second

**Motivation:** Microsoft has reissued new UUIDs for the same licence multiple
times over the past 5–6 years (EA → A1/A3/A5 → new commerce/CSP → benefit
variants). `skuPartNumber` strings like `STANDARDWOFFPACK_FACULTY` have been
stable across all those generations while UUIDs change with every new issuance.

**New `classify_user_role()` order:**

1. **Fragment match on `skuPartNumber`** (runs first when `sku_map` available) — staff fragments checked before student across all licences, so a `STUDENT_BENEFIT` add-on cannot mask a `FACULTY` licence.
2. **SKU ID lookup from `m365_skus.json`** — fallback when `sku_map` is empty or when a licence has no recognisable fragment (e.g. Power Automate Free assigned to faculty).

Any future Microsoft SKU re-issuance is classified correctly without updating `m365_skus.json`, as long as the part number still contains `FACULTY` or `STUDENT`.

### Fixed — `m365_skus.json`: added two missing faculty SKUs

- `c2273bd0-dff7-4215-9ef5-2c7bcfb06425` — Microsoft 365 Apps for Faculty (primary licence at Gudenåskolen, absent from all previous versions)
- `f30db892-07e9-47e9-837c-80727f46fd3d` — relabelled Microsoft Power Automate Free (assigned to faculty)

---

## [1.3.9] — 2026-03-26

### Fixed — `m365_skus.json` not deployed; `build_sku_map_from_users` sampled wrong users

**File missing:** `m365_skus.json` was never copied into `classification/` on disk. `_load_sku_data()` fell back to empty sets (`staff_ids_count: 0`). Students still classified via `STUDENT` fragment; staff always `"other"`. Fix: file now shipped. Place in `GDPRScanner/classification/m365_skus.json`.

**Wrong sample:** `build_sku_map_from_users` took the first 20 alphabetical users — all students at Gudenåskolen — so it never fetched a staff part number. Fixed to sample evenly across the full list and always include the last 5 users.

---

## [1.3.8] — 2026-03-26

### Fixed — `m365_skus.json` not found in PyInstaller bundle; `🔍` SKU debug modal

`_SKU_FILE = Path(__file__).parent / ...` evaluated at class-definition time, before `sys._MEIPASS` is set in a frozen build. Replaced with `_sku_file_path()` classmethod that checks `_MEIPASS` at call time.

Added 🔍 SKU debug button to the accounts panel role-filter row. Opens a modal showing every tenant SKU ID colour-coded as 🎓 student / 👔 staff / ❓ unknown, with selectable text for pasting unknowns into `m365_skus.json`.

`/api/users/license_debug` extended: now returns `student_ids`, `staff_ids`, `runtime` block (set sizes, fragment lists, file path, sku_map entry count), and per-licence `in_staff`/`in_student`/`frag_staff`/`frag_student` trace for every user — sufficient to diagnose any classification failure without reading server logs.

---

## [1.3.7] — 2026-03-26

### Fixed — `license_debug` extended for full runtime diagnostics

`/api/users/license_debug` rewritten to expose all runtime state: `staff_ids_count`, `student_ids_count`, fragment lists, `sku_file_path`, `sku_map_entries`, and a step-by-step per-licence classification trace for every user (`in_staff`, `in_student`, `frag_staff`, `frag_student`, `skuName`).

---

## [1.3.6] — 2026-03-26

### Fixed — Staff misclassified as student: two-pass classify_user_role

**Root cause:** `f30db892-07e9-47e9-837c-80727f46fd3d` is a Microsoft *Student
Use Benefit* add-on that Microsoft automatically assigns alongside faculty
licences in Education tenants. Its `skuPartNumber` contains `"STUDENT"`. Because
the old single-pass loop checked student and staff in per-licence order, the
fragment match on this add-on fired before the authoritative faculty ID
(`94763226`) was ever reached, returning `"student"` instead of `"staff"`.

**Fix — `classify_user_role()` now uses a strict two-pass approach:**

**Pass 1 — authoritative ID match (m365_skus.json), staff before student:**
All licences are scanned for staff IDs first, then student IDs. A single faculty
SKU ID anywhere in the licence list wins regardless of what other add-on licences
appear before it.

**Pass 2 — skuPartNumber fragment match, staff before student:**
Only reached if no ID match was found. Staff fragments are checked across every
licence before student fragments — preventing a `STUDENT_BENEFIT` add-on from
masking a `FACULTY` licence later in the list.

**Result:** A staff member holding `[STUDENT_BENEFIT_ADDON, FACULTY_A1, STUDENT_DEVICE]`
is now correctly classified as `"staff"` in all cases, whether `sku_map` is
populated or not.

---

## [1.3.5] — 2026-03-26

### Fixed — Staff not recognised: always merge per-user SKU map

**Root cause:** `build_sku_map_from_users()` (which calls `/users/{id}/licenseDetails`
for up to 20 sampled users) was only called when `sku_map` was completely empty.
In practice `get_subscribed_skus()` tier 2 (`/me/licenseDetails`) always succeeds
in delegated mode, returning the signed-in admin's own license — making `sku_map`
non-empty and silently skipping the per-user sampling.

If the admin's license happened to be a faculty A1 and other staff held A3 or an
unlisted variant, those A3 users were never added to `sku_map` and fragment
matching could not fire for them, leaving them as `"other"`.

**Fix:** `build_sku_map_from_users()` is now **always called** and its results
**merged** into `sku_map`, regardless of whether `get_subscribed_skus()` already
returned entries. This guarantees that every distinct SKU ID actually in use by
any of the first 20 users gets a `skuPartNumber` entry, enabling fragment matching
for all staff variants — including those not yet listed in `m365_skus.json`.

Same merge applied in `license_debug` so the 🔍 modal also sees complete data.

---

## [1.3.4] — 2026-03-26

### Fixed — Role classification: three-tier SKU map fallback

**Root cause:** `get_subscribed_skus()` requires `Directory.Read.All` or
`Organization.Read.All`. If the Azure app registration does not have that
permission (typical delegated/device-code setups), it silently returned `{}`
and the fragment fallback never ran, leaving every user as `"other"`.

**Fix — `get_subscribed_skus()` now tries three endpoints in order:**

| Tier | Endpoint | Permission needed |
|---|---|---|
| 1 | `/subscribedSkus` | Directory.Read.All (admin) |
| 2 | `/me/licenseDetails` | User.Read only |
| 3 | `build_sku_map_from_users()` via `/users/{id}/licenseDetails` (up to 20 users) | User.Read.All |

Each tier logs how many SKU entries it found. Tier 2 always works in delegated
mode and covers the signed-in user's licenses. Tier 3 covers all distinct SKUs
used in the tenant by sampling up to 20 users. If any tier returns results, the
others are skipped.

**UI warning banner** — when every fetched user resolves to `"other"`, a red
banner appears above the accounts list: *"No users classified — click 🔍 to
diagnose."*  It disappears automatically once classification succeeds.

---

## [1.3.3] — 2026-03-26

### Fixed — Role classification: SKU debug modal + path resolution

**Problem:** Even with `classification/m365_skus.json` loading correctly, users showed as
unclassified because the tenant's actual SKU IDs were not in the file. There was
no easy way to discover which IDs to add.

**Changes:**

- **🔍 SKU debug button** — a small magnifying-glass button added to the role
  filter row (next to 🎓 Elev). Clicking it opens a modal that calls
  `GET /api/users/license_debug` and lists every unique SKU ID in the tenant,
  colour-coded: `🎓 student` / `👔 staff` / `❓ unknown`. Unknown IDs can be
  selected and copied directly into `classification/m365_skus.json`.

- **`/api/users/license_debug`** extended — now also returns `student_ids` and
  `staff_ids` arrays from the loaded SKU file so the frontend can mark each
  tenant SKU as known or unknown without a second round-trip.

- **`_sku_file_path()` classmethod** — replaced the static `_SKU_FILE` class
  attribute with a method that checks `sys._MEIPASS` first (PyInstaller bundle)
  then falls back to `Path(__file__).parent / "skus" / "m365_skus.json"`.
  The static attribute evaluated at class-definition time before `_MEIPASS` was
  set, causing the frozen app to look in the wrong directory.

- **Server-side warning** — `GET /api/users` now logs a `WARNING` to stdout
  when 0 out of N users are classified, including a sample of the unrecognised
  SKU IDs seen in the first 20 users.

- **Translated** — EN / DA / DE (3 new keys)

---

## [1.3.2] — 2026-03-26

### Fixed — Student/Staff misclassification: incomplete SKU lists + no override (#1.3.2)

**Root cause:** The hardcoded SKU lists introduced in v1.0.0 covered only ~8 student
and 6 staff SKUs. Microsoft publishes 100+ Education SKU IDs; any tenant using a SKU
not in those lists silently fell through to `"other"`, leaving users unclassified
or relying solely on the `skuPartNumber` fragment fallback — which itself was too
specific (`STANDARDWOFFPACK_STUDENT` instead of just `STUDENT`).

#### `m365_connector.py` — Expanded SKU lists and broader fragment matching

**Student set** expanded from 8 → 12 SKUs:
- Added `46c119d4` (M365 A1 for Students — student use benefit)
- Added `8fc2205d` (O365 A5 for Students)
- Added `160d616a` (O365 A3 for Students device)
- Added `a4e376bd` (M365 A1 for Students new commerce)

**Staff set** expanded from 6 → 9 SKUs:
- Added `2d61d025` (M365 A1 for Faculty — faculty use benefit)
- Added `15b1d32e` (O365 A3 for Faculty device)
- Added `ba04c29e` (M365 A1 for Faculty new commerce)

**Fragment patterns** broadened — `"STUDENT"` and `"FACULTY"` now catch all
part-number variants (`_STUDENT`, `STUDENT_`, `STUDENT_BENEFIT`, `_FAC`, etc.)
without needing to enumerate every Microsoft naming permutation.

#### `m365_scanner.py` — Manual role overrides

Because no SKU list can ever be complete, admins can now correct individual users
directly from the accounts panel:

- **🎓/👔/❓ role badge** on every user row — click to cycle:
  `auto → student → staff → other → (clear, back to auto)`
- Overridden rows show the badge in accent colour with a **✎** indicator
- Overrides persisted to `~/.m365_scanner_role_overrides.json` — survive
  restarts and re-authentication
- Applied at both display time (`/api/users`) and scan time (`_user_role_map`)
  so card badges, filter buttons, Excel Role column, and Article 30 inventory
  split all reflect the corrected role
- `GET /api/users/role_override` — returns all current overrides
- `POST /api/users/role_override` — sets or clears one override
- Override file added to `--purge` file list
- Translated — EN / DA / DE (3 new keys)

---

## [1.3.1] — 2026-03-26

### Fixed — Student/Staff role misclassification (`m365_connector.py`)

Two SKU ID collisions in `_STUDENT_SKU_IDS` / `_STAFF_SKU_IDS` caused Faculty
users to be shown as Students (and vice versa) for any tenant using A5 or A3
Education licenses:

| SKU ID | Correct role | Bug |
|---|---|---|
| `e578b273-6db4-4691-bba0-8d691f4da603` | Staff (M365 Education A5 for Faculty) | Was also in `_STUDENT_SKU_IDS` as "O365 A5 for Students" — Faculty A5 users always showed as 🎓 Student |
| `78e66a63-337a-4a9a-8959-41c6654dfb56` | Student (Office 365 A3 for Students) | Was also in `_STAFF_SKU_IDS` as "M365 A1 for Faculty (device)" — this had no effect because student is checked first, but the comment was wrong and the duplicate entry was confusing |

`classify_user_role()` checks student first, so any overlap resolves to student,
silently misclassifying all affected Faculty accounts.

**Fix:** removed `e578b273` from `_STUDENT_SKU_IDS` and `78e66a63` from
`_STAFF_SKU_IDS`. Also removed a stale duplicate of `e578b273` that appeared
twice in `_STAFF_SKU_IDS`. Added a `RuntimeWarning` guard inside
`classify_user_role()` that logs any future collision between the two sets.

**Impact:** Article 30 staff/student inventory split, role filter buttons (👔 / 🎓),
role badges on cards, and Excel Role column are all now correct for A5 and A3
Education tenants.

**Workaround until update:** use `GET /api/users/license_debug` to see the raw
SKU IDs and current classification for each user.

---

## [1.3.0] — 2026-03-26

### Added — Biometric photo scanning (#9)

**GDPR reference:** Article 9 (special categories — biometric data), Article 5(1)(b)(e), Recital 38, Databeskyttelsesloven §6

- **`PHOTO_EXTS`** — new constant covering `.jpg .jpeg .png .bmp .tiff .tif .webp .heic .heif`
- **`_detect_photo_faces(content, filename)`** — calls `ds._get_cv2()` + `ds.detect_faces_cv2()` (already in `document_scanner.py`); PIL fallback for HEIC/HEIF; `minNeighbors=5` for conservative detection; returns face count or 0 on any failure; entirely safe — exceptions swallowed silently
- **`scan_photos` option** — new boolean scan option (default `False` — opt-in); extracted from `scan_opts` alongside `delta` and `email_body`
- **`🖼 Scan photos for faces` toggle** in the Options panel, with hint: "Slower — opt in"
- **Photo items flagged even without CPRs** — a file is added to results if `face_count > 0`, even if no CPR number is found; photographs of identifiable people are Art. 9 data regardless of CPR content
- **`"biometric"` auto-injected** into `special_category` when faces are detected and `"biometric"` is not already present
- **`face_count`** field added to card payload, DB, Excel, and Article 30 report

**DB (migration #4):**
- `face_count INTEGER NOT NULL DEFAULT 0` added to `flagged_items` via auto-migration
- `save_item()` updated to persist `face_count`

**UI:**
- **`📷 N faces` badge** — teal `photo-face-badge` pill shown on cards in both grid and list view when `face_count > 0`
- **`📷 Photos / biometric` filter** added to the Special dropdown in the filter bar; `applyFilters()` handles `specialVal === 'photo'`
- `buildScanPayload()` includes `scan_photos`; `_applyProfile()` restores it when loading a profile

**Excel export:**
- `Face count` column added as column 3 (between CPR Hits and Special category); URL column index updated from 10 → 11 for hyperlink styling

**Article 30 report:**
- Summary section: `Photos with detected faces (Art. 9 biometric)` row with item + face count; explanatory note on legal basis and parental consent (Databeskyttelsesloven §6)
- New dedicated section: *Photographs and Biometric Data (Article 9)* — intro paragraph, 4-bullet retention guidance (purpose limitation, pupil consent, website removal, archiving), item table (name, account, source, faces, modified date), capped at 50 rows
- Methodology section: bullet added describing OpenCV Haar cascade detection

**Translated** — EN / DA / DE (16 new keys per language)

---

## [1.2.3] — 2026-03-26

### Added — Profile management modal (#15d)

- **⚙ Profiles button** in the sidebar Database row opens a modal listing all saved profiles
- **Each profile row** shows name (with ● active indicator), sources summary, description, and last run timestamp
- **Use** — loads the profile into the sidebar and updates the topbar dropdown; closes the modal
- **Edit** — expands an inline edit form directly in the row; saves name and description via `POST /api/profiles/save`
- **Duplicate** — creates a copy with a unique `(copy)` / `(copy 2)` suffix; reloads the list
- **Delete** — confirms, removes via `POST /api/profiles/delete`, clears `_activeProfileId` if the deleted profile was active
- Empty state shown when no profiles have been saved yet
- Translated — EN / DA / DE (14 new keys per language)

### Added — Database export/import UI (#11)

- **🗄 Database** sidebar section with **Export** and **Import** buttons (always visible; sits between Email report and User info)
- **Export button** — calls `GET /api/db/export`; triggers a browser download of a timestamped ZIP (`gdpr_export_YYYYMMDD_HHmmss.zip`) containing 8 JSON files; CPR hashes only, thumbnails stripped
- **Import modal** — file picker (`.zip` only), mode selector (Merge / Replace), replace warning panel, status line, and Import button; calls `POST /api/db/import` with multipart form data
- **`GET /api/db/export`** Flask route — generates ZIP in a temp file, streams bytes as `application/zip` attachment
- **`POST /api/db/import`** Flask route — accepts multipart `file`, `mode`, `confirm`; validates replace confirmation server-side; returns `{ok, mode, imported: {table: count}}`
- Translated — EN / DA / DE (17 new keys per language)

### Changed — Article 9 keyword matching compiled to regex (#13)

- `_load_keywords()` now compiles one `re.Pattern` per Article 9 category at startup using a longest-first alternation: `(?:keyword_a|keyword_b|…)` with `re.IGNORECASE`
- Short keywords (≤ 4 chars) retain `(?<!\w)…(?!\w)` word-boundary anchors to prevent substring false positives
- `_check_special_category()` uses the compiled patterns via `pattern.finditer()` instead of a sequential `str.find()` loop over up to 459 entries
- Startup log now reports compiled category count: `Loaded 459 keywords (9 categories compiled)`
- **Performance:** ~10–50× faster for large tenants; negligible difference for typical school tenants (~100 flagged items); meaningful saving at 1 000+ items

---

## [1.2.2] — 2026-03-21

### Added — Profile selector in topbar (15c)

- **Profile dropdown** in the topbar, between the Scan button and the spacer — shows "Default (sidebar)" plus all saved profiles with their last run date
- **💾 Save button** next to the dropdown — prompts for a name and saves the current sidebar state (sources, options, user selection, retention settings) as a named profile via `POST /api/profiles/save`
- **`onProfileChange()`** — fires when the dropdown changes; calls `_applyProfile()` to populate the sidebar controls from the selected profile
- **`_applyProfile(profile)`** — sets all source checkboxes, scan options, retention fields, and queues user selection for when the accounts list is loaded
- **`_applyPendingProfileUsers()`** — applies a profile's `user_ids` to the accounts list after `loadUsers()` completes; safe to call multiple times
- **`loadProfiles()`** — fetches `/api/profiles` and populates the dropdown; called on `onAuthenticated()`
- **`saveCurrentAsProfile()`** — collects the full `buildScanPayload()` state and posts it as a new or updated profile
- Profiles with a description show it as a tooltip on the dropdown option
- Selecting "Default (sidebar)" clears `_activeProfileId` so the sidebar is used directly with no profile applied
- **Translated** — EN / DA / DE (6 new keys)

---

## [1.2.1] — 2026-03-21

### Added — Scan profiles 15a + 15b

**15a — Backend profile storage**

- `_profiles_load()` — reads all profiles from `~/.m365_scanner_settings.json`
- `_profiles_write()` — atomic write of the full settings dict
- `_profile_from_settings()` — wraps a flat settings dict as a profile object
- `_profile_get(name_or_id)` — case-insensitive lookup by name or UUID
- `_profile_save(profile)` — insert or update a profile
- `_profile_delete(name_or_id)` — delete by name or UUID
- `_profile_touch(id, scan_id)` — updates `last_run` and `last_scan_id` after a successful scan
- **Automatic migration** — on first run, existing flat `~/.m365_scanner_settings.json` is silently wrapped into a profile named "Default"; no user action required
- **Legacy shim** — `_save_settings()` and `_load_settings()` continue to work unchanged; all existing headless setups are unaffected
- **Profile API routes** — `GET /api/profiles`, `POST /api/profiles/save`, `POST /api/profiles/delete`, `GET /api/profiles/get` for future UI use (15c/15d)

**15b — CLI profile support**

- `--list-profiles` — tabular listing of all profiles with name, sources, last run, and scan ID
- `--save-profile NAME` — saves current CLI options as a named profile; updates existing if name matches
- `--delete-profile NAME` — removes a profile by name
- `--profile NAME` — loads a named profile for `--headless` runs; populates sources, retention, fiscal year end, and email recipients from the profile; prints profile name, description, and last run before scanning
- After a successful headless scan, the active profile's `last_run` and `last_scan_id` are updated automatically

---

## [1.2.0] — 2026-03-20

### Added — Article 9 sensitive category detection (#3)

- **`keywords/da.json`** — 459 Danish keywords across 9 Article 9 categories: health, mental health, criminal (Art. 10), trade union, religion, ethnicity, political, biometric, and sexual orientation. Includes `_false_positive_guidance` for ambiguous terms and `_proximity_note` explaining the matching strategy
- **`keywords/` subfolder** — mirrors the `lang/` pattern; `keywords/en.json` and `keywords/de.json` can be added without code changes
- **`_load_keywords()`** — loads the keyword file at startup matching the active UI language; falls back to `da.json`
- **`_check_special_category(text, cprs)`** — returns a sorted list of matched Article 9 category keys; a keyword only triggers when within 150 characters of a CPR number (proximity filter); if no CPRs are present in the text, any keyword occurrence triggers
- **Card badge** — purple `⚠ Art.9 — health, criminal` pill on flagged cards showing all detected categories
- **Filter bar dropdown** — "All risk levels / Art. 9 special category" quick filter in the results grid
- **DB migration #3** — `special_category TEXT NOT NULL DEFAULT '[]'` added to `flagged_items` via auto-migration; stored as JSON array
- **`finish_scan()`** — counts special category items per scan and writes to `scan_history.special_category` for trend tracking
- **Excel export** — "Special category" column added as column 3 on all per-source sheets
- **Article 30 report** — special category item count and DPIA warning added to the summary section; "Art. 9" column added to the per-source breakdown table with purple highlighting on non-zero values
- **Translated** — EN / DA / DE (6 new keys per language)
- **Build scripts** — `keywords/` folder bundled into PyInstaller app alongside `lang/`
- **`.gitignore`** — `!keywords/*.json` added to prevent keyword files being excluded by the `*.json` catch-all

---

## [1.1.3] — 2026-03-20

### Fixed

- **Stray duplicate `_get_bytes` body** — dead code block left after `delete_drive_item_for_user` from a previous edit has been removed

### Changed — `m365_connector.py`

- **Split timeouts** — replaced all hardcoded `timeout=30` / `timeout=60` with two tuned constants:
  - `_TIMEOUT_API = (10, 45)` — 10s connect, 45s read for JSON API calls
  - `_TIMEOUT_BYTES = (10, 120)` — 10s connect, 120s read for file/attachment downloads
  - The 10s connect timeout makes hung connections fail fast; the read timeout allows slow wireless links to complete a transfer without aborting

- **Exponential backoff with retry** — all four core request methods (`_get`, `_post`, `_get_bytes`, `_delete`) now retry up to 4 times on transient network errors:
  - Retried: `ConnectionError`, `Timeout`, `ChunkedEncodingError`, `ReadTimeout`, HTTP 429, HTTP 503, HTTP 504
  - Not retried: HTTP 403 (permission), HTTP 410 (delta token expired) — raised immediately
  - Backoff: 2s → 4s → 8s between attempts (capped at 30s); 429 responses use the `Retry-After` header value
  - Intermittent wireless dropouts and brief gateway errors are now absorbed transparently without interrupting a scan

- **Streaming file downloads** — `_get_bytes` now uses `stream=True` and `iter_content(65536)` so large attachments are received in 64 KB chunks rather than one blocking read; prevents read timeouts on slow connections for large files

- **`list_users` inline timeout** — the `_fetch` helper inside `list_users` was using its own hardcoded `timeout=30`; updated to use `_TIMEOUT_API`

---

## [1.1.2] — 2026-03-20

### Fixed

- **App does not start after build** — `m365_db.py`, `scanner_worker.py`, and `VERSION` were missing from PyInstaller `datas` in `build_m365.py`; the app crashed immediately on launch because these files could not be found inside the bundle
- **`_read_app_version()` broken in both build scripts** — still searched for `APP_VERSION = "..."` as a string literal in the scanner source, but both scanners now read from the `VERSION` file; build scripts updated to read `VERSION` directly
- **`VERSION` not bundled** — `build.py` (Document Scanner) also missing the `VERSION` file in `datas`

### Added

- **`--purge` CLI flag** — permanently deletes all data files created by the scanner (SQLite database, Azure credentials, SMTP credentials, settings, checkpoint, delta tokens, language preference, OCR cache, MSAL token cache); prompts for `yes` confirmation; `--yes` skips prompt for scripted use
- **`--export-db FILE`** — exports the database to a structured ZIP archive containing 8 JSON files; thumbnails excluded; CPR stored as hashes only
- **`--import-db FILE`** — imports a previously exported ZIP; `--import-mode merge` (default) adds dispositions and deletion log only; `--import-mode replace` wipes and restores all tables; `--yes` skips confirmation on replace

---

## [1.1.1] — 2026-03-19

### Fixed

- **Layout collapse in light mode** — `.topbar` CSS rule was broken by an earlier edit; `border-bottom` and `background` properties were orphaned onto a dangling line, causing the topbar to render with no background and the Scan button to be nearly invisible
- **Sidebar missing** — `.layout` used `height: 100vh` which ignored `body` padding, causing the flex layout to overflow and the sidebar to disappear
- **macOS pywebview titlebar overlap** — content rendered behind the traffic-light buttons; fixed with `padding-top: 30px` on `body` when running inside pywebview on macOS, combined with `box-sizing: border-box` and `height: 100%` on `.layout`
- **`<option>` elements not translated** — `applyI18n()` used `el.innerHTML` on `<option>` elements; some browsers do not re-render the select's visible text when `innerHTML` is set on an already-mounted option; switched to `el.textContent` for option elements
- **Disposition filter dropdown not translated on load** — filter bar is hidden until first scan result arrives so `applyI18n()` on `DOMContentLoaded` missed it; `applyI18n()` is now called when the filter bar is first shown
- **Card delete button z-index** — added `z-index: 1` to `.card-delete-btn` so it stacks correctly within its card context

### Added

- **`--reset-db` CLI flag** — permanently drops and recreates all database tables; shows a summary of what will be deleted and requires typing `yes` to confirm
- **`--yes` flag** — skips confirmation prompts; use with `--reset-db` for scripted/automated resets
- **`ScanDB.reset()`** — new method in `m365_db.py` that drops all tables in correct foreign-key order, resets `user_version` to 0, and reopens the connection with a fresh schema

---

## [1.1.0] — 2026-03-19

### Added — M365 Scanner

- **Student / staff role classification** — O365 license SKU IDs used to classify users as 🎓 Student or 👔 Staff with no extra Azure permissions required. Hardcoded known Microsoft Education SKU IDs cover M365/Office 365 A1/A3/A5 for Students and Faculty. Fragment fallback for future SKUs.
- **Role filter in accounts panel** — All / 👔 Ansat / 🎓 Elev buttons filter the user list before selecting accounts to scan
- **Role badge on result cards** — 🎓/👔 pill shown on every card in grid and list view
- **`user_role` in SQLite DB** — stored in `flagged_items` table; DB migration applied automatically on first run
- **Licensed users only** — accounts without an assigned O365 license are excluded from the user list
- **Disposition filter in filter bar** — filter results grid by compliance disposition status
- **Headless auto-delete of `delete-scheduled` items** — items tagged for deletion are removed automatically after each headless scan
- **Deletion audit log** — every deletion logged to `deletion_log` table with timestamp, actor, reason, and legal basis
- **`GET /api/db/deletion_log`** — API endpoint for the deletion log
- **Deletion log in Article 30 report** — dedicated section with summary-by-reason table and full 7-column log
- **Article 30 — student/staff split** — Section 3 (Data Inventory) now shows Staff and Student tables separately; parental consent note added for student items (Databeskyttelsesloven §6)
- **`GET /api/users/license_debug`** — diagnostic endpoint showing raw SKU IDs and classified roles for each user
- **`_resolve_display_name()`** — resolves GUIDs and "Microsoft Konto" guest account placeholders to email address throughout UI and Article 30 report
- **Account name in Article 30** — resolved via `user_ids` stored in scan options; GUID no longer shown in any column
- **All Article 30 strings translated** — deletion log section now uses `L()` throughout; 19 new keys in EN/DA/DE
- **`VERSION` file** — single source of truth; both scanners read version at startup via `Path(__file__).parent / "VERSION"`
- **`CHANGELOG.md`** — release history and versioning policy
- **`SECURITY.md`** — responsible disclosure process
- **`CONTRIBUTING.md`** — development setup, code style, PR process
- **`LICENSE`** — AGPL-3.0 with commercial licensing note and GDPR disclaimer
- **`.gitignore`** — covers credentials, databases, audit logs, venv, build artefacts

### Fixed — M365 Scanner

- Language switching no longer reloads the page — translations applied in-place, scan results preserved
- Connect screen freeze — duplicate `renderAccountList` function definition caused a JavaScript syntax error that prevented `onAuthenticated()` from firing
- Account column in Article 30 report showing GUIDs — resolved via `_acct_map` built from stored `user_ids`
- "Microsoft Konto" / GUID display names on cards and in reports — resolved to email address

### Changed — M365 Scanner

- **Excel export** — 9 columns (was 7): added Account (display name), Role, and Disposition; URL hyperlink column index updated accordingly
- **Accounts list** — licensed users only; `assignedLicenses` post-filter applied

---

## [1.0.0] — 2026-03-19 — Initial public release

### Document Scanner (`server.py`)

- Scan PDFs, Word, Excel, CSV, and image files for Danish CPR numbers
- OCR support via Tesseract for scanned/image-based PDFs
- NER-based detection of names, addresses, phone numbers, emails, IBANs, and bank accounts via spaCy
- CPR validation: strict Modulus 11 check + century-digit verification
- Redaction modes: mask CPR only, or full anonymisation of all personal data
- Face detection and blurring in image files via OpenCV
- Risk scoring per file based on CPR count, age, and PII density
- Dry-run mode — scan without writing any output files
- JSON audit log (`scanner_audit.jsonl`) — append-only, records every action
- SQLite OCR cache (`~/.document_scanner_ocr_cache.db`) — avoids re-OCR of unchanged pages
- Web UI on port 5000 with grid and list view, live progress, drag-and-drop upload
- Standalone macOS `.app` and Windows `.exe` via PyInstaller + pywebview

### M365 Scanner (`m365_scanner.py`)

#### Scanning
- Exchange mailboxes: all folders and subfolders, recursive, language-independent using `wellKnownName` identifiers
- OneDrive, SharePoint, Teams file scanning via Microsoft Graph API
- Attachment scanning: PDF, Word, Excel inside emails
- CPR detection with the same strict validator as the Document Scanner
- NER-based PII detection (phone, IBAN, bank account, name, address, org)
- Progressive streaming — results appear card-by-card via Server-Sent Events
- Incremental / resumable scans — checkpoint saved on interruption, resume on next run
- Delta scan — Graph `/delta` endpoints fetch only changed items since last scan
- Per-item thumbnail generation — image previews and placeholder SVGs

#### Results
- Results grid with grid and list view, search, source filter, and disposition filter
- Account name and role (🎓 Student / 👔 Staff) badge on every card
- 🗓 Overdue badge on items exceeding the retention cutoff
- Preview panel with iframe preview, metadata strip, and disposition dropdown

#### Compliance features
- **Retention policy enforcement** (GDPR Art. 5(1)(e)): rolling or fiscal-year cutoff (e.g. Bogføringsloven Dec 31), 🗓 Overdue badge, bulk-delete quick filter, headless auto-delete via `--retention-years` and `--fiscal-year-end`
- **Data subject lookup** (Art. 15/17): modal, CPR hashed before query, bulk delete with audit logging
- **Disposition tagging** (Art. 5(1)(a)): Unreviewed / Retain (legal/legitimate/contract) / Delete-scheduled / Deleted — filter bar, preview panel, Excel export, headless auto-delete of scheduled items
- **Deletion audit log** (Art. 5(2)): every deletion logged with timestamp, actor, reason, legal basis
- **Article 30 report** (Art. 30): structured `.docx` export — summary, data categories, data inventory (staff and student sections), retention analysis, compliance trend, deletion audit log, methodology

#### User management
- Application mode (service account) and Delegated mode (device code flow)
- License-based role classification: 🎓 Student / 👔 Staff detected from O365 SKU IDs — no extra permissions needed
- Role filter buttons in accounts panel (All / 👔 Ansat / 🎓 Elev)
- Licensed users only — accounts without an assigned license are excluded
- Display name resolution: GUIDs and "Microsoft Konto" guest placeholders resolved to email address

#### Database (`m365_db.py`)
- SQLite persistence layer alongside JSON session cache
- Tables: `scans`, `flagged_items`, `cpr_index`, `pii_hits`, `dispositions`, `scan_history`, `deletion_log`
- CPR numbers stored as SHA-256 hashes only — never in plaintext
- Schema migration support via `_MIGRATIONS` + `user_version` pragma

#### Exports
- Excel export: 9 columns including Account, Role, Disposition; per-source sheets with auto-filter
- Article 30 Word document export
- Email report via SMTP (STARTTLS / SMTPS / plain); headless `--email-to` flag

#### Headless / scheduled mode
- `--headless --output DIR --settings FILE` for cron / Task Scheduler
- `--retention-years N --fiscal-year-end MM-DD` for automated retention enforcement
- `--email-to` for automated report delivery
- Non-interactive: deletes automatically; interactive (TTY): prompts for confirmation

#### Internationalisation
- Language files: English (`en`), Danish (`da`), German (`de`)
- Language switching applies in-place — no page reload, scan results preserved

#### Installation
- `install_windows.ps1`: Python, Tesseract, Poppler, venv — all local to project folder, no system PATH changes; all downloads via `curl.exe`
- `install_macos.sh`: Homebrew, Python 3.12, Tesseract, Poppler, spaCy model
- `Dockerfile` + `docker-compose.yml` for containerised deployment
- GitHub Actions: 4 parallel build jobs (Document Scanner + M365 × Windows + Linux), auto-release on `v*` tags

---

## Versioning policy

- **PATCH** (`1.0.x`) — bug fixes, translation updates, minor UI tweaks
- **MINOR** (`1.x.0`) — new feature, new suggestion from SUGGESTIONS.md implemented
- **MAJOR** (`x.0.0`) — breaking change: DB migration required, config format change, or Azure permission requirement change

To release a new version:

```bash
# 1. Update VERSION
echo "1.1.0" > VERSION

# 2. Update CHANGELOG (add new section above [1.0.0])

# 3. Commit and tag
git commit -am "Release 1.1.0"
git tag v1.1.0
git push && git push --tags
# GitHub Actions builds and publishes automatically
```