StyxX65 034ced943e Extended document redaction to Google Drive, SFTP, SMB, and local PDFs Extends the ✂ in-place redaction feature beyond local DOCX/XLSX/CSV/TXT files to cover all remaining file source types and adds PDF support for local files.

2026-05-28 17:47:02 +02:00

206 KiB

Raw Blame History

Changelog

All notable changes to GDPR Scanner are documented here.

Format follows Keep a Changelog. Version numbers follow Semantic Versioning.

[Unreleased]

Added

PDF redaction for local files — the ✂ redact button now works on local PDF files in addition to DOCX, XLSX, CSV, and TXT. Text-based PDFs are redacted using PyMuPDF's physical redaction (page.apply_redactions()), which removes the underlying text data from the PDF stream — not just paints over it. Scanned/image-based PDFs go through the OCR bbox path: CPR positions are found via Tesseract then physically painted and sanitised. Falls back to a reportlab overlay if PyMuPDF is not installed; raises a clear error if both libraries are absent.
Google Drive file redaction — the ✂ redact button now works on native DOCX, XLSX, and PDF files stored in Google Drive (both Google Workspace service-account and personal OAuth connectors). The file is downloaded via the Drive API, redacted locally using the same PyMuPDF / python-docx / openpyxl pipeline as local files, then uploaded back as a new revision via files().update(). Google Docs/Sheets exported as DOCX are detected by MIME type and refused with a clear message (re-upload after exporting manually). Requires the drive scope (not drive.readonly) on the service-account domain-wide delegation grant; a 403 surfaces the exact Google error so admins can add the scope. Methods added: get_drive_file_mime, download_drive_file_by_id, update_drive_file on both GoogleWorkspaceConnector and PersonalGoogleConnector.
SFTP file redaction — the ✂ button now works on SFTP files (DOCX, XLSX, CSV, TXT, PDF). The file is downloaded via paramiko, redacted locally, then written back with sftp.open(path, "wb"). Source config is matched from _load_file_sources() by host + username; credentials are resolved from the keychain via _resolve_sftp_credentials. Requires the item to be in the current session's state.flagged_items (SFTP host info is not stored in the DB). New method: SFTPScanner.write_file(remote_path, content).
SMB file redaction — the ✂ button now works on SMB/CIFS network share files (DOCX, XLSX, CSV, TXT, PDF). Source config is looked up by matching the host parsed from full_path (//host/share/…). File is downloaded and re-uploaded using smbprotocol with CreateDisposition.FILE_SUPERSEDE so the file is atomically replaced. New function: file_scanner.write_smb_file(path, content, username, password, domain).
AI-enhanced NER via Claude — Named Entity Recognition (names, addresses, organisations) can now be powered by Claude Haiku instead of spaCy. Enable in Settings → AI / NER: paste an Anthropic API key, toggle on, click Test to confirm. When enabled, document_scanner.py calls the Claude API (claude-haiku-4-5-20251001) instead of spaCy for all three scan engines; results are cached in-memory per document (bounded at 2 000 entries) so repeated scans of the same file never re-charge the API. Falls back to spaCy automatically if the key is missing or the anthropic package is not installed. API key stored in config.json under claude_api_key; toggle stored under claude_ner. Routes: GET/POST /api/settings/claude, POST /api/settings/claude/test.

Fixed

Settings modal too narrow for seven tabs — widened from 640 px to 720 px so all tab labels fit on one line without wrapping.

[1.6.28] — 2026-05-28

Added

Date-range scoping for viewer tokens — tokens can now carry optional valid_from and valid_to scope fields (YYYY-MM-DD). When set, GET /api/db/flagged filters items whose modified date falls outside the range. The share modal now shows two date inputs ("Items from" / "Items until") that apply to any scope type (all/role/user). The token list shows a green date-range badge when a range is stored. The server validates format and enforces valid_from ≤ valid_to. All three scope dimensions (role, user, date-range) are independent and combinable.
CPR-only mode — a new cpr_only scan option (sidebar toggle #optCprOnly, profile editor #peOptCprOnly) makes all three scan engines skip items that have no qualifying CPR numbers. Files whose only hits are email addresses, phone numbers, detected faces, or EXIF/GPS metadata are not flagged. The flag already detected is still shown on cards when cpr_only=false (default). Gated in all three engines: file scan skip condition, M365 email flagging, M365 file flagging, and Google Gmail/Drive flagging.
OCR language override — a new ocr_lang scan option (sidebar select #optOcrLang, profile editor #peOptOcrLang) lets operators choose the Tesseract language pack(s) used when scanning scanned PDFs and images. Presets: dan+eng (default), dan, eng, dan+eng+deu, dan+eng+swe, dan+eng+fra. The setting flows from the UI through the profile, into all three scan engines (M365 _scan_bytes_timeout, M365 attachments _scan_bytes, M365 files _scan_bytes, Google _scan_bytes for both Gmail and Drive). The lang parameter is threaded through cpr_detector._scan_bytes → document_scanner.scan_pdf / scan_image and the spawned PDF-OCR subprocess worker. The OCR cache key already included lang, so per-language results are cached independently.
Built-in file redaction for local files — a scissor button (✂) appears on cards for local DOCX, XLSX, CSV, and TXT files. Clicking it rewrites the file in-place with all detected CPR numbers replaced by ██████-████ (DOCX/XLSX) or █-blocks (CSV/TXT), then removes the card from the grid and logs a "redacted" disposition. The redaction is atomic: a temp file in the same directory is written first and then moved over the original, so a crash never leaves a half-written file. Implemented in routes/export.py (POST /api/redact_item) using the existing document_scanner redact functions; front-end in results.js (redactItem) with the button hidden for non-local or unsupported-extension items and for resolved/viewer-mode cards.
DELETE /api/delete_item route registration fix — the delete_item handler in routes/export.py was missing its @bp.route decorator, so the endpoint was never registered in Flask's URL map. The route now works correctly.
Scheduled report-only email job — scheduled jobs can now be configured as "report only" (toggle #schedReportOnly). When enabled, the job skips the scan entirely and instead emails the latest scan results already in the database. If the in-memory result list is empty (e.g. after a server restart), results are loaded from the DB via get_session_items(). M365 authentication is not required for report-only jobs — email is sent Graph-first if authenticated, SMTP otherwise. Jobs fail with a clear error if no scan results are available. The job list card shows a blue "Report only" badge. Setting report_only=True in the editor automatically enables "Email report automatically" and dims the Profile field (unused for report-only runs).
Compliance audit log — every significant admin action is now written to an immutable audit_log table in the scanner database. Recorded events: profile save/delete, viewer token create/revoke, viewer/interface/admin PIN set/change/clear, file source add/update/delete, scheduler job save/delete, scan start/stop, SMTP config save, single and bulk disposition changes, item delete, and item redact. Each record stores a Unix timestamp, an action key, a human-readable detail string, and the client IP address. Accessible via GET /api/audit_log (returns newest-first, max 1000 entries; filterable by ?action=). Visible in the Settings modal under a new Audit Log tab; the table refreshes whenever the tab is opened. The log_audit_event() module-level helper in gdpr_db.py silently no-ops if the DB is unavailable, so all call sites are safe in test and offline contexts.

Fixed

Stop button had no effect on Google Workspace scans — POST /api/scan/stop only set state._scan_abort (the M365/file abort event) and never touched state._google_scan_abort. Separately, _check_abort() inside _run_google_scan was checking gdpr_scanner._scan_abort (the M365 event) instead of the module-level _scan_abort alias that points to state._google_scan_abort. Both bugs combined meant neither the Stop button nor POST /api/google/scan/cancel had any effect on a running Google scan. Fixed by having scan_stop() set both events and having _check_abort() use the correct module-level alias.
Settings tab labels wrapping to two lines — adding the Audit Log tab pushed the six-tab row past the 540 px modal width, causing "E-mailrapport" (and similar long translations) to break onto a second line. The modal is now 640 px wide and tabs carry white-space:nowrap; .settings-tabs retains flex-wrap:wrap as a safety net on very small screens.

[1.6.27] — 2026-05-27

Added

Email body excerpt preserved for offline preview — when an M365 email or Gmail message is flagged, the first 500 characters of its plain-text body are stored in the card (body_excerpt), the checkpoint JSON, and a new body_excerpt DB column (migration #10). The M365 email preview now falls back to this excerpt when Graph is unavailable (not authenticated, token expired) or when resuming from a checkpoint without a live connection. The Gmail preview now shows the stored excerpt as the primary content (with the "Open in Gmail" link appended below) rather than the previous plain link-card. A helper _excerpt_page() in routes/database.py renders the excerpt with the same header layout as the full Graph-fetched preview.
Re-scan diff — resolved items in history view — when browsing a past scan session, items that were flagged in the immediately preceding session but are no longer present in the current one are automatically appended below a "N items no longer present" divider. Resolved items are greyed out and carry a green ✓ Resolved badge; the delete button is hidden since the file is already gone. The history banner updates to show the resolved count alongside the flagged count. The diff is computed client-side by fetching the previous session's items and comparing IDs — no new API endpoint needed. Implemented in history.js (loadHistorySession) and results.js (appendCard).
Google Workspace scan test suite — 19 new tests in tests/test_google_scan.py covering all three routes (GET /api/google/scan/users, POST /api/google/scan/start, POST /api/google/scan/cancel) and the core scan engine (_run_google_scan). Route tests verify: 401 when unauthenticated, 409 when scan already running, lock released on both normal completion and exception, abort event cleared on start. Engine tests verify: CPR hits are broadcast as scan_file_flagged, clean items are not, source_type is correctly set to "gmail" for Gmail items and "gdrive" for Drive items, and google_scan_done always fires with correct flagged_count / total_scanned values.

[1.6.26] — 2026-04-29

Fixed

Previous scan results visible when a new scan starts — two async functions (loadHistorySession and loadLastScanSummary) could resolve after startScan had already cleared the grid. loadHistorySession would re-populate the grid with old history items; loadLastScanSummary would re-show the last-scan summary card. Both functions now bail early after each await if any of the three scan-running flags (S._m365ScanRunning, S._googleScanRunning, S._fileScanRunning) is set — those flags are written synchronously by startScan before any awaits, so the check is race-free.
Selected card scrolls out of view when preview panel opens — clicking a card in grid view opens the 420 px preview panel, which shrinks the grid area and reflows the card columns. The selected card was no longer visible. openPreview() now schedules a requestAnimationFrame after removing .hidden from the panel so the card is scrolled back into view (scrollIntoView block: nearest) once the layout has settled.
Gmail and Google Drive preview crashed with a 404 Graph API error — _source_type was never set on Google items in routes/google_scan.py, so Gmail and Google Drive cards carried an empty source_type. The preview route in routes/database.py only checked for "local", "smb", and "email" before falling through to the M365 else-branch, which tried to call https://graph.microsoft.com/.../drive/items/gmail:{id}/preview — always a 404. Fixed by tagging Gmail items as _source_type = "gmail" and Google Drive items as "gdrive" at scan time. The preview route now handles both: Google Drive files get an embeddable https://drive.google.com/file/d/{id}/preview iframe; Gmail messages (not embeddable) show an info card with an "Open in Gmail" link. The state.connector (M365 auth) guard was also moved inside the email and M365 else branches so Google-only setups no longer receive a 401 when opening a Gmail or Drive preview.

[1.6.25] — 2026-04-25

Added

Checkpoint / resume for Google and File scans — stopping a Google Workspace or file (local/SMB/SFTP) scan mid-way and restarting now resumes from where it left off, exactly like M365 scans have always done. Each engine writes its own checkpoint file (checkpoint_google.json, checkpoint_file_{source_id}.json) every 25 items. On restart, previously found cards are re-emitted via SSE so the grid is repopulated before new items arrive. The Scan button now always checks for a live checkpoint before starting — if one exists the resume banner is shown regardless of whether the user reloaded the page. POST /api/scan/checkpoint returns a per-engine breakdown; POST /api/scan/clear_checkpoint wipes all checkpoint_*.json files. Google users' email addresses are included in the checkpoint payload from the frontend so the server can compute a matching key. checkpoint.py functions gained a prefix keyword argument (default "m365") — existing M365 call sites are unchanged.
CPR cross-referencing (related documents) — clicking any flagged card that contains CPR hits now shows a "Related documents" section in the preview panel listing other items from the same scan session that share at least one CPR number. Items are ordered by number of shared CPRs; clicking any entry opens it in the preview panel. Works in both live mode and history mode (respects ?ref=N). Powered by a self-join on the existing cpr_index table — no new data collection needed. New GDPRDb.get_related_items(item_id, ref_scan_id) method and GET /api/db/related/<item_id>?ref=N endpoint in routes/database.py. Frontend: #previewRelated div in the preview panel, _loadRelated(f) in results.js, window._openRelated(id, itemData) helper (looks up live S.flaggedData first, falls back to API response for history items).
Email address and Danish phone number detection — all three scan engines (M365, Google Workspace, local/SMB/SFTP) can now flag files and messages containing email addresses or Danish phone numbers in addition to CPR numbers. Detection is opt-in per profile: two new toggle options Scan for email addresses and Scan for phone numbers (default off) appear in the scan options panel and profile editor. When enabled, matches are stored as email_count / phone_count on each DB row and surfaced as colour-coded badges in list view, grid view, and the preview panel. Email regex requires a structurally valid address (local@domain.tld); phone regex covers 8-digit Danish numbers with optional +45/0045 prefix and common spacing patterns. Both are deduplicated before counting. Requires DB migration (adds two INTEGER columns to flagged_items; applied automatically on first startup via _MIGRATIONS).
SFTP as a 4th file connector — SFTP servers can now be added as file sources alongside local folders, SMB shares, and cloud sources. A new SFTPScanner class in sftp_connector.py implements the same iter_files() interface as FileScanner, so run_file_scan(), SSE broadcasting, DB persistence, card building, scheduled scans, and exports work without changes. Supports password auth and SSH private key auth (RSA, Ed25519, ECDSA, DSS); passphrases stored in the OS keychain. Key files uploaded via POST /api/file_sources/upload_key and stored in ~/.gdprscanner/sftp_keys/ with chmod 600. SFTP sources appear with a 🔒 icon in the sources panel. Requires paramiko>=3.4 (optional — scanner falls back gracefully if not installed). New source-type selector (Local / Network (SMB) / SFTP) replaces the SMB path-prefix auto-detection in the add-source form.
POST /api/file_sources/upload_key — new endpoint that validates and stores an SSH private key file, returning a key_path for use in the source definition.
SFTP entry in export SOURCE_MAP — Excel and Article 30 exports render SFTP sources as "🔒 SFTP" with a purple tint (EDE9F7), consistent with the existing per-source tab and summary table logic.

Fixed

File source form placeholders untranslated — all nine placeholder texts in the Add source and Edit source forms (source name, path, SMB host/user, SFTP host/user/path, passphrase) were hardcoded English strings. Nine new data-i18n-placeholder keys added to en.json, da.json, and de.json; all 12 affected <input> elements now carry data-i18n-placeholder attributes.
"Name" and "Auth" labels untranslated in SFTP form — the source-name label and the Auth toggle label in the add-source panel had no data-i18n attributes. Added keys m365_fsrc_name (DA: "Navn") and m365_fsrc_sftp_auth (same across languages). The name label used an inner <span data-i18n> to preserve the required-field * indicator, which would have been clobbered by a data-i18n on the outer <label> element. The same clobber bug was fixed for the m365_fsrc_label usage in the edit form.
Password field placeholder showed "Stored in OS keychain" in English — added translation key m365_fsrc_pw_keychain_placeholder (DA: "Gemt i OS-nøglering") and applied data-i18n-placeholder to the three password inputs across both forms (SMB add, SFTP add, SMB edit).

[1.6.24] — 2026-04-25

Fixed

Scheduler UI showed untranslated English strings — frequency labels ("Daily", "Weekly", "Monthly"), "Next:", "Running...", "Disabled", and both empty-state messages ("No scheduled scans yet." / "No scheduled runs yet") were hardcoded English strings in scheduler.js instead of using t(). All six call sites in schedLoad(), schedRenderJobs(), and schedLoadHistory() now call t() with the appropriate key. Three new translation keys added to en.json, da.json, and de.json: m365_sched_no_jobs, m365_sched_running, m365_sched_disabled.

[1.6.23] — 2026-04-21

Added

Video file metadata scanning — .mp4, .mov, .m4v, .avi, .mkv, .wmv, .flv, .webm files are now included in all scan sources (M365 OneDrive/SharePoint/Teams, Google Drive, local/SMB). No frame or audio analysis is performed; only container metadata is extracted: GPS coordinates (iPhone/Android QuickTime ©xyz atom, ISO 6709 format), author/artist, title, comment/description, and recording date. A smartphone recording with an embedded GPS location is flagged with the gps_location special category, exactly like a geotagged photo. AVI metadata (RIFF INFO INAM/IART/ICMT) is parsed without any external library. Requires mutagen>=1.47 (added to requirements.txt).
Audio file metadata scanning — .mp3, .flac, .ogg, .m4a, .aac, .wma, .wav, .opus, .aiff files are now scanned for PII-bearing tags across all sources. Extracted fields: title, artist, album artist, composer, lyricist, conductor, author, copyright, comment, description. No audio content is transcribed. Uses mutagen.File(easy=True) which normalises tag formats across ID3 (MP3), MPEG-4 (M4A/AAC), Vorbis (FLAC/OGG), and ASF (WMA) into a unified lowercase-key interface. A voice recording saved with a student's name in the artist tag will be flagged with exif_pii. Fixed a silent bug in _extract_audio_metadata where mutagen.File(io.BytesIO(content), filename) was passing the BytesIO as the filename positional argument; corrected to mutagen.File(fileobj=..., filename=...).
Audio and video test fixtures — tests/fixtures/local_files/generate_fixtures.py now generates 6 new fixtures: 14_audio_artist_pii.mp3, 15_audio_artist_pii.flac (artist name → flag), 16_audio_no_pii.mp3, 17_audio_no_pii.flac (no tags → no flag), 18_video_gps.mp4 (GPS + artist → flag), 19_video_no_pii.mp4 (no tags → no flag). Total fixtures: 19 (14 flagged, 5 negative).

Fixed

Audio and video files not appearing in local/SMB file scan — file_scanner.py maintained its own hardcoded DEFAULT_EXTENSIONS set that was never updated when video and audio extensions were added to cpr_detector.SUPPORTED_EXTS. Fixed by importing SUPPORTED_EXTS from cpr_detector directly; DEFAULT_EXTENSIONS is now an alias for it. cpr_detector.SUPPORTED_EXTS is the single source of truth for all scan sources (M365, Google Drive, local, SMB).
Profile copy rename not reflected in left column until modal reopen — saving a renamed profile via the full editor (_pmgmtSaveFullEdit) called loadProfiles() to refresh S._profiles but never called _renderProfileMgmt(), so the left-column list was not repainted. The new name only appeared after closing and reopening the modal. Fixed by calling _renderProfileMgmt() immediately after loadProfiles() and re-applying the .active highlight to the correct row. 10 new route integration tests added for all profile API endpoints; total test count: 182.

[1.6.22] — 2026-04-21

Added

Auto-email after manual scan — a new Email report after manual scan toggle in Settings → Email report sends the Excel report to the configured recipients automatically when a manual scan completes. Disabled by default. Stored as auto_email_manual in smtp.json. Uses the same Graph-first → SMTP-fallback path as scheduled scan auto-email. Only fires when there are flagged items and at least one recipient is saved; errors are logged but never surface to the UI (the scan result is unaffected).
Route integration test suite — 44 new tests in tests/test_route_integration.py covering security-sensitive and data-correctness paths: viewer token CRUD, role and user scope enforcement on GET /api/db/flagged, bulk disposition isolation, viewer PIN set/verify/rate-limit/clear, interface PIN gate and multi-step flows, scan lock release on run_scan() exception, and GET /api/db/sessions shape and ordering. Total test count: 172.

Fixed

Role scope filter silently returned nothing — GET /api/db/flagged filtered rows by row.get("role") but the column returned from the DB is user_role. Role-scoped viewer tokens ({"role": "student"} or {"role": "staff"}) therefore excluded every item and returned an empty list. Fixed in routes/database.py.
Historical session query included newer scans — gdpr_db.get_session_items(ref_scan_id=N) used a lower-bounded window (started_at >= ref.started_at - 300) with no upper bound, so any scan that started after the historical reference was also returned. Viewing a past session in the history browser would show items from all subsequent scans as well. Fixed by adding an upper bound (started_at BETWEEN ref.started_at - 300 AND ref.started_at + 300).
Scan button stuck disabled after file scan — run_file_scan broadcast a scan_start SSE event, which the scan_start handler in _attachSchedulerListeners intercepted and set S._m365ScanRunning = true. When file_scan_done fired it checked !S._m365ScanRunning before re-enabling the button — finding it still true, the button stayed disabled permanently. No scan_done (M365) ever arrives to clear the flag. Fixed by removing the scan_start broadcast from run_file_scan; the scan_phase "Files — …" event immediately following already sets _fileScanRunning correctly via the phase-source detection in _attachScanListeners.
TypeError: unhashable type: 'dict' during file and M365 scans — _distinct_cprs = list(dict.fromkeys(cprs)) in both scan paths treated cprs as a list of strings, but extract_matches returns a list of dicts ({"formatted": "…", "page": …, …}). The deduplication crashed on the first file that contained CPR numbers, aborting the scan loop. Fixed in both run_file_scan (line 251) and run_scan (line 1100) by keying on c["formatted"]: list(dict.fromkeys(c["formatted"] for c in cprs)).
Profile applied early lost user selection and source checkboxes — two startup race conditions: (1) Profiles with user_ids = "all" applied before the M365 user list had loaded ran .forEach() on an empty array (no-op); when loadUsers() completed it defaulted all users to selected = false with nothing to override, leaving the accounts panel completely unchecked. Fixed by adding a _pendingProfileAllUsers deferred flag mirroring the existing _pendingProfileUserIds mechanism — loadUsers() applies it after populating S._allUsers. (2) If the profile was selected in the narrow window before _loadFileSources() returned and rendered the sources panel, _applyProfile() iterated zero checkboxes and the source selection was silently discarded; a subsequent renderSourcesPanel() call then re-rendered all sources as checked (their default). Fixed by calling renderSourcesPanel() in _applyProfile() when no source checkboxes are present in the DOM yet — same guard already used in loadUsers().

[1.6.21] — 2026-04-20

Added

Local-file scan test fixtures — tests/fixtures/local_files/ contains 13 ready-made files (.txt, .csv, .docx, .xlsx) covering every detection scenario: CPR with explicit label, mod-11–valid CPR without label, post-2007 CPR with/without context keyword, protected number (day+40), multiple CPRs in one file, mixed PII (CPR + email + Art. 9 health data), and three true-negative cases (clean content, invoice false-positive, post-2007 serial number without context). All CPR numbers are mathematically valid; false-positive fixtures are verified to produce zero hits. Run generate_fixtures.py to regenerate the binary files.
Interface PIN — optional session-level authentication gate for the main scanner interface. Set a 4–8 digit PIN in Settings → Security → Interface PIN; anyone reaching http://host:5100 is redirected to /login and must enter the PIN before accessing scan controls, settings, or results. Viewer tokens and the /view route are completely unaffected — reviewers continue to use their own auth chain. The PIN is stored as a salted SHA-256 hash in config.json. Brute-force protection: 5 failed attempts per IP locks out for 5 minutes. A POST /api/interface/logout endpoint clears the session. PIN management via GET/POST/DELETE /api/interface/pin.

Fixed

"Vælg" (select mode) button did nothing — toggleSelectMode, toggleCardSelect, selectAllVisible, and applyBulkDisposition were defined inside an ES module but never assigned to window, so all onclick attributes calling them silently failed. Added the four missing window.* exports at the bottom of results.js.
Progress counter frozen at M365 total during Google/file scan — the scan_progress handler in scan.js only updated progressStats and progressEta for source === "m365". When M365 finished first, the counter stayed at its final value (e.g. "15083 / 15083 ETA 0s") for the entire duration of the Google and file scans. Fixed in two places: scan_done now clears the stats/ETA elements immediately when another scan is still running; scan_progress for Google/file sources now shows a running "X scanned" count (using the scanned field those engines already send) and clears ETA, but only while M365 is not running — M365 stats continue to dominate during concurrent scans.
PDF OCR kills process on large files — document_scanner previously called convert_from_path() once for the entire PDF before the processing loop, allocating all page images in memory simultaneously. A 50-page A4 PDF at 300 DPI required ~1.3 GB in a single allocation, triggering the OS OOM killer. Fixed by rendering one page at a time with convert_from_path(first_page=N, last_page=N) inside the loop across scan_pdf, redact_fitz_pdf, and redact_pdf. Peak OCR memory is now bounded to roughly one page (~26 MB at 300 DPI) regardless of document length.
No bulk disposition tagging — each result card had to be opened individually to set a disposition. Added a Select mode (filter bar "Vælg" button) that reveals per-card checkboxes. Selecting one or more items shows a bulk tag bar at the bottom of the grid with a disposition dropdown and Apply button. Calls POST /api/db/disposition/bulk; updates all selected items in-memory and clears the selection. "Select all visible" / "Deselect all" toggle available in the bar. Hidden in viewer mode.
No disposition progress summary — added a thin stats bar between the filter bar and the grid showing total · unreviewed · retain · delete · % reviewed. Updates after every single or bulk disposition save and after each grid render. Unreviewed count is highlighted in red until everything is tagged; turns green at 100%.
Google Drive always did a full scan — Drive scanning in routes/google_scan.py used conn.iter_drive_files() on every run, re-downloading every file regardless of what changed. Added Google Drive delta scan using the Drive Changes API. When delta is enabled in scan options, the first run records a Changes API start page token per user (gdrive:{email} key in delta.json). Subsequent runs call conn.get_drive_changes(user_email, token) and only process files that have been added or modified since the last scan. Invalid or expired tokens fall back to a full scan automatically. Token save loads the current delta.json fresh before writing to avoid racing with concurrent M365 token saves. google_scan_done SSE event now includes delta and delta_sources fields.
No memory guard before OCR page renders — added _ocr_mem_ok() check (psutil.virtual_memory().available >= 500 MB) before each page render in all three OCR paths. Pages that would exceed the threshold are skipped and recorded as "skipped" in page_methods with a printed warning rather than crashing the scan.

[1.6.20] — 2026-04-18

Fixed

Graph sendMail reported as failure despite email being delivered — _post() in m365_connector.py called r.json() unconditionally after raise_for_status(). The Graph sendMail endpoint returns HTTP 202 with an empty body on success, causing json.JSONDecodeError: Expecting value: line 1 column 1 (char 0). This was caught by the smtp_test exception handler and surfaced as an error even though the email had been sent. Fixed by returning r.json() if r.content else {} so any Graph endpoint that responds with no body (sendMail, delete operations, etc.) is handled correctly.
Graph error hidden when SMTP host not configured — when Graph failed and no SMTP host was saved, smtp_test returned the generic "No SMTP host configured" message, swallowing the actual Graph error. The if not host branch now surfaces the Graph exception text alongside the Mail.Send permission guidance so the real cause is visible.
Gmail vs Google Workspace SMTP error messages — the auth failure handler now detects whether the username is a personal Gmail address (@gmail.com) or a Google Workspace custom-domain account, and shows a different message for each. Personal Gmail: existing App Password troubleshooting steps. Google Workspace: explains that SMTP access is controlled by the Workspace admin console (2-Step Verification policy, SMTP relay service), not the user's personal security settings.

[1.6.19] — 2026-04-18

Fixed

Gmail SMTP error message misleading when App Password already in use — the auth failure handler in both smtp_test and send_report unconditionally told the user to "create an App Password", even when they were already using one. Gmail returns the same 535 / Username and Password not accepted error for a wrong app password, a revoked app password, spaces left in the 16-character code, or a wrong username — none of which are helped by the old message. The Gmail branch now lists the three most common causes (spaces in the code, revoked password, wrong username) and still links to the App Password page to generate a new one. The Microsoft personal account branch is unchanged.

[1.6.18] — 2026-04-18

Fixed

Art.30 and Excel exports missing GWS and local/SMB sources — two silent failures caused Google Workspace and file-scan results to be absent from all exports after a page reload.
- routes/google_scan.py: called _db.end_scan() (method does not exist on GDPRDb — the correct name is finish_scan). The resulting AttributeError was swallowed by the bare except Exception: pass guard, so finished_at was never written on GWS scan records. Since get_session_items() requires finished_at IS NOT NULL, every GWS scan was permanently invisible to both export functions.
- routes/google_scan.py: emitted "scan_done" at completion instead of "google_scan_done", causing the M365 done handler to fire for Google scans and breaking the SSE teardown logic.
- scan_engine.py (run_file_scan): called _db.begin_scan(sources=…, user_count=0, options=source) with keyword arguments, but begin_scan(self, options: dict) only accepts a single positional dict. The TypeError was caught silently, leaving _db_scan_id = None; all subsequent save_item calls were skipped, so local and SMB items were never written to the database.

[1.6.17] — 2026-04-18

Added

Scan history browser — results from any past scan session can now be reviewed without running a new scan. On page load, when no scan is running, the last completed session is automatically loaded into the results grid. A History banner appears above the filter bar showing the session date, scanned sources, and item count. A Sessions button in the banner opens a dropdown listing all past sessions newest-first, each showing date, time, source labels, item count, and Delta / Latest badges. Clicking a session loads its items. A Latest scan button (shown only when browsing a past session) jumps back to the most recent session. Starting a new scan exits history mode and takes over the grid with live SSE results. Session cache is invalidated on each scan completion so the picker always reflects the true state of the database.
- gdpr_db.py — new get_sessions(limit, window_seconds) groups all completed scans by the 300-second concurrent-scan window and returns session summaries newest-first. get_session_items() gains an optional ref_scan_id parameter to anchor the session window to any past scan.
- routes/database.py — new GET /api/db/sessions; GET /api/db/flagged now accepts ?ref=<scan_id> to serve items for a specific historical session.
- static/js/history.js (new) — loadHistorySession(refScanId), openHistoryPicker(), closeHistoryPicker(), exitHistoryMode(), invalidateHistoryCache() all exposed on window.
- state.js — _historyRefScanId: null tracks which session is currently displayed (null = live/SSE).
- results.js — initial status check calls loadHistorySession(null) instead of loadLastScanSummary().
- scan.js — startScan() calls exitHistoryMode(); all three *_done handlers call invalidateHistoryCache().
User-scoped viewer tokens (#34) — viewer token links can now be restricted to a specific person so the recipient sees only their own flagged files, across both M365 and Google Workspace. The Share modal's scope selector gains a User option that opens a searchable name autocomplete backed by the already-loaded S._allUsers list. Typing filters by display name or email; each row shows the person's full name, role badge, and all associated email addresses (M365 UPN and GWS email shown together for dual-platform users). Selecting a name fills the input with the display name and stores both email addresses internally. Scope is stored as {"user": ["alice@m365.dk", "alice@gws.dk"], "display_name": "Alice Smith"}. Server-side enforcement in GET /api/db/flagged filters WHERE account_id IN (list) so items from either platform are included. The viewer header shows the person's full name in a locked identity badge (#viewerIdentityBadge); #filterRole is hidden. Token rows in the Active links list show the display name badge. Free-text email entry still works as a fallback when no accounts are loaded. File-scan items (account_id = "") never appear in user-scoped views — consistent with the existing role-scope behaviour.

[1.6.16] — 2026-04-18

Added

User-scoped viewer tokens (#34) — viewer token links can now be restricted to a specific person so the recipient sees only their own flagged files, across both M365 and Google Workspace. The Share modal's scope selector gains a User option that opens a searchable name autocomplete backed by the already-loaded S._allUsers list. Typing filters by display name or email; each row shows the person's full name, role badge, and all associated email addresses (M365 UPN and GWS email shown together for dual-platform users). Selecting a name fills the input with the display name and stores both email addresses internally. Scope is stored as {"user": ["alice@m365.dk", "alice@gws.dk"], "display_name": "Alice Smith"}. Server-side enforcement in GET /api/db/flagged filters WHERE account_id IN (list) so items from either platform are included. The viewer header shows the person's full name in a locked identity badge (#viewerIdentityBadge); #filterRole is hidden. Token rows in the Active links list show the display name badge. Free-text email entry still works as a fallback when no accounts are loaded. File-scan items (account_id = "") never appear in user-scoped views — consistent with the existing role-scope behaviour.

[1.6.15] — 2026-04-12

Added

Role-scoped viewer tokens — viewer token links can now be restricted to a single role so the recipient can only see student or staff items. A new Role scope dropdown (All roles / Ansatte / Elever) in the Share modal is selected when creating a token. The scope is stored as "scope": {"role": "student"|"staff"} in viewer_tokens.json. Enforcement is two-layered: GET /api/db/flagged filters items server-side using session["viewer_scope"].role set at token validation time; the #filterRole dropdown in the viewer is pre-set and hidden so the constraint cannot be bypassed client-side. Tokens without a scope field (existing tokens, PIN sessions) remain unrestricted. Role badge (Ansatte / Elever) shown on each scoped token row in the Active links list.
Role filter in results + role-scoped exports — a new Role dropdown in the filter bar (All roles / Ansatte / Elever) narrows the results grid to staff or student items. Clicking Excel or Art.30 while a role is selected exports only that group — the ?role=student|staff param is forwarded to both export endpoints. _build_excel_bytes() and _build_article30_docx() now accept a role param; all internal sheets (GPS, External transfers, Art.30 staff/student tables) respect the filter. Filenames get an _elever or _ansatte suffix.
Scan filter options for student environments — two new profile options reduce noise when scanning student accounts:
- Ignore GPS in images (skip_gps_images) — images whose only PII signal is an embedded GPS coordinate are not flagged. Smartphones embed location in every camera photo by default, generating large numbers of low-priority flags in school contexts. GPS data is still extracted and shown in the detail card when the image is flagged by another signal (faces, EXIF author/comment). Applies to M365, Google, and file scans.
- Min. CPR count per file (min_cpr_count, default 1) — a file is only flagged if it contains at least this many distinct CPR numbers. Set to 2 to avoid reporting a student's own consent form or registration document (one CPR) while still flagging class lists and grade sheets with multiple students' CPRs. Deduplication is by value — a CPR repeated 10 times counts as 1 distinct number. Applies to M365, Google, and file scans.
- Both options are saved in profiles and editable in the Profile Manager editor.
GitHub Actions CI/CD — macOS build — .github/workflows/build.yml now also builds a macOS .app bundle (macos-15, Apple Silicon ARM64) on every push to main and on v* tags. Released as GDPRScanner_macos_arm64.zip. (Originally macos-13 / Intel, changed when GitHub retired that runner.)

Fixed

OneDrive 404 errors during delta scans — GET /users/{id}/drive/root/delta returns 404 for users with no OneDrive licence, a disabled service plan, a drive that was never provisioned (account never signed in), or a suspended account. Previously these 404s fell through to requests.raise_for_status() and were caught by the generic except Exception handler in _scan_user_onedrive, broadcasting a red scan_error card. Full scans never showed the error because _iter_drive_folder_for has a bare except Exception: return. Fixed by adding M365DriveNotFound(M365Error) to m365_connector.py, raising it from _get() on HTTP 404, and handling it explicitly in _scan_user_onedrive with a scan_phase broadcast ("OneDrive (user): not provisioned — skipped") before the generic exception handler.
CI — Windows artifact never uploaded — PyInstaller --onedir puts the exe inside dist/GDPRScanner/, not at dist/*.exe. The artifact glob never matched, so no Windows build appeared in releases. A PowerShell packaging step now zips dist\GDPRScanner\ into GDPRScanner_windows_x64.zip (mirroring the existing Linux step).
EFFORT_ESTIMATE.md — build effort estimate document covering component-by-component hour breakdowns and complexity drivers for the project.
Settings → Security tab — new dedicated pane in the Settings modal. Admin PIN and Viewer PIN groups moved here from the General tab, which now contains only Appearance and About. The Share modal's Configure button navigates directly to the Security tab.
Viewer mode layout — the sidebar, log panel, and progress bar are now hidden in viewer mode so results fill the full window width. The 🔍 GDPRScanner brand is shown in the top-left of the topbar (replacing the sidebar header) at the same size and weight as the normal sidebar title.
Share modal — Revoke / Copy buttons broken — JSON.stringify(token) produced a double-quoted string that terminated the surrounding onclick="…" HTML attribute early, so neither button fired its handler. Both now pass the token as a single-quoted JS string literal, which is safe for the hex token format.
Viewer PIN — Clear PIN rejected with "current PIN is incorrect" — clicking Clear PIN without first typing in the Current PIN field sent an empty string to the server, which correctly rejected it. A client-side guard now validates the field is non-empty before sending the request, and focuses the input with an inline error message if it is empty.
Share modal — all UI strings now translated — the Share results modal and Viewer PIN settings group were fully hardcoded in English. All visible strings are now backed by i18n keys (share_*, viewer_pin_*) in en.json, da.json, and de.json.
Excel / ART.30 export — Gmail and Google Drive missing from summary — by_source was built from flagged items only, so sources that produced zero hits were silently skipped. Both the Excel Summary sheet and the ART.30 "Breakdown by source" table now include every source that was actually scanned, showing 0 items and 0 CPR hits where nothing was found. New GDPRDb.get_session_sources() method reads the sources JSON column from all scans in the current session window to determine which sources ran.
Scan never finishes when M365 + Google run concurrently — scan_done (M365 finished) was closing the SSE connection immediately via S.es.close(), even when S._googleScanRunning or S._fileScanRunning was still true. The google_scan_done / file_scan_done events therefore never arrived, leaving the progress bar stuck at 100% indefinitely. SSE teardown is now deferred until the last concurrent scan completes: scan_done only closes the connection if neither Google nor File is still running; google_scan_done and file_scan_done close it when they are the final scan to finish.

[1.6.14] — 2026-04-10

Added — read-only viewer mode (#33)

A DPO, school principal, or compliance coordinator can now review scan results and tag dispositions without access to scan controls, credentials, or settings.

Token links

New 🔗 Share button in the topbar opens a token management modal.
Create generates a 64-char hex token (secrets.token_hex(32)) with an optional label and expiry (7 d / 30 d / 90 d / 1 yr / never).
Copy copies the full http://host:5100/view?token=… URL to the clipboard.
Revoke deletes the token immediately; any browser using it is locked out on next navigation.
Tokens are stored in ~/.gdprscanner/viewer_tokens.json with created_at, expires_at, and last_used_at metadata. Expired tokens are cleaned up on each list fetch.

PIN alternative

A 4–8 digit numeric PIN can be set in Settings → General → Viewer PIN.
Opening /view without a token shows a PIN entry form (templates/viewer_pin.html).
Correct PIN sets a Flask session cookie (session["viewer_ok"]) valid for the browser session — no token needed after that.
Brute-force guard: 5 failed attempts per 5 minutes per IP returns 429.
PIN stored as salted SHA-256 inside viewer_tokens.json (no extra dependencies).

/view route

Checks ?token= first (validates + binds session), then existing session cookie, then PIN form (if a PIN is configured), then 403.
Serves the same index.html with window.VIEWER_MODE = true injected.
Invalid/expired tokens show templates/viewer_denied.html.

Viewer mode (JS)

auth.js — bypasses M365 auth check entirely; adds viewer-mode class to <body>; shows scanner screen immediately.
results.js — on DOMContentLoaded calls _loadViewerResults() which fetches GET /api/db/flagged (all items from the last completed scan session, joined with dispositions) and renders the grid directly — no SSE required.
CSS (body.viewer-mode) hides: Sources/Options/Accounts sidebar panels; Scan/Stop buttons; profile bar; config-group buttons; resume banner; bulk-delete button; per-card delete button; data-subject delete button; Share button.
Disposition tagging (select + Save) remains fully functional — /api/db/disposition has no auth guard.
Filter bar, Excel export, Art.30 export, preview panel, and log remain accessible.

New files: routes/viewer.py, static/js/viewer.js, templates/viewer_pin.html, templates/viewer_denied.html

Files changed: app_config.py, gdpr_scanner.py, templates/index.html, static/style.css, static/js/auth.js, static/js/results.js, static/js/scheduler.js, routes/database.py

Fixed — memory exhaustion during large M365 scans

Addressed root causes of runaway memory growth (reported: up to 90 GB RSS) that could crash the host machine during scans of large Microsoft 365 tenants.

scan_engine.py

Email body HTML stripped at collection time — Graph API returns the full body field (raw HTML, up to ~1 MB per message) for every email fetched. Previously, all message dicts — including the raw HTML — were accumulated in work_items before any scanning began. For 1 000 users × 2 000 emails this could mean >100 GB in work_items alone. The body is now converted to plain text immediately on collection (_precomputed_body), and the raw body and bodyPreview keys are deleted from the dict before it is queued. The processing loop reads _precomputed_body via pop() and dels it after use.
work_items converted to deque before processing — items are now released from memory one by one via popleft() as they are processed, rather than keeping the entire list alive for the duration of the scan. gc.collect() is called immediately after conversion and after each checkpoint save.
content bytes freed as early as possible in the file processing branch — raw download bytes are now del'd immediately after content.decode() (before the expensive NER/PII pass), and also in the no-hits else branch where they were previously kept alive until the next loop iteration.
body_text freed after use in the email branch — del body_text added after _broadcast_card so large plain-text bodies do not linger until the next iteration.
Memory guard before file downloads — uses psutil.virtual_memory().available to skip a file download and log a warning if fewer than 300 MB of RAM are available, preventing a single large file from pushing an already-pressured machine into OOM.

document_scanner.py

PDF OCR page images freed page by page — convert_from_path() renders all pages at 300 DPI before scanning begins (~26 MB per A4 page; a 100-page PDF ≈ 2.6 GB). Each rendered PIL.Image is now nulled out (images[page_num-1] = None) immediately after OCR, so only one page image is live at a time instead of the entire document.

Changed — Sources panel is now resizable and collapsible

The KILDER sidebar panel now behaves consistently with the other sidebar sections.

Collapsible — the ▾ / ▸ toggle was already wired up; collapse state is already persisted in localStorage. No change needed here.
Resizable — a drag handle (sources-resize-handle) added at the bottom of the panel body. Dragging up shrinks the panel (scroll appears); dragging down is capped at the panel's natural content height — you cannot expand it beyond what is needed to show all sources. Height preference persisted in localStorage under gdpr_sources_h.
Auto-fit on render — _fitSourcesPanel() is called at the end of every renderSourcesPanel() invocation. On first load and whenever sources are added or removed (e.g. connecting Google), the panel height snaps to exactly fit all visible sources. A previously saved smaller height is honoured only if it is still smaller than the new content height; dragging back to full height clears the saved preference.
The old max-height: calc(5 * 26px) fixed cap is removed.

Files changed: templates/index.html, static/style.css, static/js/log.js (_fitSourcesPanel, _initSourcesResize), static/js/sources.js, static/js/results.js.

[1.6.13] — 2026-04-10

Added — developer tooling

run_tests.sh — shell script to activate the venv and run the full test suite. Accepts any pytest arguments: ./run_tests.sh, ./run_tests.sh -q, ./run_tests.sh tests/test_app_config.py.
Directory-scoped CLAUDE.md rules — routes/CLAUDE.md, static/js/CLAUDE.md, templates/CLAUDE.md, lang/CLAUDE.md replace the previous single-file context document. Each file is loaded automatically by Claude Code only when working in the relevant directory.

Fixed — documentation

README.md project files table — removed four phantom entries (Dockerfile, docker-compose.yml, .dockerignore, scanner_audit.jsonl); corrected static/app.js description to "archived monolith — no longer loaded"; fixed manual paths (MANUAL-EN.md → docs/manuals/MANUAL-EN.md); added missing files: scan_engine.py, sse.py, checkpoint.py, app_config.py, cpr_detector.py, google_connector.py, static/style.css, static/js/*.js, routes/google_auth.py, routes/google_scan.py, run_tests.sh, docs/setup/ guides.
docs/manuals/MANUAL-EN.md, docs/manuals/MANUAL-DA.md — version header updated from 1.6.11 → 1.6.13; footer updated from v1.6.8 → v1.6.13.

Changed — blueprint migration batch 3, 4, 5 (auth, database, export — migration complete)

All remaining direct @app.route registrations removed from gdpr_scanner.py. Flask now routes every API endpoint exclusively through its blueprint. Only GET / and GET /api/scan/stream (SSE) remain in gdpr_scanner.py.

routes/auth.py — rewritten with direct imports (batch 3, 6 routes):

MSAL_OK, M365Connector, M365Error imported from m365_connector
_load_config, _save_config imported from app_config
Dead module-level globals _pending_flow and _auth_poll_result removed from gdpr_scanner.py
Routes removed: /api/auth/status, /api/auth/start, /api/auth/poll, /api/auth/userinfo, /api/auth/signout, /api/auth/config

routes/database.py — rewritten with direct imports (batch 4, 15 routes):

_get_db, DB_OK from gdpr_db; _set_admin_pin, _verify_admin_pin, _admin_pin_is_set from app_config; _clear_checkpoint, _DELTA_PATH from checkpoint; _extract_exif, _html_esc, _placeholder_svg from cpr_detector
SCANNER_OK determined by local import document_scanner try/except
db_export improved: uses NamedTemporaryFile instead of mktemp (safer for frozen apps)
Email preview HTML: full CSS ruleset (*, *::before, *::after, img, table, scrollbar) from gdpr_scanner.py version restored
Routes removed: /api/db/stats, /api/db/trend, /api/db/scans, /api/db/subject, /api/db/overdue, /api/db/disposition (×2), /api/db/deletion_log, /api/db/reset, /api/admin/pin (×2), /api/db/export, /api/db/import, /api/preview/<item_id>, /api/thumb

routes/export.py — rewritten with direct imports (batch 5, 3 routes):

_get_db, DB_OK from gdpr_db; _GUID_RE, _resolve_display_name from app_config; M365PermissionError from m365_connector
app.logger replaced with logging.getLogger(__name__)
Dead delete_item() helper removed from gdpr_scanner.py (was unreachable; blueprint has its own copy)
Routes removed: /api/export_excel, /api/export_article30, /api/delete_bulk

tests/test_routes.py — db_patch fixture updated: now patches routes.database._get_db / routes.database.DB_OK and routes.export._get_db / routes.export.DB_OK (was patching gdpr_scanner._get_db/gdpr_scanner.DB_OK which no longer have any effect). Two test_without_db_returns_503 tests updated to monkeypatch routes.database.DB_OK instead of gdpr_scanner.DB_OK.

[1.6.12] — 2026-04-10

Fixed — profile editor save drops users from non-active role groups

In _pmgmtSaveFullEdit (profile management editor), the save function applied the active role filter (_pmgmtRoleActive) to the list of checked checkboxes before saving. Since _pmgmtFilterAccounts hides rows via display:none but does not uncheck them, users from other role groups that remained checked (but hidden) were silently discarded on save. The role filter at save time is removed — all checked checkboxes are now captured regardless of which role tab is visible.

[1.6.11] — 2026-04-10

Changed — blueprint migration batch 1 (scan + app_routes)

15 direct @app.route registrations removed from gdpr_scanner.py. Flask now routes all of these exclusively through their blueprint counterparts, which previously existed as dead code shadowed by the direct routes.

routes/scan.py — rewritten with direct imports (was entirely non-functional as dead code due to bare-name NameErrors behind the shadow):

Added GET /api/scan/status (new — was only in gdpr_scanner.py)
Added GET /api/src_toggles, POST /api/src_toggles (new — was only in gdpr_scanner.py)
scan_checkpoint_info — added missing check_only handling present in the gdpr_scanner.py version
All state references converted from bare names to state._scan_lock / state._scan_abort; run_scan imported lazily from scan_engine inside _run to avoid circular imports
_save_settings, _load_settings, _load_src_toggles, _save_src_toggles imported from app_config
_checkpoint_key, _load_checkpoint, _clear_checkpoint, _load_delta_tokens, _DELTA_PATH imported from checkpoint

routes/app_routes.py — cleaned up:

APP_VERSION now computed locally from VERSION file (was a bare-name reference to gdpr_scanner.py global)
_LANG_DIR computed at module level; fixed sys / _sys alias mismatch in get_langs (bug in blueprint that never manifested while shadowed)
_set_lang_override, _load_lang_forced imported directly from app_config
get_langs — added missing langs.sort() present in the gdpr_scanner.py version

tests/test_routes.py — mock_connector fixture simplified: no longer needs to patch gdpr_scanner._connector since the direct scan/start route is gone; state.connector alone is sufficient. run_scan stub in test_authenticated_returns_started updated to target scan_engine directly.

Routes removed from gdpr_scanner.py: /api/about, /api/langs, /api/set_lang, /api/lang, /api/scan/status, /api/scan/start, /api/scan/stop, /api/scan/checkpoint, /api/scan/clear_checkpoint, /api/settings/save, /api/settings/load, /api/src_toggles, /api/delta/status, /api/delta/clear

Still in gdpr_scanner.py: GET / (root), GET /api/scan/stream (SSE — cannot be in a blueprint), and the auth, users, sources, database, export route groups (31 routes — next batches).

[1.6.10] — 2026-04-10

Fixed — Google Drive `exportSizeLimitExceeded` warning

Native Google Workspace files too large for Drive's export API (Google's server-side limit, distinct from the 20 MB local cap) now produce a clean skip message instead of a stray WARNING googleapiclient.http — Encountered 403 Forbidden with reason "exportSizeLimitExceeded" in the log. A logging.Filter subclass is installed on the googleapiclient.http logger at import time to suppress the duplicate external warning; the except HttpError block in _drive_iter detects the reason and logs [gdrive] skip '<name>' — file too large for Google export API (exportSizeLimitExceeded) with the file ID.

Fixed — peak memory during large file/SMB scans (OOM risk reduction)

Three targeted buffer-lifetime fixes reduce peak RSS during large scans:

cpr_detector.py — del content after writing the PDF bytes to a temp file in _scan_bytes_timeout. The 20 MB buffer was previously held in the main process for the entire duration of p.join(timeout) (up to 60 s), overlapping with the spawned subprocess's ~150–300 MB heap. It is now freed before the subprocess starts.
scan_engine.py — del content after the thumbnail block in run_file_scan. The raw file buffer was kept alive through card dict construction and the start of the next loop iteration; it is now freed as soon as the thumbnail (or placeholder SVG) has been generated.
file_scanner.py — PREFETCH_WINDOW reduced from 2 to 1. Halves the maximum number of concurrently-held SMB read buffers (from 2 × 20 MB to 1 × 20 MB).

[1.6.9] — 2026-04-10

Changed — frontend migrated to ES modules

Phase 2 complete: All 10 split JS files converted from <script defer> to <script type="module">.

static/js/state.js introduced as the shared state module — exports a single S object holding all previously-global mutable state (flaggedData, _allUsers, _profiles, _fileSources, _srcPct, scan-running flags, etc.). All 10 modules import { S } from state.js and mutate its properties in place.
Every function called from an inline HTML onclick= handler is explicitly exported via window.fnName = fnName at the bottom of each module (~80 exports across 10 files).
var LANG retained in the inline <script> block (not a module) so it remains a true global accessible from all modules as a bare name.
app.js retained as archive; no longer loaded by index.html.

Fixed — connector.js SyntaxError caused by duplicate function declarations

openFileSourcesModal and closeFileSourcesModal were declared twice at module top level in connector.js — once as redirect stubs pointing to the new unified Sources modal, and once as the old #fsrcBackdrop implementations left over from the pre-unification code. In ES module strict mode, duplicate function declarations in the same scope are a SyntaxError. The engine rejected the entire module at parse time, meaning none of its ~35 window.* exports were ever set. Symptoms:

"Kilder" (Sources) button did nothing — window.openSourcesMgmt was never set
Google status dot, file source loading, and sources panel re-render all silently failed — window.smGoogleRefreshStatus, window._loadFileSources etc. were undefined
Sources panel showed only M365 sources even when Google Workspace was configured

Fix: removed the stale async function openFileSourcesModal / function closeFileSourcesModal bodies (lines 511–518). The redirect stubs at lines 505–506 (openSourcesMgmt('files')) are the correct new behaviour. Also removed the duplicate window.openFileSourcesModal and window.closeFileSourcesModal assignments that appeared twice in the exports block.

Fixed — Profiler modal did not open when `_renderProfileMgmt` threw

If _renderProfileMgmt() threw a runtime error (e.g. due to downstream failures from the connector.js parse error), openProfileMgmtModal would abort before reaching classList.add('open'), leaving the modal invisibly closed. The function now wraps both _renderProfileMgmt() and _pmgmtOpenEditor() in individual try-catch blocks. Any error is logged to the console; the modal opens regardless.

Fixed — blocking alert on every unhandled async error

ui.js contained a duplicate unhandledrejection listener that called alert() for every unhandled Promise rejection. Background API calls (Google status, file sources, src_toggles) could fire these alerts at page load, and browsers that had already suppressed one alert silently blocked all subsequent ones. Removed the alert() handler; the console.error handler is retained.

[1.6.8] — 2026-04-09

Fixed — memory pressure during large scans

SMB prefetch window reduced

PREFETCH_WINDOW reduced from 5 to 2 in file_scanner.py. Peak in-flight SMB memory drops from ~250 MB to ~40 MB during large network share scans.
MAX_FILE_BYTES reduced from 50 MB to 20 MB — files larger than 20 MB are skipped rather than buffered in full.

PDF subprocess concurrency limited

A module-level threading.Semaphore(1) in cpr_detector.py ensures at most one PDF OCR subprocess runs at a time. Previously, multiple threads could each spawn a ~200 MB subprocess simultaneously, causing OOM under load.

Google Workspace export buffer reduced

_MAX_EXPORT_BYTES in google_connector.py reduced from 50 MB to 20 MB.
_drive_iter now explicitly deletes the BytesIO buffer (del buf) before yielding each file's bytes, releasing the double-buffer peak immediately rather than waiting for GC.

Fixed — Excel and Article 30 exports missing sources

Gmail and Google Drive tabs added to Excel export

SOURCE_MAP in routes/export.py was missing gmail, gdrive, local, and smb entries. Items from these sources were silently dropped — they were grouped internally but never written to a sheet.
All eight source types now have dedicated tabs: Outlook, OneDrive, SharePoint, Teams, Gmail, Google Drive, Local, Network.
The same fix applies to the inline Excel builder in gdpr_scanner.py.

Concurrent scan results captured in exports

M365, Google Workspace, and file scans each create their own scan_id. The previous DB fallback used get_flagged_items(), which only returned results for the single most-recently-completed scan — silently dropping the other sources after page reload.
New get_session_items(window_seconds=300) in gdpr_db.py returns items from all scans whose started_at falls within a 5-minute session window of the latest completed scan.
Both export_excel() and export_article30() now use get_session_items() as their DB fallback. _build_article30_docx() also uses it directly.

Changed — "Email" source renamed to "Outlook"

The email source type (Microsoft Exchange mailboxes) is now consistently labelled Outlook everywhere:

Source badges on result cards (SOURCE_BADGES.email)
Filter bar dropdown
_sourceLabel() in JS
Excel tab label
m365_src_email, m365_filter_email, m365_phase_emails in all three lang files (en.json, da.json, de.json)
Article 30 report uses Exchange (Outlook) for the formal legal context

Rationale: with Gmail also present, "Email" was ambiguous. "Outlook" ties the source unambiguously to Microsoft 365.

Changed — progress bar moved above log panel

#progressBar moved from below the topbar to just above #logWrap (above the activity log).
The bar is now a permanent placeholder — always visible, never hidden. display: flex is the permanent state; display: none is no longer used.
Background changed from var(--surface) to var(--bg) to match the log area. Border changed from border-bottom to border-top.
New _clearProgressBar() helper resets phase, stats, ETA, and file fields on scan end, leaving the bar visually empty at idle. All previous style.display assignments removed.

Clicking Cancel in the profile editor previously closed the editor panel but left the profile list modal open behind it. _pmgmtCloseEditor() now calls closeProfileMgmt() to dismiss the full modal.
Dead stub function _pmgmtCancelEdit(id) {} removed.

Changed — exports available without running a new scan

The filter bar (including Excel and Art.30 export buttons) is always visible on page load.
Exports now use get_session_items() as the DB fallback, so the buttons produce a complete report from the previous session immediately after page reload — no new scan required.

Fixed — profile loading clobbered by scan start

_save_settings() is called on every M365 scan start with a payload containing only M365 sources, user_ids, and options. It was writing this back via _profile_from_settings(), which has no google_sources field — permanently stripping Google and file source selections from the active profile after each scan.
_save_settings() now preserves google_sources and file_sources from the existing profile when the payload does not include them, and rebuilds the combined sources array as M365 + google + file.
_profile_from_settings() updated to pass through google_sources when present in the payload.

Fixed — "no results" shown during live scan after hard refresh

Hard-refreshing the browser mid-scan caused the "Ingen CPR-numre fundet" card to appear immediately, before the SSE watchdog had detected the running scan.
loadLastScanSummary() is no longer called directly on DOMContentLoaded. It is now called inside _sseWatchdog on the first status poll, only if no scan is currently running (_initialStatusChecked flag).

Fixed — progress bar source pill showing "Email" instead of "Outlook"

_PHASE_SOURCE_MAP entry for Exchange mail phases still had label: 'Email'. Updated to 'Outlook' to match the rename applied elsewhere.

Changed — profile manager UI simplified

Removed the redundant × close button from the list panel header — the editor panel's × already closes the entire modal.
Removed the Luk (Close) button from the list panel footer — the footer now contains only + Ny profil.
The editor footer Cancel/Annuller button replaced with a single Luk button that closes the entire modal (consistent with _pmgmtCloseEditor() behaviour).

Changed — log panel collapsible

A ▾/▸ toggle button added to the left of the log header. Clicking it collapses or expands the log panel (resize handle + log body together, wrapped in #logSectionBody).
State persists in localStorage via the existing toggleSection / restoreSectionStates mechanism (sc_logSection key).

Changed — log header buttons translated

All, Errors, and Copy buttons in the log header now use data-i18n attributes and are fully translated in all three lang files.
Translation keys added: btn_errors (da: Fejl, de: Fehler), log_copy (da: Kopier, de: Kopieren).
Symbol prefix ⎘ removed from the Copy button label.

Changed — project documentation structure

User manuals moved from project root to docs/manuals/ (MANUAL-DA.md, MANUAL-EN.md).
Setup guides moved from project root to docs/setup/ (M365_SETUP.md, GOOGLE_SETUP.md).
routes/app_routes.py and build_gdpr.py updated to reference the new manual paths.
README.md links updated accordingly.

Fixed — disposition carry-forward across scans

When a previously reviewed file reappears in a new scan it now shows its prior disposition immediately on the result card — no need to open the preview panel first.

get_prior_disposition(item_id) added to ScanDB in gdpr_db.py. Returns the stored disposition status if it differs from 'unreviewed', otherwise None.
get_flagged_items() and get_session_items() in gdpr_db.py now LEFT JOIN dispositions and return COALESCE(d.status, 'unreviewed') as disposition on every row. Exports and the results grid therefore reflect the latest review decision without an extra round-trip.
_with_disposition(card, db) helper added to scan_engine.py. Injects the prior disposition into a card dict before it is broadcast as scan_file_flagged. Used at all four broadcast points:
- scan_engine.py — file scan (line ~297)
- scan_engine.py — checkpoint resume re-emit loop (line ~357)
- scan_engine.py — M365 scan (line ~456)
- routes/google_scan.py — Google Workspace scan (line ~225)
The frontend already reads f.disposition || 'unreviewed' for filter matching — no JS changes required.

[1.6.7] — 2026-04-06

Fixed — emoji/symbol removal from all buttons and indicators

All UI buttons stripped of emoji and symbol prefixes

Every interactive element in the topbar, filter bar, modals, and settings panels now uses plain text only. Removed: ▶, ■, 💾, ✕, ⚙, 🕐, ⬇, ⬆, 🗑, 📋, ☰, ⊞.
Affected buttons: Scan, Stop, Save (profile), Clear (profile), Profiler/Profiles, Kilder/Sources, Indstillinger/Settings, Excel, Art.30, Slet/Delete (bulk), Liste/List, Gitter/Grid, Export (DB), Import (DB), Reset DB, scheduled scan title.
Labels updated in templates/index.html and all three lang files (da.json, en.json, de.json).

Filter bar — Clear button standardised

The × clear-filter button was an oversized bare symbol (font-size: 16px, no border). Replaced with a proper text button (Ryd/Clear/Löschen) matching the 26 px filter bar standard: bordered, border-radius: 5px, turns red on hover.
Translation key m365_filter_clear added to all three lang files.

Scheduler indicator — "Next:" label translated

The hardcoded 'Next: ' prefix in schedUpdateSidebarIndicator() is now t('m365_sched_next', 'Next'). Key added to all three lang files (da: Næste, de: Nächste).
Clock emoji 🕐 removed from the indicator and from m365_sched_title in all lang files.

Fixed — result card badges, progress bar on browser refresh

Result card badges — standardised to 9 px pill style

All result card badges now follow the app-wide badge standard: font-size: 9px; padding: 1px 5px; border-radius: 10px.
.source-badge (OneDrive, Exchange, Gmail, etc.) had no CSS definition at all — it now has the correct size, padding, and border-radius.
.cpr-badge reduced from 10px / 2px 6px to 9px / 1px 5px.
.photo-face-badge, .special-cat-badge, .overdue-badge, .role-pill reduced from 10px / border-radius: 4px to 9px / 1px 5px / border-radius: 10px.
Removed camera emoji (📷) from the Faces badge.
.card-source gains flex-wrap: wrap so badges wrap on narrow cards instead of overflowing.

Progress bar — survives browser refresh

Refreshing the browser mid-scan no longer causes the progress bar to appear without coloured segment pills.
Three code paths now defensively set the correct running flag and call _renderProgressSegments() before the track is needed:
- scan_start SSE handler (sets _m365ScanRunning).
- scan_progress SSE handler (sets the flag matching the event's source field — covers mid-scan reconnects where scan_start has scrolled out of the 500-event replay buffer).
- scan_phase SSE handler (infers source from phase text; fires before scan_progress in the replay sequence).
- _sseWatchdog (sets _m365ScanRunning immediately on detecting a running scan via /api/scan/status, which checks the M365 lock).

Improved — scan responsiveness, UI layout, preview panel

Scan abort responsiveness

Stop now takes effect within one Graph API round-trip across all collection phases. Previously, pressing Stop only checked the abort flag in the processing loop — the entire collection phase (email folder enumeration, OneDrive file listing, Teams channel fetching, SharePoint site iteration) ran to completion first, which could take 10+ minutes on large tenants.
Abort checks added to: email folder loop (inside _scan_user_email), OneDrive items loop (delta and full modes in _scan_user_onedrive), Teams team loop and channel loop (inside _scan_user_teams), SharePoint site loop, and all outer per-user loops.
Side effect: the scheduler no longer fails with "Manual scan already running" when a job fires shortly after the user pressed Stop — the lock is now released promptly.

Scheduler — graceful skip on lock contention

When a scheduled job fires while a manual (or other scheduled) scan holds the lock, the job now logs Skipped — a scan is already running and returns cleanly. Previously it raised RuntimeError("Manual scan already running"), which was logged as a hard failure with a full traceback in the UI.

Filter bar — always visible, full-width, 26 px

Filter bar was hidden until the first result arrived. It is now always visible.
Moved from inside the left column to a direct child of .main, above .content-area. The preview panel's top edge now aligns with the grid's top edge rather than overlapping the filter bar.
All filter bar controls standardised to height: 26 px (input, select, button) to match the topbar control standard. Redundant inline padding/font-size/border-radius stripped from button inline styles.

Preview panel

Resizable: a 5 px drag handle on the left edge lets the user adjust the panel width. Handle uses pointer capture (setPointerCapture) so dragging over the iframe or releasing outside the browser window always terminates the drag cleanly. Width is persisted in sessionStorage and restored when the panel is next opened.
Min width: 280 px; max width: 70% of window width.
Fixed: clicking the close (×) button had no effect. Root cause: panel.style.width set by the resize logic is an inline style and overrides the CSS class .hidden { width: 0 }. Fix: closePreview() now clears panel.style.width = '' before adding .hidden; openPreview() restores the saved width when showing the panel.
Email preview iframe: added * { max-width: 100% }, overflow-x: hidden, table { table-layout: fixed }, and img { height: auto } to prevent wide HTML emails from creating a horizontal scrollbar inside the 420 px panel.
Email preview iframe scrollbar: matches the app's 4 px thin scrollbar style.

Thin scrollbars everywhere

.grid-area (results grid) and .log-panel now use the same 4 px thin scrollbar style (scrollbar-width: thin; width: 4px) as #accountsList and #sourcesPanel. Previously they used the system-default wide scrollbar.

Scheduler next scan indicator

#schedNextIndicator was a plain display: block div with no height constraint, causing it to sit taller than adjacent topbar controls. Fixed to height: 26 px; display: inline-flex; align-items: center with a border and border-radius matching the surrounding pill buttons.

Log and preview resize — pointer capture fix

Both resize handles (logResizeHandle, previewResizeHandle) switched from mousedown + document.addEventListener('mousemove'/'mouseup') to pointerdown + setPointerCapture. The old approach lost the drag when the cursor moved over the iframe (which has its own input context) or left the browser window. Pointer capture routes all pointer events to the handle until pointerup/pointercancel regardless of cursor position.

Manuals updated (MANUAL-DA.md, MANUAL-EN.md)

Version 1.6.4 → 1.6.6.
Section 2: activity log description now mentions copy button, error filter, and resize handle.
Section 4.4: progress bar description updated — source pill labels listed, old "current phase" wording removed.
Section 8: profiles section updated for loader model, ✕ clear button, and explicit mention that Google/file sources are saved.

[1.6.6] — 2026-04-06

Improved — UX polish II (clusters, badges, log panel, progress bar)

Pill clusters

KONTI section header: Alle / Ingen / ↻ converted from bare text links to a pill cluster (height: 22px), matching the pattern used in the Profile editor.
Profile list rows (Profiler modal): Brug + Kopier grouped into a pill cluster; Slet kept as a separate standalone danger button.

Badge sizing

Platform badges (M365, GWS, M365+GWS) and role badges (Ansat, Elev, Anden) standardised to font-size: 9px; padding: 1px 5px; border-radius: 10px across the main sidebar account list and the profile modal. Previously the sidebar used larger inline styles (font-size: 10px; padding: 2px 7px) that made badges visually heavier than in the modal.

Account rows

Main sidebar account row padding reduced from 4px 0 to 2px 0, matching the compact density of the profile modal account list.
SKU debug search icon button standardised to height: 26px to match the adjacent role filter cluster.

Log panel — full rebuild

Color-coded log levels: .log-err (red var(--danger)), .log-ok (green var(--success)), .log-warn (orange #e0922a)). Level classes were already passed to log() but had no CSS — all entries appeared in the same muted colour.
Live scanning indicator: a single italic ▶ filename line at the bottom of the log updates in place via scan_file SSE events. Never scrolls; clears automatically when the scan finishes. Avoids flooding the log with per-file entries.
Copy button (⎘ Copy) in the log header copies all log text to clipboard; flashes ✓ Copied for 1.5 s.
Log level filter (All / Errors) in log header — hides info lines when Errors mode is active.
Resizable: drag handle at the top edge of the panel resizes vertically and snaps to the nearest full line (row = 18 px: 16 px line-height + 2 px margin; 2–30 lines range).
Default height set to 8 lines exactly (height: 154px = 8 × 18 + 10 px padding).
Persistent across page refresh: up to 300 lines saved to sessionStorage; restored on DOMContentLoaded; cleared on new scan start.
Smart scroll: auto-scroll only triggers when already within 24 px of the bottom — scrolling up to read earlier entries stops the follow behaviour.

Progress bar — segmented multi-source

Replaced the single progressFill bar with a dynamically segmented track (#progressTrack). One segment per active scan type (M365 / Google / Files), equal width, separated by a 1 px gap. Segments are added at scan start and removed as each source finishes.
Color-coded: M365 = blue (var(--accent)), Google = dark green (#3a7d44), Files = purple-gray (#7a6a9e).
Each segment fills independently — M365 at 80% and Google at 20% are shown simultaneously with no interference. Eliminates the _maxPct hack (bar stuck at 100% after first source finishes).
Backend (scan_engine.py, routes/google_scan.py): all scan_progress SSE events now include "source": "m365" / "google" / "file". Frontend routes each event to the correct segment by d.source.
Stats (X / Y) and ETA only update from M365 events — the only source with meaningful totals and time estimates.

Progress bar — phase display

#progressWho replaces the plain-text phase span. Renders a colour-coded source pill ([Email], [OneDrive], [Gmail], [GDrive], [Local], etc.) followed by the user's full display name.
Source pill uses the universal badge standard: font-size: 9px; padding: 1px 5px; border-radius: 10px; font-weight: 500.
_setProgressPhase() identifies the source from the full phase string via _PHASE_SOURCE_MAP, then splits on — to extract the username. Phases without a dash (e.g. 📂 folder: 3 msg(s)) fall back to the last known user (_progressCurrentUser).
_resolveDisplayName() resolves email addresses in Google phase strings to the user's display name via _allUsers. Also strips trailing count suffixes (: 3 file(s)).
Pill labels standardised: Email, OneDrive, SharePoint, Teams, Gmail, GDrive, Local — matching the source names used elsewhere in the UI.
All 25 scan_phase strings now produce a pill: 📂 emoji maps to Email; Google Workspace — email phases resolve to display name; file scan startup uses Files — {label}; Google per-user phase uses Google Workspace — {email}.
Source map ordering: Google Workspace matched before Gmail so the GWS startup phase shows [Gmail] only when no broader match applies.
Fixed: email regex was missing the i flag (/E-?mail.../u → /E-?mail.../iu), causing Danish "Indsamler e-mails" to fall through to plain text.

Scheduler — Google and file sources

Scheduled scans now run Google Workspace sources. _build_options extracts google_sources from the profile (with legacy fallback for profiles that stored gmail/gdrive inside sources). A separate Google scan block runs after the file scan loop using _google_scan_lock.

Profile dropdown — loader model

Removed the selectable "Standard (sidebar)" / "Default (sidebar)" empty option. Profiles are now loaders, not persistent modes — selecting one pushes its settings into the sidebar; the sidebar is always the live state.
Replaced with a disabled placeholder "— Vælg profil —" shown when no profile has been loaded.
Added a ✕ clear button (#profileClearBtn) that appears next to the dropdown when a profile is active. Clicking it clears _activeProfileId and resets the dropdown to the placeholder without touching the sidebar — the loaded settings remain.
clearActiveProfile() function added.
Lang keys: m365_profile_default removed, m365_profile_placeholder added (da/en/de).

Bug fixes

Profile role filter respected at scan time: getSelectedUsers() now filters the returned list by _activeRoleFilter, preventing hidden-role users from being silently included in M365 scans and profile saves via the topbar quick-save.
Profile editor role filter respected at save time: _pmgmtSaveFullEdit now excludes IDs whose role doesn't match _pmgmtRoleActive. Prevents "select all → filter by staff → save" from silently saving student accounts that were checked but hidden.
Profile editor role filter state reset on open: _openEditorForProfile resets _pmgmtRoleActive = '' so a stale filter from a previous session doesn't silently hide accounts when the editor is reopened.
Google and file sources not saved in profiles: _pmgmtSaveFullEdit now checks whether the checkboxes are actually present in #peSourcesPanel (DOM query) rather than using !!window._googleConnected and _fileSources.length > 0 as proxies. The async status fetches could complete after the editor opened, leaving the panel without checkboxes while the proxy read true, silently discarding the user's selection.
Profile editor now re-renders #peSourcesPanel when smGoogleRefreshStatus() resolves or _loadFileSources() completes if the editor is open and the panel has no Google/file checkboxes yet.

[1.6.5] — 2026-04-04

Improved — UX polish pass (topbar, sidebar, clusters)

Topbar

All topbar elements normalised to height: 26px: Scan/Stop buttons, profile dropdown, save button, config cluster, stats pill, icon buttons (🔍, ?, 🌙). Previously each had independent padding, making the topbar uneven.
Config buttons (Profiler, Kilder, Indstillinger) extracted from #profileBar into a dedicated .config-group pill cluster separated by a .topbar-sep divider — visually distinct from the profile selector group.
Data subject lookup moved from the sidebar footer into the topbar as a 🔍 icon button (left of ?). Sidebar strip removed.

Sidebar

KILDER, INDSTILLINGER, and KONTI sections are now collapsible. Each header gets a ▾/▸ chevron (section-collapse-btn). Collapse state persists in localStorage per section. KONTI releases its flex:1 when collapsed.
Role filter buttons (Alle / Ansat / Elev) converted to a pill cluster (.role-filter-btn) matching the topbar cluster pattern. SKU debug button stays separate.
Date preset buttons (1 år / 2 år / 5 år / 10 år / Alle) converted to a pill cluster.
All pill cluster buttons, input fields, and date picker set to height: 26px — the universal control height across the UI.
Toggle size reduced from 36×20px to 32×18px with knob gap tightened from 3px to 2px. Knob-to-track ratio improved for a sleeker look.
Role filter buttons display live counts: "Alle (277)", "Ansat (62)", "Elev (254)". Updated by updateRoleFilterCounts(), called from renderAccountList().

Empty state

On load, fetches /api/db/stats. If a previous scan exists, shows a summary card (hits, unique CPR subjects, items scanned, date, sources) instead of the bare placeholder. The placeholder is shown below as a "start new scan" prompt. Summary hidden when a scan starts.

Added — Single-instance lock

~/.gdprscanner/app.lock — an exclusive process lock is acquired at startup to prevent two instances from running simultaneously against the same database and settings files.
Desktop (build_gdpr.py launcher): lock is checked before Flask starts. If another instance holds the lock the app prints "GDPRScanner is already running." to stderr and exits immediately.
Server (gdpr_scanner.py): same guard in interactive web-UI mode (not headless — batch runs may legitimately coexist with a live server).
Uses fcntl.flock(LOCK_EX | LOCK_NB) on macOS/Linux and msvcrt.locking on Windows. The OS releases the lock automatically on crash or clean exit — no stale lockfiles.

Added — Port auto-increment + stdout port signal

gdpr_scanner.py (server mode): if the requested port (default 5100, or --port N) is already in use, the server auto-increments up to 100 ports and logs a warning: [!] Port 5100 in use — using 5101 instead.
build_gdpr.py launcher (desktop mode): find_free_port() was already present; auto-increment was already the desktop behaviour.
Both modes emit GDPR_PORT=<n> (flush=True) to stdout before Flask starts — a machine-readable signal parseable by any parent process or wrapper script that needs to know the actual bound port.

Added — Built-in user manual (#31 ✅)

MANUAL-EN.md / MANUAL-DA.md — standalone end-user manuals in English and Danish. 14 sections covering all major features: Getting started, Sources panel, Running a scan, Understanding results, Reviewing results, Bulk actions, Profiles, Scheduler, Export & email, Article 30 report, Data subject lookup, Settings, Retention policy, and FAQ. Written for school administrators and municipal compliance officers — no technical knowledge assumed.
GET /manual — new Flask route in routes/app_routes.py. Reads ?lang=da|en (falls back to the current UI language). Finds the appropriate .md file relative to the project root, converts it to a fully self-contained styled HTML page, and returns it without any external dependencies.
_md_to_html(md) — zero-dependency Markdown-to-HTML converter using only Python's re and html stdlib modules. Handles headings with anchor IDs, fenced code blocks, tables, ordered/unordered lists, blockquotes, bold, italic, inline code, links, and horizontal rules.
? button in the topbar (right of the theme toggle) — opens the manual in a dedicated window (960×800, resizable) using the current langSelect value. In the packaged desktop app the window is a native pywebview window (pywebview.api.open_manual()); in the browser it opens via window.open(). Repeated clicks reuse the same window rather than spawning new ones. Does not interrupt any in-progress scan.
Manual page: 860 px max-width layout, language switcher (DA ↔ EN), 🖨 print button, @media print CSS (toolbar hidden, h2 page breaks, external link URLs appended for paper printing).

Fixed — Manual not found in packaged app

MANUAL-DA.md and MANUAL-EN.md were missing from the PyInstaller bundle — build_gdpr.py now includes all MANUAL-*.md files as root-level data files (--add-data MANUAL-*.md:.). The route already used sys._MEIPASS for the frozen path; the files simply weren't being copied in.
build_gdpr.py LAUNCHER_CODE — added open_manual(lang) method to the Api class. Creates a new pywebview window for the manual URL; reuses the existing window if already open.

Fixed — Email routing, profile source persistence, SMTP error messages

routes/email.py — structural rewrite

Removed __getattr__ module-level hook. Bare-name lookups inside function bodies do not go through __getattr__ (Python resolves them via LOAD_GLOBAL directly from __dict__), so _load_smtp_config, _save_smtp_config, _build_excel_bytes, and _send_report_email all raised NameError at runtime when the blueprint route won instead of the app-level duplicate.
_load_smtp_config, _save_smtp_config now imported directly from app_config. _build_excel_bytes imported from routes.export.
_send_report_email(xl_bytes, fname, smtp_cfg, recipients) was called in three places but never defined anywhere. Now defined as a module-level helper: builds a MIMEMultipart("mixed") message with the Excel as a MIMEBase attachment and sends via the configured SMTP server.
_send_email_graph moved into the blueprint (was only used by the duplicate app-level routes).

gdpr_scanner.py

Removed four duplicate app-level routes that were masking the broken blueprint: GET /api/smtp/config, POST /api/smtp/config, POST /api/smtp/test, POST /api/send_report.
from routes.email import _send_report_email added after blueprint imports so scan_scheduler.py (_m._send_report_email) and the CLI headless path both resolve the function correctly.

SMTP error messages (routes/email.py)

All three auth/connection error handlers (smtp_test, send_report, _send_report_email) now classify errors by host type before choosing a message:
- DNS / connection failure (nodename nor servname, getaddrinfo, Connection refused, timeout) → "Could not connect to SMTP server — check hostname and port."
- Corporate M365 host (office365, microsoft) + auth error → M365 admin centre / enable Authenticated SMTP guidance.
- Personal Microsoft host (outlook, live, hotmail) + auth error → App Password guidance at account.microsoft.com/security.
- Gmail host + auth error → App Password guidance at Google Account Security.
- Anything else → raw SMTP error, unmodified.
Previously 530 (generic "authentication required") unconditionally triggered the M365 admin centre message even when the configured host was Gmail or a personal Outlook account.

static/app.js — profile source persistence

_pmgmtSaveFullEdit was overwriting google_sources and file_sources with [] whenever the editor was opened and those checkboxes weren't rendered (Google not connected / file sources not loaded). Now preserves the profile's existing google_sources when _googleConnected is false, and file_sources when _fileSources is empty.
_applyProfile built _pendingProfileSources by filtering against _fileSources — which is empty at profile-apply time (async load not yet complete), so the pending list was always empty and file source checkboxes defaulted to checked=true regardless of the profile. Now stores profile.file_sources directly (falling back to non-M365/Google IDs from profile.sources).
Added _pendingGoogleSources (mirrors _pendingProfileSources for Google). Set in _applyProfile from profile.google_sources; consumed in renderSourcesPanel() the first time Gmail/Drive checkboxes appear (when Google connects after the profile was applied). Previously they defaulted to checked=true.

Fixed — Progress bar and profile sources

static/app.js

Progress bar fluctuated and ETA flickered when M365, Google, and file scans ran concurrently. Root cause: all three scan types broadcast scan_progress on the same SSE stream and their events interleave. Fixed with two changes: (1) _maxPct tracks the highest pct seen across all concurrent scans — the bar only ever moves forward; (2) ETA and stats counter are only written when the incoming event actually carries those fields (d.eta !== undefined, d.total present) — a Google/file event without ETA no longer wipes the ETA set by the M365 event a millisecond earlier.
progressPhase was being overwritten with the current filename by scan_progress events, causing it to alternate between phase text ("Google Workspace scan…") and individual filenames. Current filename now correctly updates progressFile instead.
Profile editor (_openEditorForProfile) only passed profile.sources (M365 IDs) to _renderEditorSources — Google and Local/SMB source checkboxes were always unchecked when reopening a saved profile. Now passes the union of sources, google_sources, and file_sources.

Added — SMB pre-fetch cache (#22 ✅)

SMB file scans now decouple directory traversal from file reads. A 5-slot sliding-window ThreadPoolExecutor keeps up to 5 reads in flight simultaneously, with a 60-second hard timeout per file. A stalled NAS read produces an error card in the UI and the scan continues — the scan thread is never blocked.
file_scanner.py — _smb_collect() new method walks the SMB tree (directory listing only, no reads), yielding file descriptors plus _COLLECT_SKIP / _COLLECT_ERROR sentinels for over-size files and listing failures. _iter_smb() rewritten: phase 1 collects all candidates; phase 2 resolves sentinels immediately then feeds real files through the executor window. PREFETCH_WINDOW = 5 and SMB_READ_TIMEOUT = 60 constants added. Local scanner (_iter_local) untouched.

Added — PDF OCR via multiprocessing (#20 ✅)

PDF files are now scanned in local/SMB file scans. Previously excluded because Tesseract/Poppler subprocesses could hang indefinitely.
cpr_detector.py — new _worker_scan_pdf() (module-level, required for spawn context) runs document_scanner.scan_pdf() in a fresh subprocess and returns results via a multiprocessing.Queue. New _scan_bytes_timeout() wraps PDF scanning: writes content to a temp file, spawns the worker via multiprocessing.get_context("spawn"), joins with a 60-second hard timeout, and terminates the process tree if it exceeds the limit. Non-PDF files delegate straight to _scan_bytes().
scan_engine.py — run_file_scan() now calls _scan_bytes_timeout() instead of _scan_bytes() for all files. Stub added to module-level injected globals.
gdpr_scanner.py — _scan_bytes_timeout imported from cpr_detector and injected into scan_engine.
file_scanner.py — .pdf removed from FILE_SCAN_EXTENSIONS exclusion; all default extensions now included.

Fixed — Post-v1.6.4 release bugs (continued)

routes/google_scan.py

_run_google_scan() crashed with UnboundLocalError: cannot access local variable 'data' when user_emails was not passed in the request. The fallback data.get("user_emails", []) referenced the request-handler local data which is not in scope inside the scan function — data and options are the same object. Removed the redundant fallback.

routes/export.py — Article 30 report

SOURCE_LABELS was missing gmail, gdrive, local, and smb — all four source types rendered as raw keys in every table (inventory, Art. 9, photo, deletion audit log). Now map to "Gmail", "Google Drive", "Local files", "Network / SMB".
Per-source breakdown table only iterated M365 sources (email, onedrive, sharepoint, teams) — Google and local/SMB findings were completely absent from the summary even when present. Loop now covers all eight source types.
Methodology bullet (a30_method_4) only mentioned Microsoft Graph sources. Updated in en.json, da.json, de.json, and the hardcoded fallback to also mention Google Workspace (service account + domain-wide delegation) and local/SMB file shares.

scheduler.py

Removed stale file. scan_scheduler.py fully supersedes it; routes/scheduler.py and gdpr_scanner.py both import from scan_scheduler. The old file had diverged significantly (missing UUID migration, connector auto-reconnect, file source resolution, debug SSE events).

templates/index.html

Removed 9 unused CSS classes: .sidebar-sub, .btn-secondary, .log-ok, .log-err, .log-warn, .user-bar, .sign-out-btn, .source-badge, .srcmgmt-coming-soon.

Added — Personal Google account OAuth (#30 ✅)

Personal Google accounts can now be scanned without a service account or Workspace admin. A device-code OAuth flow (mirrors M365 delegated mode) lets a user sign in interactively with their own Google account.
google_connector.py — new PersonalGoogleConnector class: get_device_code_flow() / complete_device_code_flow() static methods hit Google's device-auth endpoint; _refresh_if_needed() handles transparent token refresh via google.oauth2.credentials.Credentials; list_users() returns a single-item list (the signed-in user) so the scan engine needs no changes. iter_gmail_messages() / iter_drive_files() share the same iteration logic as GoogleConnector via extracted _gmail_iter() / _drive_iter() module-level helpers.
Token persisted to ~/.gdprscanner/google_token.json (chmod 600). New helpers: save_personal_token, load_personal_token, delete_personal_token.
routes/google_auth.py — four new endpoints: GET /api/google/personal/status, POST /api/google/personal/start, POST /api/google/personal/poll, POST /api/google/personal/signout. Background thread blocks on complete_device_code_flow; frontend polls — identical pattern to M365 delegated auth.
routes/state.py — google_pending_flow and google_poll_result added.
templates/index.html — auth-mode toggle (Workspace / Personal account) in the Google pane; personal section with client ID/secret fields and inline device-code box (reuses .device-code-box CSS); workspace setup guide hidden in personal mode.
static/app.js — smGoogleSetMode() switches visible sections; smGoogleRefreshStatus() now checks both /api/google/auth/status and /api/google/personal/status in parallel; smGooglePersonalStart(), smGooglePersonalPoll(), smGooglePersonalSignOut() added.
lang/en.json, da.json, de.json — 14 new keys each.

Fixed — Post-v1.6.4 release bugs

checkpoint.py

Scheduled scans crashed with string indices must be integers, not 'str' when user_ids in the profile contained plain ID strings rather than dicts. _checkpoint_key() now handles both formats: u["id"] if isinstance(u, dict) else u.

scan_engine.py

Same root cause as above: run_scan() now normalises user_ids entries to dicts at the top of the function before any access, so both plain strings and {id, displayName, userRole} objects work correctly.

scan_scheduler.py

file_sources in profiles are stored as source ID strings by the JS frontend. The scheduler now resolves each ID to its full source dict via _load_file_sources() before calling run_file_scan(). Plain path strings are also handled as a fallback.
Full traceback is now included in the scheduler_error SSE event so failures are diagnosable from the UI status panel without needing the CLI.

routes/app_routes.py

/api/langs (language selector endpoint) only globbed *.lang files — after the v1.6.3 JSON migration the language dropdown was silently empty. Now globs both *.json and *.lang with deduplication, matching the existing logic in gdpr_scanner.py.

static/app.js

Profile editor (_pmgmtSaveFullEdit) did not update file_sources or google_sources when the user changed source checkboxes — both fields were carried forward unchanged via ...profile. Now splits #peSourcesPanel checkboxes by data-source-type and writes file_sources, google_sources, and sources explicitly on every save.

gdpr_scanner.py

/api/langs only globbed *.lang files — after migrating to JSON, the language selector showed nothing. Now globs both *.json and *.lang, deduplicates by language code, and sorts alphabetically.
SOURCE_LABELS was missing gmail, gdrive, local, and smb entries — these sources now get correct tab names in Excel export and correct labels in the Article 30 report.
Excel export filename changed from m365_scan_*.xlsx to gdpr_scan_*.xlsx.
Article 30 methodology paragraph now mentions Google Workspace scanning via service account with domain-wide delegation. DA and DE lang files updated to match.

routes/google_scan.py

Gmail and Google Drive result cards showed the email address as account name instead of the user's display name. Fixed: _user_display_map is now built from list_users() and applied to each scanned item.
Role badge (Elev/Ansat/Anden) was missing on Google results when user_emails came from the request rather than list_users(). Fixed: role map is now populated in both cases.
Google scan now emits google_scan_done instead of scan_done so the progress bar stays open until both M365 and Google scans finish.

scan_engine.py

File scan now emits file_scan_done instead of scan_done so the progress bar stays open until all active scan types finish.
pct in both Google and file scan progress events was hardcoded at 50 — now increments from 10 to a max of 90.

static/app.js

Progress bar now tracks three independent flags (_m365ScanRunning, _googleScanRunning, _fileScanRunning) and only hides when all active scans have completed.
google_scan_done and file_scan_done SSE event handlers added.
Source filter dropdown (search results) and bulk delete source dropdown were missing Gmail, Google Drive, Lokal, and Netværk (SMB) options.
Profile preset buttons (1 år / 2 år / etc.) were never highlighted when applying a profile — matching used years × 365.25 but profiles store years × 365. Fixed.
_fileScanRunning flag set correctly at scan start from fileSources.length.

routes/state.py / routes/google_scan.py

M365 and Google scans shared _scan_lock — Google now uses _google_scan_lock and _google_scan_abort so both platforms scan in parallel.

templates/index.html

Sources, Settings and Schedule indicator moved from sidebar section header / footer into the topbar, to the right of the Profiles button.
Source filter dropdown and bulk delete dropdown updated with Google and file source options.

README.md

All emoji removed (role badges, action icons, status indicators). Plain text equivalents used throughout.
lang/da.json and lang/de.json updated with Google Workspace methodology text for the Article 30 report.

[1.6.4] — 2026-04-03

Added — Full profile editor (#15e ✅)

Two-panel modal (profile list left, full editor right). Click a profile row to edit it; the active row is highlighted.
+ Ny profil button in the left panel footer — creates a blank profile and opens the editor immediately, works when no profiles exist.
Editor sections match the sidebar exactly:
- Navn — name + description fields
- Kilder — same rendering as the main KILDER panel, including M365, Google Workspace, and file/SMB sources
- Konti — role filter (Alle / Ansat / Elev), text search, Alle / Ingen select buttons, + Tilføj konto manual entry, platform badges (M365 / GWS / M365+GWS), role badges
- Indstillinger — date picker with year presets (1/2/5/10/Alle), Scan e-mailindhold, Scan vedhæftede filer, Maks. vedhæftet filstørrelse (MB), Maks. e-mails pr. bruger, Delta-scanning, Søg efter ansigter i billeder — all as toggle sliders
- Opbevaringspolitik — always visible; Opbevaringsår + Regnskabsår slut dropdown
Annuller, ×, and Gem all close the full modal. Auto-opens first profile on modal open.
Profile editor defaults match the main window: accounts are unchecked by default; only explicitly saved user_ids are shown as checked.

Fixed — Parallel M365 + Google scanning

M365 and Google scans shared _scan_lock — starting both simultaneously caused "Google scan already running" immediately after scan start. Fixed: routes/state.py now defines _google_scan_lock and _google_scan_abort as separate threading primitives; routes/google_scan.py uses these instead of the M365 lock. Both platforms now scan in parallel.

Fixed — User selection defaults

All users now default to selected: false on page load (previously true). The profile editor follows the same rule.
"Vælg alle" button renamed to "Alle" to match the main sidebar.

[1.6.3] — 2026-04-03

Fixed — Post-v1.6.3 release bugs

static/app.js

Source toggle state (Email, OneDrive, SharePoint, Teams, Gmail, Google Drev) not persisted across restarts. Fixed: all toggles now save to ~/.gdprscanner/src_toggles.json via a new /api/src_toggles endpoint and are restored on page load.
Deselecting M365 sources in Source Management did not update account badges — M365 + GWS still shown. Fixed: badge now uses hasM365Src and effectiveGws computed inside renderAccountList(), and M365 source toggles now call renderAccountList() on change.
Google-only scans reported wrong account count in live log (e.g. "26 konto(er)" when 1 was selected). Root cause: getSelectedUsers() returned all selected users including Google-only accounts. Fixed: getSelectedUsers() now returns only M365 users; Google users are counted separately for the log message. The "select at least one account" guard no longer blocks Google-only scans.
Cross-platform identity matching used email prefix (anne.hansen before @) — changed to displayName matching since both M365 and GWS are maintained from the same AD source.
_onGoogleSourceToggle() and M365 source toggles did not call renderAccountList() — account badges not updated when toggling sources in Source Management.

routes/google_auth.py

Removed /api/google/auth/sources endpoint and src_gmail/src_drive keys from the status response — replaced by unified /api/src_toggles endpoint in gdpr_scanner.py.

app_config.py / gdpr_db.py / checkpoint.py / google_connector.py / m365_connector.py / scan_scheduler.py / scheduler.py / gdpr_scanner.py

All data files moved from ~/ root into ~/.gdprscanner/ subdirectory with cleaner short names (scanner.db, config.json, token.json, etc.). A migration shim runs on first startup and moves existing ~/.gdpr_scanner_* files automatically. MAINTAINER.md updated with new file locations.

scan_scheduler.py

Scheduled scans ignored file_sources from the profile — _build_options() dropped them. Fixed: file_sources now included in opts, and run_file_scan() is called for each file source in the profile during a scheduled run (#15f ✅).

static/app.js — profile save

file_sources in profile was hardcoded to [] — now saves the actual checked file sources from buildScanPayload() (#15f).

Fixed — Post-release (continued)

routes/state.py / routes/google_scan.py

M365 and Google scans shared _scan_lock — starting both simultaneously caused "Google scan already running" immediately. Fixed: Google scan now uses its own _google_scan_lock and _google_scan_abort so both platforms can run in parallel.

static/app.js — profile editor (#15e ✅)

Profile editor drawer implemented: two-panel modal (profile list left, full editor right). Click any profile to open its editor.
Editor sections: Navn + beskrivelse, Kilder (same rendering as main KILDER panel, including Google and file sources), Konti (with Alle / Ansat / Elev role filter, text search, Alle / Ingen select buttons, + Tilføj konto manual add), Indstillinger (full mirror of sidebar — date picker with year presets, Scan e-mailindhold, Scan vedhæftede filer, Maks. vedhæftet filstørrelse, Maks. e-mails pr. bruger, Delta-scanning, Søg efter ansigter i billeder, all as toggle sliders), Opbevaringspolitik (always visible — Opbevaringsår + Regnskabsår slut).
- Ny profil button in left panel footer — creates a blank profile and opens the editor immediately, works even when no profiles exist.
Annuller, ×, and Gem all close the full modal (not just the editor panel).
Auto-opens first profile's editor when modal opens.

static/app.js — defaults

All users now default to selected: false on load (were true). Profile editor follows the same rule — only explicitly saved user_ids are shown as checked.
"Vælg alle" button renamed to "Alle" to match the main sidebar.

routes/state.py

Added _google_scan_lock and _google_scan_abort as separate threading primitives for Google scans.

Added — Google Workspace full integration

Accounts panel

Google Workspace users now appear in the Accounts panel alongside M365 users. Each row shows a platform badge: M365 (blue) or GWS (green).
Account list filters by checked sources: check only Google sources → only GWS accounts shown; check only M365 → only M365 accounts; check both → all; check none → empty.
Role filter (All / Ansat / Elev) works across both platforms.
_mergeGoogleUsers() — dedicated async function fetches /api/google/scan/users and merges results into _allUsers independently of M365 auth timing. Called on page load, on Google connect/disconnect, and after M365 loadUsers().

Scanning

Selected Google user emails are now passed as user_emails to /api/google/scan/start — only selected accounts are scanned, not all users in the domain.
routes/google_scan.py — _scan_lock and _scan_abort now imported directly from routes.state (previously relied on __getattr__, which does not resolve bare names inside function bodies — caused NameError on scan start).
user_emails now read from the top-level request body in addition to the nested options dict.
Gmail scan result cards now correctly labelled "Gmail" (source_type was email → mapped to "Exchange"). Fixed in google_connector.py.
Gmail and Google Drive cards now show styled source badges (badge-gmail red tint, badge-gdrive blue tint). Previously fell back to unstyled.

Profiles

Google sources (gmail, gdrive) and selected Google user emails are now saved to scan profiles and correctly restored on load.
Fixed googleSources const temporal dead zone — declaration moved before use in buildScanPayload().

Added — OU-based role classification for Google Workspace (#23 Phase 1 ✅)

classification/google_ou_roles.json — maps Google Workspace Organisational Unit paths to roles. Edit to match your school's OU structure; no code change required. Default: /Elever → student, /Personale → staff.
google_connector.py — list_users() fetches orgUnitPath via projection=full and classifies each user via classify_ou_role().
routes/google_scan.py — role map built from list_users() result; each scan card now carries the correct user_role.

Added — Documentation split

M365_SETUP.md — step-by-step Microsoft 365 setup (app registration, permissions, auth modes, headless config, troubleshooting).
GOOGLE_SETUP.md — step-by-step Google Workspace setup (service account, domain-wide delegation, scopes, OU role mapping, troubleshooting).
README.md — trimmed from 774 to 611 lines; setup detail moved to the two new files.

Changed — i18n migrated from `.lang` to JSON (#27 ✅)

lang/en.json, da.json, de.json — 709 keys each, standard flat JSON.
app_config.py — loader now prefers .json, falls back to .lang for backward compatibility.
Old .lang files retained as fallback; can be deleted once JSON files are confirmed working.

Changed — `skus/` renamed to `classification/` (#29 ✅)

skus/education.json → classification/m365_skus.json
skus/google_ou_roles.json → classification/google_ou_roles.json
All path references updated in m365_connector.py, google_connector.py, routes/users.py, gdpr_scanner.py, build_gdpr.py, all lang files, and static/app.js.

Changed — UI polish (icons removed, badges added)

Role filter buttons (Staff / Student), scan option labels (Delta scan, Scan photos, Retention policy), and account list role badges — all emoji removed, plain text only.
Role badge on account rows changed from emoji icon button to plain outline pill (Ansat / Elev / Anden).
Scan result cards — role icon prefix replaced with small inline badge.
All six lang files cleaned of emoji in role, mode, option, and Art.30 inventory keys.
Progress bar fixed at 32px height — emoji in filenames no longer push the bar taller.
Scrollbars in Sources and Accounts panels thinned to 4px.

Fixed — Account list / source interaction

Deselecting all sources now empties the account list.
Deselecting M365 sources no longer disables Accounts when Google sources are still checked.
_updateAccountsVisibility() now checks all source types, not just M365.

Fixed — Role override cycling

Role override never cleared for users loaded with a pre-existing override (roleOverride: true from a previous session) because _autoRole was never populated from the server. Fixed: replaced _autoRole comparison with a step counter — after 3 clicks the override clears regardless of the original auto role.
Role badge changed from <span> to <button type="button"> inside label rows — prevents label click-forwarding to the checkbox (which caused the first user to receive the override instead of the clicked user).

[1.6.2] — 2026-03-28

Added — Google Workspace account list and source integration

static/app.js — Google Workspace users (292 users in testing) now appear in the Accounts panel with GWS badge (blue = M365, green = GWS). M365 users carry M365 badge.
Account list filters by checked sources: check only Google sources → only GWS accounts shown; check only M365 → only M365 accounts; check both → all accounts; check none → empty list.
Role filter buttons (All / Ansat / Elev) work across both platforms.
_mergeGoogleUsers() — dedicated function fetches /api/google/scan/users and merges results into _allUsers independently of M365 auth timing. Called on page load, on Google connect/disconnect, and after M365 loadUsers().
startScan() — selected Google user emails now passed as user_emails to /api/google/scan/start, so only the chosen accounts are scanned (previously ignored selection and scanned all users).
routes/google_scan.py — _scan_lock and _scan_abort now imported directly from routes.state (previously relied on __getattr__ which doesn't resolve bare names inside function bodies — caused NameError on scan start).
user_emails now read from the top-level request body in addition to the nested options dict.

Added — OU-based role classification for Google Workspace (#23 Phase 1)

classification/google_ou_roles.json — new file mapping Google Workspace Organisational Unit paths to roles. Edit to match your school's OU structure; no code change required. Default: /Elever → student, /Personale → staff.
google_connector.py — list_users() now fetches orgUnitPath via projection=full and classifies each user via classify_ou_role(). Each user dict now includes userRole and orgUnitPath.

Added — Documentation split

M365_SETUP.md — step-by-step Microsoft 365 setup guide (app registration, permissions, auth modes, headless config, role classification, troubleshooting).
GOOGLE_SETUP.md — step-by-step Google Workspace setup guide (service account, domain-wide delegation, OAuth scopes, OU role mapping, troubleshooting).
README.md — trimmed from 774 to 611 lines. Auth/permissions/headless detail moved to setup guides. Two new "Microsoft 365" and "Google Workspace" sections link to the respective files.

Changed — UI polish (icons removed)

Role filter buttons (Staff / Student) — emoji removed, plain text only.
Scan option labels (Delta scan, Scan photos for faces, Retention policy) — emoji removed.
Account list role badge — replaced clickable emoji button (👔/🎓/👤) with plain outline pill badge (Ansat / Elev), matching the platform badge style.
Scan result cards — role icon prefix removed from account name; replaced with small inline outline badge.
All three lang files (en.lang, da.lang, de.lang) cleaned of emoji in m365_role_staff, m365_role_student, m365_opt_delta, m365_opt_scan_photos, m365_opt_retention, m365_mode_delegated, m365_bulk_overdue_btn, a30_inv_staff, a30_inv_students.

Fixed — Profile save/load with Google sources

Google sources (gmail, gdrive) and selected Google user emails now saved in scan profiles and correctly restored on load.
googleSources const declaration moved before use in buildScanPayload() — fixed temporal dead zone ReferenceError.

Fixed — Account list / source interaction

Deselecting all sources now empties the account list (previously kept showing all users).
Selecting only Google sources no longer disables the Accounts section (previously greyed out when no M365 sources were checked).
_updateAccountsVisibility() now checks all source types, not just M365.

Added — Google Workspace role classification via OU mapping (#23 Phase 1)

classification/google_ou_roles.json — new file mapping Google Workspace Organizational Unit paths to roles (student / staff). Edit to match your school's OU structure; no code change required. Default prefixes: /Elever → student, /Personale → staff.
google_connector.py — list_users() now requests orgUnitPath from the Admin Directory API and classifies each user via classify_ou_role(). Each user dict now includes userRole and orgUnitPath.
routes/google_scan.py — role map built from list_users() result; scan cards now carry the correct user_role instead of always "other".

Fixed — Post-split and app runtime bugs (additional)

routes/database.py

Settings panel showed "Scanned: 0, Flagged: 0, Scans: 0" because get_stats() returns {} when no scan has a finished_at timestamp (interrupted or first-run). Fixed: stats endpoint now queries flagged_items and scans tables directly so counts are always correct regardless of scan completion state. Stats populate on app start from existing DB data — no re-scan required.
DB export produced a ZIP but nothing was downloaded in the native app because URL.createObjectURL() does not work in pywebview. Fixed: exportDB(), exportExcel(), and exportArticle30() in static/app.js now detect pywebview and call window.pywebview.api.save_db_export() / save_excel() / save_article30() which use the native macOS/Windows save dialog. Browser fallback preserved.
Added save_db_export() and save_article30() methods to the pywebview Api class in build_gdpr.py. Fixed save_excel filename from m365_scan_ to gdpr_scan_.

scan_engine.py

run_file_scan() called _db.start_scan() which does not exist — the correct method is begin_scan(). Silent exception meant _db_scan_id was always None and no file scan results were ever written to the database. Fixed.

Added — Personal use disposition value (#28)

Staff members using work equipment for private purposes will now appear in scan results. Added personal-use as a disposition value so reviewers can explicitly mark items as outside the organisation's compliance scope.

New disposition: Personal use — out of scope in both UI dropdowns
Art. 30 report labels it "Personal use — out of GDPR scope (Art. 2(2)(c))"
Translated in EN / DA / DE

Legal basis: GDPR Article 2(2)(c) — processing by a natural person in the course of a purely personal activity is outside GDPR scope.

Added — pytest test suite (#26)

112 tests across 4 modules — all passing.

Test module	Tests	What it covers
`tests/test_document_scanner.py`	36	`is_valid_cpr`, `extract_matches`, `scan_docx`, `scan_xlsx`, `_scan_bytes` — CPR detection, false-positive suppression, binary edge cases
`tests/test_app_config.py`	34	i18n loading, Article 9 keyword detection, config round-trip, admin PIN, profiles CRUD, Fernet encryption
`tests/test_checkpoint.py`	18	`_checkpoint_key` stability, save/load/clear, wrong-key isolation, delta token round-trip
`tests/test_db.py`	24	Scan lifecycle, `save_item`, CPR hash-only storage, `lookup_data_subject`, dispositions, export/import cycle

Support files:

tests/conftest.py — shared fixtures: docx_with_cpr, docx_no_cpr, xlsx_with_cpr, xlsx_no_cpr, txt_with_art9, binary_garbage, tmp_db
pytest.ini — test discovery config

Run with: pytest tests/ from the project root.

Fixed — Six post-split runtime bugs

All bugs introduced by the #25 module split — the pre-split code had none of these.

gdpr_scanner.py

_current_scan_id imported as a string binding (from sse import _current_scan_id), so scan_stream() always saw "" — SSE replay filter excluded all events and the progress bar showed nothing. Fixed: reads sse._current_scan_id at call time via module reference.
_connector assignment only updated the local module global, not _state.connector. scan_engine.py reads _state.connector, which stayed None after sign-in — every scan reported "Not connected to M365". Fixed: all five _connector = ... assignments now dual-assign _connector = _state.connector = ....

scan_engine.py

_load_role_overrides, _resolve_display_name, _scan_text_direct were undefined bare names inside run_scan() — raised NameError at runtime. Fixed: proper imports from app_config and cpr_detector.
PHOTO_EXTS and SUPPORTED_EXTS were stub empty sets at import time; injection via _se.PHOTO_EXTS = ... replaced the module attribute but function bodies still saw the empty stubs. Fixed: scan_engine.py now imports these directly from cpr_detector at module level.
scan_progress SSE event broadcasts index and pct; the UI handler read d.completed — progress bar was always 0%. Fixed in static/app.js: handler now reads d.pct (pre-calculated server-side) and populates progressStats (n / total) and progressEta elements that were wired in HTML but never written.
Source collection (OneDrive, SharePoint, Teams) completed silently with no count in the live log. Fixed: broadcasts 📁 OneDrive — user: N file(s), 🌐 SharePoint: N file(s), 💬 Teams — user: N file(s) after each successful collection.

cpr_detector.py

_scan_text_direct() called ds.scan_text() which internally calls extract_cpr_and_dates() — a function that does not exist in document_scanner.py (pre-existing bug in that module). Result: every email body scan returned zero CPRs. Same bug affected .txt files and the unknown-extension fallback in _scan_bytes(). Fixed: all three replaced with ds.extract_matches(text, 1, "text") which works correctly.

static/app.js

scan_file_flagged handler called renderCards() which is not defined anywhere — silent ReferenceError in the browser, cards pushed to flaggedData but never displayed. Fixed: replaced with applyFilters() which calls renderGrid() and shows the filter bar.
scan_done handler never showed the filter bar (containing Excel and Art.30 export buttons) when results existed — only the stats numbers updated. Fixed: scan_done now explicitly shows the filter bar and calls applyFilters() when flaggedData.length > 0.

[1.6.1] — 2026-03-28

Changed — Split `gdpr_scanner.py` into focused modules (#25)

gdpr_scanner.py was 5554 lines. It is now 3591 lines and delegates to five focused modules. No behaviour changes — all existing routes, blueprints, and imports continue to work unchanged.

New files:

Module	Lines	Contents
`sse.py`	52	`broadcast()`, `_sse_queues`, `_sse_buffer`, `_current_scan_id`
`checkpoint.py`	79	`_save_checkpoint()`, `_load_checkpoint()`, `_checkpoint_key()`, delta token load/save
`app_config.py`	553	i18n, Article 9 keywords, config, admin PIN, profiles, settings, SMTP, file sources, Fernet encryption
`cpr_detector.py`	381	`_scan_bytes()`, `_extract_exif()`, `_detect_photo_faces()`, `_make_thumb()`, `_get_pii_counts()`
`scan_engine.py`	1006	`run_scan()`, `run_file_scan()` — M365 and file-system scan orchestration

Changed files:

gdpr_scanner.py — imports and re-exports from all five modules; keeps Flask app init, @app.route definitions, blueprint registration, and __main__ entry point
routes/state.py — _scan_lock and _scan_abort moved here from gdpr_scanner.py so scan_engine.py can reference them without a circular import

Isolation: each new module is importable in isolation with fallback stubs, enabling unit tests (#26) to import cpr_detector or checkpoint without pulling in Flask, MSAL, or the full application.

[1.6.0] — 2026-03-28

Changed — Rename: M365 Scanner → GDPRScanner (#24)

The tool now scans M365, Google Workspace, local file systems, and SMB network shares. The name "M365 Scanner" was misleading. This release renames everything with no behaviour changes.

Files renamed:

Old	New
`m365_scanner.py`	`gdpr_scanner.py`
`m365_db.py`	`gdpr_db.py`
`build_m365.py`	`build_gdpr.py`
`build_m365.sh`	`build_gdpr.sh`
`start_m365.sh` (created by install_macos.sh)	`start_gdpr.sh`
`start_m365.bat` (created by install_windows.ps1)	`start_gdpr.bat`

Config files renamed on first startup (migration shim):

Old `~/` path	New `~/` path
`.m365_scanner_config.json`	`.gdpr_scanner_config.json`
`.m365_scanner.db`	`.gdpr_scanner.db`
`.m365_scanner_token.json`	`.gdpr_scanner_token.json`
`.m365_scanner_delta.json`	`.gdpr_scanner_delta.json`
`.m365_scanner_settings.json`	`.gdpr_scanner_settings.json`
`.m365_scanner_smtp.json`	`.gdpr_scanner_smtp.json`
`.m365_scanner_role_overrides.json`	`.gdpr_scanner_role_overrides.json`
`.m365_scanner_file_sources.json`	`.gdpr_scanner_file_sources.json`
`.m365_scanner_machine_id`	`.gdpr_scanner_machine_id`
`.m365_scanner_checkpoint.json`	`.gdpr_scanner_checkpoint.json`
`.m365_scanner_schedule.json`	`.gdpr_scanner_schedule.json`
`.m365_scanner_msal_cache.bin`	`.gdpr_scanner_msal_cache.bin`
`.m365_scanner_lang`	`.gdpr_scanner_lang`

The migration runs silently at startup — existing scan history, credentials, settings, and role overrides are preserved automatically.

Intentionally unchanged:

m365_connector.py — kept as-is; it is the Microsoft Graph connector and the m365_ prefix accurately describes what it connects to
i18n keys with the m365_ prefix that describe M365-specific UI elements (Azure credential fields, device code flow screens) — the prefix is correct

Run with:

python gdpr_scanner.py [--port 5100]

[1.5.9] — 2026-03-28

Added — Google Workspace scanning (#10)

Organisations running mixed Microsoft/Google environments can now scan Gmail and Google Drive alongside M365 in a single tool. The Google Workspace tab in Source Management is now fully active (was "Coming soon" stub).

New files:

google_connector.py — service account OAuth with domain-wide delegation; Gmail message + attachment iterator; Drive file iterator with automatic export of native Docs/Sheets/Slides → DOCX/XLSX/PPTX before scanning
routes/google_auth.py — /api/google/auth/status, /connect, /disconnect
routes/google_scan.py — /api/google/scan/start, /cancel, /users

Changed files:

routes/state.py — google_connector slot added
m365_scanner.py — Google blueprints registered; GOOGLE_CONNECTOR_OK / GOOGLE_AUTH_OK flags; connector auto-restored from saved key on startup
templates/index.html — Google tab activated; full credentials pane with key file upload, admin email field, Gmail + Drive source toggles, and setup guide
static/app.js — smGoogleRefreshStatus(), smGoogleConnect(), smGoogleDisconnect(), getGoogleScanOptions(), key file reader
requirements.txt, install_windows.ps1, install_macos.sh — three new optional Google API dependencies
lang/en.lang, da.lang, de.lang — 14 new i18n keys each

Dependencies (optional — scanner starts without them):

pip install google-auth google-auth-httplib2 google-api-python-client

Setup required in Google Workspace Admin Console:

Create a Google Cloud project; enable Gmail API, Drive API, Admin SDK
Create a service account; download the JSON key; enable domain-wide delegation
In Workspace Admin → Security → API Controls → Domain-wide delegation add the service account client ID with scopes: gmail.readonly, drive.readonly, admin.directory.user.readonly

Scan results write to the same SQLite database with source_type = "gmail" or "gdrive" — Article 30 reports and data subject lookups cover both platforms automatically.

[1.5.8] — 2026-03-28

Fixed — Scheduled scans invisible in the browser (#21)

Scheduled scans now show full live progress in the browser — progress bar, phase text, flagged cards, and log entries — exactly like manual scans.

Root cause (critical): When run as python m365_scanner.py, the module loads as __main__. The scheduler's import m365_scanner as _m loaded a second copy of the module with its own empty _sse_queues. Events from _m.broadcast() went nowhere — the browser's SSE connection was reading from __main__'s queues.

Fix: sys.modules["m365_scanner"] = sys.modules[__name__] at the top of the module ensures all imports share the single running instance.

Fixed — SSE event replay for late-connecting browsers (#21)

Opening the browser mid-scan (manual or scheduled) now replays all buffered progress events so the live log and card grid are fully populated.

Additional root causes and fixes:

_autoConnectSSEIfRunning() only attached scheduler_* listeners on page load — replayed scan_phase, scan_file_flagged, and scan_done events were silently ignored
Idle SSE connections died silently (Flask/Werkzeug threading); the browser had no live connection when a scheduled scan fired minutes/hours later

Changes — Python (m365_scanner.py):

Module identity fix: sys.modules["m365_scanner"] = sys.modules[__name__]
Added _current_scan_id global — unique timestamp-based ID set at the start of every scan (M365 and file scans) and cleared after scan_done
broadcast() injects scan_id into every SSE event payload
scan_stream() filters the replay buffer to only include events matching the current scan_id, preventing stale replay from previous scans
New sse_replay / sse_replay_done marker events bracket the replayed block so the browser can distinguish replay from live events
New GET /api/scan/status lightweight endpoint returning {running, scan_id}

Changes — JavaScript (static/app.js):

Extracted _attachScanListeners(es) and _attachSchedulerListeners(es) — shared by both startScan() and _autoConnectSSEIfRunning()
_attachSchedulerListeners now shows the progress bar on scheduler_started and hides it on scheduler_done / scheduler_error
SSE polling watchdog (_sseWatchdog) checks /api/scan/status every 4s; reopens the SSE connection via _ensureSSE() if it has died
_userStartedScan flag — scan_done only closes the SSE connection for user-initiated scans; scheduled scans keep it alive
Fixed es.onerror handler — no longer silently nulls es

Fixed — File scan `scan_complete` → `scan_done` event name

run_file_scan() was broadcasting scan_complete on finish, but the JS only listens for scan_done. Renamed to scan_done with the same total_scanned / flagged_count payload shape as M365 scans.

Fixed — Resume scan used wrong profile

startScan() never told the server which profile was active. Settings were always saved to the Default profile. Now profile_id is sent in the scan start payload and _save_settings() accepts a profile_id parameter (takes precedence over profile_name).

Fixed — `install_macos.sh` launcher scripts

start_gdpr.sh and build_m365.sh templates now use exec python3 instead of exec python — fixes "not found" after removing python.org interpreter
spaCy model install: creates a pip shim in venv/bin/ (spaCy's shutil.which("pip") couldn't find the venv's pip3), falls back to direct pip install if spacy download still fails, and prepends venv/bin to PATH explicitly

Added — Diagnostic logging

[run_scan] prints sources, user count, app_mode, and a sample user entry at scan start — helps verify scheduled scans use the correct profile
[SSE] console.log messages in the browser for scan_phase, scan_done, scan_file_flagged, scheduler_started, scheduler_done, scheduler_error — aids debugging SSE delivery issues

Added — i18n keys (EN / DA / DE)

m365_sse_reconnecting — shown when page load detects a running scan
m365_sse_replay_note — logged after replayed events finish

[1.5.7] — 2026-03-28

Several strings in the Settings → General and Settings → Scheduler tabs were displaying in English regardless of the active language.

Missing lang keys added (EN / DA / DE):

btn_save — Save / Gem / Speichern (used by scheduler editor Save button and others)
m365_settings_about — About / Om / Über
m365_settings_save_pin — Save PIN / Gem PIN / PIN speichern
m365_sched_freq_daily/weekly/monthly — frequency labels in job list and editor
m365_sched_dow_mon through _sun — day-of-week labels

Template fixes:

"About" group heading now has data-i18n="m365_settings_about"
"Save PIN" button uses dedicated key m365_settings_save_pin instead of generic btn_save
Frequency and day-of-week <option> elements now have data-i18n attributes
Scheduler job list (schedRenderJobs) and status update now use t() for frequency labels

Changed — Theme toggle replaced with slider

The "Toggle dark / light" text button in Settings → General is replaced with a standard toggle slider (consistent with all other toggles in the UI). The slider reflects the current theme state when the tab opens and toggles the theme on click.

[1.5.6] — 2026-03-28

Feature — SSE event replay (#21)

Opening the browser mid-scan (e.g. while a scheduled scan is running) now replays all buffered events so the live log and result cards populate immediately, rather than showing nothing until the next event fires.

m365_scanner.py:

Added _sse_buffer: deque = deque(maxlen=500) — a ring buffer that stores every broadcast() event
broadcast() appends to the buffer before sending to SSE clients
run_scan() clears the buffer at the start of each scan so stale events from the previous scan are not replayed
Removed duplicate @app.route("/api/scan/stream") — route is now handled exclusively by the routes/scan.py blueprint

routes/scan.py:

scan_stream() replays _m._sse_buffer immediately when a new client connects, then switches to live events
All globals accessed directly via import m365_scanner as _m to avoid __getattr__ resolution failures that caused 500 errors
A : connected comment line is sent first to confirm the stream is flowing

static/app.js:

_autoConnectSSEIfRunning() — new function called on DOMContentLoaded that always opens the SSE connection on page load. If a scan is already running, buffered events replay immediately. If the buffer is empty, no events fire and the log stays quiet.
Handles scan_phase, scan_progress, scan_start, scan_file_flagged, scan_done, scheduler_started, scheduler_done, scheduler_error events
startScan() closes and reopens the SSE connection to get a clean stream for each manual scan

m365_scanner.py — CLI output when no browser connected:

broadcast() now prints key events to the terminal when _sse_queues is empty (i.e. no browser tab is watching), so scheduled scans are visible in the CLI: scan phases, file progress, errors, and completion summary

[1.5.5] — 2026-03-28

Fixed — Scheduler: multiple bugs after multi-job implementation

scheduler.py renamed to scan_scheduler.py

Python's stdlib includes a sched/scheduler module that was being resolved instead of the project's own scheduler.py, causing module 'scheduler' has no attribute 'load_jobs'. Renaming the project file to scan_scheduler.py eliminates the collision entirely. All imports updated in routes/scheduler.py and m365_scanner.py.

Jobs with missing UUID assigned on load

Jobs saved before the multi-job refactor had "id": "". load_jobs() now detects any job with a missing or empty id and assigns a fresh UUID, then rewrites the file. This fixed "Delete failed: id required" and silent edit failures.

Enabled toggle added to each scheduler row

Each job row now has an inline toggle switch instead of a static ✓/— indicator. Toggling saves the change immediately via /api/scheduler/jobs/save. The job description also shows "Next: [date]" after the status fetch resolves.

Edit no longer duplicates the job

_sched().reload() inside the save route was not wrapped in its own try/except. If APScheduler threw (e.g. not yet started), the exception propagated and caused the save to fall through to the "create new" path. Both reload() calls (save and delete) are now wrapped in try/except: pass.

Delete button now works

The delete button was passing the HTML-escaped job name through the onclick attribute — names with apostrophes or special characters broke the JS string. Fixed by passing only id and looking up the name from _schedJobs inside schedDeleteJob(). The route and JS both have proper error handling now.

"Not authenticated" on scheduled run

state.connector is assigned once at startup (_state.connector = _connector) and never updated when the user authenticates later. The scheduler now reads _m._connector directly from the live m365_scanner module at run time, guaranteeing it sees the current authenticated connector. flagged_items and scan_meta reads also updated to use _m.flagged_items and _m.scan_meta directly.

[1.5.4] — 2026-03-28

Feature — Multiple scheduled scans

The Settings → Scheduler tab now supports multiple independent named scan jobs, replacing the previous single-job form.

scheduler.py

Config format changed from a single dict to {"jobs": [...]}. Each job has its own id (UUID), name, and all existing fields (frequency, time, profile, auto-email, auto-retention).
Old single-job ~/.m365_scanner_schedule.json files are automatically migrated to the new format on first load — no manual changes needed.
ScanScheduler registers one APScheduler job per enabled scan and tracks running state and last-run info per job independently.
Backward-compat shims (load_schedule_config, save_schedule_config) kept for any existing integrations.

routes/scheduler.py — new CRUD endpoints:

GET /api/scheduler/jobs — list all jobs
POST /api/scheduler/jobs/save — create or update a job (by id)
POST /api/scheduler/jobs/delete — delete a job by id
POST /api/scheduler/jobs/run_now — run a specific job immediately by id
Old /api/scheduler/config and /api/scheduler/run_now kept as backward-compat shims

templates/index.html — scheduler pane replaced with a job list (styled like File sources) and an inline editor that slides open when adding or editing. Each row shows the job name, frequency summary, enabled/running status pill, and ▶ Run / ✏ Edit / ✕ Delete buttons. Schedule configuration lives exclusively in the editor — nothing schedule-related appears in the sidebar except the existing "Next: …" indicator.

static/app.js — all sched* functions rewritten for multi-job: schedLoad, schedRenderJobs, schedAddJob, schedEditJob, schedSaveJob, schedDeleteJob, schedRunJob, schedCancelEdit, schedLoadHistory, schedUpdateSidebarIndicator.

Lang keys added: m365_sched_add, m365_sched_name, m365_sched_editor_new, m365_sched_editor_edit, m365_sched_name_required, m365_sched_no_runs, btn_cancel (da/en/de).

[1.5.3] — 2026-03-27

Added — Suggestion #19: Scheduled / automatic scans

In-process scheduler using APScheduler so GDPR scans run automatically on a configurable cadence — no cron or Task Scheduler setup required.

Backend:

New scheduler.py module wrapping APScheduler BackgroundScheduler with a single coalescing job; misfire grace time 1 hour.
Config stored in ~/.m365_scanner_schedule.json (daily/weekly/monthly, time-of-day, profile selector, auto-email, auto-retention).
Run history persisted in new schedule_runs DB table (migration #7).
routes/scheduler.py blueprint — GET/POST /api/scheduler/config, GET /api/scheduler/status, POST /api/scheduler/run_now, GET /api/scheduler/history.
Scheduler starts automatically on app.run; status printed at boot.
Scheduled scans reuse the full run_scan() pipeline (checkpoints, delta, broadcast, DB) — identical to interactive scans.
Auto-email sends the Excel report via Graph or SMTP after each scheduled scan.
Auto-retention optionally enforces the retention policy on overdue items.

UI:

Settings → Scheduler tab — enable/disable toggle, frequency picker (daily/weekly/monthly), time-of-day, profile selector, auto-email and auto-retention toggles, status display, run history, "Run now" button.
Sidebar — 🕐 next-scan indicator near the settings button; click to open scheduler config. Polls every 60 s.
Scan log — scheduled scans appear with 🕐 prefix via SSE events (scheduler_started, scheduler_done, scheduler_error).

Build / deps:

APScheduler>=3.10 added to requirements.txt.
scheduler.py and APScheduler hidden imports added to build_m365.py.
Schedule config added to --purge cleanup list.
Lang keys added for DA / EN / DE.

[1.5.2] — 2026-03-27

Fixed — File/SMB scan: image-only PDFs no longer hang the scanner

scan_pdf() in document_scanner launches Tesseract OCR and Poppler subprocesses when a PDF has no text layer. These subprocesses cannot be killed from a Python thread, causing the scanner to hang indefinitely on scanned documents (e.g. ESTA applications, invoice scans).

Fix: Before calling scan_pdf(), _scan_bytes() now opens the PDF with pdfplumber (pure Python, no subprocesses) and checks whether any page has a text layer using the existing is_text_page() helper. If all pages are image-only, the file is skipped immediately with no CPR hits — which is correct, since machine-readable CPR numbers cannot exist in an image-only PDF.

Text-layer PDFs (the majority) pass the check and are scanned normally. Only image-only PDFs (scanned documents) are skipped.

This replaces multiple failed approaches (ThreadPoolExecutor timeouts, shutdown(wait=False), extension-based skipping) that either blocked on context manager exit or removed legitimate file types from scanning.

Fixed — SMB scanning: multiple smbprotocol 1.14+ API changes

See v1.5.1 for details. Additional fix in this release:

smb_host is now auto-derived from the path (//host/share → host) when not explicitly stored in the source JSON, so SMB sources saved without an explicit host field still connect correctly.

Fixed — Routes blueprint: globals resolved lazily to prevent circular imports

Each route blueprint (routes/*.py) now uses Python's module __getattr__ hook to lazily resolve globals from m365_scanner at call time, not at import time. This prevents the circular import that caused double blueprint registration on startup.

Added — File source Edit button

See v1.5.1.

[1.5.1] — 2026-03-27

Fixed — SMB scanning: multiple smbprotocol 1.14+ API incompatibilities

Several functions in file_scanner.py used deprecated or renamed smbprotocol APIs:

uuid4_str() removed — Connection() now requires a uuid.UUID object, not a string. Changed to uuid.uuid4() directly; added import uuid at module level.
RequestedOpcodes removed from smbprotocol.open — was imported but never used; removed.
FilePipePrinterAccessMask.FILE_LIST_DIRECTORY → DirectoryAccessMask.FILE_LIST_DIRECTORY — directory listing requires DirectoryAccessMask, not the file/pipe mask.
FileDirectoryInformation moved — from smbprotocol.query_info to smbprotocol.file_info; import updated.
FileInformationClass enum — query_directory() expects FileInformationClass.FILE_DIRECTORY_INFORMATION (int enum), not a class instance.
query_directory() kwargs renamed — file_name= → pattern=, output_buffer_length= → max_output=.
Filename bytes — file_name field now returns UTF-16-LE bytes; decoded to str with error handling.
smb_host auto-derivation — if smb_host is not explicitly stored in the source JSON, it is now extracted from the path (//host/share → host). is_smb no longer requires smb_host to be pre-set.

Fixed — SMB scanning: junk directories skipped

Added SKIP_DIRS constant — a set of folder names silently skipped in both local and SMB walks:

.recycle  .recycler  $recycle.bin  .trash  .trashes
.sync  .btsync  .syncthing
.git  .svn  .hg
__pycache__  node_modules
.spotlight-v100  .fseventsd  .temporaryitems
system volume information  lost+found

Local walker prunes these from _dirs[:] before os.walk descends. SMB walker checks before recursing. Hidden directories (. prefix) are also skipped in both.

STATUS_END_OF_FILE errors (zero-byte placeholder files from Bittorrent Sync, .sync/stream_test.txt etc.) are now silently skipped instead of logged as warnings.

Fixed — SMB/local file scans: OCR disabled, per-file timeout added

PDF scanning via document_scanner.scan_pdf() would trigger Tesseract OCR on image-based PDFs (scanned forms, photos) causing single files to hang for minutes.

_scan_bytes_timeout() — new wrapper around _scan_bytes using ThreadPoolExecutor with a 30-second deadline per file. Timed-out files are logged as errors and scanning continues.

skip_ocr=True — file scan loop now passes skip_ocr=True to _scan_bytes, disabling OCR and reducing DPI to 150. Only the text layer is extracted from PDFs. This is appropriate for bulk compliance scanning where image-only PDFs rarely contain machine-readable CPR numbers.

Added — File source Edit button

Each file source row in ⚙ Sources → File sources now has an ✏ Edit button between Scan and Delete. Clicking it pre-fills the add form with the existing name, path, SMB host, and username (password shown as placeholder dots). Saving with an existing ID updates the source in-place. The Add button label changes to Save changes while editing and reverts on save.

[1.5.0] — 2026-03-27

Refactor — HTML template and JavaScript extracted from m365_scanner.py

m365_scanner.py was a ~9600-line monolith containing HTML, CSS, JavaScript, and Python all in one string. This made frontend edits unsafe (no linting, no syntax highlighting, string escaping hazards) and diffs unreadable.

What changed:

templates/index.html — the full HTML/CSS template (1418 lines), served via Flask's render_template() with two Jinja2 variables: app_version and lang_json
static/app.js — all JavaScript (2832 lines), served by Flask's built-in static file handler at /static/app.js
m365_scanner.py — reduced from 9586 to 5334 lines (44% smaller); now contains only Python: business logic, API routes, and configuration

Flask configuration updated:

app = Flask(__name__,
            template_folder=os.path.join(BASE_DIR, "templates"),
            static_folder=os.path.join(BASE_DIR, "static"))

BASE_DIR resolves to sys._MEIPASS when running as a PyInstaller bundle, or to the directory containing m365_scanner.py otherwise — the same pattern already used for lang/, keywords/, and classification/.

Build script updated:

build_m365.py now bundles templates/ and static/ alongside the existing lang/, keywords/, and classification/ directories.

Zero behaviour change — the app works identically. Only the file organisation changed.

[1.5.0] — 2026-03-27

Refactor — HTML template and JavaScript extracted from m365_scanner.py

m365_scanner.py was a ~9600-line monolith containing HTML, CSS, JavaScript, and Python all in one string. This makes frontend edits unsafe (no linting, no syntax highlighting, string-escaping hazards) and diffs unreadable.

New files:

templates/index.html — full HTML/CSS template (1452 lines) served via Flask render_template(). Two Jinja2 variables: {{ app_version }} and {{ lang_json | safe }}.
static/app.js — all JavaScript (2832 lines) served by Flask's built-in static file handler at /static/app.js.

Flask app updated:

app = Flask(__name__,
            template_folder=os.path.join(BASE_DIR, "templates"),
            static_folder=os.path.join(BASE_DIR, "static"))

BASE_DIR resolves to sys._MEIPASS when running as a PyInstaller bundle, or the directory of m365_scanner.py otherwise — the same pattern already used for lang/, keywords/, and classification/. build_m365.py updated to bundle both new directories.

Result: m365_scanner.py reduced from 9586 to ~2100 lines of pure Python. Zero behaviour change.

Refactor — Routes split into Flask Blueprints

All 55 API routes extracted from m365_scanner.py into a routes/ package. Shared mutable state lives in routes/state.py; blueprints import from there to avoid circular imports.

routes/
  __init__.py       package marker
  state.py          shared globals: connector, flagged_items, LANG, …
  auth.py           /api/auth/*                           174 lines
  users.py          /api/users/* + role overrides         222 lines
  scan.py           /api/scan/* + /api/settings/*         123 lines
  sources.py        /api/file_sources/* + /api/file_scan   93 lines
  profiles.py       /api/profiles/*                        48 lines
  email.py          /api/smtp/* + /api/send_report        210 lines
  database.py       /api/db/* + /api/admin/* + preview    536 lines
  export.py         Excel + Art.30 export + bulk delete  1177 lines
  app_routes.py     /api/about + /api/langs + /api/lang    67 lines

Housekeeping — Document Scanner files removed

The following files belonged to the standalone Document Scanner product and have been removed from this repository:

server.py — Document Scanner web app
scanner_worker.py — Document Scanner process-pool worker
build.py — Document Scanner build script
build_app.sh — Document Scanner shell build script
Dockerfile — Document Scanner Docker image
docker-compose.yml — Document Scanner Docker Compose file
doc_scanner_icon.png — Document Scanner app icon

requirements.txt rewritten for the M365 Scanner only. Removed pdf2image, pytesseract, pypdf, reportlab, img2pdf, and py7zr (Document Scanner dependencies). Added cryptography>=42.0 (SMTP password encryption, already in use since v1.4.7).

[1.4.8] — 2026-03-27

Changed — Email: Microsoft Graph API preferred over SMTP

Both Test and Send now now try the Microsoft Graph API first when the scanner is authenticated to Microsoft 365. This avoids SMTP AUTH entirely — no port 587, no app password, no admin centre changes needed.

New _send_email_graph() helper — sends via /me/sendMail (delegated mode) or /users/{sender}/sendMail (app mode). Supports optional Excel attachment for the full report. Requires the Mail.Send Graph permission on the Azure app registration (Application or Delegated, depending on auth mode).

Priority order:

Microsoft Graph API — used when connected to M365
SMTP — fallback if not connected or Graph fails

Error surfacing — Graph permission errors (403 / Forbidden / Mail.Send / insufficient privileges) are now returned directly with a clear actionable message: add Mail.Send permission to the Azure app registration and grant admin consent. Previously the error was silently swallowed and the scanner fell through to SMTP, masking the real problem.

SMTP AUTH error — if SMTP is used and Microsoft 365 returns error 530 5.7.57 ("Client not authenticated"), the error message now includes a plain-English tip explaining how to enable SMTP AUTH in the M365 admin centre, or how to use Graph instead.

Changed — Test button sends a real email to configured recipients

The SMTP Test button previously only verified connectivity (EHLO/STARTTLS handshake). It now sends an actual HTML test email to the configured recipients, making it easy to verify end-to-end delivery including spam filtering.

[1.4.7] — 2026-03-27

Security — SMTP password encrypted at rest

Previously the SMTP password was stored as plaintext in ~/.m365_scanner_smtp.json. It is now encrypted using Fernet symmetric encryption (cryptography library, already a dependency).

Implementation:

A random Fernet key is generated on first use and saved to ~/.m365_scanner_machine_id (chmod 0o600 — owner-readable only)
Passwords are stored as enc:<ciphertext> in the JSON file
_encrypt_password() / _decrypt_password() handle the encode/decode cycle
_load_smtp_config() transparently decrypts on load; _save_smtp_config() encrypts on save
Legacy plaintext passwords (no enc: prefix) are read as-is and re-encrypted next time settings are saved — no migration step required
Encrypted blobs are machine-specific — the ciphertext cannot be decrypted on another machine without the key file
Graceful fallback to plaintext if cryptography is unavailable (rare)
The GET /api/smtp/config endpoint never returns the password to the browser; it returns only has_password: true/false

Fixed — EXIF `has_pii` false positives on screenshots

_EXIF_PII_TAGS previously included HostComputer, DocumentName, and PageName. These are set automatically by macOS/Windows on every screenshot (machine name, app name) and contain no personal data about an individual. Removed from the tag set.

Minimum content length of 3 characters added — a field must contain at least 3 non-whitespace characters to trigger a has_pii flag. Prevents empty or single-character values from causing false positives.

Affected fields retained: Artist, Copyright, ImageDescription, UserComment, XPAuthor, XPSubject, XPComment, XPKeywords — all fields a human would deliberately fill with personal information.

Fixed — Accounts section not greyed out when switching to a file-only profile

_applyProfile() restores source checkboxes but did not call _updateAccountsVisibility() afterwards. Switching to a profile with no M365 sources selected left the accounts section fully interactive. Fixed by calling _updateAccountsVisibility() immediately after the checkbox restore loop.

[1.4.6] — 2026-03-27

Changed — Excel export updated for EXIF, GPS, and file sources

New columns in all source sheets:

GPS — ✔ tick when GPS coordinates are present in the item's EXIF data
EXIF author — author/artist name extracted from EXIF metadata
Special category column now filters out gps_location and exif_pii (represented by the dedicated GPS column instead)

New source types in SOURCE_MAP:

local — 📁 Local (green tab), for files from local folder scans
smb — 🌐 Network (blue tab), for files from SMB/CIFS network shares
Both get their own sheet when results exist; skipped silently if empty

Summary sheet:

Row 4: "Items with GPS data" count (shown only when non-zero)
Summary table shifted to row 7 to accommodate (was row 6)
Source rows now skipped when a source has zero items

New GPS locations sheet:

Teal tab — created only when GPS items exist
Columns: Name, Latitude, Longitude, Maps link (blue hyperlink), Account, Date Modified
Auto-filter enabled; alternating row colours

Bug fix: dead old function body (164 lines after the return) removed — the previous str_replace only replaced the docstring, leaving unreachable code in the file.

[1.4.5] — 2026-03-26

Fixed — `_detect_photo_faces` missing after EXIF insertion

The str_replace that added _extract_exif() accidentally consumed the def _detect_photo_faces function definition (it was part of the replaced string). All image scans raised NameError: name '_detect_photo_faces' is not defined. Function restored at its original position before _scan_bytes().

Fixed — Progress bar shows "undefined / undefined" during file scans

The M365 scan_progress SSE event sends {index, total, pct, file, eta}. The file scanner sent only {scanned, flagged}. The JS handler blindly read d.index and d.total, producing undefined / undefined.

Fixes:

run_file_scan() now broadcasts {scanned, flagged, file, pct} so the current filename and a progress indicator are shown while scanning.
The scan_progress JS handler now checks which fields are present and renders accordingly: index / total for M365 scans, N · M flagged for file scans.

Fixed — Local file preview: PDF, XLSX, DOCX now render content

/api/preview/<id> for source_type=local previously showed only a metadata placeholder for PDF and Office files. Now:

Type	Preview
PDF	First 5 pages extracted via `pdfplumber`, CPR numbers highlighted in red
XLSX / XLSM	First 50 rows of up to 3 sheets as a styled table
CSV	First 50 rows as a table
DOCX / DOC	First 80 paragraphs as text, CPR numbers highlighted

All fall back to a metadata card if the library is unavailable or the file cannot be parsed. document_scanner (already imported) provides access to pdfplumber and openpyxl.

[1.4.4] — 2026-03-26

Added — #18 EXIF metadata extraction from images

New function _extract_exif(content, filename) — extracts structured EXIF data from JPEG, PNG, TIFF, WEBP, and HEIC images using Pillow (already a dependency). No new packages required.

Extracted fields:

GPS coordinates — converted from DMS rational values to decimal degrees; Google Maps link generated
Author / Artist / Copyright / Description / UserComment / Keywords — checked for PII content
Device — camera make and model
Datetime — DateTimeOriginal or DateTime

Behaviour changes:

EXIF extraction runs on all scanned images regardless of the "Scan photos" toggle — it is lightweight (no CV processing) and always relevant
Images with GPS or PII-bearing EXIF fields are flagged even without CPR hits
special_category gains "gps_location" and/or "exif_pii" entries as appropriate
Face detection (_detect_photo_faces) still requires the "🖼 Scan photos for faces" opt-in

UI:

🌍 GPS badge — teal pill on result cards (grid and list view) when GPS coordinates are present
Preview panel — local image previews now show a collapsible "EXIF data" section beneath the image with GPS (clickable Google Maps link), author, date, device, and any other PII-bearing fields

Applies to both M365 and file system scans — OneDrive/SharePoint images and local/SMB files go through the same extraction path.

[1.4.3] — 2026-03-26

Three sidebar sections (✉ Email report, 🗄 Database, and the language selector + About link) have been removed from the sidebar and consolidated into a single ⚙ Settings modal, opened via a button in the sidebar footer.

General tab — language selector (mirrors the hidden langSelect), theme toggle, and About info (version, Python, MSAL, Requests, openpyxl versions).

Email report tab — full SMTP configuration (host, port, username, password, from address, STARTTLS, recipients), Save, and Send now. Pre-fills from saved config. openSmtpModal() now redirects to this tab for backward compatibility.

Database tab — DB stats (total items, flagged items, scan count), ⬇ Export, ⬆ Import, and 🗑 Reset DB. exportDB() and openImportDBModal() work unchanged.

🔍 Data subject lookup remains as a sidebar shortcut since it is part of the active compliance workflow.

[1.4.2] — 2026-03-26

Added — Dynamic sources panel in sidebar

The sidebar sources panel is now fully dynamic. Previously the four M365 sources (Email, OneDrive, SharePoint, Teams) were hardcoded checkboxes. Now:

renderSourcesPanel() builds the list at runtime from _M365_SOURCES (the four fixed M365 entries) and _fileSources (saved local/SMB sources). A "File sources" group header appears automatically when any file sources are configured.
Per-source visibility toggles in the ⚙ Sources modal (Microsoft 365 tab) control which M365 sources appear in the panel. Toggling one off removes it from the panel immediately.
File sources added in the Sources modal appear as checkboxes in the panel alongside the M365 sources, with 📁 (local) or 🌐 (SMB) icons.
The panel shows up to 5 rows before scrolling (max-height: calc(5 * 26px)).
Profile save/restore — file source selections are now included when saving a profile. buildScanPayload() merges M365 and file source IDs into allSources; _applyProfile() restores all of them. A _pendingProfileSources mechanism handles the async case where file sources load after the profile is applied.

Added — Hint tooltips on Delta scan, Scan photos, Retention policy toggles

Each of the three advanced option toggles now has a circled ? icon to the right of the label. Clicking it shows a speech bubble (fixed-positioned, z-index: 9999) with the hint text, positioned to the right of the icon and visible above the main content area. Only one bubble can be open at a time; clicking anywhere outside closes it.

Changed — ⚙ Profiles button moved to topbar

The accent-coloured ⚙ Profiles button was removed from the Database section in the sidebar. A plain ⚙ Profiles button (matching the style of ⚙ Sources) now appears to the right of the 💾 save button in the topbar profile bar.

Changed — App mode badge (modeBadge) removed

The modeBadge button and userBar div have been removed from the sidebar. Connection status and mode (App / Delegated) are now shown exclusively in the Sources modal (Microsoft 365 tab) — connection info row with green/grey status dot, display name, email, and mode label.

smRefreshStatus() now calls /api/auth/status (correct endpoint) and pre-fills Client ID, Tenant ID, and Client Secret fields from the saved config. Connects via /api/auth/config + /api/auth/start; disconnects via /api/auth/signout + signOut().

Fixed — File source naming: Name field required; auto-suggest from path

The "Label" field renamed to "Name" and marked required (red asterisk). fsrcAutoName() suggests a name as the user types the path — last path segment for local paths, host / share for SMB paths. The user's own name is never overwritten once typed.

Fixed — Sources panel fixed height with scroll

#sourcesPanel in the sidebar now has max-height: calc(5 * 26px); overflow-y: auto so it shows exactly 5 rows before scrolling, regardless of how many sources are configured.

The "Fiscal year end" label and select were previously side-by-side, causing the label to wrap on long translations (e.g. "Regnskabsårs afslutning"). Now stacked vertically (flex-direction: column) with width: 100% on the select.

Fixed — ⚙ cog size inconsistency between Sources and Profiles buttons

Both buttons previously used ⚙️ (U+2699 + variation selector U+FE0F), which can render at emoji size rather than text size. Replaced with plain ⚙ (U+2699) in both so they render at identical size.

Fixed — MB label removed from max attachment size picker

The "MB" text span to the right of the attachment size number input has been removed.

Fixed — File source selections included in profiles

buildScanPayload() now collects both M365 and file source IDs and merges them into allSources, which is saved as profile.sources. Previously only M365 source IDs were saved.

[1.4.1] — 2026-03-26

Replaced the fragmented sidebar source configuration with a single ⚙️ Sources button above the sources panel. This opens a tabbed modal:

Microsoft 365 tab: Azure credentials (Client ID, Tenant ID, Client Secret) moved from the auth screen into the modal — can be updated or cleared post-connect. Per-source toggles (Email, OneDrive, SharePoint, Teams) control which sources appear in the sidebar panel. Disconnect button signs out without leaving the page.

Google Workspace tab: Stub with "Coming soon" — placeholder for Gmail and Google Drive when implemented.

File sources tab: Full file source management (list, add, delete, scan) moved from the standalone "📁 File sources" sidebar row into this tab. The separate sidebar row is removed.

Sidebar change: The "📁 File sources" sidebar section is removed. The sources panel now has a compact ⚙️ Sources button in its header row. The panel itself respects the per-source visibility toggles set in the modal — if a user disables OneDrive, it disappears from the panel immediately.

Backward compatibility: openFileSourcesModal() redirects to openSourcesMgmt('files') so any existing call sites continue to work.

[1.4.0] — 2026-03-26

Added — #8 File system scanning (local folders and SMB/CIFS network shares)

New file: file_scanner.py — unified local + network file iterator.

FileScanner.iter_files() yields (relative_path, bytes, metadata) regardless of whether the source is a local path or a network share. All CPR scanning, card streaming, and DB persistence stay in m365_scanner.py — file_scanner.py only handles how files are accessed.

Local scanning uses os.walk() on any path (workstation, USB drive, or already-mounted network share). SMB/CIFS scanning uses smbprotocol directly without requiring a mount — supports SMB2/3 with NTLM or domain credentials. smbprotocol is optional: if not installed, the scanner falls back to local-only mode with a logged warning.

Credential storage priority (SMB):

OS keychain via keyring (recommended — password never touches the filesystem)
NAS_PASSWORD environment variable
.env file (chmod 600) via python-dotenv

Both optional dependencies (smbprotocol, keyring, python-dotenv) are added to requirements.txt as opt-in extras.

Results write to the same SQLite DB as M365 items with source_type = "local" or "smb", so the Article 30 report and data subject lookup cover all sources in a single view. File/network cards use 📁 and 🌐 source badges respectively.

UI — 📁 File sources sidebar section:

Manage button → opens the File Sources modal
Add source form — label, path; SMB fields (host, user, password) appear automatically when the path starts with // or \; host is auto-filled from the path
Per-source ▶ Scan button — starts a scan immediately; results stream into the main grid via SSE exactly like an M365 scan
Delete — removes a source definition (does not affect scan results already in the DB)
Sources persist in ~/.m365_scanner_file_sources.json

New API routes:

Route	Method	Description
`/api/file_sources`	GET	List all file source definitions
`/api/file_sources/save`	POST	Add or update a source
`/api/file_sources/delete`	POST	Remove a source by id
`/api/file_sources/store_creds`	POST	Store SMB password in OS keychain
`/api/file_scan/start`	POST	Start a file scan (non-blocking)

New CLI flags:

# Scan a local folder
python m365_scanner.py --scan-path ~/Documents

# Scan an SMB share (password from OS keychain)
python m365_scanner.py --scan-path //nas.school.dk/shares \
  --smb-user "DOMAIN\\henrik" --smb-keychain-key gdpr-scanner-nas

# One-time credential storage
python m365_scanner.py --smb-store-creds --smb-host nas.school.dk \
  --smb-user "DOMAIN\\henrik"

# With photo scanning and file size limit
python m365_scanner.py --scan-path //nas/staff --scan-photos --max-file-mb 100

build_m365.py — file_scanner.py added to PyInstaller datas bundle.

[1.3.11] — 2026-03-26

Fixed — Face detection: excessive false positives on background elements

Haar cascade detection with minNeighbors=5 and min_size=40px was triggering on background textures, bottle labels, artwork, and out-of-focus persons, reporting up to 16 faces for a photo containing 1–2 actual subjects.

Changes in _detect_photo_faces() (m365_scanner.py):

min_size raised 40 → 80 px — eliminates detections on small background features; out-of-focus background persons and objects are too small in pixels to exceed this threshold
minNeighbors raised 5 → 8 — each candidate region must be confirmed by 8 overlapping scale-pyramid detections instead of 5; random texture patterns rarely survive this many confirmations

If over-detection persists on a specific image, minNeighbors=10 and min_size=100 are reasonable next steps before genuine faces are missed.

Fixed — Result cards: replaced 👤 + separate role-pill with unified role icon

The account-pill (showing the owner's display name) previously prepended a static 👤 via CSS ::before and rendered a separate role-pill span (🎓/👔) alongside it. Both elements have been merged: the account-pill now prefixes the display name directly with the role icon — 🎓 name for students, 👔 name for staff, 👤 name for unclassified — removing the redundant separate badge and saving horizontal space in both grid and list view.

[1.3.10] — 2026-03-26

Changed — Role classification: fragment-first, ID-second

Motivation: Microsoft has reissued new UUIDs for the same licence multiple times over the past 5–6 years (EA → A1/A3/A5 → new commerce/CSP → benefit variants). skuPartNumber strings like STANDARDWOFFPACK_FACULTY have been stable across all those generations while UUIDs change with every new issuance.

New classify_user_role() order:

Fragment match on skuPartNumber (runs first when sku_map available) — staff fragments checked before student across all licences, so a STUDENT_BENEFIT add-on cannot mask a FACULTY licence.
SKU ID lookup from m365_skus.json — fallback when sku_map is empty or when a licence has no recognisable fragment (e.g. Power Automate Free assigned to faculty).

Any future Microsoft SKU re-issuance is classified correctly without updating m365_skus.json, as long as the part number still contains FACULTY or STUDENT.

Fixed — `m365_skus.json`: added two missing faculty SKUs

c2273bd0-dff7-4215-9ef5-2c7bcfb06425 — Microsoft 365 Apps for Faculty (primary licence at Gudenåskolen, absent from all previous versions)
f30db892-07e9-47e9-837c-80727f46fd3d — relabelled Microsoft Power Automate Free (assigned to faculty)

[1.3.9] — 2026-03-26

Fixed — `m365_skus.json` not deployed; `build_sku_map_from_users` sampled wrong users

File missing: m365_skus.json was never copied into classification/ on disk. _load_sku_data() fell back to empty sets (staff_ids_count: 0). Students still classified via STUDENT fragment; staff always "other". Fix: file now shipped. Place in GDPRScanner/classification/m365_skus.json.

Wrong sample: build_sku_map_from_users took the first 20 alphabetical users — all students at Gudenåskolen — so it never fetched a staff part number. Fixed to sample evenly across the full list and always include the last 5 users.

[1.3.8] — 2026-03-26

Fixed — `m365_skus.json` not found in PyInstaller bundle; `🔍` SKU debug modal

_SKU_FILE = Path(__file__).parent / ... evaluated at class-definition time, before sys._MEIPASS is set in a frozen build. Replaced with _sku_file_path() classmethod that checks _MEIPASS at call time.

Added 🔍 SKU debug button to the accounts panel role-filter row. Opens a modal showing every tenant SKU ID colour-coded as 🎓 student / 👔 staff / ❓ unknown, with selectable text for pasting unknowns into m365_skus.json.

/api/users/license_debug extended: now returns student_ids, staff_ids, runtime block (set sizes, fragment lists, file path, sku_map entry count), and per-licence in_staff/in_student/frag_staff/frag_student trace for every user — sufficient to diagnose any classification failure without reading server logs.

[1.3.7] — 2026-03-26

Fixed — `license_debug` extended for full runtime diagnostics

/api/users/license_debug rewritten to expose all runtime state: staff_ids_count, student_ids_count, fragment lists, sku_file_path, sku_map_entries, and a step-by-step per-licence classification trace for every user (in_staff, in_student, frag_staff, frag_student, skuName).

[1.3.6] — 2026-03-26

Fixed — Staff misclassified as student: two-pass classify_user_role

Root cause: f30db892-07e9-47e9-837c-80727f46fd3d is a Microsoft Student Use Benefit add-on that Microsoft automatically assigns alongside faculty licences in Education tenants. Its skuPartNumber contains "STUDENT". Because the old single-pass loop checked student and staff in per-licence order, the fragment match on this add-on fired before the authoritative faculty ID (94763226) was ever reached, returning "student" instead of "staff".

Fix — classify_user_role() now uses a strict two-pass approach:

Pass 1 — authoritative ID match (m365_skus.json), staff before student: All licences are scanned for staff IDs first, then student IDs. A single faculty SKU ID anywhere in the licence list wins regardless of what other add-on licences appear before it.

Pass 2 — skuPartNumber fragment match, staff before student: Only reached if no ID match was found. Staff fragments are checked across every licence before student fragments — preventing a STUDENT_BENEFIT add-on from masking a FACULTY licence later in the list.

Result: A staff member holding [STUDENT_BENEFIT_ADDON, FACULTY_A1, STUDENT_DEVICE] is now correctly classified as "staff" in all cases, whether sku_map is populated or not.

[1.3.5] — 2026-03-26

Fixed — Staff not recognised: always merge per-user SKU map

Root cause: build_sku_map_from_users() (which calls /users/{id}/licenseDetails for up to 20 sampled users) was only called when sku_map was completely empty. In practice get_subscribed_skus() tier 2 (/me/licenseDetails) always succeeds in delegated mode, returning the signed-in admin's own license — making sku_map non-empty and silently skipping the per-user sampling.

If the admin's license happened to be a faculty A1 and other staff held A3 or an unlisted variant, those A3 users were never added to sku_map and fragment matching could not fire for them, leaving them as "other".

Fix: build_sku_map_from_users() is now always called and its results merged into sku_map, regardless of whether get_subscribed_skus() already returned entries. This guarantees that every distinct SKU ID actually in use by any of the first 20 users gets a skuPartNumber entry, enabling fragment matching for all staff variants — including those not yet listed in m365_skus.json.

Same merge applied in license_debug so the 🔍 modal also sees complete data.

[1.3.4] — 2026-03-26

Fixed — Role classification: three-tier SKU map fallback

Root cause: get_subscribed_skus() requires Directory.Read.All or Organization.Read.All. If the Azure app registration does not have that permission (typical delegated/device-code setups), it silently returned {} and the fragment fallback never ran, leaving every user as "other".

Fix — get_subscribed_skus() now tries three endpoints in order:

Tier	Endpoint	Permission needed
1	`/subscribedSkus`	Directory.Read.All (admin)
2	`/me/licenseDetails`	User.Read only
3	`build_sku_map_from_users()` via `/users/{id}/licenseDetails` (up to 20 users)	User.Read.All

Each tier logs how many SKU entries it found. Tier 2 always works in delegated mode and covers the signed-in user's licenses. Tier 3 covers all distinct SKUs used in the tenant by sampling up to 20 users. If any tier returns results, the others are skipped.

UI warning banner — when every fetched user resolves to "other", a red banner appears above the accounts list: "No users classified — click 🔍 to diagnose." It disappears automatically once classification succeeds.

[1.3.3] — 2026-03-26

Problem: Even with classification/m365_skus.json loading correctly, users showed as unclassified because the tenant's actual SKU IDs were not in the file. There was no easy way to discover which IDs to add.

Changes:

🔍 SKU debug button — a small magnifying-glass button added to the role filter row (next to 🎓 Elev). Clicking it opens a modal that calls GET /api/users/license_debug and lists every unique SKU ID in the tenant, colour-coded: 🎓 student / 👔 staff / ❓ unknown. Unknown IDs can be selected and copied directly into classification/m365_skus.json.
/api/users/license_debug extended — now also returns student_ids and staff_ids arrays from the loaded SKU file so the frontend can mark each tenant SKU as known or unknown without a second round-trip.
_sku_file_path() classmethod — replaced the static _SKU_FILE class attribute with a method that checks sys._MEIPASS first (PyInstaller bundle) then falls back to Path(__file__).parent / "skus" / "m365_skus.json". The static attribute evaluated at class-definition time before _MEIPASS was set, causing the frozen app to look in the wrong directory.
Server-side warning — GET /api/users now logs a WARNING to stdout when 0 out of N users are classified, including a sample of the unrecognised SKU IDs seen in the first 20 users.
Translated — EN / DA / DE (3 new keys)

[1.3.2] — 2026-03-26

Fixed — Student/Staff misclassification: incomplete SKU lists + no override (#1.3.2)

Root cause: The hardcoded SKU lists introduced in v1.0.0 covered only ~8 student and 6 staff SKUs. Microsoft publishes 100+ Education SKU IDs; any tenant using a SKU not in those lists silently fell through to "other", leaving users unclassified or relying solely on the skuPartNumber fragment fallback — which itself was too specific (STANDARDWOFFPACK_STUDENT instead of just STUDENT).

`m365_connector.py` — Expanded SKU lists and broader fragment matching

Student set expanded from 8 → 12 SKUs:

Added 46c119d4 (M365 A1 for Students — student use benefit)
Added 8fc2205d (O365 A5 for Students)
Added 160d616a (O365 A3 for Students device)
Added a4e376bd (M365 A1 for Students new commerce)

Staff set expanded from 6 → 9 SKUs:

Added 2d61d025 (M365 A1 for Faculty — faculty use benefit)
Added 15b1d32e (O365 A3 for Faculty device)
Added ba04c29e (M365 A1 for Faculty new commerce)

Fragment patterns broadened — "STUDENT" and "FACULTY" now catch all part-number variants (_STUDENT, STUDENT_, STUDENT_BENEFIT, _FAC, etc.) without needing to enumerate every Microsoft naming permutation.

`m365_scanner.py` — Manual role overrides

Because no SKU list can ever be complete, admins can now correct individual users directly from the accounts panel:

🎓/👔/❓ role badge on every user row — click to cycle: auto → student → staff → other → (clear, back to auto)
Overridden rows show the badge in accent colour with a ✎ indicator
Overrides persisted to ~/.m365_scanner_role_overrides.json — survive restarts and re-authentication
Applied at both display time (/api/users) and scan time (_user_role_map) so card badges, filter buttons, Excel Role column, and Article 30 inventory split all reflect the corrected role
GET /api/users/role_override — returns all current overrides
POST /api/users/role_override — sets or clears one override
Override file added to --purge file list
Translated — EN / DA / DE (3 new keys)

[1.3.1] — 2026-03-26

Fixed — Student/Staff role misclassification (`m365_connector.py`)

Two SKU ID collisions in _STUDENT_SKU_IDS / _STAFF_SKU_IDS caused Faculty users to be shown as Students (and vice versa) for any tenant using A5 or A3 Education licenses:

SKU ID	Correct role	Bug
`e578b273-6db4-4691-bba0-8d691f4da603`	Staff (M365 Education A5 for Faculty)	Was also in `_STUDENT_SKU_IDS` as "O365 A5 for Students" — Faculty A5 users always showed as 🎓 Student
`78e66a63-337a-4a9a-8959-41c6654dfb56`	Student (Office 365 A3 for Students)	Was also in `_STAFF_SKU_IDS` as "M365 A1 for Faculty (device)" — this had no effect because student is checked first, but the comment was wrong and the duplicate entry was confusing

classify_user_role() checks student first, so any overlap resolves to student, silently misclassifying all affected Faculty accounts.

Fix: removed e578b273 from _STUDENT_SKU_IDS and 78e66a63 from _STAFF_SKU_IDS. Also removed a stale duplicate of e578b273 that appeared twice in _STAFF_SKU_IDS. Added a RuntimeWarning guard inside classify_user_role() that logs any future collision between the two sets.

Impact: Article 30 staff/student inventory split, role filter buttons (👔 / 🎓), role badges on cards, and Excel Role column are all now correct for A5 and A3 Education tenants.

Workaround until update: use GET /api/users/license_debug to see the raw SKU IDs and current classification for each user.

[1.3.0] — 2026-03-26

Added — Biometric photo scanning (#9)

GDPR reference: Article 9 (special categories — biometric data), Article 5(1)(b)(e), Recital 38, Databeskyttelsesloven §6

PHOTO_EXTS — new constant covering .jpg .jpeg .png .bmp .tiff .tif .webp .heic .heif
_detect_photo_faces(content, filename) — calls ds._get_cv2() + ds.detect_faces_cv2() (already in document_scanner.py); PIL fallback for HEIC/HEIF; minNeighbors=5 for conservative detection; returns face count or 0 on any failure; entirely safe — exceptions swallowed silently
scan_photos option — new boolean scan option (default False — opt-in); extracted from scan_opts alongside delta and email_body
🖼 Scan photos for faces toggle in the Options panel, with hint: "Slower — opt in"
Photo items flagged even without CPRs — a file is added to results if face_count > 0, even if no CPR number is found; photographs of identifiable people are Art. 9 data regardless of CPR content
"biometric" auto-injected into special_category when faces are detected and "biometric" is not already present
face_count field added to card payload, DB, Excel, and Article 30 report

DB (migration #4):

face_count INTEGER NOT NULL DEFAULT 0 added to flagged_items via auto-migration
save_item() updated to persist face_count

UI:

📷 N faces badge — teal photo-face-badge pill shown on cards in both grid and list view when face_count > 0
📷 Photos / biometric filter added to the Special dropdown in the filter bar; applyFilters() handles specialVal === 'photo'
buildScanPayload() includes scan_photos; _applyProfile() restores it when loading a profile

Excel export:

Face count column added as column 3 (between CPR Hits and Special category); URL column index updated from 10 → 11 for hyperlink styling

Article 30 report:

Summary section: Photos with detected faces (Art. 9 biometric) row with item + face count; explanatory note on legal basis and parental consent (Databeskyttelsesloven §6)
New dedicated section: Photographs and Biometric Data (Article 9) — intro paragraph, 4-bullet retention guidance (purpose limitation, pupil consent, website removal, archiving), item table (name, account, source, faces, modified date), capped at 50 rows
Methodology section: bullet added describing OpenCV Haar cascade detection

Translated — EN / DA / DE (16 new keys per language)

[1.2.3] — 2026-03-26

⚙ Profiles button in the sidebar Database row opens a modal listing all saved profiles
Each profile row shows name (with ● active indicator), sources summary, description, and last run timestamp
Use — loads the profile into the sidebar and updates the topbar dropdown; closes the modal
Edit — expands an inline edit form directly in the row; saves name and description via POST /api/profiles/save
Duplicate — creates a copy with a unique (copy) / (copy 2) suffix; reloads the list
Delete — confirms, removes via POST /api/profiles/delete, clears _activeProfileId if the deleted profile was active
Empty state shown when no profiles have been saved yet
Translated — EN / DA / DE (14 new keys per language)

Added — Database export/import UI (#11)

🗄 Database sidebar section with Export and Import buttons (always visible; sits between Email report and User info)
Export button — calls GET /api/db/export; triggers a browser download of a timestamped ZIP (gdpr_export_YYYYMMDD_HHmmss.zip) containing 8 JSON files; CPR hashes only, thumbnails stripped
Import modal — file picker (.zip only), mode selector (Merge / Replace), replace warning panel, status line, and Import button; calls POST /api/db/import with multipart form data
GET /api/db/export Flask route — generates ZIP in a temp file, streams bytes as application/zip attachment
POST /api/db/import Flask route — accepts multipart file, mode, confirm; validates replace confirmation server-side; returns {ok, mode, imported: {table: count}}
Translated — EN / DA / DE (17 new keys per language)

Changed — Article 9 keyword matching compiled to regex (#13)

_load_keywords() now compiles one re.Pattern per Article 9 category at startup using a longest-first alternation: (?:keyword_a|keyword_b|…) with re.IGNORECASE
Short keywords (≤ 4 chars) retain (?<!\w)…(?!\w) word-boundary anchors to prevent substring false positives
_check_special_category() uses the compiled patterns via pattern.finditer() instead of a sequential str.find() loop over up to 459 entries
Startup log now reports compiled category count: Loaded 459 keywords (9 categories compiled)
Performance: ~10–50× faster for large tenants; negligible difference for typical school tenants (~100 flagged items); meaningful saving at 1 000+ items

[1.2.2] — 2026-03-21

Added — Profile selector in topbar (15c)

Profile dropdown in the topbar, between the Scan button and the spacer — shows "Default (sidebar)" plus all saved profiles with their last run date
💾 Save button next to the dropdown — prompts for a name and saves the current sidebar state (sources, options, user selection, retention settings) as a named profile via POST /api/profiles/save
onProfileChange() — fires when the dropdown changes; calls _applyProfile() to populate the sidebar controls from the selected profile
_applyProfile(profile) — sets all source checkboxes, scan options, retention fields, and queues user selection for when the accounts list is loaded
_applyPendingProfileUsers() — applies a profile's user_ids to the accounts list after loadUsers() completes; safe to call multiple times
loadProfiles() — fetches /api/profiles and populates the dropdown; called on onAuthenticated()
saveCurrentAsProfile() — collects the full buildScanPayload() state and posts it as a new or updated profile
Profiles with a description show it as a tooltip on the dropdown option
Selecting "Default (sidebar)" clears _activeProfileId so the sidebar is used directly with no profile applied
Translated — EN / DA / DE (6 new keys)

[1.2.1] — 2026-03-21

Added — Scan profiles 15a + 15b

15a — Backend profile storage

_profiles_load() — reads all profiles from ~/.m365_scanner_settings.json
_profiles_write() — atomic write of the full settings dict
_profile_from_settings() — wraps a flat settings dict as a profile object
_profile_get(name_or_id) — case-insensitive lookup by name or UUID
_profile_save(profile) — insert or update a profile
_profile_delete(name_or_id) — delete by name or UUID
_profile_touch(id, scan_id) — updates last_run and last_scan_id after a successful scan
Automatic migration — on first run, existing flat ~/.m365_scanner_settings.json is silently wrapped into a profile named "Default"; no user action required
Legacy shim — _save_settings() and _load_settings() continue to work unchanged; all existing headless setups are unaffected
Profile API routes — GET /api/profiles, POST /api/profiles/save, POST /api/profiles/delete, GET /api/profiles/get for future UI use (15c/15d)

15b — CLI profile support

--list-profiles — tabular listing of all profiles with name, sources, last run, and scan ID
--save-profile NAME — saves current CLI options as a named profile; updates existing if name matches
--delete-profile NAME — removes a profile by name
--profile NAME — loads a named profile for --headless runs; populates sources, retention, fiscal year end, and email recipients from the profile; prints profile name, description, and last run before scanning
After a successful headless scan, the active profile's last_run and last_scan_id are updated automatically

[1.2.0] — 2026-03-20

Added — Article 9 sensitive category detection (#3)

keywords/da.json — 459 Danish keywords across 9 Article 9 categories: health, mental health, criminal (Art. 10), trade union, religion, ethnicity, political, biometric, and sexual orientation. Includes _false_positive_guidance for ambiguous terms and _proximity_note explaining the matching strategy
keywords/ subfolder — mirrors the lang/ pattern; keywords/en.json and keywords/de.json can be added without code changes
_load_keywords() — loads the keyword file at startup matching the active UI language; falls back to da.json
_check_special_category(text, cprs) — returns a sorted list of matched Article 9 category keys; a keyword only triggers when within 150 characters of a CPR number (proximity filter); if no CPRs are present in the text, any keyword occurrence triggers
Card badge — purple ⚠ Art.9 — health, criminal pill on flagged cards showing all detected categories
Filter bar dropdown — "All risk levels / Art. 9 special category" quick filter in the results grid
DB migration #3 — special_category TEXT NOT NULL DEFAULT '[]' added to flagged_items via auto-migration; stored as JSON array
finish_scan() — counts special category items per scan and writes to scan_history.special_category for trend tracking
Excel export — "Special category" column added as column 3 on all per-source sheets
Article 30 report — special category item count and DPIA warning added to the summary section; "Art. 9" column added to the per-source breakdown table with purple highlighting on non-zero values
Translated — EN / DA / DE (6 new keys per language)
Build scripts — keywords/ folder bundled into PyInstaller app alongside lang/
.gitignore — !keywords/*.json added to prevent keyword files being excluded by the *.json catch-all

[1.1.3] — 2026-03-20

Fixed

Stray duplicate _get_bytes body — dead code block left after delete_drive_item_for_user from a previous edit has been removed

Changed — `m365_connector.py`

Split timeouts — replaced all hardcoded timeout=30 / timeout=60 with two tuned constants:
- _TIMEOUT_API = (10, 45) — 10s connect, 45s read for JSON API calls
- _TIMEOUT_BYTES = (10, 120) — 10s connect, 120s read for file/attachment downloads
- The 10s connect timeout makes hung connections fail fast; the read timeout allows slow wireless links to complete a transfer without aborting
Exponential backoff with retry — all four core request methods (_get, _post, _get_bytes, _delete) now retry up to 4 times on transient network errors:
- Retried: ConnectionError, Timeout, ChunkedEncodingError, ReadTimeout, HTTP 429, HTTP 503, HTTP 504
- Not retried: HTTP 403 (permission), HTTP 410 (delta token expired) — raised immediately
- Backoff: 2s → 4s → 8s between attempts (capped at 30s); 429 responses use the Retry-After header value
- Intermittent wireless dropouts and brief gateway errors are now absorbed transparently without interrupting a scan
Streaming file downloads — _get_bytes now uses stream=True and iter_content(65536) so large attachments are received in 64 KB chunks rather than one blocking read; prevents read timeouts on slow connections for large files
list_users inline timeout — the _fetch helper inside list_users was using its own hardcoded timeout=30; updated to use _TIMEOUT_API

[1.1.2] — 2026-03-20

Fixed

App does not start after build — m365_db.py, scanner_worker.py, and VERSION were missing from PyInstaller datas in build_m365.py; the app crashed immediately on launch because these files could not be found inside the bundle
_read_app_version() broken in both build scripts — still searched for APP_VERSION = "..." as a string literal in the scanner source, but both scanners now read from the VERSION file; build scripts updated to read VERSION directly
VERSION not bundled — build.py (Document Scanner) also missing the VERSION file in datas

Added

--purge CLI flag — permanently deletes all data files created by the scanner (SQLite database, Azure credentials, SMTP credentials, settings, checkpoint, delta tokens, language preference, OCR cache, MSAL token cache); prompts for yes confirmation; --yes skips prompt for scripted use
--export-db FILE — exports the database to a structured ZIP archive containing 8 JSON files; thumbnails excluded; CPR stored as hashes only
--import-db FILE — imports a previously exported ZIP; --import-mode merge (default) adds dispositions and deletion log only; --import-mode replace wipes and restores all tables; --yes skips confirmation on replace

[1.1.1] — 2026-03-19

Fixed

Layout collapse in light mode — .topbar CSS rule was broken by an earlier edit; border-bottom and background properties were orphaned onto a dangling line, causing the topbar to render with no background and the Scan button to be nearly invisible
Sidebar missing — .layout used height: 100vh which ignored body padding, causing the flex layout to overflow and the sidebar to disappear
macOS pywebview titlebar overlap — content rendered behind the traffic-light buttons; fixed with padding-top: 30px on body when running inside pywebview on macOS, combined with box-sizing: border-box and height: 100% on .layout
<option> elements not translated — applyI18n() used el.innerHTML on <option> elements; some browsers do not re-render the select's visible text when innerHTML is set on an already-mounted option; switched to el.textContent for option elements
Disposition filter dropdown not translated on load — filter bar is hidden until first scan result arrives so applyI18n() on DOMContentLoaded missed it; applyI18n() is now called when the filter bar is first shown
Card delete button z-index — added z-index: 1 to .card-delete-btn so it stacks correctly within its card context

Added

--reset-db CLI flag — permanently drops and recreates all database tables; shows a summary of what will be deleted and requires typing yes to confirm
--yes flag — skips confirmation prompts; use with --reset-db for scripted/automated resets
ScanDB.reset() — new method in m365_db.py that drops all tables in correct foreign-key order, resets user_version to 0, and reopens the connection with a fresh schema

[1.1.0] — 2026-03-19

Added — M365 Scanner

Student / staff role classification — O365 license SKU IDs used to classify users as 🎓 Student or 👔 Staff with no extra Azure permissions required. Hardcoded known Microsoft Education SKU IDs cover M365/Office 365 A1/A3/A5 for Students and Faculty. Fragment fallback for future SKUs.
Role filter in accounts panel — All / 👔 Ansat / 🎓 Elev buttons filter the user list before selecting accounts to scan
Role badge on result cards — 🎓/👔 pill shown on every card in grid and list view
user_role in SQLite DB — stored in flagged_items table; DB migration applied automatically on first run
Licensed users only — accounts without an assigned O365 license are excluded from the user list
Disposition filter in filter bar — filter results grid by compliance disposition status
Headless auto-delete of delete-scheduled items — items tagged for deletion are removed automatically after each headless scan
Deletion audit log — every deletion logged to deletion_log table with timestamp, actor, reason, and legal basis
GET /api/db/deletion_log — API endpoint for the deletion log
Deletion log in Article 30 report — dedicated section with summary-by-reason table and full 7-column log
Article 30 — student/staff split — Section 3 (Data Inventory) now shows Staff and Student tables separately; parental consent note added for student items (Databeskyttelsesloven §6)
GET /api/users/license_debug — diagnostic endpoint showing raw SKU IDs and classified roles for each user
_resolve_display_name() — resolves GUIDs and "Microsoft Konto" guest account placeholders to email address throughout UI and Article 30 report
Account name in Article 30 — resolved via user_ids stored in scan options; GUID no longer shown in any column
All Article 30 strings translated — deletion log section now uses L() throughout; 19 new keys in EN/DA/DE
VERSION file — single source of truth; both scanners read version at startup via Path(__file__).parent / "VERSION"
CHANGELOG.md — release history and versioning policy
SECURITY.md — responsible disclosure process
CONTRIBUTING.md — development setup, code style, PR process
LICENSE — AGPL-3.0 with commercial licensing note and GDPR disclaimer
.gitignore — covers credentials, databases, audit logs, venv, build artefacts

Fixed — M365 Scanner

Language switching no longer reloads the page — translations applied in-place, scan results preserved
Connect screen freeze — duplicate renderAccountList function definition caused a JavaScript syntax error that prevented onAuthenticated() from firing
Account column in Article 30 report showing GUIDs — resolved via _acct_map built from stored user_ids
"Microsoft Konto" / GUID display names on cards and in reports — resolved to email address

Changed — M365 Scanner

Excel export — 9 columns (was 7): added Account (display name), Role, and Disposition; URL hyperlink column index updated accordingly
Accounts list — licensed users only; assignedLicenses post-filter applied

[1.0.0] — 2026-03-19 — Initial public release

Document Scanner (`server.py`)

Scan PDFs, Word, Excel, CSV, and image files for Danish CPR numbers
OCR support via Tesseract for scanned/image-based PDFs
NER-based detection of names, addresses, phone numbers, emails, IBANs, and bank accounts via spaCy
CPR validation: strict Modulus 11 check + century-digit verification
Redaction modes: mask CPR only, or full anonymisation of all personal data
Face detection and blurring in image files via OpenCV
Risk scoring per file based on CPR count, age, and PII density
Dry-run mode — scan without writing any output files
JSON audit log (scanner_audit.jsonl) — append-only, records every action
SQLite OCR cache (~/.document_scanner_ocr_cache.db) — avoids re-OCR of unchanged pages
Web UI on port 5000 with grid and list view, live progress, drag-and-drop upload
Standalone macOS .app and Windows .exe via PyInstaller + pywebview

M365 Scanner (`m365_scanner.py`)

Scanning

Exchange mailboxes: all folders and subfolders, recursive, language-independent using wellKnownName identifiers
OneDrive, SharePoint, Teams file scanning via Microsoft Graph API
Attachment scanning: PDF, Word, Excel inside emails
CPR detection with the same strict validator as the Document Scanner
NER-based PII detection (phone, IBAN, bank account, name, address, org)
Progressive streaming — results appear card-by-card via Server-Sent Events
Incremental / resumable scans — checkpoint saved on interruption, resume on next run
Delta scan — Graph /delta endpoints fetch only changed items since last scan
Per-item thumbnail generation — image previews and placeholder SVGs

Results

Results grid with grid and list view, search, source filter, and disposition filter
Account name and role (🎓 Student / 👔 Staff) badge on every card
🗓 Overdue badge on items exceeding the retention cutoff
Preview panel with iframe preview, metadata strip, and disposition dropdown

Compliance features

Retention policy enforcement (GDPR Art. 5(1)(e)): rolling or fiscal-year cutoff (e.g. Bogføringsloven Dec 31), 🗓 Overdue badge, bulk-delete quick filter, headless auto-delete via --retention-years and --fiscal-year-end
Data subject lookup (Art. 15/17): modal, CPR hashed before query, bulk delete with audit logging
Disposition tagging (Art. 5(1)(a)): Unreviewed / Retain (legal/legitimate/contract) / Delete-scheduled / Deleted — filter bar, preview panel, Excel export, headless auto-delete of scheduled items
Deletion audit log (Art. 5(2)): every deletion logged with timestamp, actor, reason, legal basis
Article 30 report (Art. 30): structured .docx export — summary, data categories, data inventory (staff and student sections), retention analysis, compliance trend, deletion audit log, methodology

User management

Application mode (service account) and Delegated mode (device code flow)
License-based role classification: 🎓 Student / 👔 Staff detected from O365 SKU IDs — no extra permissions needed
Role filter buttons in accounts panel (All / 👔 Ansat / 🎓 Elev)
Licensed users only — accounts without an assigned license are excluded
Display name resolution: GUIDs and "Microsoft Konto" guest placeholders resolved to email address

Database (`m365_db.py`)

SQLite persistence layer alongside JSON session cache
Tables: scans, flagged_items, cpr_index, pii_hits, dispositions, scan_history, deletion_log
CPR numbers stored as SHA-256 hashes only — never in plaintext
Schema migration support via _MIGRATIONS + user_version pragma

Exports

Excel export: 9 columns including Account, Role, Disposition; per-source sheets with auto-filter
Article 30 Word document export
Email report via SMTP (STARTTLS / SMTPS / plain); headless --email-to flag

Headless / scheduled mode

--headless --output DIR --settings FILE for cron / Task Scheduler
--retention-years N --fiscal-year-end MM-DD for automated retention enforcement
--email-to for automated report delivery
Non-interactive: deletes automatically; interactive (TTY): prompts for confirmation

Internationalisation

Language files: English (en), Danish (da), German (de)
Language switching applies in-place — no page reload, scan results preserved

Installation

install_windows.ps1: Python, Tesseract, Poppler, venv — all local to project folder, no system PATH changes; all downloads via curl.exe
install_macos.sh: Homebrew, Python 3.12, Tesseract, Poppler, spaCy model
Dockerfile + docker-compose.yml for containerised deployment
GitHub Actions: 4 parallel build jobs (Document Scanner + M365 × Windows + Linux), auto-release on v* tags

Versioning policy

PATCH (1.0.x) — bug fixes, translation updates, minor UI tweaks
MINOR (1.x.0) — new feature, new suggestion from SUGGESTIONS.md implemented
MAJOR (x.0.0) — breaking change: DB migration required, config format change, or Azure permission requirement change

To release a new version:

# 1. Update VERSION
echo "1.1.0" > VERSION

# 2. Update CHANGELOG (add new section above [1.0.0])

# 3. Commit and tag
git commit -am "Release 1.1.0"
git tag v1.1.0
git push && git push --tags
# GitHub Actions builds and publishes automatically

206 KiB Raw Blame History Unescape Escape

Changelog

[Unreleased]

Added

Fixed

[1.6.28] — 2026-05-28

Added

Fixed

[1.6.27] — 2026-05-27

Added

[1.6.26] — 2026-04-29

Fixed

[1.6.25] — 2026-04-25

Added

Fixed

[1.6.24] — 2026-04-25

Fixed

[1.6.23] — 2026-04-21

Added

Fixed

[1.6.22] — 2026-04-21

Added

Fixed

[1.6.21] — 2026-04-20

Added

Fixed

[1.6.20] — 2026-04-18

Fixed

[1.6.19] — 2026-04-18

Fixed

[1.6.18] — 2026-04-18

Fixed

[1.6.17] — 2026-04-18

Added

[1.6.16] — 2026-04-18

Added

[1.6.15] — 2026-04-12

Added

Fixed

[1.6.14] — 2026-04-10

Added — read-only viewer mode (#33)

Fixed — memory exhaustion during large M365 scans

Changed — Sources panel is now resizable and collapsible

[1.6.13] — 2026-04-10

Added — developer tooling

Fixed — documentation

Changed — blueprint migration batch 3, 4, 5 (auth, database, export — migration complete)

[1.6.12] — 2026-04-10

Fixed — profile editor save drops users from non-active role groups

[1.6.11] — 2026-04-10

Changed — blueprint migration batch 1 (scan + app_routes)

[1.6.10] — 2026-04-10

Fixed — Google Drive exportSizeLimitExceeded warning

Fixed — peak memory during large file/SMB scans (OOM risk reduction)

[1.6.9] — 2026-04-10

Changed — frontend migrated to ES modules

Fixed — connector.js SyntaxError caused by duplicate function declarations

Fixed — Profiler modal did not open when _renderProfileMgmt threw

Fixed — blocking alert on every unhandled async error

[1.6.8] — 2026-04-09

Fixed — memory pressure during large scans

Fixed — Excel and Article 30 exports missing sources

Changed — "Email" source renamed to "Outlook"

Changed — progress bar moved above log panel

Fixed — profile manager Cancel closes entire modal

Changed — exports available without running a new scan

Fixed — profile loading clobbered by scan start

Fixed — "no results" shown during live scan after hard refresh

Fixed — progress bar source pill showing "Email" instead of "Outlook"

Changed — profile manager UI simplified

Changed — log panel collapsible

Changed — log header buttons translated

Changed — project documentation structure

Fixed — disposition carry-forward across scans

[1.6.7] — 2026-04-06

Fixed — emoji/symbol removal from all buttons and indicators

Fixed — result card badges, progress bar on browser refresh

Improved — scan responsiveness, UI layout, preview panel

[1.6.6] — 2026-04-06

Improved — UX polish II (clusters, badges, log panel, progress bar)

206 KiB

Raw Blame History

Fixed — Google Drive `exportSizeLimitExceeded` warning

Fixed — Profiler modal did not open when `_renderProfileMgmt` threw

Changed — i18n migrated from `.lang` to JSON (#27 ✅)

Changed — `skus/` renamed to `classification/` (#29 ✅)

Changed — Split `gdpr_scanner.py` into focused modules (#25)

Fixed — File scan `scan_complete` → `scan_done` event name

Fixed — `install_macos.sh` launcher scripts